# SI 650 / EECS 549: Homework 3 Part 3

Homework 3 Part 3 will have you working with deep learning models in a variety of ways. You will likely need to run this on Great Lakes unless you have access to a GPU elsewhere (or be prepared to wait a long time). You should have completed Parts 1 and 2 before attempting this notebook to familiarize yourself with how PyTerrier works.

In Part 3, you'll try the following tasks:
 - Use a large language model to re-rank content
 - Use a text-to-text model to perform query augmentation
 - Train a deep learning IR model and compare its performance.
 
The first two of these tasks will rely on models that we've pretrained for you. However, we've also provided code for how to train these. In the third task, you'll do one simple training and evaluate its results.

For the first two tasks, we've provided most of the code. *You are expected to submit results showing that you have successfully executed it*. You'll need to understand some of the code to complete task 3, which requires you to write new code.

As with the past notebooks, you'll need to have `JAVA_HOME` set, which will need to be run on Great Lakes. You can potentially set it in the notebook with a Jupyter command like
```
!export JAVA_HOME=/fill/in/the/path/to/the/JDK/here
```
by first figuring out where the JDK is installed. (This is setting the `JAVA_HOME` environment variable in the unix way). 

### Install the PyTerrier extensions

You'll need two extensions for [OpenNIR](https://opennir.net/) and [doc2query](https://github.com/terrierteam/pyterrier_doc2query). We've provided the package install commands in comments below.

In [None]:
!pip install --upgrade "git+https://github.com/Georgetown-IR-Lab/OpenNIR"
!pip install --upgrade "git+https://github.com/terrierteam/pyterrier_doc2query.git"

In [None]:
!pip install python-terrier

## Getting started

Start PyTerrier as we have in past notebooks.

In [4]:
import pyterrier as pt
if not pt.started():
    pt.init(tqdm='notebook')
import onir_pt
import pyterrier_doc2query
import os
import pandas as pd

terrier-assemblies 5.6 jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.6 jar not found, downloading to /root/.pyterrier...
Done
PyTerrier 0.7.1 has loaded Terrier 5.6 (built by craigmacdonald on 2021-09-17 13:27)
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.


### [TREC-COVID19](https://ir.nist.gov/covidSubmit/) Dataset download

The following cell downloads the [TREC-COVID19](https://ir.nist.gov/covidSubmit/) dataset that we will use periodically throughout this notebook.

In [5]:
dataset = pt.datasets.get_dataset('irds:cord19/trec-covid')
topics = dataset.get_topics(variant='description')
qrels = dataset.get_qrels()

[INFO] [starting] https://ir.nist.gov/covidSubmit/data/topics-rnd5.xml
[INFO] [finished] https://ir.nist.gov/covidSubmit/data/topics-rnd5.xml: [7ms] [18.7kB] [2.59MB/s]
[INFO] [starting] https://ir.nist.gov/covidSubmit/data/qrels-covid_d5_j0.5-5.txt
[INFO] [finished] https://ir.nist.gov/covidSubmit/data/qrels-covid_d5_j0.5-5.txt: [166ms] [1.14MB] [6.88MB/s]


# Task 1: Build the inverted index for the TREC-COVID19 collection. (2 points)

Build the index for the TREC Covid-19 (CORD19) data like we have in past notebooks but without any fancy options (e.g., no positional indexing).

In [11]:
cord19 = pt.datasets.get_dataset('irds:cord19/trec-covid')
pt_index_path = './terrier_cord19'

if not os.path.exists(pt_index_path + "/data.properties"):
    # create the index, using the IterDictIndexer indexer 
    iter_indexer = pt.IterDictIndexer(pt_index_path, overwrite=True)

    # we give the dataset get_corpus_iter() directly to the indexer
    # while specifying the fields to index and the metadata to record
    index_ref = iter_indexer.index(cord19.get_corpus_iter(), fields=('abstract', ), meta=['docno', 'text'], meta_lengths=[20, 4096])

else:
    # if you already have the index, use it.
    index_ref = pt.IndexRef.of(pt_index_path)

index = pt.IndexFactory.of(index_ref)

[INFO] If you have a local copy of https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv, you can symlink it here to avoid downloading it again: /root/.ir_datasets/downloads/80d664e496b8b7e50a39c6f6bb92e0ef
[INFO] [starting] https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv
[INFO] [finished] https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv: [6.67s] [269MB] [40.4MB/s]
[INFO] [starting] building docstore
docs_iter: 100%|█████████████████████| 192509/192509 [13.51s<0ms, 14247.56doc/s]
[INFO] [finished] docs_iter: [13.51s] [192509doc] [14246.02doc/s]
[INFO] [finished] building docstore [13.53s]


cord19/trec-covid documents:   0%|          | 0/192509 [25ms<?, ?it/s]

02:11:01.057 [ForkJoinPool-1-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 54937 empty documents
02:11:01.205 [ForkJoinPool-1-worker-3] ERROR org.terrier.structures.indexing.Indexer - Could not finish MetaIndexBuilder: 
java.io.IOException: Key 8lqzfj2e is not unique: 37597,11755
For MetaIndex, to suppress, set metaindex.compressed.reverse.allow.duplicates=true
	at org.terrier.structures.collections.FSOrderedMapFile$MultiFSOMapWriter.mergeTwo(FSOrderedMapFile.java:1374)
	at org.terrier.structures.collections.FSOrderedMapFile$MultiFSOMapWriter.close(FSOrderedMapFile.java:1308)
	at org.terrier.structures.indexing.BaseMetaIndexBuilder.close(BaseMetaIndexBuilder.java:321)
	at org.terrier.structures.indexing.classical.BasicIndexer.createDirectIndex(BasicIndexer.java:346)
	at org.terrier.structures.indexing.Indexer.index(Indexer.java:369)
	at org.terrier.python.ParallelIndexer$1.apply(ParallelIndexer.java:63)
	at org.terrier.python.ParallelIndexer$1.apply(ParallelIndexer.j

## Using an untuned Re-rankers

This notebook will have you work with a few neural re-ranking methods that you've used in class. We can build them from scratch using `onir_pt.reranker` or load them from pretrained models. The models we load from scratch won't have been trained to do IR (yet), however.

And OpenNIR reranking model consists of:
 - `ranker` (e.g., `drmm`, `knrm`, or `pacrr`). This defines the neural ranking architecture. We discussed the `knrm` approach in class.
 - `vocab` (e.g., `wordvec_hash`, or `bert`). This defines how text is encoded by the model. This approach makes it easy to swap out different text representations. 
 
Let's start with the `knrm` method we discussed in class:

In [None]:
knrm = onir_pt.reranker('knrm', 'wordvec_hash', text_field='abstract')

config file not found: config
[02;37m[2021-11-22 22:37:54,698][WordvecHashVocab][DEBUG] [0m[37m[starting] downloading https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip[0m




[02;37m[2021-11-22 22:38:17,787][onir.util.download][DEBUG] [0m[37mdownloaded https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip [22.64s] [682M] [31.5MB/s][0m
[02;37m[2021-11-22 22:38:17,797][WordvecHashVocab][DEBUG] [0m[37m[finished] downloading https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip [23.10s][0m
[02;37m[2021-11-22 22:38:17,797][WordvecHashVocab][DEBUG] [0m[37m[starting] extracting vecs[0m
[02;37m[2021-11-22 22:38:41,094][WordvecHashVocab][DEBUG] [0m[37m[finished] extracting vecs [23.30s][0m
[02;37m[2021-11-22 22:38:41,095][WordvecHashVocab][DEBUG] [0m[37m[starting] loading vecs into memory[0m
[02;37m[2021-11-22 22:41:01,168][WordvecHashVocab][DEBUG] [0m[37m[finished] loading vecs into memory [02:20][0m
[02;37m[2021-11-22 22:41:01,388][WordvecHashVocab][DEBUG] [0m[37m[starting] writing cached at /root/data/onir/vocab/wordvec_hash/fasttext-wiki-news-300d-1M.p[0m
[02;37m[2021-11-22 22:4

Let's look at how well this model work at ranking compared with our default `BatchRetrieve`

In [None]:
br = pt.BatchRetrieve(index) % 100
pipeline = br >> pt.text.get_text(dataset, 'abstract') >> knrm
pt.Experiment(
    [br, pipeline],
    topics,
    qrels,
    names=['DPH', 'DPH >> KNRM'],
    eval_metrics=["map", "ndcg", 'ndcg_cut.10', 'P.10', 'mrt']
)

[02;37m[2021-11-22 22:41:44,522][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2021-11-22 22:42:04,493][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/1250 [48ms<?, ?it/s]

[02;37m[2021-11-22 22:42:08,697][onir_pt][DEBUG] [0m[37m[finished] batches: [4.20s] [1250it] [297.40it/s][0m


Unnamed: 0,name,map,ndcg,ndcg_cut.10,P.10,mrt
0,DPH,0.068056,0.165653,0.609058,0.658,134.035137
1,DPH >> KNRM,0.054801,0.14545,0.359716,0.45,596.670733


The `knrm` models' performance is lower! The mode doesn't work very well because it hasn't yet been trained for IR; it's using random weights to combine the scores from the similarity matrix--but this is at least a start.

## Loading a trained re-ranker

You can train re-ranking models in PyTerrier using the `fit` method. Here's an example of how to train the `knrm` model on the MS MARCO dataset, which is a large IR collection.

```python
# transfer training signals from a medical sample of MS MARCO
from sklearn.model_selection import train_test_split
train_ds = pt.datasets.get_dataset('irds:msmarco-passage/train/medical')
train_topics, valid_topics = train_test_split(train_ds.get_topics(), test_size=50, random_state=42) # split into training and validation sets

# Index MS MARCO
indexer = pt.index.IterDictIndexer('./terrier_msmarco-passage')
tr_index_ref = indexer.index(train_ds.get_corpus_iter(), fields=('text',), meta=('docno',))

pipeline = (pt.BatchRetrieve(tr_index_ref) % 100 # get top 100 results
            >> pt.text.get_text(train_ds, 'text') # fetch the document text
            >> pt.apply.generic(lambda df: df.rename(columns={'text': 'abstract'})) # rename columns
            >> knrm) # apply neural re-ranker

pipeline.fit(
    train_topics,
    train_ds.get_qrels(),
    valid_topics,
    train_ds.get_qrels())
```

Training deep learning models takes a bit of time (especially for large datasets like MS MARCO), so we've provided a model that's already been trained for you to download.

In [None]:
del knrm # free up the memory before loading a new version of the ranker (helpful for the GPU)
knrm = onir_pt.reranker.from_checkpoint('http://jurgens.people.si.umich.edu/ir/knrm.medmarco.tar.gz', text_field='abstract', 
                                        expected_md5="d70b1d4f899690dae51161537e69ed5a")



[02;37m[2021-11-22 22:42:23,891][onir.util.download][DEBUG] [0m[37mdownloaded http://jurgens.people.si.umich.edu/ir/knrm.medmarco.tar.gz [7ms] [1.43k] [207kB/s] [md5 hash verified][0m
[02;37m[2021-11-22 22:42:23,912][WordvecHashVocab][DEBUG] [0m[37m[starting] reading cached at /root/data/onir/vocab/wordvec_hash/fasttext-wiki-news-300d-1M.p[0m
[02;37m[2021-11-22 22:43:24,768][WordvecHashVocab][DEBUG] [0m[37m[finished] reading cached at /root/data/onir/vocab/wordvec_hash/fasttext-wiki-news-300d-1M.p [01:01][0m


In [None]:
pipeline2 = br >> pt.text.get_text(dataset, 'abstract') >> knrm
pt.Experiment(
    [br, pipeline2],
    topics,
    qrels,
    names=['DPH', 'DPH >> KNRM'],
    baseline=0,
    eval_metrics=["map", "ndcg", 'ndcg_cut.10', 'P.10', 'mrt']
)

[02;37m[2021-11-22 22:43:39,420][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2021-11-22 22:43:39,687][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/1250 [33ms<?, ?it/s]

[02;37m[2021-11-22 22:43:43,686][onir_pt][DEBUG] [0m[37m[finished] batches: [4.00s] [1250it] [312.77it/s][0m


Unnamed: 0,name,map,P.10,ndcg,ndcg_cut.10,mrt,map +,map -,map p-value,P.10 +,P.10 -,P.10 p-value,ndcg +,ndcg -,ndcg p-value,ndcg_cut.10 +,ndcg_cut.10 -,ndcg_cut.10 p-value
0,DPH,0.068056,0.658,0.165653,0.609058,94.499726,,,,,,,,,,,,
1,DPH >> KNRM,0.065099,0.598,0.160549,0.532602,198.171466,20.0,30.0,0.095533,12.0,26.0,0.024604,20.0,30.0,0.028324,20.0,30.0,0.00597


The tuned performance is a little better than before, but `knrm` still underperforms our first-stage ranking model.

## Reranking with BERT

Large language models such as [BERT](https://arxiv.org/abs/1810.04805) are much more powerful neural models that have been shown to be effective for ranking like we discussed in class. 

Like with `knrm`, we'll start by using BERT for re-ranking with its initial parameters. These parameters have been turned for the masked language modeling (i.e., filling a word in the blank) and predicting the next sentence--but have not been tuned for IR at all.

In [None]:
del knrm # clear out memory from KNRM (useful for GPU)
bert = onir_pt.reranker('vanilla_transformer', 'bert', text_field='abstract', vocab_config={'train': True})

100%|██████████| 231508/231508 [108ms<0ms, 2150451.63B/s] 
100%|██████████| 433/433 [2ms<0ms, 264280.21B/s]
100%|██████████| 440473133/440473133 [12.93s<0ms, 34054441.14B/s] 


Let's see how this non-IR trained model does on CORD10 data

In [None]:
pipeline3 = br % 100 >> pt.text.get_text(dataset, 'abstract') >> bert
pt.Experiment(
    [br, pipeline3],
    topics,
    qrels,
    names=['DPH', 'DPH >> VBERT'],
    baseline=0,
    eval_metrics=["map", "ndcg", 'ndcg_cut.10', 'P.10', 'mrt']
)

[02;37m[2021-11-22 23:32:12,921][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2021-11-22 23:32:13,157][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/1250 [24ms<?, ?it/s]

[02;37m[2021-11-22 23:38:01,728][onir_pt][DEBUG] [0m[37m[finished] batches: [05:49] [1250it] [ 3.59it/s][0m


Unnamed: 0,name,map,P.10,ndcg,ndcg_cut.10,mrt,map +,map -,map p-value,P.10 +,P.10 -,P.10 p-value,ndcg +,ndcg -,ndcg p-value,ndcg_cut.10 +,ndcg_cut.10 -,ndcg_cut.10 p-value
0,DPH,0.068056,0.658,0.165653,0.609058,83.899746,,,,,,,,,,,,
1,DPH >> VBERT,0.056413,0.458,0.147048,0.374197,7053.683518,8.0,42.0,6e-06,6.0,37.0,9.462198e-07,5.0,45.0,1e-06,8.0,41.0,4.134415e-09


As we see, although the ERT model is pre-trained for recognizing language, it doesn't do very well at ranking on our benchmark. To get better performance, we'll need to tune for the task of relevance ranking.

We can train the model for ranking (as shown above for KNRM) or we can download a trained model. Here, we will use the [SLEDGE](https://arxiv.org/abs/2010.05987) model, which is a BERT model trained on scientific text and tuned on medical queries.

In [None]:
bert = onir_pt.reranker.from_checkpoint('http://jurgens.people.si.umich.edu/ir/scibert-medmarco.tar.gz', 
                                         text_field='abstract', expected_md5="854966d0b61543ffffa44cea627ab63b")



[02;37m[2021-11-22 23:38:10,489][onir.util.download][DEBUG] [0m[37mdownloaded http://jurgens.people.si.umich.edu/ir/scibert-medmarco.tar.gz [8.06s] [499M] [60.8MB/s] [md5 hash verified][0m




[02;37m[2021-11-22 23:38:28,390][onir.util.download][DEBUG] [0m[37mdownloaded https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/pytorch_models/scibert_scivocab_uncased.tar [13.38s] [411M] [42.6MB/s] [md5 hash verified][0m


extracting: 411MB [2.02s, 204MB/s]
extracting: 821MB [9.39s, 87.4MB/s]


Let's run another experiment to see how this new model trained for IR does.

In [None]:
pipeline4 = br % 100 >> pt.text.get_text(dataset, 'abstract') >> bert
pt.Experiment(
    [br, pipeline4],
    topics,
    qrels,
    names=['DPH', 'DPH >> Trained-BERT'],
    baseline=0,
    eval_metrics=["map", "ndcg", 'ndcg_cut.10', 'P.10', 'mrt']
)

[02;37m[2021-11-22 23:38:59,449][onir_pt][DEBUG] [0m[37musing GPU (deterministic)[0m
[02;37m[2021-11-22 23:38:59,667][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/1250 [30ms<?, ?it/s]

[02;37m[2021-11-22 23:44:26,697][onir_pt][DEBUG] [0m[37m[finished] batches: [05:27] [1250it] [ 3.82it/s][0m


Unnamed: 0,name,map,P.10,ndcg,ndcg_cut.10,mrt,map +,map -,map p-value,P.10 +,P.10 -,P.10 p-value,ndcg +,ndcg -,ndcg p-value,ndcg_cut.10 +,ndcg_cut.10 -,ndcg_cut.10 p-value
0,DPH,0.068056,0.658,0.165653,0.609058,83.647015,,,,,,,,,,,,
1,DPH >> Trained-BERT,0.07571,0.77,0.173079,0.701995,6620.828306,36.0,14.0,0.001278,28.0,11.0,0.000851,36.0,14.0,0.010118,31.0,19.0,0.012156


Training helped a lot! We're able to improve upon the initial ranking from `BatchRetrieve`. However, from looking at `mrt` we can see that this is pretty slow to run--and this was using a GPU! This performance time underscores the trade-off in using large language models at retrieval time: they may perform better, but could be much slower.

# Deep learning at indexing time: doc2query

Instead of using our large language models to rerank, another option is to use them at _indexing time_ to augment our documents. In class, we discussed one such option, doc2query, that augments an inverted index structure by predicting queries that may be used to search for the document, and appending those to the document text.

We can use doc2query using the `pyterrier_doc2query` package, which was loaded at the top.

### Loading a pre-trained model

We'll start by using a version of the doc2query model released by the authors that is trained on the MS MARCO collection.

In [None]:
if not os.path.exists("t5-base.zip"):
  !wget http://jurgens.people.si.umich.edu/ir/t5-base.zip
  !unzip t5-base.zip

--2021-11-22 23:47:29--  http://jurgens.people.si.umich.edu/ir/t5-base.zip
Resolving jurgens.people.si.umich.edu (jurgens.people.si.umich.edu)... 141.211.184.98
Connecting to jurgens.people.si.umich.edu (jurgens.people.si.umich.edu)|141.211.184.98|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 357139559 (341M) [application/zip]
Saving to: ‘t5-base.zip’


2021-11-22 23:47:37 (43.9 MB/s) - ‘t5-base.zip’ saved [357139559/357139559]

Archive:  t5-base.zip
  inflating: model.ckpt-1004000.data-00000-of-00002  
  inflating: model.ckpt-1004000.data-00001-of-00002  
  inflating: model.ckpt-1004000.index  
  inflating: model.ckpt-1004000.meta  


We can load the model weights by specifying the checkpoint.

In [None]:
doc2query = pyterrier_doc2query.Doc2Query('model.ckpt-1004000', batch_size=8)

Downloading:   0%|          | 0.00/773k [44ms<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [24ms<?, ?B/s]

Downloading:   0%|          | 0.00/1.17k [30ms<?, ?B/s]

Doc2query using cuda


### Running doc2queries on sample text

Let's see what queries it predicts for the sample document that we've made up:

In [None]:
df = pd.DataFrame([{"docno" : "d1", "text" :"The University of Michigan School of Information (UMSI) delivers innovative, elegant and ethical solutions connecting people, information and technology. The school was one of the first iSchools in the nation and is the premier institution studying and using technology to improve human computer interactions."}])
df.iloc[0].text

'The University of Michigan School of Information (UMSI) delivers innovative, elegant and ethical solutions connecting people, information and technology. The school was one of the first iSchools in the nation and is the premier institution studying and using technology to improve human computer interactions.'

In [None]:
doc2query_df = doc2query(df)
doc2query_df.iloc[0].querygen

'what is umsi university of michigan school of information what is umsi'

Not too bad, though the questions are somewhat generic

### Loading an index of doc2query documents

Let's see how doc2query does on improving the performance in the TREC COVID data. Since indexing with doc2query takes a while (due to needing to run the deep learning models), we've provided an index with the text already added.

If you would like to index the collection with doc2query yourself (or use doc2query for your course project), you can use the following code:

```python
dataset = pt.get_dataset("irds:cord19/trec-covid")
indexer = (
  pyterrier_doc2query.Doc2Query('model.ckpt-1004000', doc_attr='abstract', batch_size=8, append=True) # aply doc2query on abstracts and append
  >> pt.apply.generic(lambda df: df.rename(columns={'abstract': 'text'}) # rename "abstract" column to "text" for indexing
  >> pt.IterDictIndexer("./doc2query_index_path")) # index the expanded documents
indexref = indexer.index(dataset.get_corpus_iter())
```


In [None]:
if not os.path.exists('doc2query_marco_cord19.zip'):
  !wget http://jurgens.people.si.umich.edu/ir/doc2query_marco_cord19.zip
  !unzip doc2query_marco_cord19.zip
doc2query_indexref = pt.IndexRef.of('./doc2query_index_path/data.properties')

--2021-11-22 23:47:53--  http://jurgens.people.si.umich.edu/ir/doc2query_marco_cord19.zip
Resolving jurgens.people.si.umich.edu (jurgens.people.si.umich.edu)... 141.211.184.98
Connecting to jurgens.people.si.umich.edu (jurgens.people.si.umich.edu)|141.211.184.98|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 45804576 (44M) [application/zip]
Saving to: ‘doc2query_marco_cord19.zip’


2021-11-22 23:47:55 (58.3 MB/s) - ‘doc2query_marco_cord19.zip’ saved [45804576/45804576]

Archive:  doc2query_marco_cord19.zip
   creating: doc2query_index_path/
  inflating: doc2query_index_path/data.document.fsarrayfile  
  inflating: doc2query_index_path/data.inverted.bf  
  inflating: doc2query_index_path/data.direct.bf  
  inflating: doc2query_index_path/data.lexicon.fsomapid  
  inflating: doc2query_index_path/data.lexicon.fsomaphash  
  inflating: doc2query_index_path/data.lexicon.fsomapfile  
  inflating: doc2query_index_path/data.meta.zdata  
  inflating: doc2query_index_pa

Let's look at the results on TREC COVID by first merging the scores with the rankings

In [None]:
dataset = pt.get_dataset('irds:cord19/trec-covid')
pipeline = pt.BatchRetrieve(doc2query_indexref) % 1 >> pt.text.get_text(dataset, 'title')
res = pipeline(dataset.get_topics('title'))
res.merge(dataset.get_qrels(), how='left').head()

Unnamed: 0,qid,docid,docno,rank,score,query,title,label,iteration
0,1,101299,jwmrgy5d,0,8.427298,coronavirus origin,COVID-19 in the heart and the lungs: could we ...,0.0,5.0
1,2,182167,g8grcy5j,0,13.922648,coronavirus response to weather changes,The Stirling Protocol – Putting the environmen...,0.0,4.0
2,3,85678,tl30wlpy,0,7.22418,coronavirus immunity,Receptor-dependent coronavirus infection of de...,,
3,4,145871,l5fxswfz,0,12.773362,how do people die from the coronavirus,Analysis on 54 Mortality Cases of Coronavirus ...,2.0,1.5
4,5,180990,3sepefqa,0,12.99598,animal models of covid 19,Current global vaccine and drug efforts agains...,0.0,4.0


What kind of queries does doc2query generate for the CORD19 documents?

In [None]:
df = pd.DataFrame(doc for doc in dataset.get_corpus_iter() if doc['docno'] in ('3sepefqa', 'l5fxswfz'))
df = df.rename(columns={'abstract': 'text'})
doc2query_df = doc2query(df)
for querygen, docno, text in zip(doc2query_df['querygen'], doc2query_df['docno'], df['text']):
    print(docno)
    print(querygen)
    print(text)

cord19/trec-covid documents:   0%|          | 0/192509 [23ms<?, ?it/s]

l5fxswfz
what is the current number of cases of coronavirus disease worldwide? how many coronavirus cases in the world how many coronaviruses are there
Since the identification of the first case of coronavirus disease 2019 (COVID-19), the global number of confirmed cases as of March 15, 2020, is 156,400, with total death in 5,833 (3.7%) worldwide. Here, we summarize the morality data from February 19 when the first mortality occurred to 0 am, March 10, 2020, in Korea with comparison to other countries. The overall case fatality rate of COVID-19 in Korea was 0.7% as of 0 am, March 10, 2020.
3sepefqa
what is the medicine used for copid what is copid 19 medication what is the cure for the comid epidemic
COVID-19 has become one of the biggest health concern, along with huge economic burden. With no clear remedies to treat the disease, doctors are repurposing drugs like chloroquine and remdesivir to treat COVID-19 patients. In parallel, research institutes in collaboration with biotech comp

## Evaluating the effects of doc2query

Here, we'll change our evaluation setup a bit from what we did before. Rather than compare two models for the same index, we'll instead compare the same model (BM25) with two different ways of indexing (two indices)! Our baseline will be an index of CORD19 without the doc2query additions.

Let's load a copy of the CORD19 index that we used earlier.

In [None]:
indexref = pt.IndexRef.of('./terrier_cord19/data.properties')

### Task 2: Write the `Experiment` to compare indices (3 points)
Run an `Experiment` using a `BM25` ranker that compares the indices `indexref` and `doc2query_indexref`. Compare your models using MAP, NDCG, and NDCG@10. Note that our doc2query model was trained on MS MARCO, which isn't the same kind of collection as CORD19, so this performance tells us how well that model can transfer to a new setting.

In [None]:
pt.Experiment(
    [pt.BatchRetrieve(doc2query_indexref, wmodel='BM25'), pt.BatchRetrieve(indexref, wmodel='BM25')],
    topics,
    qrels,
    eval_metrics=["map", "ndcg", "ndcg_cut_10"])

Unnamed: 0,name,map,ndcg,ndcg_cut_10
0,BR(BM25),0.196293,0.412673,0.623652
1,BR(BM25),0.195498,0.411103,0.644463


# Task 3: Train a new model! (25 points)

All of the prior exercises have had you working with either off-the-shelf models (not trained for IR) or models that someone else has trained for you. To give you a sense of how to train a model, your primary task in this notebook is to train a simple `knrm` model, which should be relatively efficient to train on a GPU. 

To keep things simple, we'll use the same setup for CORD19 that we did in Part 2 (30 queries in train, 5 in dev, 15 queries in test) which is still relatively small for training a deep learning model but will get you started on the process. 

Your tasks are the following:
- Load the CORD19 dataset and split it into train, dev, and test
- Create a new `knrm` ranker and a pipeline that uses it
- Run an `Experiment` comparing four models:
  - a default `BatchRetrieve`, filtering to the top 100 results
  - BM25, filtering to the top 100 results
  - a pipeline that feeds the top 100 results of the default `BatchRetrieve` to your `knrm` model
  - a pipeline that feeds the top 100 results of BM25 to your `knrm` model
  
Your `Experiment` should evaluate models using MAP, NDCG, NDCG@10, Precision@10, and Mean Response Time.
  
We expect to see the `Experiment`'s results in the final cell. You are, of course, welcome to try training any of the fancier models to see how they do as well!

In [6]:
topics = dataset.get_topics(variant='title')
qrels = dataset.get_qrels()

In [7]:
from sklearn.model_selection import train_test_split

tr_va_topics, test_topics = train_test_split(topics, test_size=15, random_state=42)
train_topics, valid_topics =  train_test_split(tr_va_topics, test_size=5, random_state=42)

In [9]:
knrm = onir_pt.reranker.from_checkpoint('http://jurgens.people.si.umich.edu/ir/knrm.medmarco.tar.gz', text_field='abstract', 
                                        expected_md5="d70b1d4f899690dae51161537e69ed5a")

config file not found: config


                                                                                                        

[02;37m[2021-11-23 02:01:51,751][onir.util.download][DEBUG] [0m[37mdownloaded http://jurgens.people.si.umich.edu/ir/knrm.medmarco.tar.gz [45ms] [1.43k] [31.7kB/s] [md5 hash verified][0m
[02;37m[2021-11-23 02:01:51,773][WordvecHashVocab][DEBUG] [0m[37m[starting] downloading https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip[0m




[02;37m[2021-11-23 02:02:17,788][onir.util.download][DEBUG] [0m[37mdownloaded https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip [25.57s] [682M] [26.5MB/s][0m
[02;37m[2021-11-23 02:02:17,796][WordvecHashVocab][DEBUG] [0m[37m[finished] downloading https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip [26.02s][0m
[02;37m[2021-11-23 02:02:17,796][WordvecHashVocab][DEBUG] [0m[37m[starting] extracting vecs[0m
[02;37m[2021-11-23 02:02:42,787][WordvecHashVocab][DEBUG] [0m[37m[finished] extracting vecs [24.99s][0m
[02;37m[2021-11-23 02:02:42,788][WordvecHashVocab][DEBUG] [0m[37m[starting] loading vecs into memory[0m
[02;37m[2021-11-23 02:05:21,173][WordvecHashVocab][DEBUG] [0m[37m[finished] loading vecs into memory [02:38][0m
[02;37m[2021-11-23 02:05:21,358][WordvecHashVocab][DEBUG] [0m[37m[starting] writing cached at /root/data/onir/vocab/wordvec_hash/fasttext-wiki-news-300d-1M.p[0m
[02;37m[2021-11-23 02:0

In [13]:
br = pt.BatchRetrieve(index) % 100
bm25 = pt.BatchRetrieve(index, wmodel='BM25') % 100
br_knrm = br >> pt.text.get_text(dataset, 'abstract') >> knrm
bm25_knrm = bm25 >> pt.text.get_text(dataset, 'abstract') >> knrm

In [14]:
pt.Experiment(
    [br, bm25, br_knrm, bm25_knrm],
    valid_topics,
    qrels,
    names=['Default', 'BM25', 'Default >> KNRM', 'BM25 >> KNRM'],
    baseline=0,
    eval_metrics=["map", "ndcg", 'ndcg_cut.10', 'P.10', 'mrt']
)

[INFO] NumExpr defaulting to 2 threads.


[02;37m[2021-11-23 02:12:06,866][onir_pt][ERROR] [0m[31mgpu=True, but CUDA is not available. Falling back on CPU.[0m
[02;37m[2021-11-23 02:12:06,882][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/125 [32ms<?, ?it/s]

[02;37m[2021-11-23 02:12:07,868][onir_pt][DEBUG] [0m[37m[finished] batches: [984ms] [125it] [126.98it/s][0m
[02;37m[2021-11-23 02:12:08,303][onir_pt][ERROR] [0m[31mgpu=True, but CUDA is not available. Falling back on CPU.[0m
[02;37m[2021-11-23 02:12:08,304][onir_pt][DEBUG] [0m[37m[starting] batches[0m


batches:   0%|          | 0/125 [23ms<?, ?it/s]

[02;37m[2021-11-23 02:12:08,976][onir_pt][DEBUG] [0m[37m[finished] batches: [671ms] [125it] [186.37it/s][0m


Unnamed: 0,name,map,P.10,ndcg,ndcg_cut.10,mrt,map +,map -,map p-value,P.10 +,P.10 -,P.10 p-value,ndcg +,ndcg -,ndcg p-value,ndcg_cut.10 +,ndcg_cut.10 -,ndcg_cut.10 p-value
0,Default,0.118251,0.82,0.228417,0.77356,270.245959,,,,,,,,,,,,
1,BM25,0.115882,0.8,0.227022,0.741581,108.740266,2.0,3.0,0.684073,0.0,1.0,0.373901,2.0,3.0,0.801761,1.0,4.0,0.095643
2,Default >> KNRM,0.107604,0.7,0.219123,0.641592,318.768676,0.0,5.0,0.060864,0.0,2.0,0.208,1.0,4.0,0.105601,1.0,4.0,0.222509
3,BM25 >> KNRM,0.11278,0.7,0.223038,0.636257,216.041404,1.0,4.0,0.247847,0.0,3.0,0.108701,1.0,4.0,0.32766,0.0,5.0,0.113137
