## SI 650 / EECS 549: Homework 3 Part 2

Homework 3 Part 2 will have you working with deep learning models in a variety of ways. You will likely need to run this on Great Lakes unless you have access to a GPU elsewhere (or be prepared to wait a long time). You should have completed Parts 1 and 2 before attempting this notebook to familiarize yourself with how PyTerrier works.

In Part 3, you'll try the following tasks:
 - Use a large language model to re-rank content
 - Use a text-to-text model to perform query augmentation
 - Train a deep learning IR model and compare its performance.
 
The first two of these tasks will rely on models that we've pretrained for you. However, we've also provided code for how to train these. In the third task, you'll do one simple training in evaluate.

For the first two tasks, we've provided most of the code. *You are expected to submit results showing that you have successfully executed it*. You'll need to understand some of the code to complete task 3, which requires you to write new code.

As with the past notebooks, you'll need to have `JAVA_HOME` set, which will need to be run on Great Lakes. You can potentially set it in the notebook with a Jupyter command like
```
!export JAVA_HOME=/fill/in/the/path/to/the/JDK/here
```
by first figuring out where the JDK is installed. (This is setting the `JAVA_HOME` environment variable in the unix way). 

You can work on Parts 1 and 2 separately, so feel free to do the GPU-part (Part 2) when the resources are available on Great Lakes and the non-GPU part on your laptop or on a non-GPU Great Lakes machine (which likely will be much faster). Note that there will be no extensions due to Great Lakes bottlenecks, so you are encouraged to complete Part 2 as soon as possible.

### Install the PyTerrier extensions

You'll need two extensions for [OpenNIR](https://opennir.net/) and [doc2query](https://github.com/terrierteam/pyterrier_doc2query). We've provided the package install commands in comments below.

In [2]:
!pip install --upgrade git+https://github.com/Georgetown-IR-Lab/OpenNIR
!pip install --upgrade git+https://github.com/terrierteam/pyterrier_doc2query.git

Defaulting to user installation because normal site-packages is not writeable
Collecting git+https://github.com/Georgetown-IR-Lab/OpenNIR
  Cloning https://github.com/Georgetown-IR-Lab/OpenNIR to /tmp/pip-req-build-kqj247bw
  Running command git clone -q https://github.com/Georgetown-IR-Lab/OpenNIR /tmp/pip-req-build-kqj247bw
  Resolved https://github.com/Georgetown-IR-Lab/OpenNIR to commit 88a4679372f471a04d284a99404ffce2b7a1dc49
Collecting torch>=1.3.1
  Downloading torch-1.12.1-cp39-cp39-manylinux1_x86_64.whl (776.4 MB)
[K     |████████████████████████████████| 776.4 MB 29 kB/s /s eta 0:00:01
[?25hCollecting pytorch-pretrained-bert==0.6.1
  Downloading pytorch_pretrained_bert-0.6.1-py3-none-any.whl (114 kB)
[K     |████████████████████████████████| 114 kB 62.5 MB/s eta 0:00:01
[?25hCollecting pytorch-transformers==1.1.0
  Downloading pytorch_transformers-1.1.0-py3-none-any.whl (158 kB)
[K     |████████████████████████████████| 158 kB 14.3 MB/s eta 0:00:01
[?25hCollecting token

## Getting started

Start PyTerrier as we have in past notebooks.

In [1]:
!export JAVA_HOME='/sw/pkgs/arc/openjdk/jdk-18.0.1.1'
import os
os.environ['JAVA_HOME'] = '/sw/pkgs/arc/openjdk/jdk-18.0.1.1'
os.environ['JVM_PATH']='/sw/pkgs/arc/openjdk/jdk-18.0.1.1/lib/server/libjvm.so'
os.environ['CUDA_VISIBLE_DEVICES'] = '6,7'

In [2]:
import pyterrier as pt
if not pt.started():
    pt.init(tqdm='notebook')
import onir_pt
import pyterrier_doc2query
import os
import pandas as pd

PyTerrier 0.8.1 has loaded Terrier 5.6 (built by craigmacdonald on 2021-09-17 13:27)

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.


### [TREC-COVID19](https://ir.nist.gov/covidSubmit/) Dataset download

The following cell downloads the [TREC-COVID19](https://ir.nist.gov/covidSubmit/) dataset that we will use periodically throughout this notebook.

In [3]:
dataset = pt.datasets.get_dataset('irds:cord19/trec-covid')
topics = dataset.get_topics(variant='description')
qrels = dataset.get_qrels()

[INFO] [starting] https://ir.nist.gov/covidSubmit/data/topics-rnd5.xml
[INFO] [finished] https://ir.nist.gov/covidSubmit/data/topics-rnd5.xml: [1ms] [18.7kB] [17.3MB/s]
  df.drop(df.columns.difference(['qid','query']), 1, inplace=True)
[INFO] [starting] https://ir.nist.gov/covidSubmit/data/qrels-covid_d5_j0.5-5.txt
[INFO] [finished] https://ir.nist.gov/covidSubmit/data/qrels-covid_d5_j0.5-5.txt: [174ms] [1.14MB] [6.56MB/s]
                                                                                           

# Task 1: Build the inverted index for the TREC-COVID19 collection (5 points)

Build the index for the TREC Covid-19 (CORD19) data like we have in past notebooks but without any fancy options (e.g., no positional indexing).

In [4]:
cord19 = pt.datasets.get_dataset('irds:cord19/trec-covid')

pt_index_path = './cord19'
if not os.path.exists(pt_index_path + "/data.properties"):
    indexer = pt.index.IterDictIndexer(pt_index_path, overwrite=True)
    indexer.setProperty("termpipelines", "Stopwords")
    index_ref = indexer.index(cord19.get_corpus_iter(),fields=('abstract',),meta=('docno',))

else:
    # if you already have the index, create an IndexRef from the data in pt_index_path
    # that we can use to load using the IndexFactory
    index_ref = pt.IndexRef.of(pt_index_path + "/data.properties")

index = pt.IndexFactory.of(index_ref)

[INFO] [starting] building docstore
[INFO] If you have a local copy of https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv, you can symlink it here to avoid downloading it again: /home/maojy/.ir_datasets/downloads/80d664e496b8b7e50a39c6f6bb92e0ef
[INFO] [starting] https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv
docs_iter:   0%|                                    | 0/192509 [399ms<?, ?doc/s]
https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv: 0.0%| 0.00/269M [0ms<?, ?B/s]
https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv: 0.0%| 49.2k/269M [120ms<10:57, 410kB/s]
https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv: 0.1%| 270k/269M [222ms<03:41, 1.22MB/s]
https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv: 0.5%| 1.21M/269M [346ms<01:16, 3.51MB/s]
https://ai2-semanticscholar-c

cord19/trec-covid documents:   0%|          | 0/192509 [16ms<?, ?it/s]

  index_ref = indexer.index(cord19.get_corpus_iter(),fields=('abstract',),meta=('docno',))


17:18:14.468 [ForkJoinPool-1-worker-1] WARN org.terrier.structures.indexing.Indexer - Indexed 54937 empty documents
17:18:14.642 [ForkJoinPool-1-worker-1] ERROR org.terrier.structures.indexing.Indexer - Could not finish MetaIndexBuilder: 
java.io.IOException: Key 8lqzfj2e is not unique: 37597,11755
For MetaIndex, to suppress, set metaindex.compressed.reverse.allow.duplicates=true
	at org.terrier.structures.collections.FSOrderedMapFile$MultiFSOMapWriter.mergeTwo(FSOrderedMapFile.java:1374)
	at org.terrier.structures.collections.FSOrderedMapFile$MultiFSOMapWriter.close(FSOrderedMapFile.java:1308)
	at org.terrier.structures.indexing.BaseMetaIndexBuilder.close(BaseMetaIndexBuilder.java:321)
	at org.terrier.structures.indexing.classical.BasicIndexer.createDirectIndex(BasicIndexer.java:346)
	at org.terrier.structures.indexing.Indexer.index(Indexer.java:369)
	at org.terrier.python.ParallelIndexer$1.apply(ParallelIndexer.java:63)
	at org.terrier.python.ParallelIndexer$1.apply(ParallelIndexer.j

## Using an untuned Re-rankers

This notebook will have you work with few neural re-ranking methods that you've used in class. We can build them from scratch using `onir_pt.reranker` or load them from pretrained models. The models we load from scratch won't have been trained to do IR (yet), however.

And OpenNIR reranking model consists of:
 - `ranker` (e.g., `drmm`, `knrm`, or `pacrr`). This defines the neural ranking architecture. We discussed the `knrm` approach in class.
 - `vocab` (e.g., `wordvec_hash`, or `bert`). This defines how text is encoded by the model. This approach makes it easy to swap out different text representations. 
 
Let's start with the `knrm` method we discussed in class:

In [5]:
knrm = onir_pt.reranker('knrm', 'wordvec_hash', text_field='abstract')

config file not found: config
[2022-11-01 17:18:41,594][WordvecHashVocab][DEBUG] [starting] downloading https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip


                                                                                                                             



                                                                                                            

[2022-11-01 17:18:58,902][onir.util.download][DEBUG] downloaded https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip [16.71s] [682M] [34.9MB/s]


                                                                                                            

[2022-11-01 17:18:58,907][WordvecHashVocab][DEBUG] [finished] downloading https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip [17.32s]
[2022-11-01 17:18:58,908][WordvecHashVocab][DEBUG] [starting] extracting vecs
[2022-11-01 17:19:10,239][WordvecHashVocab][DEBUG] [finished] extracting vecs [11.33s]
[2022-11-01 17:19:10,243][WordvecHashVocab][DEBUG] [starting] loading vecs into memory
[2022-11-01 17:21:51,886][WordvecHashVocab][DEBUG] [finished] loading vecs into memory [02:42]
[2022-11-01 17:21:52,118][WordvecHashVocab][DEBUG] [starting] writing cached at /home/maojy/data/onir/vocab/wordvec_hash/fasttext-wiki-news-300d-1M.p
[2022-11-01 17:21:58,498][WordvecHashVocab][DEBUG] [finished] writing cached at /home/maojy/data/onir/vocab/wordvec_hash/fasttext-wiki-news-300d-1M.p [6.38s]


Let's look at how well this model work at ranking compared with our default `BatchRetrieve`

In [6]:
br = pt.BatchRetrieve(index) % 100
pipeline = br >> pt.text.get_text(dataset, 'abstract') >> knrm
pt.Experiment(
    [br, pipeline],
    topics,
    qrels,
    names=['DPH', 'DPH >> KNRM'],
    eval_metrics=["map", "ndcg", 'ndcg_cut.10', 'P.10', 'mrt']
)

[2022-11-01 17:24:17,533][onir_pt][ERROR] gpu=True, but CUDA is not available. Falling back on CPU.
[2022-11-01 17:24:17,665][onir_pt][DEBUG] [starting] batches


batches:   0%|          | 0/1250 [14ms<?, ?it/s]

[2022-11-01 17:24:22,425][onir_pt][DEBUG] [finished] batches: [4.76s] [1250it] [262.64it/s]


Unnamed: 0,name,map,ndcg,ndcg_cut.10,P.10,mrt
0,DPH,0.058707,0.151733,0.576693,0.622,59.544833
1,DPH >> KNRM,0.045271,0.132041,0.320171,0.372,134.598549


The `knrm` models' performance is lower! The mode doesn't work very well because it hasn't yet been trained for IR; it's using random weights to combine the scores from the similarity matrix--but this is at least a start.

## Loading a trained re-ranker

You can train re-ranking models in PyTerrier using the `fit` method. Here's an example of how to train the `knrm` model on the MS MARCO dataset, which is a large IR collection.

```python
# transfer training signals from a medical sample of MS MARCO
from sklearn.model_selection import train_test_split
train_ds = pt.datasets.get_dataset('irds:msmarco-passage/train/medical')
train_topics, valid_topics = train_test_split(train_ds.get_topics(), test_size=50, random_state=42) # split into training and validation sets

# Index MS MARCO
indexer = pt.index.IterDictIndexer('./terrier_msmarco-passage')
tr_index_ref = indexer.index(train_ds.get_corpus_iter(), fields=('text',), meta=('docno',))

pipeline = (pt.BatchRetrieve(tr_index_ref) % 100 # get top 100 results
            >> pt.text.get_text(train_ds, 'text') # fetch the document text
            >> pt.apply.generic(lambda df: df.rename(columns={'text': 'abstract'})) # rename columns
            >> knrm) # apply neural re-ranker

pipeline.fit(
    train_topics,
    train_ds.get_qrels(),
    valid_topics,
    train_ds.get_qrels())
```

Training deep learning models takes a bit of time (especially for large datasets like MS MARCO), so we've provided a model that's already been trained for you to download.

In [7]:
del knrm # free up the memory before loading a new version of the ranker (helpful for the GPU)
knrm = onir_pt.reranker.from_checkpoint('https://macavaney.us/knrm.medmarco.tar.gz', text_field='abstract', 
                                        expected_md5="d70b1d4f899690dae51161537e69ed5a")

                                                                         

[2022-11-01 17:24:37,886][onir.util.download][DEBUG] downloaded https://macavaney.us/knrm.medmarco.tar.gz [1ms] [1.43k] [1.66MB/s] [md5 hash verified]


                                                                                     

[2022-11-01 17:24:37,897][WordvecHashVocab][DEBUG] [starting] reading cached at /home/maojy/data/onir/vocab/wordvec_hash/fasttext-wiki-news-300d-1M.p
[2022-11-01 17:24:46,423][WordvecHashVocab][DEBUG] [finished] reading cached at /home/maojy/data/onir/vocab/wordvec_hash/fasttext-wiki-news-300d-1M.p [8.53s]


In [8]:
pipeline2 = br >> pt.text.get_text(dataset, 'abstract') >> knrm
pt.Experiment(
    [br, pipeline2],
    topics,
    qrels,
    names=['DPH', 'DPH >> KNRM'],
    baseline=0,
    eval_metrics=["map", "ndcg", 'ndcg_cut.10', 'P.10', 'mrt']
)

[2022-11-01 17:24:54,822][onir_pt][ERROR] gpu=True, but CUDA is not available. Falling back on CPU.
[2022-11-01 17:24:54,824][onir_pt][DEBUG] [starting] batches


batches:   0%|          | 0/1250 [15ms<?, ?it/s]

[2022-11-01 17:24:58,858][onir_pt][DEBUG] [finished] batches: [4.03s] [1250it] [309.96it/s]


Unnamed: 0,name,map,P.10,ndcg,ndcg_cut.10,mrt,map +,map -,map p-value,P.10 +,P.10 -,P.10 p-value,ndcg +,ndcg -,ndcg p-value,ndcg_cut.10 +,ndcg_cut.10 -,ndcg_cut.10 p-value
0,DPH,0.058707,0.622,0.151733,0.576693,38.694083,,,,,,,,,,,,
1,DPH >> KNRM,0.056452,0.584,0.147442,0.523946,117.81626,18.0,32.0,0.169177,18.0,21.0,0.138233,16.0,34.0,0.052831,20.0,29.0,0.038403


The tuned performance is a little better than before, but `knrm` still underperforms our first-stage ranking model.

## Reranking with BERT

Large language models such as [BERT](https://arxiv.org/abs/1810.04805) are much more powerful neural models that have been shown to be effective for ranking like we discussed in class. 

Like with `knrm`, we'll start by using BERT for re-ranking with its initial parameters. These parameters have been turned for the masked language modeling (i.e., filling a word in the blank) and predicting the next sentence--but have not been tuned for IR at all.

In [9]:
del knrm # clear out memory from KNRM (useful for GPU)
vbert = onir_pt.reranker('vanilla_transformer', 'bert', text_field='abstract', vocab_config={'train': True})

100%|██████████| 231508/231508 [66ms<0ms, 3526462.34B/s]
100%|██████████| 433/433 [1ms<0ms, 647693.88B/s]
100%|██████████| 440473133/440473133 [5.49s<0ms, 80298554.35B/s]  


Let's see how this non-IR trained model does on CORD10 data

In [10]:
pipeline3 = br % 100 >> pt.text.get_text(dataset, 'abstract') >> vbert
pt.Experiment(
    [br, pipeline3],
    topics,
    qrels,
    names=['DPH', 'DPH >> VBERT'],
    baseline=0,
    eval_metrics=["map", "ndcg", 'ndcg_cut.10', 'P.10', 'mrt']
)

[2022-11-01 17:25:25,032][onir_pt][ERROR] gpu=True, but CUDA is not available. Falling back on CPU.
[2022-11-01 17:25:25,037][onir_pt][DEBUG] [starting] batches


batches:   0%|          | 0/1250 [14ms<?, ?it/s]

[2022-11-01 17:33:16,958][onir_pt][DEBUG] [finished] batches: [07:52] [1250it] [ 2.65it/s]


Unnamed: 0,name,map,P.10,ndcg,ndcg_cut.10,mrt,map +,map -,map p-value,P.10 +,P.10 -,P.10 p-value,ndcg +,ndcg -,ndcg p-value,ndcg_cut.10 +,ndcg_cut.10 -,ndcg_cut.10 p-value
0,DPH,0.058707,0.622,0.151733,0.576693,34.342954,,,,,,,,,,,,
1,DPH >> VBERT,0.046275,0.38,0.131818,0.310408,9477.003205,7.0,43.0,1e-06,4.0,41.0,1.698767e-08,5.0,45.0,9.261981e-08,4.0,45.0,1.048371e-09


As we see, although the ERT model is pre-trained for recognizing language, it doesn't do very well at ranking on our benchmark. To get better performance, we'll need to tune for the task of relevance ranking.

We can train the model for ranking (as shown above for KNRM) or we can download a trained model. Here, we will use the [SLEDGE](https://arxiv.org/abs/2010.05987) model, which is a BERT model trained on scientific text and tuned on medical queries.

In [11]:
vbert = onir_pt.reranker.from_checkpoint('https://macavaney.us/scibert-medmarco.tar.gz', 
                                         text_field='abstract', expected_md5="854966d0b61543ffffa44cea627ab63b")

                                                                                        

[2022-11-01 17:33:23,083][onir.util.download][DEBUG] downloaded https://macavaney.us/scibert-medmarco.tar.gz [5.86s] [499M] [92.8MB/s] [md5 hash verified]


                                                                                                                                                  

[2022-11-01 17:33:40,469][onir.util.download][DEBUG] downloaded https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/pytorch_models/scibert_scivocab_uncased.tar [11.00s] [411M] [42.4MB/s] [md5 hash verified]


extracting: 411MB [2.54s, 162MB/s]                                                                                                                   
extracting: 821MB [7.19s, 114MB/s]  


Let's run another experiment to see how this new model trained for IR does.

In [12]:
pipeline4 = br % 100 >> pt.text.get_text(dataset, 'abstract') >> vbert
pt.Experiment(
    [br, pipeline4],
    topics,
    qrels,
    names=['DPH', 'DPH >> Trained-BERT'],
    baseline=0,
    eval_metrics=["map", "ndcg", 'ndcg_cut.10', 'P.10', 'mrt']
)

[2022-11-01 17:34:04,772][onir_pt][ERROR] gpu=True, but CUDA is not available. Falling back on CPU.
[2022-11-01 17:34:04,776][onir_pt][DEBUG] [starting] batches


batches:   0%|          | 0/1250 [14ms<?, ?it/s]

[2022-11-01 17:41:00,991][onir_pt][DEBUG] [finished] batches: [06:56] [1250it] [ 3.00it/s]


Unnamed: 0,name,map,P.10,ndcg,ndcg_cut.10,mrt,map +,map -,map p-value,P.10 +,P.10 -,P.10 p-value,ndcg +,ndcg -,ndcg p-value,ndcg_cut.10 +,ndcg_cut.10 -,ndcg_cut.10 p-value
0,DPH,0.058707,0.622,0.151733,0.576693,32.964087,,,,,,,,,,,,
1,DPH >> Trained-BERT,0.066863,0.758,0.15998,0.683735,8362.906671,36.0,14.0,0.000264,31.0,13.0,0.000171,35.0,15.0,0.004615,32.0,18.0,0.007476


Training helped a lot! We're able to improve upon the initial ranking from `BatchRetrieve`. However, from looking at `mrt` we can see that this is pretty slow to run--and this was using a GPU! This performance time underscores the trade-off in using large language models at retrieval time: they may perform better, but could be much slower.

# Deep learning at indexing time: doc2query

Instead of using our large language models to rerank, another option is to use them at _indexing time_ to augment our documents. In class, we discussed one such option, doc2query, that augments an inverted index structure by predicting queries that may be used to search for the document, and appending those to the document text.

We can use doc2query using the `pyterrier_doc2query` package, which was loaded at the top.

### Loading a pre-trained model

We'll start by using a version of the doc2query model released by the authors that is trained on the MS MARCO collection.

In [13]:
if not os.path.exists("t5-base.zip"):
  !wget https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/T5-passage/t5-base.zip
  !unzip t5-base.zip

--2022-11-01 18:02:07--  https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/T5-passage/t5-base.zip
Resolving proxy.arc-ts.umich.edu (proxy.arc-ts.umich.edu)... 141.213.136.200
Connecting to proxy.arc-ts.umich.edu (proxy.arc-ts.umich.edu)|141.213.136.200|:3128... connected.
Proxy request sent, awaiting response... 200 OK
Length: 357139559 (341M) [application/octet-stream]
Saving to: ‘t5-base.zip’


2022-11-01 18:02:13 (87.8 MB/s) - ‘t5-base.zip’ saved [357139559/357139559]

Archive:  t5-base.zip
  inflating: model.ckpt-1004000.data-00000-of-00002  
  inflating: model.ckpt-1004000.data-00001-of-00002  
  inflating: model.ckpt-1004000.index  
  inflating: model.ckpt-1004000.meta  


We can load the model weights by specifying the checkpoint.

In [16]:
!pip install tensorflow

Defaulting to user installation because normal site-packages is not writeable
Collecting tensorflow
  Downloading tensorflow-2.10.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (578.1 MB)
     |█▊                              | 30.6 MB 10.3 MB/s eta 0:00:54

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



     |████▋                           | 82.7 MB 166.6 MB/s eta 0:00:03

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



     |███████▋                        | 138.1 MB 109.3 MB/s eta 0:00:05

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



     |██████████▊                     | 194.2 MB 367 kB/s eta 0:17:2404

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



     |█████████████▉                  | 250.5 MB 379 kB/s eta 0:14:24

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



     |█████████████████               | 307.7 MB 168.2 MB/s eta 0:00:02

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



     |██████████████████████▏         | 400.8 MB 420 kB/s eta 0:07:02     |████████████████████▉           | 375.8 MB 408 kB/s eta 0:08:16     |█████████████████████▎          | 384.0 MB 408 kB/s eta 0:07:56     |█████████████████████▊          | 392.2 MB 420 kB/s eta 0:07:23

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



     |███████████████████████████     | 487.9 MB 445 kB/s eta 0:03:23

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



     |███████████████████████████████▋| 571.4 MB 447 kB/s eta 0:00:16     |████████████████████████████████| 578.1 MB 32 kB/s 
Collecting grpcio<2.0,>=1.24.3
  Downloading grpcio-1.50.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB)
     |██████████████████████          | 3.2 MB 96.4 MB/s eta 0:00:01

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



     |████████████████████████████████| 5.9 MB 106.6 MB/s eta 0:00:01
Collecting google-auth<3,>=1.6.3
  Downloading google_auth-2.14.0-py2.py3-none-any.whl (175 kB)
     |████████████████████████████████| 175 kB 113.5 MB/s eta 0:00:01
[?25hCollecting tensorboard-plugin-wit>=1.6.0
  Downloading tensorboard_plugin_wit-1.8.1-py3-none-any.whl (781 kB)
     |████████████████████████████████| 781 kB 122.4 MB/s eta 0:00:01
[?25hCollecting tensorboard-data-server<0.7.0,>=0.6.0
  Downloading tensorboard_data_server-0.6.1-py3-none-manylinux2010_x86_64.whl (4.9 MB)
     |████████████████████████████████| 4.9 MB 119.6 MB/s eta 0:00:01
Collecting google-auth-oauthlib<0.5,>=0.4.1
  Downloading google_auth_oauthlib-0.4.6-py2.py3-none-any.whl (18 kB)
Collecting markdown>=2.6.8
  Downloading Markdown-3.4.1-py3-none-any.whl (93 kB)
     |████████████████████████████████| 93 kB 432 kB/s s eta 0:00:01
Collecting rsa<5,>=3.1.4
  Downloading rsa-4.9-py3-none-any.whl (34 kB)
Collecting cachetools<6.0,>=2.

In [17]:
doc2query = pyterrier_doc2query.Doc2Query('model.ckpt-1004000', batch_size=8)

2022-11-01 18:05:28.925233: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-01 18:05:29.949690: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/slurm/lib64:
2022-11-01 18:05:29.949711: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-11-01 18:05:30.033203: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-11-01 18:05:32.0850

Doc2query using cpu


### Running doc2queries on sample text

Let's see what queries it predicts for the sample document that we've made up:

In [18]:
df = pd.DataFrame([{"docno" : "d1", "text" :"The University of Michigan School of Information (UMSI) delivers innovative, elegant and ethical solutions connecting people, information and technology. The school was one of the first iSchools in the nation and is the premier institution studying and using technology to improve human computer interactions."}])
df.iloc[0].text

'The University of Michigan School of Information (UMSI) delivers innovative, elegant and ethical solutions connecting people, information and technology. The school was one of the first iSchools in the nation and is the premier institution studying and using technology to improve human computer interactions.'

In [19]:
doc2query_df = doc2query(df)
doc2query_df.iloc[0].querygen

'what is umsi what is the name of university of michigan school of information what is umsi'

Not too bad, though the questions are somewhat generic

### Loading an index of doc2query documents

Let's see how doc2query does on improving the performance in the TREC COVID data. Since indexing with doc2query takes a while (due to needing to run the deep learning models), we've provide an index with the text already added.

If you would like to index the collection with doc2query yourself (or use doc2query for your course project), you can use the following code:

```python
dataset = pt.get_dataset("irds:cord19/trec-covid")
indexer = (
  pyterrier_doc2query.Doc2Query('model.ckpt-1004000', doc_attr='abstract', batch_size=8, append=True) # aply doc2query on abstracts and append
  >> pt.apply.generic(lambda df: df.rename(columns={'abstract': 'text'}) # rename "abstract" column to "text" for indexing
  >> pt.IterDictIndexer("./doc2query_index_path")) # index the expanded documents
indexref = indexer.index(dataset.get_corpus_iter())
```


In [20]:
if not os.path.exists('doc2query_marco_cord19.zip'):
  !wget http://www.dcs.gla.ac.uk/~craigm/ecir2021-tutorial/doc2query_marco_cord19.zip
  !unzip doc2query_marco_cord19.zip
doc2query_indexref = pt.IndexRef.of('./doc2query_index_path/data.properties')

--2022-11-01 18:05:47--  http://www.dcs.gla.ac.uk/~craigm/ecir2021-tutorial/doc2query_marco_cord19.zip
Resolving proxy.arc-ts.umich.edu (proxy.arc-ts.umich.edu)... 141.213.136.200
Connecting to proxy.arc-ts.umich.edu (proxy.arc-ts.umich.edu)|141.213.136.200|:3128... connected.
Proxy request sent, awaiting response... 302 Found
Location: https://www.dcs.gla.ac.uk/~craigm/ecir2021-tutorial/doc2query_marco_cord19.zip [following]
--2022-11-01 18:05:48--  https://www.dcs.gla.ac.uk/~craigm/ecir2021-tutorial/doc2query_marco_cord19.zip
Connecting to proxy.arc-ts.umich.edu (proxy.arc-ts.umich.edu)|141.213.136.200|:3128... connected.
Proxy request sent, awaiting response... 200 OK
Length: 45804576 (44M) [application/zip]
Saving to: ‘doc2query_marco_cord19.zip’


2022-11-01 18:05:51 (18.4 MB/s) - ‘doc2query_marco_cord19.zip’ saved [45804576/45804576]

Archive:  doc2query_marco_cord19.zip
   creating: doc2query_index_path/
  inflating: doc2query_index_path/data.document.fsarrayfile  
  inflating: 

Let's look at the results on TREC COVID by first merging the scores with the rankings

In [21]:
dataset = pt.get_dataset('irds:cord19/trec-covid')
pipeline = pt.BatchRetrieve(doc2query_indexref) % 1 >> pt.text.get_text(dataset, 'title')
res = pipeline(dataset.get_topics('title'))
res.merge(dataset.get_qrels(), how='left').head()

  df.drop(df.columns.difference(['qid','query']), 1, inplace=True)


Unnamed: 0,qid,docid,docno,rank,score,query,title,label,iteration
0,1,101299,jwmrgy5d,0,8.427298,coronavirus origin,COVID-19 in the heart and the lungs: could we ...,0.0,5.0
1,2,182167,g8grcy5j,0,13.922648,coronavirus response to weather changes,The Stirling Protocol – Putting the environmen...,0.0,4.0
2,3,85678,tl30wlpy,0,7.22418,coronavirus immunity,Receptor-dependent coronavirus infection of de...,,
3,4,145871,l5fxswfz,0,12.773362,how do people die from the coronavirus,Analysis on 54 Mortality Cases of Coronavirus ...,2.0,1.5
4,5,180990,3sepefqa,0,12.99598,animal models of covid 19,Current global vaccine and drug efforts agains...,0.0,4.0


What kind of queries does doc2query generate for the CORD19 documents?

In [22]:
df = pd.DataFrame(doc for doc in dataset.get_corpus_iter() if doc['docno'] in ('3sepefqa', 'l5fxswfz'))
df = df.rename(columns={'abstract': 'text'})
doc2query_df = doc2query(df)
for querygen, docno, text in zip(doc2query_df['querygen'], doc2query_df['docno'], df['text']):
    print(docno)
    print(querygen)
    print(text)

cord19/trec-covid documents:   0%|          | 0/192509 [15ms<?, ?it/s]

l5fxswfz
number of cases of coronavirus how many people are alive from the coronavirus coronavirus deaths
Since the identification of the first case of coronavirus disease 2019 (COVID-19), the global number of confirmed cases as of March 15, 2020, is 156,400, with total death in 5,833 (3.7%) worldwide. Here, we summarize the morality data from February 19 when the first mortality occurred to 0 am, March 10, 2020, in Korea with comparison to other countries. The overall case fatality rate of COVID-19 in Korea was 0.7% as of 0 am, March 10, 2020.
3sepefqa
what is copid medicine covid 19 causes and treatment covid19 is a disease that causes symptoms and effects
COVID-19 has become one of the biggest health concern, along with huge economic burden. With no clear remedies to treat the disease, doctors are repurposing drugs like chloroquine and remdesivir to treat COVID-19 patients. In parallel, research institutes in collaboration with biotech companies have identified strategies to use vir

## Evaluating the effects of doc2query

Here, we'll change our evaluation setup a bit from what we did before. Rather than compare two models for the same index, we'll instead compare the same model (BM25) with two different ways of indexing (two indices)! Our baseline will be an index of CORD19 without the doc2query additions.

Let's load a copy of the CORD19 index that we used earlier.

In [23]:
indexref = pt.IndexRef.of('./terrier_cord19/data.properties')

### Task 2: Write the `Experiment` to compare indices (5 points)
Run an `Experiment` using a `BM25` ranker that compares the indices `indexref` and `doc2query_indexref`.  Note that our doc2query model was trained on MS MARCO, which isn't the same kind of collection as CORD19, so this performance tells us how well that model can transfer to a new setting.

You should evaluate using the metrics "map", "ndcg", "ndcg_cut.10"

In [24]:
index1 = pt.IndexFactory.of(indexref)
index2 = pt.IndexFactory.of(doc2query_indexref)
idx = pt.BatchRetrieve(index1, wmodel="BM25")
doc2q = pt.BatchRetrieve(index2, wmodel="BM25")
pt.Experiment(
    [doc2q, idx],
    topics,
    qrels,
    eval_metrics=["map", "ndcg", 'ndcg_cut.10']
)

JavaException: JVM exception occurred: No IndexLoaders were supported for indexref ./terrier_cord19/data.properties; It may be your ref has the wrong location. Alternatively, Terrier is misconfigured - did you import the correct package to deal with this indexref? java.lang.UnsupportedOperationException

# Task 3: Train a new model! (30 points)

All of the prior exercises have had you working with either off-the-shelf models (not trained for IR) or models that someone else has trained for you. To give you a sense of how to train a model, your primary task in this notebook is to train a simple `knrm` model, which should be relatively efficient to train on a GPU. 

To keep thinsg simple, we'll use the same setup for CORD19 that we did in Part 2 (30 queries in train, 5 in dev, 15 queries in test) which is still relatively small for training a deep learning model but will get you started on the process. 

Your tasks are the following:
- Load the CORD19 dataset and split it into train, dev, and test. Your test set should have 15 queries, dev set 5 queries, and the rest in train. use a seed of 42 for the `random_state` to ensure consistent results with what we expected.
- Create a new `knrm` ranker and a pipeline that uses it
- Run an `Experiment` comparing four models:
  - a default `BatchRetrieve`, filtering to the top 100 results
  - BM25, filtering to the top 100 results
  - a pipeline that feeds the top 100 results of the default `BatchRetrieve` to your `knrm` model
  - a pipeline that feeds the top 100 results of BM25 to your `knrm` model
  
Your `Experiment` should evaluate with metrics `"map", "ndcg", 'ndcg_cut.10', 'P.10', 'mrt'`
  
We expect to see the `Experiment`'s results in the final cell. You are, of course, welcome to try training any of the fancier models to see how they do as well!

In [25]:
RANK_CUTOFF = 10
SEED=42

from sklearn.model_selection import train_test_split

tr_va_topics, test_topics = train_test_split(topics, test_size=15, random_state=SEED)
train_topics, valid_topics =  train_test_split(tr_va_topics, test_size=5, random_state=SEED)

In [26]:
knrm = onir_pt.reranker('knrm', 'bert', text_field='abstract', vocab_config={'train': True})
br = pt.BatchRetrieve(index) % 100
bm25 = pt.BatchRetrieve(index, wmodel="BM25")% 100
pipeline1 = br >> pt.text.get_text(dataset, 'abstract') >> knrm
pipeline2 = bm25 >> pt.text.get_text(dataset, 'abstract') >> knrm
pt.Experiment(
    [br,bm25, pipeline1,pipeline2],
    topics,
    qrels,
    names=['DPH', 'DPH >> KNRM'],
    eval_metrics=["map", "ndcg", 'ndcg_cut.10', 'P.10', 'mrt']
)

ValueError: names should be the same length as retr_systems