# JHU JSALT Summer School IR Laboratory -- Part 2 (extra)

This notebook is mainly borrowed from the series of Colab notebooks created for the [CIKM 2021](https://www.cikm2021.org/) Tutorial entitled '**IR From Bag-of-words to BERT and Beyond through Practical Experiments**'. For more information, please visit their [repository](https://github.com/terrier-org/cikm2021tutorial).

In particular, in this notebook you will:

 - Re-rank documents using neural models like KNRM, Vanilla BERT, EPIC, and monoT5.
 - Use DeepCT and doc2query to augment documents for lexical retrieval functions like BM25.

## Setup

In the following, we will set up the libraries required to execute the notebook.

### Pyterrier installation

The following cell installs the latest release of the [PyTerrier](https://github.com/terrier-org/pyterrier) package.

In [None]:
!pip install -q --upgrade python-terrier

### Pyterrier plugins installation

We install the [OpenNIR](https://opennir.net/), [monoT5](https://github.com/terrierteam/pyterrier_t5), [DeepCT](https://github.com/terrierteam/pyterrier_deepct) and [doc2query](https://github.com/terrierteam/pyterrier_doc2query) PyTerrier plugins. You can safely ignore the package versioning errors.

In [None]:
!pip install -q --upgrade git+https://github.com/Georgetown-IR-Lab/OpenNIR
!pip install -q --upgrade git+https://github.com/terrierteam/pyterrier_t5
!pip install -q --upgrade git+https://github.com/terrierteam/pyterrier_deepct.git
!pip install -q --upgrade git+https://github.com/terrierteam/pyterrier_doc2query.git

## Preliminary steps

These lines are needed for DeepCT and to make Tensorflow more quiet.

In [None]:
%tensorflow_version 1.x
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
import tensorflow as tf
assert tf.__version__.startswith("1"), "TF 1 is required by DeepCT; on Colab, use %tensorflow_version 1.x"
tf.logging.set_verbosity(tf.logging.ERROR)

**[PyTerrier](https://github.com/terrier-org/pyterrier) initialization**

Lets get [PyTerrier](https://github.com/terrier-org/pyterrier) started. This will download the latest version of the [Terrier](http://terrier.org/) IR platform. We also import the [OpenNIR](https://opennir.net/) pyterrier bindings.

In [None]:
import pyterrier as pt
if not pt.started():
    pt.init()
from pyterrier.measures import * # allow for natural measure names
import onir_pt

### [TREC-COVID19](https://ir.nist.gov/covidSubmit/) Dataset download

The following cell downloads the [TREC-COVID19](https://ir.nist.gov/covidSubmit/) dataset that we will use in the remainder of this notebook.

In [None]:
dataset = pt.datasets.get_dataset('irds:cord19/trec-covid')
topics = dataset.get_topics(variant='description')
qrels = dataset.get_qrels()

### Terrier inverted index download

To save a few minutes, we use a pre-built Terrier inverted index for the TREC-COVID19 collection ([`'terrier_stemmed'`](http://data.terrier.org/trec-covid.dataset.html#terrier_stemmed) version). Download time took a few seconds for us.

In [None]:
index = pt.get_dataset('trec-covid').get_index('terrier_stemmed_positions')

## Re-Rankers from scratch

Let's start exploring a few neural re-ranking methods! We can build them from scratch using `onir_pt.reranker`.

And OpenNIR reranking model consists of:
 - `ranker` (e.g., `drmm`, `knrm`, or `pacrr`). This defines the neural ranking architecture.
 - `vocab` (e.g., `wordvec_hash`, or `bert`). This defines how text is encoded by the model. This approach makes it easy to swap out different text representations.

This line will take a few minutes to run as it downloads and prepares the word vectors.

In [None]:
knrm = onir_pt.reranker('knrm', 'wordvec_hash', text_field='title_abstract')

Let's look at how well these models work at ranking!

In [None]:
br = pt.BatchRetrieve(index) % 50
# build a sub-pipeline to get the concatenated title and abstract text
get_title_abstract = pt.text.get_text(dataset, 'title') >> pt.text.get_text(dataset, 'abstract') >> pt.apply.title_abstract(lambda r: r['title'] + ' ' + r['abstract'])
pipeline = br >> get_title_abstract >> knrm
pt.Experiment(
    [br, pipeline],
    topics,
    qrels,
    names=['DPH', 'DPH >> KNRM'],
    eval_metrics=[AP(rel=2), nDCG, nDCG@10, P(rel=2)@10]
)

This doesn't work very well because the model is not trained; it's using random weights to combine the scores from the similarity matrix.

## Loading a trained re-ranker

You can train re-ranking models in PyTerrier using the `fit` method. This takes a bit of time, so we'll download a model that's already been trained. If you'd like to train the model yourself, you can use:

```python
# transfer training signals from a medical sample of MS MARCO
from sklearn.model_selection import train_test_split
train_ds = pt.datasets.get_dataset('irds:msmarco-passage/train/medical')
train_topics, valid_topics = train_test_split(train_ds.get_topics(), test_size=50, random_state=42) # split into training and validation sets

# Index MS MARCO
indexer = pt.index.IterDictIndexer('./terrier_msmarco-passage')
tr_index_ref = indexer.index(train_ds.get_corpus_iter(), fields=('text',), meta=('docno',))

pipeline = (pt.BatchRetrieve(tr_index_ref) % 100 # get top 100 results
            >> pt.text.get_text(train_ds, 'text') # fetch the document text
            >> pt.apply.generic(lambda df: df.rename(columns={'text': 'abstract'})) # rename columns
            >> knrm) # apply neural re-ranker

pipeline.fit(
    train_topics,
    train_ds.get_qrels(),
    valid_topics,
    train_ds.get_qrels())
```

In [None]:
del knrm # free up the memory before loading a new version of the ranker
knrm = onir_pt.reranker.from_checkpoint('https://macavaney.us/knrm.medmarco.tar.gz', text_field='title_abstract', expected_md5="d70b1d4f899690dae51161537e69ed5a")

In [None]:
pipeline = br >> get_title_abstract >> knrm
pt.Experiment(
    [br, pipeline],
    topics,
    qrels,
    names=['DPH', 'DPH >> KNRM'],
    baseline=0,
    eval_metrics=[AP(rel=2), nDCG, nDCG@10, P(rel=2)@10]
)

That's a little better than before, but it still underperforms our first-stage ranking model.

## Vanilla BERT

Contextualized language models, such as [BERT](https://arxiv.org/abs/1810.04805), are much more powerful neural models that have been shown to be effective for ranking.

We'll try using a "vanilla" (or "mono") version of the BERT model. The BERT model is pre-trained for the task of language modeling and next sentence prediction.

In [None]:
del knrm # clear out memory from KNRM
vbert = onir_pt.reranker('vanilla_transformer', 'bert', text_field='title_abstract', vocab_config={'train': True})

Let's see how this model does on TREC COVID.

In [None]:
pipeline = br % 50 >> get_title_abstract >> vbert
pt.Experiment(
    [br, pipeline],
    topics,
    qrels,
    names=['DPH', 'DPH >> VBERT'],
    baseline=0,
    eval_metrics=[AP(rel=2), nDCG, nDCG@10, P(rel=2)@10]
)

As we see, although the model is pre-trained, it doesn't do very well at ranking on our benchmark. This is because it's not tuned for the task of relevance ranking.

We can train the model for ranking (as shown above for KNRM) or we can download a trained model. Here, we use the [SLEDGE](https://arxiv.org/abs/2010.05987) model, which is a Vanilla BERT model trained on scientific text and tuned on medical queries.

In [None]:
sledge = onir_pt.reranker.from_checkpoint('https://macavaney.us/scibert-medmarco.tar.gz', text_field='title_abstract', expected_md5="854966d0b61543ffffa44cea627ab63b")

In [None]:
pipeline = br % 50 >> get_title_abstract >> sledge
pt.Experiment(
    [br, pipeline],
    topics,
    qrels,
    names=['DPH', 'DPH >> SLEDGE'],
    baseline=0,
    eval_metrics=[AP(rel=2), nDCG, nDCG@10, P(rel=2)@10, 'mrt']
)

That's much better! We're able to significantly improve upon the first stage ranker. But we can see that this is pretty slow to run.

## EPIC

Some models focus on query-time computational efficiency. The [EPIC](https://arxiv.org/abs/2004.14245) model builds light-weight document representations that are independent of the query. This means that they can be computed ahead of time. You can index the corpus yourself with the following code (but it takes a while):

```python
indexed_epic = onir_pt.indexed_epic.from_checkpoint('https://macavaney.us/epic.msmarco.tar.gz', index_path='./epic_cord19')
indexed_epic.index(dataset.get_corpus_iter(), fields=('title', 'abstract'))
```

Instead, we'll download a copy of the EPIC-processed documents:

In [None]:
import os
if not os.path.exists('epic_cord19.zip'):
  !wget http://macavaney.us/epic_cord19.zip
  !unzip epic_cord19.zip
indexed_epic = onir_pt.indexed_epic.from_checkpoint('https://macavaney.us/epic.msmarco.tar.gz', index_path='./epic_cord19')

We can now run this model over the results of a first-stage ranker. Note how we do not need to fetch the document text with `pt.text.get_text`, which further saves time.

In [None]:
br = pt.BatchRetrieve(index) % 50
pipeline = (br >> indexed_epic.reranker())
pt.Experiment(
    [br, pipeline],
    dataset.get_topics('description'),
    dataset.get_qrels(),
    names=['DPH', 'DPH >> EPIC (indexed)'],
    eval_metrics=[AP(rel=2), nDCG, nDCG@10, P(rel=2)@10, "mrt"]
)

## Tuning re-ranking threshold

[Prior work suggests](https://arxiv.org/pdf/1904.12683.pdf) that the re-ranking cutoff threshold is an important model hyperparameter. Let's see how this parameter affects EPIC.

In [None]:
cutoffs = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
dph = pt.BatchRetrieve(index)
res = pt.Experiment(
    [dph % cutoff >> indexed_epic.reranker() for cutoff in cutoffs],
    dataset.get_topics('description'),
    dataset.get_qrels(),
    names=[f'c={cutoff}' for cutoff in cutoffs],
    eval_metrics=[AP(rel=2), nDCG, nDCG@10, P(rel=2)@10, "mrt"]
)
res

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline
plt.plot(res['name'], res['nDCG@10'], label='nDCG@10')
plt.plot(res['name'], res['P(rel=2)@10'], label='P(rel=2)@10')
plt.ylabel('value')
plt.legend()
plt.show()
plt.clf()
plt.plot(res['name'], res['mrt'])
plt.ylabel('mrt')
plt.show()

It appears that the optimal re-ranking threshold for this collection is around 50-70. This also avoids excessive re-ranking time, which grows roughly linearly with larger thredhols. In pratice, this paramter should be tuned on a held-out validation set to avoid over-fitting.

## monoT5

The [monoT5](https://arxiv.org/abs/2003.06713) model scores documents using a causal language model. Let's see how this approach works on TREC COVID.

The `MonoT5ReRanker` class from `pyterrier_t5` automatically loads a version of the monoT5 ranker that is trained on the MS MARCO passage dataset.

In [None]:
from pyterrier_t5 import MonoT5ReRanker
monoT5 = MonoT5ReRanker(text_field='title_abstract')

In [None]:
br = pt.BatchRetrieve(index) % 50
pipeline = (br >> get_title_abstract >> monoT5)
pt.Experiment(
    [br, pipeline],
    dataset.get_topics('description'),
    dataset.get_qrels(),
    names=['DPH', 'DPH >> T5'],
    eval_metrics=[AP(rel=2), nDCG, nDCG@10, P(rel=2)@10, "mrt"]
)

## DeepCT

Recall that the DeepCT model repeats terms based on their estimated importance. This repitition boosts the importance in an inverted index structure.

We provide an interface to the DeepCT model in the `pyterrier_deepct` package:

In [None]:
import pyterrier_deepct

### Loading a pre-trained model

We will load the pre-trained verison of DeepCT provided by the authors.

In [None]:
if not os.path.exists("marco.zip"):
  !wget http://boston.lti.cs.cmu.edu/appendices/arXiv2019-DeepCT-Zhuyun-Dai/outputs/marco.zip
  !unzip marco.zip
if not os.path.exists("uncased_L-12_H-768_A-12.zip"):
  !wget https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-12_H-768_A-12.zip
  !unzip uncased_L-12_H-768_A-12.zip
  !mkdir -p bert-base-uncased
  !mv vocab.txt bert_* bert-base-uncased/

Loading a model is as simple as specifying the model configuration and weight file:

In [None]:
deepct = pyterrier_deepct.DeepCTTransformer("bert-base-uncased/bert_config.json", "marco/model.ckpt-65816")

### Running on sample text

We can transform a dataframe with a sample document to observe the effect of DeepCT:

In [None]:
import pandas as pd
df = pd.DataFrame([{"docno" : "d1", "text" :"The 30th ACM International Conference on Information and Knowledge Management (CIKM) is held virtually due to the COVID-19 pandemic."}])
df.iloc[0].text

In [None]:
deepct_df = deepct(df)
deepct_df.iloc[0].text

(You may need to expand the text using the \[...\] button at the end of the text.)

Interesting, right? We can see a lot of terms are expanded. Let's use `Counter` to see which are the most important terms.

In [None]:
from collections import Counter
Counter(deepct_df.iloc[0].text.split()).most_common()

As you can see, DeepCT considers "Conference", "CIKM", and "ACM" to be the most important terms in the document. Not bad choices. However, it completley removes the word "virtually".

### Loading an index of DeepCT documents

It takes too long to run DeepCT over the entire CORD19 collection in a tutorial setting, so we provide a version of the index for download.

If you would like to index the collection with DeepCT yourself, you can use:

```python
dataset = pt.get_dataset("irds:cord19/trec-covid")
indexer = (
  pt.apply.generic(lambda df: df.rename(columns={'abstract': 'text'})) # rename "abstract" column to "text"
  >> deepct # apply DeepCT transformation
  >> pt.IterDictIndexer("./deepct_index_path")) # index the modified documents
indexref = indexer.index(dataset.get_corpus_iter())
```

In [None]:
if not os.path.exists('deepct_marco_cord19.zip'):
  !wget http://www.dcs.gla.ac.uk/~craigm/cikm2021-tutorial/deepct_marco_cord19.zip
  !unzip deepct_marco_cord19.zip
deepct_indexref = pt.IndexRef.of('./deepct_index_path')

How well does DeepCT perform on TREC COVID? Let's run an experiment.

In [None]:
pt.Experiment(
    [br, pt.BatchRetrieve(deepct_indexref)],
    dataset.get_topics('description'),
    dataset.get_qrels(),
    names=['DPH', 'DeepCT'],
    baseline=0,
    eval_metrics=[AP(rel=2), nDCG, nDCG@10, P(rel=2)@10]
)

The recall improves DPH (so unbounded nDCG improves), but the top results suffer (nDCG@10 and P(rel=2)@10 are reduced). Let's dig into the top results for each of the TREC COVID quereis to see what's happening.

In [None]:
pipeline = pt.BatchRetrieve(deepct_indexref) % 1 >> pt.text.get_text(dataset, 'title')
res = pipeline(topics)
res.merge(qrels, how='left').head()

Ouch-- queries 2, 3, and 4 are non-relevant (the top doc for query 1 wasn't judged). Let's dig deeper into those source documents.

In [None]:
df = pd.DataFrame(doc for doc in dataset.get_corpus_iter() if doc['docno'] in ('g8grcy5j', '2c4jk2ms', 'mtjs9zv9'))
df = df.rename(columns={'abstract': 'text'})
deepct_df = deepct(df)
print('deepct-transformed documents')
for deepct_text, docno, text in zip(deepct_df['text'], deepct_df['docno'], df['text']):
  print(docno)
  print(Counter(deepct_text.split()).most_common(10))
  print(text)

As we can see, the document ranked highest for "*will sars cov2 infected people develop immunity*" (2c4jk2ms) gives high scores to the term sars, overpowering the other query terms.

The top document for "*what causes death from covid 19*" (mtjs9zv9) has high scores for covid, 19, and death, but is discussing the topic with respect to HIV rather than COVID itself. This underscores the limitations of using bag-of-words for scoring instead.

The top document for "*how does the coronavirus respond to changes in weather*" (g8grcy5j) discusses the potential for change in climate policy as a result of COVID-19, not how the virus responds to weather. DeepCT picks up on this theme and gives weather-related words high importance.

## doc2query

Recall that doc2query augments an inverted index structure by predicting queries that may be used to search for the document, and appending those to the document text.

We provide an interface to doc2query using the `pyterrier_doc2query` package:

In [None]:
import pyterrier_doc2query

### Loading a pre-trained model

We will again use a version of the doc2query model released by the authors that is trained on the MS MARCO collection.

In [None]:
import os
if not os.path.exists("t5-base.zip"):
  !wget https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/T5-passage/t5-base.zip
  !unzip t5-base.zip

We can load the model weights by specifying the checkpoint.

In [None]:
doc2query = pyterrier_doc2query.Doc2Query('model.ckpt-1004000', batch_size=8)

### Running on sample text

Let's see what queries it predicts for the sample document:

In [None]:
import pandas as pd
df = pd.DataFrame([{"docno" : "d1", "text" :"The 30th ACM International Conference on Information and Knowledge Management (CIKM) is held virtually due to the COVID-19 pandemic."}])
df.iloc[0].text

In [None]:
doc2query_df = doc2query(df)
doc2query_df.iloc[0].querygen

Doc2query can genrate some resonable questions (e.g., "*where is cikm held*"), but also can generates some that are off-topic and introduce some non-relevant terms (e.g., cicm, cmim).

### Loading an index of doc2query documents

Let's see how it does on TREC COVID. Again, it takes too long to index in a tutorial setting, so we provide an index.

If you would like to index the collection with doc2query yourself, you can use:

```python
dataset = pt.get_dataset("irds:cord19/trec-covid")
indexer = (
  pyterrier_doc2query.Doc2Query('model.ckpt-1004000', doc_attr='abstract', batch_size=8, append=True) # aply doc2query on abstracts and append
  >> pt.apply.generic(lambda df: df.rename(columns={'abstract': 'text'}) # rename "abstract" column to "text" for indexing
  >> pt.IterDictIndexer("./doc2query_index_path")) # index the expanded documents
indexref = indexer.index(dataset.get_corpus_iter())
```

In [None]:
if not os.path.exists('doc2query_marco_cord19.zip'):
  !wget http://www.dcs.gla.ac.uk/~craigm/cikm2021-tutorial/doc2query_marco_cord19.zip
  !unzip doc2query_marco_cord19.zip
doc2query_indexref = pt.IndexRef.of('./doc2query_index_path')

Let's see how doc2query performs on TREC COVID:

In [None]:
pt.Experiment(
    [br, pt.BatchRetrieve(doc2query_indexref)],
    topics,
    qrels,
    names=['DPH', 'doc2query'],
    baseline=0,
    eval_metrics=[AP(rel=2), nDCG, nDCG@10, P(rel=2)@10]
)

Similar to DeepCT, we see that the approach can significantly improve recall-oriented meausres, but doesn't help with precision-measures.

Let's again investigate the top results.

In [None]:
pipeline = pt.BatchRetrieve(doc2query_indexref) % 1 >> pt.text.get_text(dataset, 'title')
res = pipeline(topics)
res.merge(qrels, how='left').head()

Let's take a look at what queries it generates for some of these documents:

In [None]:
df = pd.DataFrame(doc for doc in dataset.get_corpus_iter() if doc['docno'] in ('124czudi', 'gtp01rna'))
df = df.rename(columns={'abstract': 'text'})
doc2query_df = doc2query(df)
for querygen, docno, text in zip(doc2query_df['querygen'], doc2query_df['docno'], df['text']):
  print(docno)
  print(querygen)
  print(text)

For "*what causes death from covid 19*" (gtp01rna), the top document focuses on the deaths from COVID in the US, but not on the specific causes due to COVID.

For "*how does the coronavirus respond to changes in weather*" (124czudi), the top document is about climate change (similar to DeepCT).

#  That's all folks

If you aren't coming back for Part 4 of the tutorial, please don't forget to complete our exit quiz: https://forms.office.com/r/RiYSAxAKhk