# MSMARCO Document Ranking Task using PyTerrier - uogTrBaseDPH

This notebook demonstrates indexing and performing a baseline DPH run for the MSMARCO Document Ranking task using [PyTerrier](https://github.com/terrier-org/pyterrier).

Author: Craig Macdonald, University of Glasgow

## PyTerrier Setup

We need to install PyTerrier. We can do this using Pip by uncommenting this line.

In [None]:
#!pip install --upgrade git+https://github.com/terrier-org/pyterrier.git#egg=python-terrier

If your JAVA_HOME environment variable does not specify a directory for Java 11, you should set it here.

In [None]:
#import os
#os.environ["JAVA_HOME"] = "/local/trmaster/opt/jdk-11.0.6/"

Lets now start PyTerrier.

In [1]:
import pyterrier as pt
if not pt.started():
  pt.init(mem=8000)

## Dataset Setup

PyTerrier contains a Datasets API that alllows to index/retrieve from a number of standard datasets. We can see which datasets are supported using `pt.list_datasets()`:

In [2]:
pt.list_datasets()

Unnamed: 0,dataset,topics,qrels,corpus,index
0,50pct,,,,True
1,vaswani,True,True,True,True
2,trec-deep-learning-docs,"(train, dev, test, test-2020)","(train, dev, test)",True,
3,trec-robust-2004,True,True,,
4,trec-robust-2005,True,True,,
5,trec-covid,"(round1, round2, round3)","(round1, round2)",,
6,trec-wt2g,True,True,,
7,trec-wt-2002,"(td, np)","(np, td)",,
8,trec-wt-2003,"(td, np)","(np, td)",,
9,trec-wt-2004,"(all, np, hp, td)","(hp, td, np, all)",,


For the MSMARCO document ranking task, the corresponding dataset is `"trec-deep-learning-docs"`, which we can see provides various topics and qrels sets, and provides a copy of the corpus.

In [3]:
dataset = pt.get_dataset("trec-deep-learning-docs")

If we run `get_corpus()`, it will download the TREC formatted version of the corpus. NB: This is 22GB, so too much for Google Colab unfortunately.

In [8]:
dataset.get_corpus()

['/users/craigm/.pyterrier/corpora/trec-deep-learning-docs/corpus/msmarco-docs.trec.gz']

## Indexing

Lets get setup for indexing. This is a basic configuration, without applying any stopword removal or stremming. Indexing on our machine took just over 1 hour using a single thread.

In [10]:
!rm -rf index/
!mkdir -p index
props = {
  'indexer.meta.reverse.keys':'docno',
  'termpipelines' : '',
}

pt.logging('INFO')
indexer = pt.TRECCollectionIndexer("./index")
indexer.setProperties(**props)
indexref = indexer.index(dataset.get_corpus())

10:44:59.728 [main] WARN  o.t.i.MultiDocumentFileCollection - trec.encoding is not set; resorting to platform default (ISO-8859-1). Indexing may be platform dependent. Recommend trec.encoding=UTF-8
10:44:59.824 [main] INFO  o.t.i.MultiDocumentFileCollection - TRECCollection 0% processing /users/craigm/.pyterrier/corpora/trec-deep-learning-docs/corpus/msmarco-docs.trec.gz
10:44:59.883 [main] INFO  o.t.structures.indexing.Indexer - creating the data structures data_1
10:44:59.928 [main] INFO  o.t.s.indexing.LexiconBuilder - LexiconBuilder active - flushing every 100000 documents, or when memory threshold hit
11:43:34.096 [main] INFO  o.t.structures.indexing.Indexer - Collection #0 took 3514 seconds to index (3213835 documents)
11:44:06.810 [main] INFO  o.t.s.indexing.LexiconBuilder - 33 lexicons to merge
11:45:48.357 [main] INFO  o.t.s.indexing.LexiconBuilder - Optimising structure lexicon
11:45:48.366 [main] INFO  o.t.s.i.FSOMapFileLexiconUtilities - Optimising lexicon with 17470544 ent

Lets see the statistics of the generated index.

In [13]:
pt.logging('WARN')
index = pt.IndexFactory.of(indexref)
print(index.getCollectionStatistics().toString())


Number of documents: 3213835
Number of terms: 17470544
Number of fields: 0
Number of tokens: 3667907097
Field names: []
Positions:   false



All being well, you should have indexed 3213835 documents

## Retrieval

This notebook contains a demonstration of how to execute a baseline retrieval run, using a Divergence from Randomness weighting model called [DPH](http://terrier.org/docs/current/javadoc/org/terrier/matching/models/DPH.html). You could also use [BM25](http://terrier.org/docs/current/javadoc/org/terrier/matching/models/BM25.html) or many of the [other weighting models that Terrier provides](http://terrier.org/docs/current/javadoc/org/terrier/matching/models/package-summary.html).

We use an object called BatchRetrieve. The constructor parameters are as follows:
 - `wmodel` - name of the Terrier weighting model class
 - `properties` - Terrier configurations - here we re-specify the termpipeline to match the indexing configuration
 - `verbose` - we set this to True, so we can view progress (using [TQDM](https://github.com/tqdm/tqdm)) when retrieving for these large topic sets.
 
Finally, we only want 100 results per query, so we apply the rank cutoff operator `%`.

In [14]:
DPH_br = pt.BatchRetrieve(index, wmodel="DPH", properties={"termpipelines": ""}, verbose=True) % 100

Lets now evaluate performance on the MSMARCO Dev set. Experiment is a declarative notation for running one or more experiment pipelines on a standard set of topics, then evaluating them for the same qrels. We report the MRR measure.

The dev set is quite large (> 5000 queries). This took 1 hour to run for us.

In [18]:
pt.Experiment([DPH_br], dataset.get_topics("dev"), dataset.get_qrels("dev"), eval_metrics=["recip_rank"])

12:32:40.100 [main] WARN  o.t.a.batchquerying.TRECQuery - trec.encoding is not set; resorting to platform default (ISO-8859-1). Retrieval may be platform dependent. Recommend trec.encoding=UTF-8


100%|██████████| 5193/5193 [1:01:49<00:00,  1.40q/s]


Unnamed: 0,name,recip_rank,map
0,BR(DPH),0.25827,0.25827


## Prepare Leaderboard results

Finally, lets prepare a results file for sending to the leaderboard. Again, with 5793 topics, this took about 1 hour.

In [19]:
pt.io.write_results(DPH_br(dataset.get_topics("leaderboard-2020")), "uogTrBaseDPH.res.gz", run_name="uogTrBaseDPH")

100%|██████████| 5793/5793 [1:07:51<00:00,  1.42q/s]
