castorini · ronakice · Nov 13, 2020 · Nov 13, 2020 · Nov 13, 2020
diff --git a/docs/experiments-monot5-tpu.md b/docs/experiments-monot5-tpu.md
@@ -1,88 +1,125 @@
 # Neural Ranking Baselines on [MS MARCO Passage Retrieval](https://github.com/microsoft/MSMARCO-Passage-Ranking) - with TPU
 
-This page contains instructions for running various monoT5 on the MS MARCO *passage* ranking task. We will run on the entire dev set. 
+This page contains instructions for running monoT5 on the MS MARCO *passage* ranking task.
 
 We will focus on using monoT5-3B to rerank, since it is difficult to run such a large model without a TPU.
-- monoT5-3B: Document Ranking with a Pretrained Sequence-to-Sequence Model [(Nogueira et al., 2020)](https://arxiv.org/pdf/2003.06713.pdf)
+We also mention the changes required to run monoT5-base for those with a more constrained compute budget.
+- monoT5: Document Ranking with a Pretrained Sequence-to-Sequence Model [(Nogueira et al., 2020)](https://arxiv.org/pdf/2003.06713.pdf)
 
-Note that there are also separate documents to run MS MARCO ranking tasks on regular GPU. Please see [MS MARCO *document* ranking task](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-doc.md), [MS MARCO *passage* ranking task - Subset](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage-subset.md) and [MS MARCO *passage* ranking task - Entire](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage-entrie.md).
+Note that there are also separate documents to run MS MARCO ranking tasks on regular GPU. Please see [MS MARCO *document* ranking task](https://github.com/castorini/pygaggle/blob/master/docs/experiments-msmarco-document.md), [MS MARCO *passage* ranking task - Subset](https://github.com/castorini/pygaggle/blob/master/docs/experiments-msmarco-passage-subset.md) and [MS MARCO *passage* ranking task - Entire](https://github.com/castorini/pygaggle/blob/master/docs/experiments-msmarco-passage-entire.md).
 
 Prior to running this, we suggest looking at our first-stage [BM25 ranking instructions](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md).
 We rerank the BM25 run files that contain ~1000 passages per query using monoT5.
 monoT5 is a pointwise reranker. This means that each document is scored independently using T5.
 
-## Data Prepare
+## Data Prep
 
-Since we will use some scripts form Pygaggle to process data and evaluate results, you need to install Pygaggle.
+Since we will use some scripts form PyGaggle to process data and evaluate results, we first install it from source.
 ```
 git clone --recursive https://github.com/castorini/pygaggle.git
 cd pygaggle
 pip install .
 ```
 
-We're first going to download the queries, qrels, run and corpus corresponding to the entire MS MARCO dev set considered. 
+We store all the files in the `data/msmarco_passage` directory.
+```
+export DATA_DIR=data/msmarco_passage
+mkdir ${DATA_DIR}
+```
+
+We provide specific data prep instructions for the train and dev set.
+
+### Train Set
+
+First, download the MS MARCO train triples:
+```
+cd ${DATA_DIR}
+wget https://storage.googleapis.com/duobert_git/triples.train.small.tar.gz
+tar -xvf triples.train.small.tar.gz
+rm triples.train.small.tar.gz
+cd ../../
+```
+
+Then convert the train triples file to the monoT5 input format:
+```
+python pygaggle/data/create_msmarco_t5_training_pairs --triples_train ${DATA_DIR}/triples.train.small.tsv --output_to_t5 ${DATA_DIR}/query_doc_pairs.train.tsv
+```
+
+Next, copy the monoT5 input file to Google Storage. TPU training will read data directly from `gs`.
+```
+gsutil cp ${DATA_DIR}/query_doc_pairs.train.tsv ${GS_FOLDER}/
+```
+
+This file is made available in our [bucket](https://console.cloud.google.com/storage/browser/castorini/monot5/data).
+
+### Dev Set
 
-The run file is generated by following the BM25 ranking instructions by Anserini. Please see [here](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md) for details.
+We download the query, qrels, run and corpus files corresponding to the MS MARCO passage dev set. 
 
-We'll store all following files in the `data/msmarco_dev` directory.
+The run file is generated by following the Anserini's [BM25 ranking instructions](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md).
 
+In short, the files are:
 - `queries.dev.small.tsv`: 6,980 queries from the MS MARCO dev set.
 - `qrels.dev.small.tsv`: 7,437 pairs of query relevant passage ids from the MS MARCO dev set.
-- `run.dev.small.tsv`: Approximately 6,980,000 pairs of dev set queries and retrieved passages using BM25.
+- `run.dev.small.tsv`: Approximately 6,980,000 pairs of dev set queries and retrieved passages using Anserini's BM25.
 - `collection.tar.gz`: All passages (8,841,823) in the MS MARCO passage corpus. In this tsv file, the first column is the passage id, and the second is the passage text.
 
-For more description about the data, please see [here](https://github.com/castorini/duobert#data-and-trained-models)
+A more detailed description of the data is available [here](https://github.com/castorini/duobert#data-and-trained-models).
+
+Let's start.
 ```
-cd data/msmarco_dev
+cd ${DATA_DIR}
 wget https://storage.googleapis.com/duobert_git/run.bm25.dev.small.tsv
 wget https://www.dropbox.com/s/hq6xjhswiz60siu/queries.dev.small.tsv
 wget https://www.dropbox.com/s/5t6e2225rt6ikym/qrels.dev.small.tsv
 wget https://www.dropbox.com/s/m1n2wf80l1lb9j1/collection.tar.gz
 tar -xvf collection.tar.gz
+rm collection.tar.gz
 mv run.bm25.dev.small.tsv run.dev.small.tsv
 cd ../../
 ```
-As a sanity check, we can evaluate the first-stage retrieved documents using the official MS MARCO evaluation script.
 
+As a sanity check, we can evaluate the first-stage retrieved documents using the official MS MARCO evaluation script.
 ```
-export DATA_DIR=data/msmarco_dev
 python tools/eval/msmarco_eval.py ${DATA_DIR}/qrels.dev.small.tsv ${DATA_DIR}/run.dev.small.tsv
 ```
 
 The output should be:
-
 ```
 #####################
 MRR @10: 0.18736452221767383
 QueriesRanked: 6980
 #####################
 ```
 
-Then create query-doc pairs for monoT5 input format.
+Then, we prepare the query-doc pairs in the monoT5 input format.
 ```
-python -m pygaggle.data.create_msmarco_monot5_input --queries ${DATA_DIR}/queries.dev.small.tsv \
-                                      --run ${DATA_DIR}/run.bm25.dev.small.tsv \
-                                      --corpus ${DATA_DIR}/collection/collection.tsv \
+python pygaggle/data/create_msmarco_monot5_input.py --queries ${DATA_DIR}/queries.dev.small.tsv \
+                                      --run ${DATA_DIR}/run.dev.small.tsv \
+                                      --corpus ${DATA_DIR}/collection.tsv \
                                       --t5_input ${DATA_DIR}/query_doc_pairs.dev.small.txt \
                                       --t5_input_ids ${DATA_DIR}/query_doc_pair_ids.dev.small.tsv
 ```
 We will get two output files here:
 - `query_doc_pairs.dev.small.txt`: The query-doc pairs for monoT5 input.
-- `query_doc_pair_ids.dev.small.tsv`: The `query_id`s and `doc_id`s that mapping to the query-doc pairs. We will use this to map monoT5 output scores back to query-doc pairs.
+- `query_doc_pair_ids.dev.small.tsv`: The `query_id`s and `doc_id`s that map to the query-doc pairs. We will use this to map query-doc pairs to their corresponding monoT5 output scores.
+
+The files are made available in our [bucket](https://console.cloud.google.com/storage/browser/castorini/monot5/data).
 
-Note that there will be a memory issue if the monoT5 input file is large. Thus, we will split the input file into multiple files.
+Note that there might be a memory issue if the monoT5 input file is too large for the memory in the instance. We thus split the input file into multiple files.
 
 ```
 split --suffix-length 3 --numeric-suffixes --lines 1000000 ${DATA_DIR}/query_doc_pairs.dev.small.txt ${DATA_DIR}/query_doc_pairs.dev.small.txt
 ```
 
 For `query_doc_pairs.dev.small.txt`, we will get 7 files after split. i.e. (`query_doc_pairs.dev.small.txt000` to `query_doc_pairs.dev.small.txt006`)
 
-Then copy these input files to Google Storage. TPU inference will read data directly from `gs`
+We copy these input files to Google Storage. TPU inference will read data directly from `gs`
 ```
 export GS_FOLDER=<google storage folder to store input/output data>
 gsutil cp ${DATA_DIR}/query_doc_pairs.dev.small.txt??? ${GS_FOLDER}
 ```
+These files can also be found in our [bucket](https://console.cloud.google.com/storage/browser/castorini/monot5/data).
 
 ## Start a VM with TPU on Google Cloud
 
@@ -213,24 +250,8 @@ If you were able to replicate these results, please submit a PR adding to the re
 
 ## Train monoT5
 
-First, download the MS MARCO train triples:
-```
-wget https://storage.googleapis.com/duobert_git/triples.train.small.tar.gz
-tar -xvf triples.train.small.tar.gz
-rm triples.train.small.tar.gz
-```
-
-Then convert the train triples to t5 input format:
-```
-python -m pygaggle.data.create_msmarco_t5_training_pairs --triples_train triples.train.small.tsv --output_to_t5 query_doc_pairs.train.tsv
-```
-
-Next, copy the input file to Google Storage. TPU training will read data directly from `gs`
-```
-gsutil cp query_doc_pairs.train.tsv ${GS_FOLDER}/query_doc_pairs.train.tsv
-```
-
 Recall the environment variables
+
 ```
 export MODEL=<t5 pretrain model, e.g. base, large, 3B>
 export GS_FOLDER=<gs folder to store checkpoints>