Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor monoT5 TPU instructions, clean up to Data Prep, add relevant data links #113

Merged
merged 2 commits into from
Nov 13, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 59 additions & 38 deletions docs/experiments-monot5-tpu.md
Original file line number Diff line number Diff line change
@@ -1,88 +1,125 @@
# Neural Ranking Baselines on [MS MARCO Passage Retrieval](https://github.com/microsoft/MSMARCO-Passage-Ranking) - with TPU

This page contains instructions for running various monoT5 on the MS MARCO *passage* ranking task. We will run on the entire dev set.
This page contains instructions for running monoT5 on the MS MARCO *passage* ranking task.

We will focus on using monoT5-3B to rerank, since it is difficult to run such a large model without a TPU.
- monoT5-3B: Document Ranking with a Pretrained Sequence-to-Sequence Model [(Nogueira et al., 2020)](https://arxiv.org/pdf/2003.06713.pdf)
We also mention the changes required to run monoT5-base for those with a more constrained compute budget.
- monoT5: Document Ranking with a Pretrained Sequence-to-Sequence Model [(Nogueira et al., 2020)](https://arxiv.org/pdf/2003.06713.pdf)

Note that there are also separate documents to run MS MARCO ranking tasks on regular GPU. Please see [MS MARCO *document* ranking task](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-doc.md), [MS MARCO *passage* ranking task - Subset](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage-subset.md) and [MS MARCO *passage* ranking task - Entire](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage-entrie.md).
Note that there are also separate documents to run MS MARCO ranking tasks on regular GPU. Please see [MS MARCO *document* ranking task](https://github.com/castorini/pygaggle/blob/master/docs/experiments-msmarco-document.md), [MS MARCO *passage* ranking task - Subset](https://github.com/castorini/pygaggle/blob/master/docs/experiments-msmarco-passage-subset.md) and [MS MARCO *passage* ranking task - Entire](https://github.com/castorini/pygaggle/blob/master/docs/experiments-msmarco-passage-entire.md).

Prior to running this, we suggest looking at our first-stage [BM25 ranking instructions](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md).
We rerank the BM25 run files that contain ~1000 passages per query using monoT5.
monoT5 is a pointwise reranker. This means that each document is scored independently using T5.

## Data Prepare
## Data Prep

Since we will use some scripts form Pygaggle to process data and evaluate results, you need to install Pygaggle.
Since we will use some scripts form PyGaggle to process data and evaluate results, we first install it from source.
```
git clone --recursive https://github.com/castorini/pygaggle.git
cd pygaggle
pip install .
```

We're first going to download the queries, qrels, run and corpus corresponding to the entire MS MARCO dev set considered.
We store all the files in the `data/msmarco_passage` directory.
```
export DATA_DIR=data/msmarco_passage
mkdir ${DATA_DIR}
```

We provide specific data prep instructions for the train and dev set.

### Train Set

First, download the MS MARCO train triples:
```
cd ${DATA_DIR}
wget https://storage.googleapis.com/duobert_git/triples.train.small.tar.gz
tar -xvf triples.train.small.tar.gz
rm triples.train.small.tar.gz
cd ../../
```

Then convert the train triples file to the monoT5 input format:
```
python pygaggle/data/create_msmarco_t5_training_pairs --triples_train ${DATA_DIR}/triples.train.small.tsv --output_to_t5 ${DATA_DIR}/query_doc_pairs.train.tsv
```

Next, copy the monoT5 input file to Google Storage. TPU training will read data directly from `gs`.
```
gsutil cp ${DATA_DIR}/query_doc_pairs.train.tsv ${GS_FOLDER}/
```

This file is made available in our [bucket](https://console.cloud.google.com/storage/browser/castorini/monot5/data).

### Dev Set

The run file is generated by following the BM25 ranking instructions by Anserini. Please see [here](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md) for details.
We download the query, qrels, run and corpus files corresponding to the MS MARCO passage dev set.

We'll store all following files in the `data/msmarco_dev` directory.
The run file is generated by following the Anserini's [BM25 ranking instructions](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md).

In short, the files are:
- `queries.dev.small.tsv`: 6,980 queries from the MS MARCO dev set.
- `qrels.dev.small.tsv`: 7,437 pairs of query relevant passage ids from the MS MARCO dev set.
- `run.dev.small.tsv`: Approximately 6,980,000 pairs of dev set queries and retrieved passages using BM25.
- `run.dev.small.tsv`: Approximately 6,980,000 pairs of dev set queries and retrieved passages using Anserini's BM25.
- `collection.tar.gz`: All passages (8,841,823) in the MS MARCO passage corpus. In this tsv file, the first column is the passage id, and the second is the passage text.

For more description about the data, please see [here](https://github.com/castorini/duobert#data-and-trained-models)
A more detailed description of the data is available [here](https://github.com/castorini/duobert#data-and-trained-models).

Let's start.
```
cd data/msmarco_dev
cd ${DATA_DIR}
wget https://storage.googleapis.com/duobert_git/run.bm25.dev.small.tsv
wget https://www.dropbox.com/s/hq6xjhswiz60siu/queries.dev.small.tsv
wget https://www.dropbox.com/s/5t6e2225rt6ikym/qrels.dev.small.tsv
wget https://www.dropbox.com/s/m1n2wf80l1lb9j1/collection.tar.gz
tar -xvf collection.tar.gz
rm collection.tar.gz
mv run.bm25.dev.small.tsv run.dev.small.tsv
cd ../../
```
As a sanity check, we can evaluate the first-stage retrieved documents using the official MS MARCO evaluation script.

As a sanity check, we can evaluate the first-stage retrieved documents using the official MS MARCO evaluation script.
```
export DATA_DIR=data/msmarco_dev
python tools/eval/msmarco_eval.py ${DATA_DIR}/qrels.dev.small.tsv ${DATA_DIR}/run.dev.small.tsv
```

The output should be:

```
#####################
MRR @10: 0.18736452221767383
QueriesRanked: 6980
#####################
```

Then create query-doc pairs for monoT5 input format.
Then, we prepare the query-doc pairs in the monoT5 input format.
```
python -m pygaggle.data.create_msmarco_monot5_input --queries ${DATA_DIR}/queries.dev.small.tsv \
--run ${DATA_DIR}/run.bm25.dev.small.tsv \
--corpus ${DATA_DIR}/collection/collection.tsv \
python pygaggle/data/create_msmarco_monot5_input.py --queries ${DATA_DIR}/queries.dev.small.tsv \
--run ${DATA_DIR}/run.dev.small.tsv \
--corpus ${DATA_DIR}/collection.tsv \
--t5_input ${DATA_DIR}/query_doc_pairs.dev.small.txt \
--t5_input_ids ${DATA_DIR}/query_doc_pair_ids.dev.small.tsv
```
We will get two output files here:
- `query_doc_pairs.dev.small.txt`: The query-doc pairs for monoT5 input.
- `query_doc_pair_ids.dev.small.tsv`: The `query_id`s and `doc_id`s that mapping to the query-doc pairs. We will use this to map monoT5 output scores back to query-doc pairs.
- `query_doc_pair_ids.dev.small.tsv`: The `query_id`s and `doc_id`s that map to the query-doc pairs. We will use this to map query-doc pairs to their corresponding monoT5 output scores.

The files are made available in our [bucket](https://console.cloud.google.com/storage/browser/castorini/monot5/data).

Note that there will be a memory issue if the monoT5 input file is large. Thus, we will split the input file into multiple files.
Note that there might be a memory issue if the monoT5 input file is too large for the memory in the instance. We thus split the input file into multiple files.

```
split --suffix-length 3 --numeric-suffixes --lines 1000000 ${DATA_DIR}/query_doc_pairs.dev.small.txt ${DATA_DIR}/query_doc_pairs.dev.small.txt
```

For `query_doc_pairs.dev.small.txt`, we will get 7 files after split. i.e. (`query_doc_pairs.dev.small.txt000` to `query_doc_pairs.dev.small.txt006`)

Then copy these input files to Google Storage. TPU inference will read data directly from `gs`
We copy these input files to Google Storage. TPU inference will read data directly from `gs`
```
export GS_FOLDER=<google storage folder to store input/output data>
gsutil cp ${DATA_DIR}/query_doc_pairs.dev.small.txt??? ${GS_FOLDER}
```
These files can also be found in our [bucket](https://console.cloud.google.com/storage/browser/castorini/monot5/data).

## Start a VM with TPU on Google Cloud

Expand Down Expand Up @@ -213,24 +250,8 @@ If you were able to replicate these results, please submit a PR adding to the re

## Train monoT5

First, download the MS MARCO train triples:
```
wget https://storage.googleapis.com/duobert_git/triples.train.small.tar.gz
tar -xvf triples.train.small.tar.gz
rm triples.train.small.tar.gz
```

Then convert the train triples to t5 input format:
```
python -m pygaggle.data.create_msmarco_t5_training_pairs --triples_train triples.train.small.tsv --output_to_t5 query_doc_pairs.train.tsv
```

Next, copy the input file to Google Storage. TPU training will read data directly from `gs`
```
gsutil cp query_doc_pairs.train.tsv ${GS_FOLDER}/query_doc_pairs.train.tsv
```

Recall the environment variables

```
export MODEL=<t5 pretrain model, e.g. base, large, 3B>
export GS_FOLDER=<gs folder to store checkpoints>
Expand Down