-
Notifications
You must be signed in to change notification settings - Fork 96
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add doc&scripts for msmarco experiments on tpu (#100)
* add doc&scripts for msmarco experiments on tpu * fix typo in experiments-monot5-tpu document * add description and fix typo in scripts for monot5 tf data processing * update monot5 tpu doc to tf2
- Loading branch information
Showing
3 changed files
with
326 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,214 @@ | ||
# Neural Ranking Baselines on [MS MARCO Passage Retrieval](https://github.com/microsoft/MSMARCO-Passage-Ranking) - with TPU | ||
|
||
This page contains instructions for running various monoT5 on the MS MARCO *passage* ranking task. We will run on the entire dev set. | ||
|
||
We will focus on using monoT5-3B to rerank, since it is difficult to run such a large model without a TPU. | ||
- monoT5-3B: Document Ranking with a Pretrained Sequence-to-Sequence Model [(Nogueira et al., 2020)](https://arxiv.org/pdf/2003.06713.pdf) | ||
|
||
Note that there are also separate documents to run MS MARCO ranking tasks on regular GPU. Please see [MS MARCO *document* ranking task](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-doc.md), [MS MARCO *passage* ranking task - Subset](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage-subset.md) and [MS MARCO *passage* ranking task - Entire](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage-entrie.md). | ||
|
||
Prior to running this, we suggest looking at our first-stage [BM25 ranking instructions](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md). | ||
We rerank the BM25 run files that contain ~1000 passages per query using monoT5. | ||
monoT5 is a pointwise reranker. This means that each document is scored independently using T5. | ||
|
||
## Data Prepare | ||
|
||
Since we will use some scripts form Pygaggle to process data and evaluate results, you need to install Pygaggle. | ||
``` | ||
git clone --recursive https://github.com/castorini/pygaggle.git | ||
cd pygaggle | ||
pip install . | ||
``` | ||
|
||
We're first going to download the queries, qrels, run and corpus corresponding to the entire MS MARCO dev set considered. | ||
|
||
The run file is generated by following the BM25 ranking instructions by Anserini. Please see [here](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md) for details. | ||
|
||
We'll store all following files in the `data/msmarco_dev` directory. | ||
|
||
- `queries.dev.small.tsv`: 6,980 queries from the MS MARCO dev set. | ||
- `qrels.dev.small.tsv`: 7,437 pairs of query relevant passage ids from the MS MARCO dev set. | ||
- `run.dev.small.tsv`: Approximately 6,980,000 pairs of dev set queries and retrieved passages using BM25. | ||
- `collection.tar.gz`: All passages (8,841,823) in the MS MARCO passage corpus. In this tsv file, the first column is the passage id, and the second is the passage text. | ||
|
||
For more description about the data, please see [here](https://github.com/castorini/duobert#data-and-trained-models) | ||
``` | ||
cd data/msmarco_dev | ||
wget https://storage.googleapis.com/duobert_git/run.bm25.dev.small.tsv | ||
wget https://www.dropbox.com/s/hq6xjhswiz60siu/queries.dev.small.tsv | ||
wget https://www.dropbox.com/s/5t6e2225rt6ikym/qrels.dev.small.tsv | ||
wget https://www.dropbox.com/s/m1n2wf80l1lb9j1/collection.tar.gz | ||
tar -xvf collection.tar.gz | ||
mv run.bm25.dev.small.tsv run.dev.small.tsv | ||
cd ../../ | ||
``` | ||
As a sanity check, we can evaluate the first-stage retrieved documents using the official MS MARCO evaluation script. | ||
|
||
``` | ||
export DATA_DIR=data/msmarco_dev | ||
python tools/eval/msmarco_eval.py ${DATA_DIR}/qrels.dev.small.tsv ${DATA_DIR}/run.dev.small.tsv | ||
``` | ||
|
||
The output should be: | ||
|
||
``` | ||
##################### | ||
MRR @10: 0.18736452221767383 | ||
QueriesRanked: 6980 | ||
##################### | ||
``` | ||
|
||
Then create query-doc pairs for monoT5 input format. | ||
``` | ||
python -m pygaggle.data.create_msmarco_monot5_input --queries ${DATA_DIR}/queries.dev.small.tsv \ | ||
--run ${DATA_DIR}/run.bm25.dev.small.tsv \ | ||
--corpus ${DATA_DIR}/collection/collection.tsv \ | ||
--t5_input ${DATA_DIR}/query_doc_pairs.dev.small.txt \ | ||
--t5_input_ids ${DATA_DIR}/query_doc_pair_ids.dev.small.tsv | ||
``` | ||
We will get two output files here: | ||
- `query_doc_pairs.dev.small.txt`: The query-doc pairs for monoT5 input. | ||
- `query_doc_pair_ids.dev.small.tsv`: The `query_id`s and `doc_id`s that mapping to the query-doc pairs. We will use this to map monoT5 output scores back to query-doc pairs. | ||
|
||
Note that there will be a memory issue if the monoT5 input file is large. Thus, we will split the input file into multiple files. | ||
|
||
``` | ||
split --suffix-length 3 --numeric-suffixes --lines 1000000 ${DATA_DIR}/query_doc_pairs.dev.small.txt ${DATA_DIR}/query_doc_pairs.dev.small.txt | ||
``` | ||
|
||
For `query_doc_pairs.dev.small.txt`, we will get 7 files after split. i.e. (`query_doc_pairs.dev.small.txt000` to `query_doc_pairs.dev.small.txt006`) | ||
|
||
Then copy these input files to Google Storage. TPU inference will read data directly from `gs` | ||
``` | ||
export GS_FOLDER=<google storage folder to store input/output data> | ||
gsutil cp ${DATA_DIR}/query_doc_pairs.dev.small.txt??? ${GS_FOLDER} | ||
``` | ||
|
||
## Start a VM with TPU on Google Cloud | ||
|
||
Define environment variables. | ||
``` | ||
export PROJECT_NAME=<gcloud project name> | ||
export PROJECT_ID=<gcloud project id> | ||
export INSTANCE_NAME=<name of vm to create> | ||
export TPU_NAME=<name of tpu to create> | ||
``` | ||
|
||
Create the VM. | ||
``` | ||
gcloud beta compute --project=${PROJECT_NAME} instances create ${INSTANCE_NAME} --zone=europe-west4-a --machine-type=n1-standard-4 --subnet=default --network-tier=PREMIUM --maintenance-policy=MIGRATE --service-account=${PROJECT_ID}-compute@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/cloud-platform --image=debian-9-stretch-v20191014 --image-project=debian-cloud --boot-disk-size=200GB --boot-disk-type=pd-standard --boot-disk-device-name=${INSTANCE_NAME} --reservation-affinity=any | ||
``` | ||
|
||
After the VM created, we can `ssh` to the machine. | ||
Then create a TPU. | ||
|
||
``` | ||
curl -O https://dl.google.com/cloud_tpu/ctpu/latest/linux/ctpu && chmod a+x ctpu | ||
./ctpu up --name=${TPU_NAME} --project=${PROJECT_NAME} --zone=europe-west4-a --tpu-size=v3-8 --tpu-only --noconf | ||
``` | ||
|
||
## Setup environment on VM | ||
Install required tools. | ||
``` | ||
sudo apt-get update | ||
sudo apt-get install git gcc screen --yes | ||
``` | ||
|
||
Install [Miniconda](https://docs.conda.io/en/latest/miniconda.html). | ||
``` | ||
curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh | ||
bash ./Miniconda3-latest-Linux-x86_64.sh | ||
``` | ||
When installation finished, do: | ||
``` | ||
source ~/.bashrc | ||
``` | ||
|
||
Then create a Python virtual environment for the experiments. | ||
``` | ||
conda init | ||
conda create --y --name py36 python=3.6 | ||
conda activate py36 | ||
``` | ||
Install dependencies. | ||
``` | ||
conda install -c conda-forge httptools jsonnet --yes | ||
pip install tensorflow tensorflow-text t5[gcp] | ||
git clone https://github.com/castorini/mesh.git | ||
pip install --editable mesh | ||
``` | ||
|
||
## Reranking with monoT5 | ||
On the TPU machine define model type and checkpoint. | ||
(Here we use our pretrained monoT5-3B as example) | ||
``` | ||
export MODEL=3B | ||
export CHECKPOINT=gs://neuralresearcher_data/doc2query/experiments/363 | ||
``` | ||
|
||
Then run following command to start the process in background and monitor the log | ||
``` | ||
export EXPNO=001 | ||
for ITER in {000..006}; do | ||
echo "Running iter: $ITER" >> out.log_eval_$EXPNO | ||
nohup t5_mesh_transformer \ | ||
--tpu="${TPU_NAME}" \ | ||
--gcp_project=${PROJECT_NAME} \ | ||
--tpu_zone="europe-west4-a" \ | ||
--model_dir="${CHECKPOINT}" \ | ||
--gin_file="gs://t5-data/pretrained_models/$MODEL/operative_config.gin" \ | ||
--gin_file="infer.gin" \ | ||
--gin_file="beam_search.gin" \ | ||
--gin_param="utils.tpu_mesh_shape.tpu_topology = '2x2'" \ | ||
--gin_param="infer_checkpoint_step = 1100000" \ | ||
--gin_param="utils.run.sequence_length = {'inputs': 512, 'targets': 64}" \ | ||
--gin_param="Bitransformer.decode.max_decode_length = 64" \ | ||
--gin_param="input_filename = '${GS_FOLDER}/query_doc_pairs.dev.small.txt${ITER}'" \ | ||
--gin_param="output_filename = '${GS_FOLDER}/query_doc_pair_scores.dev.small.txt${ITER}'" \ | ||
--gin_param="tokens_per_batch = 65536" \ | ||
--gin_param="Bitransformer.decode.beam_size = 1" \ | ||
--gin_param="Bitransformer.decode.temperature = 0.0" \ | ||
--gin_param="Unitransformer.sample_autoregressive.sampling_keep_top_k = -1" \ | ||
>> out.log_eval_exp${EXPNO} 2>&1 | ||
done & | ||
tail -100f out.log_eval_exp${EXPNO} | ||
``` | ||
|
||
It takes about 35 hours to rerank on a TPU v3-8. | ||
|
||
NOTE: We strongly encourage you to run above processes in `screen` to make sure the processes doesn't get interrupted. | ||
|
||
## Evaluate Result | ||
After rerank finished. Copy the results from GS to your work directory. And concate all score files back to one file. | ||
``` | ||
gsutil cp ${GS_FOLDER}/query_doc_pair_scores.dev.small.txt???-1100000 ${DATA_DIR}/ | ||
cat ${DATA_DIR}/query_doc_pair_scores.dev.small.txt???-1100000 > ${DATA_DIR}/query_doc_pair_scores.dev.small.txt | ||
``` | ||
|
||
Then convert the monoT5 output back to MSMARCO format. | ||
``` | ||
python -m pygaggle.data.convert_t5_output_to_msmarco_run --t5_output ${DATA_DIR}/query_doc_pair_scores.dev.small.txt \ | ||
--t5_output_ids ${DATA_DIR}/query_doc_pair_ids.dev.small.tsv \ | ||
--msmarco_run ${DATA_DIR}/run.monot5_3b.dev.tsv | ||
``` | ||
|
||
Now we can evaluate the rerank results using the official MS MARCO evaluation script. | ||
``` | ||
python tools/eval/msmarco_eval.py ${DATA_DIR}/qrels.dev.small.tsv ${DATA_DIR}/run.monot5_3b.dev.tsv | ||
``` | ||
|
||
The output should be: | ||
``` | ||
##################### | ||
MRR @10: 0.3983799517896949 | ||
QueriesRanked: 6980 | ||
##################### | ||
``` | ||
|
||
You should see the same result. | ||
|
||
If you were able to replicate these results, please submit a PR adding to the replication log! Please mention in your PR if you find any difference! | ||
|
||
## Replication Log |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
""" | ||
This script convert monoT5 output file to msmarco run file | ||
""" | ||
import collections | ||
import argparse | ||
|
||
parser = argparse.ArgumentParser() | ||
parser.add_argument("--t5_output", type=str, required=True, | ||
help="tsv file with two columns, <label> and <score>") | ||
parser.add_argument("--t5_output_ids", type=str, required=True, | ||
help="tsv file with two columns, <query_id> and <doc_id>") | ||
parser.add_argument("--msmarco_run", type=str, required=True, | ||
help="path to msmarco_run, tsv file, with <query_id>, <doc_id> and <rank>") | ||
args = parser.parse_args() | ||
|
||
examples = collections.defaultdict(list) | ||
with open(args.t5_output_ids) as f_gt, open(args.t5_output) as f_pred: | ||
for line_gt, line_pred in zip(f_gt, f_pred): | ||
query_id, doc_id = line_gt.strip().split('\t') | ||
_, score = line_pred.strip().split('\t') | ||
score = float(score) | ||
examples[query_id].append((doc_id, score)) | ||
|
||
with open(args.msmarco_run, 'w') as fout: | ||
for query_id, doc_ids_scores in examples.items(): | ||
doc_ids_scores.sort(key=lambda x: x[1], reverse=True) | ||
for rank, (doc_id, _) in enumerate(doc_ids_scores): | ||
fout.write(f'{query_id}\t{doc_id}\t{rank + 1}\n') |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,84 @@ | ||
""" | ||
This script creates monoT5 input files by taking corpus, | ||
queries and the retrieval run file for the queries and then | ||
create files for monoT5 input. Each line in the monoT5 input | ||
file follows the format: | ||
f'Query: {query} Document: {document} Relevant:\n') | ||
""" | ||
import collections | ||
from tqdm import tqdm | ||
import argparse | ||
|
||
parser = argparse.ArgumentParser() | ||
parser.add_argument("--queries", type=str, required=True, | ||
help="tsv file with two columns, <query_id> and <query_text>") | ||
parser.add_argument("--run", type=str, required=True, | ||
help="tsv file with three columns <query_id>, <doc_id> and <rank>") | ||
parser.add_argument("--corpus", type=str, required=True) | ||
parser.add_argument("--t5_input", type=str, required=True, | ||
help="path to store t5_input, txt format") | ||
parser.add_argument("--t5_input_ids", type=str, required=True, | ||
help="path to store the query-doc ids of t5_input, tsv format") | ||
args = parser.parse_args() | ||
|
||
|
||
def load_corpus(path): | ||
print('Loading corpus...') | ||
corpus = {} | ||
with open(path) as f: | ||
for line in tqdm(f): | ||
doc_id, doc = line.rstrip().split('\t') | ||
corpus[doc_id] = doc | ||
return corpus | ||
|
||
|
||
def load_queries(path): | ||
"""Loads queries into a dict of key: query_id, value: query text.""" | ||
print('Loading queries...') | ||
queries = {} | ||
with open(path) as f: | ||
for line in tqdm(f): | ||
query_id, query = line.rstrip().split('\t') | ||
queries[query_id] = query | ||
return queries | ||
|
||
|
||
def load_run(path): | ||
"""Loads run into a dict of key: query_id, value: list of candidate doc | ||
ids.""" | ||
|
||
# We want to preserve the order of runs so we can pair the run file with | ||
# the TFRecord file. | ||
print('Loading run...') | ||
run = collections.OrderedDict() | ||
with open(path) as f: | ||
for line in tqdm(f): | ||
query_id, doc_title, rank = line.split('\t') | ||
if query_id not in run: | ||
run[query_id] = [] | ||
run[query_id].append((doc_title, int(rank))) | ||
|
||
# Sort candidate docs by rank. | ||
print('Sorting candidate docs by rank...') | ||
sorted_run = collections.OrderedDict() | ||
for query_id, doc_titles_ranks in tqdm(run.items()): | ||
sorted(doc_titles_ranks, key=lambda x: x[1]) | ||
doc_titles = [doc_titles for doc_titles, _ in doc_titles_ranks] | ||
sorted_run[query_id] = doc_titles | ||
|
||
return sorted_run | ||
|
||
|
||
corpus = load_corpus(path=args.corpus) | ||
queries = load_queries(path=args.queries) | ||
run = load_run(path=args.run) | ||
|
||
print("Writing t5 input and ids") | ||
with open(args.t5_input, 'w') as fout_t5, open(args.t5_input_ids, 'w') as fout_tsv: | ||
for num_examples, (query_id, candidate_doc_ids) in enumerate( | ||
tqdm(run.items(), total=len(run))): | ||
query = queries[query_id] | ||
for candidate_doc_id in candidate_doc_ids: | ||
fout_t5.write( | ||
f'Query: {query} Document: {corpus[candidate_doc_id]} Relevant:\n') | ||
fout_tsv.write(f'{query_id}\t{candidate_doc_id}\n') |