Skip to content
Permalink
Browse files

Adds doc2query experiments on TREC-CAR (#742)

* Adds doc2query experiments on TREC-CAR

* Adds doc2query experiments on TREC-CAR

* Removes unnecessary files. Fixes permissions.

* Fixes typo and indentations

* Adds explanation for MAP difference

* Removes extra lines

* Fixes TREC CAR name

* Fixes TREC CAR name

* Fixes style.

* Change TREC-CAR folder name

* Remove document expansions reference
  • Loading branch information...
rodrigonogueira4 committed Jul 9, 2019
1 parent 7cdf354 commit ba18e77c51459886813213fb684cc558a90f4a05
@@ -10,9 +10,9 @@ Here, we run through how to replicate the BM25+Doc2query condition with our copy
## MS MARCO Passage Ranking

To replicate our Doc2query results on the [MS MARCO Passage Ranking Task](https://github.com/microsoft/MSMARCO-Passage-Ranking), follow these instructions.
Before going through this guide, it might make sense to [replicate our BM25 baselines](experiments-msmarco-passage.md) first.
Before going through this guide, it is recommended that you [replicate our BM25 baselines](experiments-msmarco-passage.md) first.

To start, grab the predicted queries (i.e., document expansions):
To start, grab the predicted queries:

```
wget https://www.dropbox.com/s/709q495d9hohcmh/pred-test_topk10.tar.gz -P msmarco-passage
@@ -28,7 +28,7 @@ $ wc msmarco-passage/pred-test_topk10.txt
8841823 536446170 2962345659 msmarco-passage/pred-test_topk10.txt
```

These are the predicted queries based on our seq2seq model, based on top _k_ sampling with a beam size of 10.
These are the predicted queries based on our seq2seq model, based on top _k_ sampling with 10 samples for each document in the corpus.
There are as many lines in the above file as there are documents; all 10 predicted queries are concatenated on a single line.

Now let's create a new document collection by concatenating the predicted queries to the original documents:
@@ -87,5 +87,64 @@ So, this simple trick improves MRR by a bit over baseline Doc2query.

## TREC CAR

Instructions coming soon!
We will now describe how to reproduce the TREC CAR results of our model BM25+doc2query presented in the paper.

To start, download the TREC CAR dataset and the predicted queries:
```
mkdir trec_car
wget http://trec-car.cs.unh.edu/datareleases/v2.0/paragraphCorpus.v2.0.tar.xz -P trec_car
wget https://storage.googleapis.com/neuralresearcher_data/doc2query/data/aligned5/pred-test_topk10.tar.gz -P trec_car
tar -xf trec_car/paragraphCorpus.v2.0.tar.xz -C trec_car
tar -xf trec_car/pred-test_topk10.tar.gz -C trec_car
```

To confirm, `paragraphCorpus.v2.0.tar.xz` should have an MD5 checksum of `a404e9256d763ddcacc3da1e34de466a` and
`pred-test_topk10.tar.gz` should have an MD5 checksum of `b9f98b55e6260c64e830b34d80a7afd7`.

These are the predicted queries based on our seq2seq model, based on top _k_ sampling with 10 samples for each document in the corpus.
There are as many lines in the above file as there are documents; all 10 predicted queries are concatenated on a single line.

Now let's create a new document collection by concatenating the predicted queries to the original documents:

```
python src/main/python/treccar/augment_collection_with_predictions.py \
--collection_path trec_car/paragraphCorpus/dedup.articles-paragraphs.cbor --output_folder trec_car/collection_jsonl_expanded_topk10 \
--predictions trec_car/pred-test_topk10.txt --stride 1
```

This augmentation process might take 2-3 hours.

We can then index the expanded documents:

```
nohup sh target/appassembler/bin/IndexCollection -collection JsonCollection \
-generator LuceneDocumentGenerator -threads 40 -input trec_car/collection_jsonl_expanded_topk10 \
-index trec_car/lucene-index.car17v2.0
```

And retrieve the test queries:

```
sh target/appassembler/bin/SearchCollection -topicreader Car \
-index trec_car/lucene-index.car17v2.0 \
-topics src/main/resources/topics-and-qrels/topics.car17v2.0.benchmarkY1test.txt \
-output trec_car/run.car17v2.0.bm25.topics.car17v2.0.benchmarkY1test.txt -bm25
```

Evaluation is performed with `trec_eval`:
```
eval/trec_eval.9.0.4/trec_eval -c -m map -c -m recip_rank \
src/main/resources/topics-and-qrels/qrels.car17v2.0.benchmarkY1test.txt \
trec_car/run.car17v2.0.bm25.topics.car17v2.0.benchmarkY1test.txt
```

With the above commands, you should be able to replicate the following results:
```
map all 0.1807
recip_rank all 0.2750
```

Note that this MAP is sligtly higher than the arXiv paper (0.178) because we used
TREC CAR corpus v2.0 in this experiment instead of corpus v1.5 used in the paper.
@@ -0,0 +1,84 @@
# -*- coding: utf-8 -*-
"""
Anserini: A Lucene toolkit for replicable information retrieval research
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
"""
import json
import os
import argparse
from trec_car_classes import *

def convert_collection(args):
print('Converting collection...')

predictions_file = open(args.predictions)
file_index = 0
with open(args.collection_path, 'rb') as f:

for i, para_obj in enumerate(iter_paragraphs(f)):

# Start writting to a new file whent the current one reached its maximum capacity.
if i % args.max_docs_per_file == 0:
if i > 0:
output_jsonl_file.close()
output_path = os.path.join(args.output_folder, 'docs{:02d}.json'.format(file_index))
output_jsonl_file = open(output_path, 'w')
file_index += 1

doc_id = para_obj.para_id
para_txt = [elem.text if isinstance(elem, ParaText)
else elem.anchor_text
for elem in para_obj.bodies]

doc_text = ' '.join(para_txt)
doc_text = doc_text.replace('\n', ' ')
doc_text = ' '.join(doc_text.split())

if not doc_text:
doc_text = 'dummy document.'

# Reads from predictions and merge then to the original doc text.
pred_text = []
for _ in range(args.stride):
pred_text.append(predictions_file.readline().strip())
pred_text = ' '.join(pred_text)
pred_text = pred_text.replace(' / ', ' ')
text = (doc_text + ' ') * args.original_copies + pred_text

output_dict = {'id': doc_id, 'contents': text}
output_jsonl_file.write(json.dumps(output_dict) + '\n')

if i % 100000 == 0:
print('Converted {} docs in {} files'.format(i, file_index))

output_jsonl_file.close()
predictions_file.close()


if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Augments TREC CAR collection with predicted queries to create Anserini jsonl collection')
parser.add_argument('--collection_path', required=True, help='TREC CAR cbor collection')
parser.add_argument('--predictions', required=True, help='query predictions file')
parser.add_argument('--output_folder', required=True, help='output folder for jsonl collection')
parser.add_argument('--stride', required=True, type=int, help='even [s] lines in predictions file is associated with each document')
parser.add_argument('--max_docs_per_file', default=1000000, type=int, help='maximum number of documents in each jsonl file')
parser.add_argument('--original_copies', default=1, type=int, help='number of copies of the original document to duplicate')

args = parser.parse_args()

if not os.path.exists(args.output_folder):
os.makedirs(args.output_folder)

convert_collection(args)
print('Done!')

0 comments on commit ba18e77

Please sign in to comment.
You can’t perform that action at this time.