Skip to content

Latest commit

 

History

History
227 lines (157 loc) · 11 KB

experiments-msmarco-doc-leaderboard.md

File metadata and controls

227 lines (157 loc) · 11 KB

Anserini: Baselines for MS MARCO Document Leaderboard

The page provides instructions for reproducing Anserini baseline runs for the MS MARCO Document Leaderboard. Prebuilt indexes can be found here. For convenience, we use Pyserini's feature to automatically download prebuilt indexes to fetch the right indexes, which are downloaded to ~/.cache/pyserini/indexes/.

Note that we are only able to evaluate on the dev queries. Scores on the test topics are only available via submission to the official leaderboard.

Note that prior to December 2021, runs generated with SearchCollection in the TREC format and then converted into the MS MARCO format give slightly different results from runs generated by SearchMsmarco directly in the MS MARCO format, due to tie-breaking effects. This was fixed with #1458, which also introduced (intra-configuration) multi-threading. As a result, SearchMsmarco has been deprecated and replaced by SearchCollection; both have been verified to generate identical output.

BM25 Baselines

We have two BM25 baselines, a "per-document" configuration (indexing each document individually, as you'd expect) and a "per-passage" configuration (where we break up each document into passages and index each passage separately, see here).

To fetch the indexes:

python -c "from pyserini.search import SimpleSearcher; SimpleSearcher.from_prebuilt_index('msmarco-doc')"
python -c "from pyserini.search import SimpleSearcher; SimpleSearcher.from_prebuilt_index('msmarco-doc-per-passage')"

Per-Document

Run with per-document configuration, BM25 default parameters:

mkdir runs/bm25base/

sh target/appassembler/bin/SearchMsmarco -hits 100 -k1 0.9 -b 0.4 -threads 9 \
 -index ~/.cache/pyserini/indexes/index-msmarco-doc-20201117-f87c94.ac747860e7a37aed37cc30ed3990f273 \
 -queries tools/topics-and-qrels/topics.msmarco-doc.dev.txt -output runs/bm25base/dev.txt &

sh target/appassembler/bin/SearchMsmarco -hits 100 -k1 0.9 -b 0.4 -threads 9 \
 -index ~/.cache/pyserini/indexes/index-msmarco-doc-20201117-f87c94.ac747860e7a37aed37cc30ed3990f273 \
 -queries tools/topics-and-qrels/topics.msmarco-doc.test.txt -output runs/bm25base/eval.txt &

Evaluation:

$ python tools/scripts/msmarco/msmarco_doc_eval.py --judgments tools/topics-and-qrels/qrels.msmarco-doc.dev.txt --run runs/bm25base/dev.txt
#####################
MRR @100: 0.23005723505603573
QueriesRanked: 5193
#####################

Note that this run corresponds to the MS MARCO document ranking leaderboard entry "Anserini's BM25, default parameters (k1=0.9, b=0.4)" dated 2020/08/16.

Run with per-document configuration, BM25 tuned parameters, optimized for recall@100 (k1=4.46, b=0.82):

mkdir runs/bm25tuned/

sh target/appassembler/bin/SearchMsmarco -hits 100 -k1 4.46 -b 0.82 -threads 9 \
 -index ~/.cache/pyserini/indexes/index-msmarco-doc-20201117-f87c94.ac747860e7a37aed37cc30ed3990f273 \
 -queries tools/topics-and-qrels/topics.msmarco-doc.dev.txt -output runs/bm25tuned/dev.txt &

sh target/appassembler/bin/SearchMsmarco -hits 100 -k1 4.46 -b 0.82 -threads 9 \
 -index ~/.cache/pyserini/indexes/index-msmarco-doc-20201117-f87c94.ac747860e7a37aed37cc30ed3990f273 \
 -queries tools/topics-and-qrels/topics.msmarco-doc.test.txt -output runs/bm25tuned/eval.txt &

Evaluation:

$ python tools/scripts/msmarco/msmarco_doc_eval.py --judgments tools/topics-and-qrels/qrels.msmarco-doc.dev.txt --run runs/bm25tuned/dev.txt
#####################
MRR @100: 0.2770296928568702
QueriesRanked: 5193
#####################

This run was not submitted to the MS MARCO document ranking leaderboard.

Per-Passage

The passage retrieval functionality is only available in SearchCollection; we use a simple script to convert back into MS MARCO format.

Run with per-passage configuration, BM25 default parameters:

mkdir runs/bm25pbase/

target/appassembler/bin/SearchCollection -topicReader TsvString -topics tools/topics-and-qrels/topics.msmarco-doc.dev.txt \
  -index ~/.cache/pyserini/indexes/index-msmarco-doc-per-passage-20201204-f50dcc.797367406a7542b649cefa6b41cf4c33/ \
  -output runs/bm25pbase/dev.trec.txt \
  -bm25 -bm25.k1 0.9 -bm25.b 0.4 -hits 1000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 &

python tools/scripts/msmarco/convert_trec_to_msmarco_run.py --input runs/bm25pbase/dev.trec.txt --output runs/bm25pbase/dev.txt

target/appassembler/bin/SearchCollection -topicReader TsvString -topics tools/topics-and-qrels/topics.msmarco-doc.test.txt \
  -index ~/.cache/pyserini/indexes/index-msmarco-doc-per-passage-20201204-f50dcc.797367406a7542b649cefa6b41cf4c33/ \
  -output runs/bm25pbase/eval.trec.txt \
  -bm25 -bm25.k1 0.9 -bm25.b 0.4 -hits 1000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 &

python tools/scripts/msmarco/convert_trec_to_msmarco_run.py --input runs/bm25pbase/eval.trec.txt --output runs/bm25pbase/eval.txt

Evaluation:

$ python tools/scripts/msmarco/msmarco_doc_eval.py --judgments tools/topics-and-qrels/qrels.msmarco-doc.dev.txt --run runs/bm25pbase/dev.txt
#####################
MRR @100: 0.2682349308946578
QueriesRanked: 5193
#####################

This run was not submitted to the MS MARCO document ranking leaderboard.

Run with per-passage configuration, BM25 tuned parameters, optimized for recall@100 (k1=2.16, b=0.61):

mkdir runs/bm25ptuned/

target/appassembler/bin/SearchCollection -topicReader TsvString -topics tools/topics-and-qrels/topics.msmarco-doc.dev.txt \
  -index ~/.cache/pyserini/indexes/index-msmarco-doc-per-passage-20201204-f50dcc.797367406a7542b649cefa6b41cf4c33/ \
  -output runs/bm25ptuned/dev.trec.txt \
  -bm25 -bm25.k1 2.16 -bm25.b 0.61 -hits 1000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 &

python tools/scripts/msmarco/convert_trec_to_msmarco_run.py --input runs/bm25ptuned/dev.trec.txt --output runs/bm25ptuned/dev.txt

target/appassembler/bin/SearchCollection -topicReader TsvString -topics tools/topics-and-qrels/topics.msmarco-doc.test.txt \
  -index ~/.cache/pyserini/indexes/index-msmarco-doc-per-passage-20201204-f50dcc.797367406a7542b649cefa6b41cf4c33/ \
  -output runs/bm25ptuned/eval.trec.txt \
  -bm25 -bm25.k1 2.16 -bm25.b 0.61 -hits 1000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 &

python tools/scripts/msmarco/convert_trec_to_msmarco_run.py --input runs/bm25ptuned/eval.trec.txt --output runs/bm25ptuned/eval.txt

Evaluation:

$ python tools/scripts/msmarco/msmarco_doc_eval.py --judgments tools/topics-and-qrels/qrels.msmarco-doc.dev.txt --run runs/bm25ptuned/dev.txt
#####################
MRR @100: 0.2751202109946902
QueriesRanked: 5193
#####################

This run corresponds to the MS MARCO document ranking leaderboard entry "Anserini's BM25 (per passage), parameters tuned for recall@100 (k1=2.16, b=0.61)" dated 2021/01/20.

Document Expansion Baselines

To fetch the indexes:

python -c "from pyserini.search import SimpleSearcher; SimpleSearcher.from_prebuilt_index('msmarco-doc-expanded-per-passage')"
python -c "from pyserini.search import SimpleSearcher; SimpleSearcher.from_prebuilt_index('msmarco-doc-expanded-per-doc')"

Per-Document

Anserini's BM25 + doc2query-T5 expansion (per document), parameters tuned for recall@100 (k1=4.68, b=0.87):

mkdir runs/doc2query-t5-per-doc/

sh target/appassembler/bin/SearchMsmarco -hits 100 -k1 4.68 -b 0.87 -threads 9 \
 -index ~/.cache/pyserini/indexes/index-msmarco-doc-expanded-per-doc-20201126-1b4d0a.f7056191842ab77a01829cff68004782 \
 -queries tools/topics-and-qrels/topics.msmarco-doc.dev.txt -output runs/doc2query-t5-per-doc/dev.txt &

sh target/appassembler/bin/SearchMsmarco -hits 100 -k1 4.68 -b 0.87 -threads 9 \
 -index ~/.cache/pyserini/indexes/index-msmarco-doc-expanded-per-doc-20201126-1b4d0a.f7056191842ab77a01829cff68004782 \
 -queries tools/topics-and-qrels/topics.msmarco-doc.test.txt -output runs/doc2query-t5-per-doc/eval.txt &

Evaluation:

$ python tools/scripts/msmarco/msmarco_doc_eval.py --judgments tools/topics-and-qrels/qrels.msmarco-doc.dev.txt --run runs/doc2query-t5-per-doc/dev.txt
#####################
MRR @100: 0.3265190296491929
QueriesRanked: 5193
#####################

This run corresponds to the MS MARCO document ranking leaderboard entry "Anserini's BM25 + doc2query-T5 expansion (per document), parameters tuned for recall@100 (k1=4.68, b=0.87)" dated 2020/12/11.

Per-Passage

The passage retrieval functionality is only available in SearchCollection; we use a simple script to convert back into MS MARCO format.

Anserini's BM25 + doc2query-T5 expansion (per passage), parameters tuned for recall@100 (k1=2.56, b=0.59):

mkdir runs/doc2query-t5-per-passage/

target/appassembler/bin/SearchCollection -topicReader TsvString -topics tools/topics-and-qrels/topics.msmarco-doc.dev.txt \
  -index ~/.cache/pyserini/indexes/index-msmarco-doc-expanded-per-passage-20201126-1b4d0a.54ea30c64515edf3c3741291b785be53 \
  -output runs/doc2query-t5-per-passage/dev.trec.txt \
  -bm25 -bm25.k1 2.56 -bm25.b 0.59 -hits 1000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 &

python tools/scripts/msmarco/convert_trec_to_msmarco_run.py --input runs/doc2query-t5-per-passage/dev.trec.txt --output runs/doc2query-t5-per-passage/dev.txt

target/appassembler/bin/SearchCollection -topicReader TsvString -topics tools/topics-and-qrels/topics.msmarco-doc.test.txt \
  -index ~/.cache/pyserini/indexes/index-msmarco-doc-expanded-per-passage-20201126-1b4d0a.54ea30c64515edf3c3741291b785be53 \
  -output runs/doc2query-t5-per-passage/eval.trec.txt \
  -bm25 -bm25.k1 2.56 -bm25.b 0.59 -hits 1000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 &

python tools/scripts/msmarco/convert_trec_to_msmarco_run.py --input runs/doc2query-t5-per-passage/eval.trec.txt --output runs/doc2query-t5-per-passage/eval.txt

Evaluation:

$ python tools/scripts/msmarco/msmarco_doc_eval.py --judgments tools/topics-and-qrels/qrels.msmarco-doc.dev.txt --run runs/doc2query-t5-per-passage/dev.txt
#####################
MRR @100: 0.32081861579183746
QueriesRanked: 5193
#####################

This run corresponds to the MS MARCO document ranking leaderboard entry "Anserini's BM25 + doc2query-T5 expansion (per passage), parameters tuned for recall@100 (k1=2.56, b=0.59)" dated 2020/12/11.

Reproduction Log*