Skip to content

Latest commit

 

History

History
26 lines (20 loc) · 1.44 KB

lucene-index.msmarco-v2-passage-d2q-t5.20220808.4d6d2a.README.md

File metadata and controls

26 lines (20 loc) · 1.44 KB

msmarco-v2-passage-d2q-t5

Lucene index of the MS MARCO V2 passage corpus, with doc2query-T5 expansions.

Note that there are two variants:

  • msmarco-v2-passage-d2q-t5 (19G uncompressed): stores term frequencies only, which supports bag-of-words queries, but no phrase queries and no relevance feedback. There is no way to fetch the raw text.
  • msmarco-v2-passage-d2q-t5-docvectors (71G uncompressed): stores term frequencies and the docvectors, which enables pseudo-relevance feedabck.

These indexes were generated on 2022/08/08 at Anserini commit fbe35e on damiano with the following commands:

nohup target/appassembler/bin/IndexCollection -collection MsMarcoV2PassageCollection \
  -generator DefaultLuceneDocumentGenerator -threads 18 \
  -input /scratch2/collections/msmarco/msmarco_v2_passage_d2q-t5/ \
  -index indexes/lucene-index.msmarco-v2-passage-d2q-t5.20220808.4d6d2a/ \
  -optimize \
  >& logs/log.msmarco-v2-passage-d2q-t5.20220808.4d6d2a.txt &

nohup target/appassembler/bin/IndexCollection -collection MsMarcoV2PassageCollection \
  -generator DefaultLuceneDocumentGenerator -threads 18 \
  -input /scratch2/collections/msmarco/msmarco_v2_passage_d2q-t5/ \
  -index indexes/lucene-index.msmarco-v2-passage-d2q-t5-docvectors.20220808.4d6d2a/ \
  -storeDocvectors -optimize \
  >& logs/log.msmarco-v2-passage-d2q-t5-docvectors.20220808.4d6d2a.txt &