Skip to content

Commit

Permalink
Update docs for uniCOIL and other learned sparse models (#997)
Browse files Browse the repository at this point in the history
pyserini.search -> pyserini.search.lucene
pyserini.index -> pyserini.index.lucene
  • Loading branch information
lintool committed Feb 12, 2022
1 parent 7e21271 commit 3732c8e
Show file tree
Hide file tree
Showing 5 changed files with 177 additions and 176 deletions.
46 changes: 23 additions & 23 deletions docs/experiments-deepimpact.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Pyserini: DeepImpact for MS MARCO V1 Passage Ranking
# Pyserini: DeepImpact on MS MARCO V1 Passage Ranking

This page describes how to reproduce the DeepImpact experiments in the following paper:

Expand All @@ -7,8 +7,6 @@ This page describes how to reproduce the DeepImpact experiments in the following
Here, we start with a version of the MS MARCO passage corpus that has already been processed with DeepImpact, i.e., gone through document expansion and term reweighting.
Thus, no neural inference is involved.

Note that Anserini provides [a comparable reproduction guide](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage-deepimpact.md) based on Java.

## Data Prep

> You can skip the data prep and indexing steps if you use our pre-built indexes. Skip directly down to the "Retrieval" section below.
Expand All @@ -17,28 +15,28 @@ We're going to use the repository's root directory as the working directory.
First, we need to download and extract the MS MARCO passage dataset with DeepImpact processing:

```bash
# Alternate mirrors of the same data, pick one:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-deepimpact-b8.tar -P collections/
wget https://vault.cs.uwaterloo.ca/s/57AE5aAjzw2ox2n/download -O collections/msmarco-passage-deepimpact-b8.tar
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-deepimpact.tar -P collections/

tar xvf collections/msmarco-passage-deepimpact-b8.tar -C collections/
tar xvf collections/msmarco-passage-deepimpact.tar -C collections/
```

To confirm, `msmarco-passage-deepimpact-b8.tar` is ~3.6 GB and has MD5 checksum `3c317cb4f9f9bcd3bbec60f05047561a`.
To confirm, `msmarco-passage-deepimpact.tar` is 3.6 GB and has MD5 checksum `fe827eb13ca3270bebe26b3f6b99f550`.

## Indexing

We can now index these docs:

```bash
python -m pyserini.index -collection JsonVectorCollection \
-input collections/msmarco-passage-deepimpact-b8/ \
-index indexes/lucene-index.msmarco-passage.deepimpact-b8 \
-generator DefaultLuceneDocumentGenerator -impact -pretokenized \
-threads 12
python -m pyserini.index.lucene \
--collection JsonVectorCollection \
--input collections/msmarco-passage-deepimpact/ \
--index indexes/lucene-index.msmarco-passage-deepimpact/ \
--generator DefaultLuceneDocumentGenerator \
--threads 12 \
--impact --pretokenized
```

The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the DeepImpact tokens.
The important indexing options to note here are `--impact --pretokenized`: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the DeepImpact tokens.

Upon completion, we should have an index with 8,841,823 documents.
The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around 15 minutes.
Expand All @@ -48,7 +46,7 @@ The indexing speed may vary; on a modern desktop with an SSD (using 12 threads,
To ensure that the tokenization in the index aligns exactly with the queries, we use pre-tokenized queries.
First, fetch the MS MARCO passage ranking dev set queries:

```
```bash
# Alternate mirrors of the same data, pick one:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/topics.msmarco-passage.dev-subset.deepimpact.tsv.gz -P collections/
wget https://vault.cs.uwaterloo.ca/s/NYibRJ9bXs5PspH/download -O collections/topics.msmarco-passage.dev-subset.deepimpact.tsv.gz
Expand All @@ -60,21 +58,23 @@ The MD5 checksum of the topics file is `88a2987d6a25b1be11c82e87677a262e`.
We can now run retrieval:

```bash
python -m pyserini.search --topics collections/topics.msmarco-passage.dev-subset.deepimpact.tsv.gz \
--index indexes/lucene-index.msmarco-passage.deepimpact-b8 \
--output runs/run.msmarco-passage.deepimpact-b8.tsv \
--impact \
--hits 1000 --batch 36 --threads 12 \
--output-format msmarco
python -m pyserini.search.lucene \
--index indexes/lucene-index.msmarco-passage-deepimpact/ \
--topics collections/topics.msmarco-passage.dev-subset.deepimpact.tsv.gz \
--output runs/run.msmarco-passage-deepimpact.tsv \
--output-format msmarco \
--batch 36 --threads 12 \
--hits 1000 \
--impact
```

Note that the important option here is `-impact`, where we specify impact scoring.
Note that the important option here is `--impact`, where we specify impact scoring.
A complete run should take around five minutes.

The output is in MS MARCO output format, so we can directly evaluate:

```bash
python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.deepimpact-b8.tsv
python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage-deepimpact.tsv
```

The results should be as follows:
Expand Down
57 changes: 23 additions & 34 deletions docs/experiments-msmarco-v2-unicoil.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Pyserini: uniCOIL w/ doc2query-T5 for MS MARCO V2
# Pyserini: uniCOIL w/ doc2query-T5 on MS MARCO V2

This page describes how to reproduce retrieval experiments with the uniCOIL model on the MS MARCO V2 collections.
Details about our model can be found in the following paper:
Expand Down Expand Up @@ -30,30 +30,28 @@ To confirm, `msmarco_v2_passage_unicoil_noexp_0shot.tar` is 24 GB and has an MD5
Index the sparse vectors:

```bash
python -m pyserini.index \
python -m pyserini.index.lucene \
--collection JsonVectorCollection \
--input collections/msmarco_v2_passage_unicoil_noexp_0shot \
--index indexes/lucene-index.msmarco-v2-passage.unicoil-noexp-0shot \
--generator DefaultLuceneDocumentGenerator \
--threads 32 \
--impact \
--pretokenized
--impact --pretokenized
```

> If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use `--index msmarco-v2-passage-unicoil-noexp-0shot` in the command below.
Sparse retrieval with uniCOIL:

```bash
python -m pyserini.search \
python -m pyserini.search.lucene \
--topics msmarco-v2-passage-dev \
--encoder castorini/unicoil-noexp-msmarco-passage \
--index indexes/lucene-index.msmarco-v2-passage.unicoil-noexp-0shot \
--output runs/run.msmarco-v2-passage.unicoil-noexp.0shot.txt \
--impact \
--batch 144 --threads 36 \
--hits 1000 \
--batch 144 \
--threads 36
--impact
```

To evaluate, using `trec_eval`:
Expand Down Expand Up @@ -91,28 +89,26 @@ To confirm, `msmarco_v2_passage_unicoil_0shot.tar` is 41 GB and has an MD5 check
Index the sparse vectors:

```bash
python -m pyserini.index \
python -m pyserini.index.lucene \
--collection JsonVectorCollection \
--input collections/msmarco_v2_passage_unicoil_0shot \
--index indexes/lucene-index.msmarco-v2-passage.unicoil-0shot \
--generator DefaultLuceneDocumentGenerator \
--threads 32 \
--impact \
--pretokenized
--impact --pretokenized
```

Sparse retrieval with uniCOIL:

```bash
python -m pyserini.search \
python -m pyserini.search.lucene \
--topics msmarco-v2-passage-dev \
--encoder castorini/unicoil-msmarco-passage \
--index indexes/lucene-index.msmarco-v2-passage.unicoil-0shot \
--output runs/run.msmarco-v2-passage.unicoil.0shot.txt \
--impact \
--batch 144 --threads 36 \
--hits 1000 \
--batch 144 \
--threads 36
--impact
```

To evaluate, using `trec_eval`:
Expand Down Expand Up @@ -152,32 +148,29 @@ To confirm, `msmarco_v2_doc_segmented_unicoil_noexp_0shot.tar` is 54 GB and has
Index the sparse vectors:

```bash
python -m pyserini.index \
python -m pyserini.index.lucene \
--collection JsonVectorCollection \
--input collections/msmarco_v2_doc_segmented_unicoil_noexp_0shot \
--index indexes/lucene-index.msmarco-doc-v2-segmented.unicoil-noexp.0shot \
--generator DefaultLuceneDocumentGenerator \
--threads 32 \
--impact \
--pretokenized
--impact --pretokenized
```

> If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use `--index msmarco-v2-doc-per-passage-unicoil-noexp-0shot` in the command below.
Sparse retrieval with uniCOIL:

```bash
python -m pyserini.search \
python -m pyserini.search.lucene \
--topics msmarco-v2-doc-dev \
--encoder castorini/unicoil-noexp-msmarco-passage \
--index indexes/lucene-index.msmarco-doc-v2-segmented.unicoil-noexp.0shot \
--output runs/run.msmarco-doc-v2-segmented.unicoil-noexp.0shot.txt \
--impact \
--batch 144 --threads 36 \
--hits 10000 \
--batch 144 \
--threads 36 \
--max-passage-hits 1000 \
--max-passage
--max-passage --max-passage-hits 1000 \
--impact
```

For the document corpus, since we are searching the segmented version, we retrieve the top 10k _segments_ and perform MaxP to obtain the top 1000 _documents_.
Expand Down Expand Up @@ -217,30 +210,27 @@ To confirm, `msmarco_v2_doc_segmented_unicoil_0shot.tar` is 62 GB and has an MD5
Index the sparse vectors:

```bash
python -m pyserini.index \
python -m pyserini.index.lucene \
--collection JsonVectorCollection \
--input collections/msmarco_v2_doc_segmented_unicoil_0shot \
--index indexes/lucene-index.msmarco-doc-v2-segmented.unicoil.0shot \
--generator DefaultLuceneDocumentGenerator \
--threads 32 \
--impact \
--pretokenized
--impact --pretokenized
```

Sparse retrieval with uniCOIL:

```bash
python -m pyserini.search \
python -m pyserini.search.lucene \
--topics msmarco-v2-doc-dev \
--encoder castorini/unicoil-msmarco-passage \
--index indexes/lucene-index.msmarco-doc-v2-segmented.unicoil.0shot \
--output runs/run.msmarco-doc-v2-segmented.unicoil.0shot.txt \
--impact \
--batch 144 --threads 36 \
--hits 10000 \
--batch 144 \
--threads 36 \
--max-passage-hits 1000 \
--max-passage
--max-passage --max-passage-hits 1000 \
--impact
```

For the document corpus, since we are searching the segmented version, we retrieve the top 10k _segments_ and perform MaxP to obtain the top 1000 _documents_.
Expand All @@ -259,7 +249,6 @@ recall_100 all 0.7556
recall_1000 all 0.9056
```


## Reproduction Log[*](reproducibility.md)

+ Results reproduced by [@lintool](https://github.com/lintool) on 2021-08-13 (commit [`2b96b9`](https://github.com/castorini/pyserini/commit/2b96b99773302315e4d7dbe4a373b36b3eadeaa6))
56 changes: 30 additions & 26 deletions docs/experiments-spladev2.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Pyserini: SPLADEv2 for MS MARCO V1 Passage Ranking
# Pyserini: SPLADEv2 on MS MARCO V1 Passage Ranking

This page describes how to reproduce with Pyserini the DistilSPLADE-max experiments in the following paper:

Expand All @@ -7,8 +7,6 @@ This page describes how to reproduce with Pyserini the DistilSPLADE-max experime
Here, we start with a version of the MS MARCO passage corpus that has already been processed with SPLADE, i.e., gone through document expansion and term reweighting.
Thus, no neural inference is involved. As SPLADE weights are given in fp16, they have been converted to integer by taking the round of weight*100.

Note that Anserini provides [a comparable reproduction guide](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage-splade-v2.md) based on Java.

## Data Prep

> You can skip the data prep and indexing steps if you use our pre-built indexes. Skip directly down to the "Retrieval" section below.
Expand All @@ -24,21 +22,23 @@ wget https://vault.cs.uwaterloo.ca/s/poCLbJDMm7JxwPk/download -O collections/msm
tar xvf collections/msmarco-passage-distill-splade-max.tar -C collections/
```

To confirm, `msmarco-passage-distill-splade-max.tar` is ~9.8 GB and has MD5 checksum `95b89a7dfd88f3685edcc2d1ffb120d1`.
To confirm, `msmarco-passage-distill-splade-max.tar` is 9.9 GB and has MD5 checksum `95b89a7dfd88f3685edcc2d1ffb120d1`.

## Indexing

We can now index these documents:

```bash
python -m pyserini.index -collection JsonVectorCollection \
-input collections/msmarco-passage-distill-splade-max \
-index indexes/lucene-index.msmarco-passage.distill-splade-max \
-generator DefaultLuceneDocumentGenerator -impact -pretokenized \
-threads 12
python -m pyserini.index.lucene \
--collection JsonVectorCollection \
--input collections/msmarco-passage-distill-splade-max \
--index indexes/lucene-index.msmarco-passage-distill-splade-max \
--generator DefaultLuceneDocumentGenerator \
--threads 12 \
--impact --pretokenized
```

The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doc lengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the SPLADEv2 tokens.
The important indexing options to note here are `--impact --pretokenized`: the first tells Anserini not to encode BM25 doc lengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the SPLADEv2 tokens.

Upon completion, we should have an index with 8,841,823 documents.
The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around 30 minutes.
Expand All @@ -61,23 +61,25 @@ The MD5 checksum of the topics file is `621a58df9adfbba8d1a23e96d8b21cb7`.
We can now run retrieval:

```bash
python -m pyserini.search --topics collections/topics.msmarco-passage.dev-subset.distill-splade-max.tsv.gz \
--index indexes/lucene-index.msmarco-passage.distill-splade-max \
--output runs/run.msmarco-passage.distill-splade-max.tsv \
--impact \
--hits 1000 --batch 36 --threads 12 \
--output-format msmarco
python -m pyserini.search.lucene \
--index indexes/lucene-index.msmarco-passage-distill-splade-max \
--topics collections/topics.msmarco-passage.dev-subset.distill-splade-max.tsv.gz \
--output runs/run.msmarco-passage-distill-splade-max.tsv \
--output-format msmarco \
--batch 36 --threads 12 \
--hits 1000 \
--impact
```

Note that the important option here is `-impact`, where we specify impact scoring.
Note that the important option here is `--impact`, where we specify impact scoring.
A complete run can take around half an hour.

*Note from authors*: We are still investigating why it takes so long using Pyserini, while the same model (including distilbert query encoder forward pass in CPU) takes only **10 minutes** on similar hardware using a numba implementation for the inverted index and using sequential processing (only one query at a time).

The output is in MS MARCO output format, so we can directly evaluate:

```bash
python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.distill-splade-max.tsv
python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage-distill-splade-max.tsv
```

The results should be as follows:
Expand All @@ -104,19 +106,21 @@ mv distilsplade_max distill-splade-max
Then run retrieval with `--encoder distill-splade-max`:

```bash
python -m pyserini.search --topics msmarco-passage-dev-subset \
--index indexes/lucene-index.msmarco-passage.distill-splade-max \
--encoder distill-splade-max \
--output runs/run.msmarco-passage.distill-splade-max.tsv \
--impact \
--hits 1000 --batch 36 --threads 12 \
--output-format msmarco
python -m pyserini.search.lucene \
--index indexes/lucene-index.msmarco-passage-distill-splade-max \
--topics msmarco-passage-dev-subset \
--encoder distill-splade-max \
--output runs/run.msmarco-passage-distill-splade-max.tsv \
--output-format msmarco \
--batch 36 --threads 12 \
--hits 1000 \
--impact
```

And then evaluate:

```bash
python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.distill-splade-max.tsv
python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage-distill-splade-max.tsv
```

The results should be something along these lines:
Expand Down
Loading

0 comments on commit 3732c8e

Please sign in to comment.