Skip to content

Commit

Permalink
Reproduction of DistilBERT KD and SBERT experiments (#507)
Browse files Browse the repository at this point in the history
  • Loading branch information
lintool committed Apr 26, 2021
1 parent 683d3fa commit 4b6e900
Show file tree
Hide file tree
Showing 2 changed files with 26 additions and 15 deletions.
24 changes: 18 additions & 6 deletions docs/experiments-distilbert_kd.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,16 @@
# Pyserini: Reproducing DistilBERT KD Results

## Dense Retrieval
This guide provides instructions to reproduce the TCT-ColBERT dense retrieval model on the MS MARCO passage ranking task, described in the following paper:

Dense retrieval with DistilBERT KD, brute-force index:
> Sebastian Hofstätter, Sophia Althammer, Michael Schröder, Mete Sertkan, and Allan Hanbury. [Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation
.](https://arxiv.org/abs/2010.02666) arXiv:2010.02666, October 2020.

You'll need a Pyserini [development installation](https://github.com/castorini/pyserini#development-installation) to get started.
Note that we have observed minor differences in scores between different computing environments (e.g., Linux vs. macOS).
However, the differences usually appear in the fifth digit after the decimal point, and do not appear to be a cause for concern from a reproducibility perspective.
Thus, while the scoring script provides results to much higher precision, we have intentionally rounded to four digits after the decimal point.

Dense retrieval, with brute-force index:

```bash
$ python -m pyserini.dsearch --topics msmarco-passage-dev-subset \
Expand All @@ -13,15 +21,15 @@ $ python -m pyserini.dsearch --topics msmarco-passage-dev-subset \
--output runs/run.msmarco-passage.distilbert-dot-margin_mse-T2.bf.tsv \
--msmarco
```
> _Optional_: replace `--encoded-queries` by `--encoder sebastian-hofstaetter/distilbert-dot-margin_mse-T2-msmarco`
> for on-the-fly query encoding.

Replace `--encoded-queries` with `--encoder sebastian-hofstaetter/distilbert-dot-margin_mse-T2-msmarco` for on-the-fly query encoding.

To evaluate:

```bash
$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.distilbert-dot-margin_mse-T2.bf.tsv
#####################
MRR @10: 0.32505867103288255
MRR @10: 0.3251
QueriesRanked: 6980
#####################
```
Expand All @@ -34,4 +42,8 @@ $ python -m pyserini.eval.convert_msmarco_run_to_trec_run --input runs/run.msmar
$ python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap msmarco-passage-dev-subset runs/run.msmarco-passage.distilbert-dot-margin_mse-T2.bf.trec
map all 0.3308
recall_1000 all 0.9553
```
```

## Reproduction Log[*](reproducibility.md)

+ Results reproduced by [@lintool](https://github.com/lintool) on 2021-04-26 (commit [`854c19`](https://github.com/castorini/pyserini/commit/854c1930ba00819245c0a9fbcf2090ce14db4db0))
17 changes: 8 additions & 9 deletions docs/experiments-sbert.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Pyserini: Reproducing SBERT MS MARCO Results
# Pyserini: Reproducing SBERT Results

## Dense Retrieval
This guide provides instructions to reproduce the SBERT dense retrieval models for MS MARCO passage ranking (v3) described [here](https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/msmarco-v3.md).

Dense retrieval with SBERT, brute-force index:
Dense retrieval, brute-force index:

```bash
$ python -m pyserini.dsearch --topics msmarco-passage-dev-subset \
Expand All @@ -13,8 +13,8 @@ $ python -m pyserini.dsearch --topics msmarco-passage-dev-subset \
--output runs/run.msmarco-passage.sbert.bf.tsv \
--msmarco
```
> _Optional_: replace `--encoded-queries` by `--encoder sentence-transformers/msmarco-distilbert-base-v3`
> for on-the-fly query encoding.

Replace `--encoded-queries` by `--encoder sentence-transformers/msmarco-distilbert-base-v3` for on-the-fly query encoding.

To evaluate:

Expand All @@ -36,8 +36,6 @@ map all 0.3372
recall_1000 all 0.9558
```

## Hybrid Dense-Sparse Retrieval

Hybrid retrieval with dense-sparse representations (without document expansion):
- dense retrieval with SBERT, brute force index.
- sparse retrieval with BM25 `msmarco-passage` (i.e., default bag-of-words) index.
Expand All @@ -52,8 +50,8 @@ $ python -m pyserini.hsearch dense --index msmarco-passage-sbert-bf \
--batch-size 36 --threads 12 \
--msmarco
```
> _Optional_: replace `--encoded-queries` by `--encoder sentence-transformers/msmarco-distilbert-base-v3`
> for on-the-fly query encoding.

Replace `--encoded-queries` by `--encoder sentence-transformers/msmarco-distilbert-base-v3` for on-the-fly query encoding.

To evaluate:

Expand All @@ -73,3 +71,4 @@ recall_1000 all 0.9659
## Reproduction Log[*](reproducibility.md)

+ Results reproduced by [@lintool](https://github.com/lintool) on 2021-04-02 (commit [`8dcf99`](https://github.com/castorini/pyserini/commit/8dcf99982a7bfd447ce9182ff219a9dad2ddd1f2))
+ Results reproduced by [@lintool](https://github.com/lintool) on 2021-04-26 (commit [`854c19`](https://github.com/castorini/pyserini/commit/854c1930ba00819245c0a9fbcf2090ce14db4db0))

0 comments on commit 4b6e900

Please sign in to comment.