DPR replication docs #336

lintool · 2021-01-20T15:33:12Z

Hi @MXueguang - when everything is implemented DPR should probably get it's own separate replication page, like for MS MARCO: https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md

Containing both spare, hybrid, and dense retrieval.

Then we can add a replication log also - starting point for people interested in working more on it.

MXueguang · 2021-01-20T20:58:08Z

The retrieval stage evaluation in the original DPR paper is:
which is calculated by the percentage of top 20/100 retrieved passages that contain the answer

Top20:  0.784
Top100: 0.854

The original DPR repo provided encoded corpus as faiss brute force index.
However, there is no pre-encoded queries provided.

by using brute force index (provided by DPR repo)
with on-the-fly CPU query encoding (our DPRQueryEncoder), pyserini gives:

top20:  0.7826881352061208
top100: 0.8516615279205207

by using hnsw index (converted from their faiss index),
with on-the-fly CPU query encoding, (our DPRQueryEncoder) pyserini gives:

top20:  0.7814319972593354
top100: 0.8497202238209433

MXueguang · 2021-01-23T20:06:31Z

Pre-encoded/CPU/GPU query encoding gives the same results as above.
but the run file for CPU v.s. GPU have slightly different.

lintool · 2021-01-23T20:42:16Z

I know the difference is relatively small so might not be worth tracking down all the way... but what other possible differences are there potentially?

MXueguang · 2021-01-23T21:05:20Z

other possible differences

I think in the retrieval stage, we checked all potential differences that are measurable?
i.e. HNSW/BF cpu/gpu

MXueguang · 2021-01-23T21:09:45Z

btw BM25 retrieval on nq-dev:

Top20: 0.6327509421034601
Top100: 0.7673860911270983

In the paper: with Lucene implementation:

Top20: 59.1
Top100: 73.7

lintool · 2021-01-23T21:16:40Z

In other words, the paper under-reports BM25 effectiveness?

lintool · 2021-01-23T21:19:07Z

other possible differences

I think in the retrieval stage, we checked all potential differences that are measurable?
i.e. HNSW/BF cpu/gpu

Well, the differences must be from somewhere? hgf versions? Are we sure we're encoding the questions exactly the same way?

MXueguang · 2021-01-23T21:23:56Z

In other words, the paper under-reports BM25 effectiveness?

I am wondering where the difference comes from. We are using the same bm25 parameters, (i.e. by default b=0.4 k=0.9). Can the difference come from different implementation Lucene v.s. anserini? But seems the differences are big here.

MXueguang · 2021-01-23T21:27:25Z

Well, the differences must be from somewhere? hgf versions? Are we sure we're encoding the questions exactly the same way?

ah, I see. versus the original implementation. There might be some inconsistency. I'll check.

MXueguang · 2021-01-23T22:41:42Z

hybrid bm25+dpr, with our implementation and alpha = 0.24 (without tuning) (not the same as the paper)

Top20: 0.7951353203151764
Top100: 0.8582848007308439

In the paper:

Top20:  76.6
Top100: 83.8

MXueguang · 2021-01-23T22:43:03Z

Well, the differences must be from somewhere? hgf versions? Are we sure we're encoding the questions exactly the same way?

I'll use their repo to generate a copy of encoded queries. it will help us figure out where the difference comes from.

MXueguang · 2021-01-24T08:05:44Z

I ran their original dpr repo code, get the following result:

Top20: 0.7813178029005368
Top100: 0.8498344181797419

which is close to ours, (v.s. number in paper)

I found a small inconsistency in our evaluation, I included the title in the doc text, but it shouldn't be
after I fix that
when I use their code to generate encoded-queries and use our code to inference, I get:

Top20: 0.7813178029005368
Top100: 0.8498344181797419

which is exactly the same as the output of their code (although different from the paper)

when I use our code (our query encoding and our retrieval), I get:

Top20: 0.7813178029005368
Top100: 0.8499486125385406

in summary,

their original repo gives close enough result to our implementation, (with a bit different from the number in the paper)
our query embedding leads to 0.0001 higher for top100. which is a tiny difference. I believe this is caused by upstream (transformers), since I am following the same way to do encoding, e.g. no padding, truncation

lintool · 2021-01-24T15:48:56Z

Okay, so my understanding is that results from our code base are very very close to results from their code base. But results from their code base are slightly lower than what they report in their paper.

If this is the case - then yes, I agree the changes are upstream. So, nothing we can do. We can consider this issue closed and the results successfully replicated.

lintool · 2021-01-26T19:28:18Z

Closed by #346

lintool assigned MXueguang Jan 20, 2021

MXueguang mentioned this issue Jan 20, 2021

Dense retrieval: incorporate DPR collections #294

Closed

lintool closed this as completed Jan 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DPR replication docs #336

DPR replication docs #336

lintool commented Jan 20, 2021

MXueguang commented Jan 20, 2021 •

edited

MXueguang commented Jan 23, 2021

lintool commented Jan 23, 2021

MXueguang commented Jan 23, 2021

MXueguang commented Jan 23, 2021

lintool commented Jan 23, 2021

lintool commented Jan 23, 2021

MXueguang commented Jan 23, 2021

MXueguang commented Jan 23, 2021

MXueguang commented Jan 23, 2021

MXueguang commented Jan 23, 2021

MXueguang commented Jan 24, 2021

lintool commented Jan 24, 2021

lintool commented Jan 26, 2021

DPR replication docs #336

DPR replication docs #336

Comments

lintool commented Jan 20, 2021

MXueguang commented Jan 20, 2021 • edited

MXueguang commented Jan 23, 2021

lintool commented Jan 23, 2021

MXueguang commented Jan 23, 2021

MXueguang commented Jan 23, 2021

lintool commented Jan 23, 2021

lintool commented Jan 23, 2021

MXueguang commented Jan 23, 2021

MXueguang commented Jan 23, 2021

MXueguang commented Jan 23, 2021

MXueguang commented Jan 23, 2021

MXueguang commented Jan 24, 2021

lintool commented Jan 24, 2021

lintool commented Jan 26, 2021

MXueguang commented Jan 20, 2021 •

edited