New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DPR replication docs #336
Comments
The retrieval stage evaluation in the original DPR paper is:
The original DPR repo provided encoded corpus as faiss brute force index. by using brute force index (provided by DPR repo)
by using hnsw index (converted from their faiss index),
|
Pre-encoded/CPU/GPU query encoding gives the same results as above. |
I know the difference is relatively small so might not be worth tracking down all the way... but what other possible differences are there potentially? |
I think in the retrieval stage, we checked all potential differences that are measurable? |
btw BM25 retrieval on nq-dev:
In the paper: with Lucene implementation:
|
In other words, the paper under-reports BM25 effectiveness? |
Well, the differences must be from somewhere? hgf versions? Are we sure we're encoding the questions exactly the same way? |
I am wondering where the difference comes from. We are using the same bm25 parameters, (i.e. by default b=0.4 k=0.9). Can the difference come from different implementation Lucene v.s. anserini? But seems the differences are big here. |
ah, I see. versus the original implementation. There might be some inconsistency. I'll check. |
hybrid bm25+dpr, with our implementation and alpha = 0.24 (without tuning) (not the same as the paper)
In the paper:
|
I'll use their repo to generate a copy of encoded queries. it will help us figure out where the difference comes from. |
I ran their original dpr repo code, get the following result:
which is close to ours, (v.s. number in paper) I found a small inconsistency in our evaluation, I included the title in the doc text, but it shouldn't be
which is exactly the same as the output of their code (although different from the paper) when I use our code (our query encoding and our retrieval), I get:
in summary,
|
Okay, so my understanding is that results from our code base are very very close to results from their code base. But results from their code base are slightly lower than what they report in their paper. If this is the case - then yes, I agree the changes are upstream. So, nothing we can do. We can consider this issue closed and the results successfully replicated. |
Closed by #346 |
Hi @MXueguang - when everything is implemented DPR should probably get it's own separate replication page, like for MS MARCO: https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md
Containing both spare, hybrid, and dense retrieval.
Then we can add a replication log also - starting point for people interested in working more on it.
The text was updated successfully, but these errors were encountered: