Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DPR replication docs #336

Closed
lintool opened this issue Jan 20, 2021 · 14 comments
Closed

DPR replication docs #336

lintool opened this issue Jan 20, 2021 · 14 comments
Assignees

Comments

@lintool
Copy link
Member

lintool commented Jan 20, 2021

Hi @MXueguang - when everything is implemented DPR should probably get it's own separate replication page, like for MS MARCO: https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md

Containing both spare, hybrid, and dense retrieval.

Then we can add a replication log also - starting point for people interested in working more on it.

@MXueguang
Copy link
Member

MXueguang commented Jan 20, 2021

The retrieval stage evaluation in the original DPR paper is:
which is calculated by the percentage of top 20/100 retrieved passages that contain the answer

Top20:  0.784
Top100: 0.854

The original DPR repo provided encoded corpus as faiss brute force index.
However, there is no pre-encoded queries provided.

by using brute force index (provided by DPR repo)
with on-the-fly CPU query encoding (our DPRQueryEncoder), pyserini gives:

top20:  0.7826881352061208
top100: 0.8516615279205207

by using hnsw index (converted from their faiss index),
with on-the-fly CPU query encoding, (our DPRQueryEncoder) pyserini gives:

top20:  0.7814319972593354
top100: 0.8497202238209433

@MXueguang
Copy link
Member

Pre-encoded/CPU/GPU query encoding gives the same results as above.
but the run file for CPU v.s. GPU have slightly different.

@lintool
Copy link
Member Author

lintool commented Jan 23, 2021

I know the difference is relatively small so might not be worth tracking down all the way... but what other possible differences are there potentially?

@MXueguang
Copy link
Member

other possible differences

I think in the retrieval stage, we checked all potential differences that are measurable?
i.e. HNSW/BF cpu/gpu

@MXueguang
Copy link
Member

btw BM25 retrieval on nq-dev:

Top20: 0.6327509421034601
Top100: 0.7673860911270983

In the paper: with Lucene implementation:

Top20: 59.1
Top100: 73.7

@lintool
Copy link
Member Author

lintool commented Jan 23, 2021

In other words, the paper under-reports BM25 effectiveness?

@lintool
Copy link
Member Author

lintool commented Jan 23, 2021

other possible differences

I think in the retrieval stage, we checked all potential differences that are measurable?
i.e. HNSW/BF cpu/gpu

Well, the differences must be from somewhere? hgf versions? Are we sure we're encoding the questions exactly the same way?

@MXueguang
Copy link
Member

In other words, the paper under-reports BM25 effectiveness?

I am wondering where the difference comes from. We are using the same bm25 parameters, (i.e. by default b=0.4 k=0.9). Can the difference come from different implementation Lucene v.s. anserini? But seems the differences are big here.

@MXueguang
Copy link
Member

Well, the differences must be from somewhere? hgf versions? Are we sure we're encoding the questions exactly the same way?

ah, I see. versus the original implementation. There might be some inconsistency. I'll check.

@MXueguang
Copy link
Member

hybrid bm25+dpr, with our implementation and alpha = 0.24 (without tuning) (not the same as the paper)

Top20: 0.7951353203151764
Top100: 0.8582848007308439

In the paper:

Top20:  76.6
Top100: 83.8

@MXueguang
Copy link
Member

Well, the differences must be from somewhere? hgf versions? Are we sure we're encoding the questions exactly the same way?

I'll use their repo to generate a copy of encoded queries. it will help us figure out where the difference comes from.

@MXueguang
Copy link
Member

I ran their original dpr repo code, get the following result:

Top20: 0.7813178029005368
Top100: 0.8498344181797419

which is close to ours, (v.s. number in paper)

I found a small inconsistency in our evaluation, I included the title in the doc text, but it shouldn't be
after I fix that
when I use their code to generate encoded-queries and use our code to inference, I get:

Top20: 0.7813178029005368
Top100: 0.8498344181797419

which is exactly the same as the output of their code (although different from the paper)

when I use our code (our query encoding and our retrieval), I get:

Top20: 0.7813178029005368
Top100: 0.8499486125385406

in summary,

  • their original repo gives close enough result to our implementation, (with a bit different from the number in the paper)
  • our query embedding leads to 0.0001 higher for top100. which is a tiny difference. I believe this is caused by upstream (transformers), since I am following the same way to do encoding, e.g. no padding, truncation

@lintool
Copy link
Member Author

lintool commented Jan 24, 2021

Okay, so my understanding is that results from our code base are very very close to results from their code base. But results from their code base are slightly lower than what they report in their paper.

If this is the case - then yes, I agree the changes are upstream. So, nothing we can do. We can consider this issue closed and the results successfully replicated.

@lintool
Copy link
Member Author

lintool commented Jan 26, 2021

Closed by #346

@lintool lintool closed this as completed Jan 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants