Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dense retrieval: incorporate DPR collections #294

Closed
lintool opened this issue Jan 5, 2021 · 12 comments
Closed

Dense retrieval: incorporate DPR collections #294

lintool opened this issue Jan 5, 2021 · 12 comments
Assignees

Comments

@lintool
Copy link
Member

lintool commented Jan 5, 2021

We can fold in all the DPR collections into Pyserini, so we can do the retriever part of a QA system directly in Pyserini.

@MXueguang
Copy link
Member

How about we make current QueryEncoder as abstract class. And make sub classes TCTColBERTQueryEncoder and DPRQueryEncoder.

The DPRQueryEncoder wrap this https://huggingface.co/transformers/model_doc/dpr.html#transformers.DPRQuestionEncoder

@lintool
Copy link
Member Author

lintool commented Jan 15, 2021

Yes, I think this is the right approach, although TCTColBERTQueryEncoder looks really ugly. I don't have any better suggestions though.

@lintool
Copy link
Member Author

lintool commented Jan 15, 2021

Aside, this also means that at some point in time we need to build sparse indexes for the Wikipedia collection used in DPR.

@lintool
Copy link
Member Author

lintool commented Jan 17, 2021

Ref: #325 - code merged!

@MXueguang We need a replication guide for this also...

Currently, we have: https://github.com/castorini/pyserini/blob/master/docs/dense-retrieval.md

Would it make sense to break into:

  • dense-retrieval-msmarco-passage.md
  • dense-retrieval-msmarco-doc.md
  • dense-retrieval-dpr.md

Thoughts?

@MXueguang
Copy link
Member

yes,
for msmarco-doc: we'll do that after we finish the msmarco-doc experiment
for dpr, I guess we need to evaluate the result by the downstream qa evaluation?

@lintool
Copy link
Member Author

lintool commented Jan 17, 2021

for msmarco-doc: we'll do that after we finish the msmarco-doc experiment

Yup.

for dpr, I guess we need to evaluate the result by the downstream qa evaluation?

No, let's focus on only the retriever stage. The architecture is retriever-reader, right? And the DPR paper gives component effectiveness of only the retriever stage. Let's try to match those numbers.

@MXueguang
Copy link
Member

How do we deal with the DPR retrieval evaluation? since the evaluation is different from regular IR tasks. i.e. evaluate by qrels
two solutions:

  1. write an script to evaluate DPR. This is straightforward
  2. we can craft a qrels, i.e. given a question, for each document, we label 1 if it contains the answer for this question and create a topic file as well. This can make DPR be same as other tasks

@lintool
Copy link
Member Author

lintool commented Jan 19, 2021

Let's do (1) for now and just check in the official DPR eval script, just like we've checked in the MS MARCO scripts. Might want to put into tools/ so PyGaggle can also use, right @ronakice ?

@MXueguang
Copy link
Member

MXueguang commented Jan 19, 2021

emmm, I don't think they have an official "script" to evaluate. They wrap the evaluation inside their retrieval functions here. I am evaluating with the script written by myself.

@MXueguang
Copy link
Member

with my script, I am getting:

Top20: 0.7794906931597579
Top100: 0.8460660043393856

Theirs are:

Top20: 0.784
Top100: 0.854

a bit lower, but I am using hnsw index rn. will evaluate on bf index next

@MXueguang
Copy link
Member

close by #335

@MXueguang
Copy link
Member

will continue the discussion about replication result in #336

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants