Dense retrieval: incorporate DPR collections #294

lintool · 2021-01-05T00:19:50Z

We can fold in all the DPR collections into Pyserini, so we can do the retriever part of a QA system directly in Pyserini.

MXueguang · 2021-01-14T04:50:58Z

How about we make current QueryEncoder as abstract class. And make sub classes TCTColBERTQueryEncoder and DPRQueryEncoder.

The DPRQueryEncoder wrap this https://huggingface.co/transformers/model_doc/dpr.html#transformers.DPRQuestionEncoder

lintool · 2021-01-15T00:59:29Z

Yes, I think this is the right approach, although TCTColBERTQueryEncoder looks really ugly. I don't have any better suggestions though.

lintool · 2021-01-15T04:33:22Z

Aside, this also means that at some point in time we need to build sparse indexes for the Wikipedia collection used in DPR.

lintool · 2021-01-17T20:13:39Z

Ref: #325 - code merged!

@MXueguang We need a replication guide for this also...

Currently, we have: https://github.com/castorini/pyserini/blob/master/docs/dense-retrieval.md

Would it make sense to break into:

dense-retrieval-msmarco-passage.md
dense-retrieval-msmarco-doc.md
dense-retrieval-dpr.md

Thoughts?

MXueguang · 2021-01-17T20:38:44Z

yes,
for msmarco-doc: we'll do that after we finish the msmarco-doc experiment
for dpr, I guess we need to evaluate the result by the downstream qa evaluation?

lintool · 2021-01-17T20:42:56Z

for msmarco-doc: we'll do that after we finish the msmarco-doc experiment

Yup.

for dpr, I guess we need to evaluate the result by the downstream qa evaluation?

No, let's focus on only the retriever stage. The architecture is retriever-reader, right? And the DPR paper gives component effectiveness of only the retriever stage. Let's try to match those numbers.

MXueguang · 2021-01-19T03:41:01Z

How do we deal with the DPR retrieval evaluation? since the evaluation is different from regular IR tasks. i.e. evaluate by qrels
two solutions:

write an script to evaluate DPR. This is straightforward
we can craft a qrels, i.e. given a question, for each document, we label 1 if it contains the answer for this question and create a topic file as well. This can make DPR be same as other tasks

lintool · 2021-01-19T03:44:21Z

Let's do (1) for now and just check in the official DPR eval script, just like we've checked in the MS MARCO scripts. Might want to put into tools/ so PyGaggle can also use, right @ronakice ?

MXueguang · 2021-01-19T03:55:27Z

emmm, I don't think they have an official "script" to evaluate. They wrap the evaluation inside their retrieval functions here. I am evaluating with the script written by myself.

MXueguang · 2021-01-19T04:08:08Z

with my script, I am getting:

Top20: 0.7794906931597579
Top100: 0.8460660043393856

Theirs are:

Top20: 0.784
Top100: 0.854

a bit lower, but I am using hnsw index rn. will evaluate on bf index next

MXueguang · 2021-01-20T16:31:48Z

close by #335

MXueguang · 2021-01-20T20:55:16Z

will continue the discussion about replication result in #336

lintool assigned MXueguang Jan 5, 2021

MXueguang closed this as completed Jan 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dense retrieval: incorporate DPR collections #294

Dense retrieval: incorporate DPR collections #294

lintool commented Jan 5, 2021

MXueguang commented Jan 14, 2021

lintool commented Jan 15, 2021

lintool commented Jan 15, 2021

lintool commented Jan 17, 2021

MXueguang commented Jan 17, 2021

lintool commented Jan 17, 2021

MXueguang commented Jan 19, 2021

lintool commented Jan 19, 2021

MXueguang commented Jan 19, 2021 •

edited

MXueguang commented Jan 19, 2021

MXueguang commented Jan 20, 2021

MXueguang commented Jan 20, 2021

Dense retrieval: incorporate DPR collections #294

Dense retrieval: incorporate DPR collections #294

Comments

lintool commented Jan 5, 2021

MXueguang commented Jan 14, 2021

lintool commented Jan 15, 2021

lintool commented Jan 15, 2021

lintool commented Jan 17, 2021

MXueguang commented Jan 17, 2021

lintool commented Jan 17, 2021

MXueguang commented Jan 19, 2021

lintool commented Jan 19, 2021

MXueguang commented Jan 19, 2021 • edited

MXueguang commented Jan 19, 2021

MXueguang commented Jan 20, 2021

MXueguang commented Jan 20, 2021

MXueguang commented Jan 19, 2021 •

edited