Seeking KILT meta-data for DPR #186

jzhoubu · 2021-09-18T10:00:05Z

Hi, I notice some updates have been made here to facilitate the KILT dataset format.
In addition, I wonder if the below meta-data from KILT can be also shared here:

the 22,220,793 passages split from the KILT knowledge source
the corresponding passage_id of positive and negative passages for each query of the NQ dataset (mined by the DPR checkpoint)

Both meta-data are necessary for reproducing or improving DPR on KILT, and I think it will be more convenient for people to follow up if the above meta-data are shared. Thanks.

The text was updated successfully, but these errors were encountered:

vlad-karpukhin · 2021-09-20T20:23:46Z

Hi @sysu-zjw,
I unfortunately don't have time to add all our code changes for KILT multi-task training setup, but can share some of the data we used:

https://dl.fbaipublicfiles.com/ur/wikipedia_split/psgs_w100.tsv.gz - this is the KILT passages set split into DPR 100-words format. It has ~ 34 mln passages afair

Datasets with new adversarial hard negative samples:
https://dl.fbaipublicfiles.com/ur/data/retriever/trivia-train-adv.json.gz
https://dl.fbaipublicfiles.com/ur/data/retriever/hotpot-train-adv.json.gz
https://dl.fbaipublicfiles.com/ur/data/retriever/nq-train-adv.json.gz

Validation datasets in the tsv format:
https://dl.fbaipublicfiles.com/ur/data/retriever/trex-dev-kilt.tsv
https://dl.fbaipublicfiles.com/ur/data/retriever/zeroshot-dev-kilt.tsv
https://dl.fbaipublicfiles.com/ur/data/retriever/hotpot-dev-kilt.tsv
https://dl.fbaipublicfiles.com/ur/data/retriever/trivia-dev-kilt.tsv
https://dl.fbaipublicfiles.com/ur/data/retriever/nq-dev-kilt.tsv

jzhoubu · 2021-09-21T04:06:33Z

Thanks, @vlad-karpukhin . This helps a lot.

According to the 21st footnote in KILT paper, they use 22,220,793 passages for DPR reproduction while your provided version has 34 mln (35,678,076). May I confirm if this is the same corpus to reproduce the reported scores in the KILT paper? Thanks.

vlad-karpukhin · 2021-09-21T05:03:51Z

34mln - this is what we've got when split KILT's wikipedia set into 100-word passages.
Reporting the results in KILT paper is tricky and requires some code modification at both DPR as KILT evaluation sides. KILT has its own codebase suitable for replication.
The links above are the data used in this paper (https://arxiv.org/pdf/2101.00117.pdf)

vlad-karpukhin · 2021-09-21T05:05:31Z

UPD: the datasets above is actually 36 mln, not 34 passages

jzhoubu · 2021-09-21T05:34:48Z

THanks @vlad-karpukhin . I just found 48950 passage is NaN in text field, and I think it is better to confirm with you if this is fine for the following experiments? Thanks.

light42 · 2021-09-30T16:26:27Z

https://dl.fbaipublicfiles.com/ur/wikipedia_split/psgs_w100.tsv.gz - this is the KILT passages set split into DPR 100-words format. It has ~ 34 mln passages afair

@vlad-karpukhin What's the format of the tsv file? I want to experiment using my own passages.

vlad-karpukhin · 2021-09-30T16:49:10Z

The format is flexible.
Have a look at

DPR/dpr/data/retriever_data.py

Line 224 in 1ee31c6

class CsvCtxSrc(RetrieverData):

You can configure which column in your tsv is id, body or title.
Or you can write your own data parsing class, just need to implement that load_data_to() method.

light42 · 2021-09-30T16:55:20Z

Thanks, that really cleared things up.

oklen · 2022-07-18T08:32:11Z

https://dl.fbaipublicfiles.com/ur/data/retriever/aidayago2-dev-multikilt.json.gz

Hi, @vlad-karpukhin the above link is not available now, could you fix it?

vladk232 · 2022-07-18T19:53:24Z

Hi @oklen , I'm not longer at Meta and can't update this repo, sorry.

philhoonoh · 2022-11-27T23:43:00Z

Hi @vlad-karpukhin, Thank you for sharing the dataset.
I believe my assumptions are correct but I want to make sure on 'hard_negative_ctxs' in each dataset.

Train & dev sets for this passages split:

'hard_negative_ctxs' in this dataset are retrieved by BM25

Datasets with new adversarial hard negative samples:

'hard_negative_ctxs' in this dataset are retrieved by DPR

Validation datasets in the tsv format:

same as the first one

Thank you again for your work

jzhoubu closed this as completed Oct 7, 2021

jzhoubu mentioned this issue Nov 22, 2021

Fail to reproduce 22,220,793 passages for DPR facebookresearch/KILT#52

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seeking KILT meta-data for DPR #186

Seeking KILT meta-data for DPR #186

jzhoubu commented Sep 18, 2021

vlad-karpukhin commented Sep 20, 2021

jzhoubu commented Sep 21, 2021 •

edited

vlad-karpukhin commented Sep 21, 2021

vlad-karpukhin commented Sep 21, 2021

jzhoubu commented Sep 21, 2021 •

edited

light42 commented Sep 30, 2021

vlad-karpukhin commented Sep 30, 2021

light42 commented Sep 30, 2021

oklen commented Jul 18, 2022

vladk232 commented Jul 18, 2022

philhoonoh commented Nov 27, 2022

Seeking KILT meta-data for DPR #186

Seeking KILT meta-data for DPR #186

Comments

jzhoubu commented Sep 18, 2021

vlad-karpukhin commented Sep 20, 2021

jzhoubu commented Sep 21, 2021 • edited

vlad-karpukhin commented Sep 21, 2021

vlad-karpukhin commented Sep 21, 2021

jzhoubu commented Sep 21, 2021 • edited

light42 commented Sep 30, 2021

vlad-karpukhin commented Sep 30, 2021

light42 commented Sep 30, 2021

oklen commented Jul 18, 2022

vladk232 commented Jul 18, 2022

philhoonoh commented Nov 27, 2022

jzhoubu commented Sep 21, 2021 •

edited

jzhoubu commented Sep 21, 2021 •

edited