Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Seeking KILT meta-data for DPR #186

Closed
jzhoubu opened this issue Sep 18, 2021 · 11 comments
Closed

Seeking KILT meta-data for DPR #186

jzhoubu opened this issue Sep 18, 2021 · 11 comments

Comments

@jzhoubu
Copy link

jzhoubu commented Sep 18, 2021

Hi, I notice some updates have been made here to facilitate the KILT dataset format.
In addition, I wonder if the below meta-data from KILT can be also shared here:

  1. the 22,220,793 passages split from the KILT knowledge source
  2. the corresponding passage_id of positive and negative passages for each query of the NQ dataset (mined by the DPR checkpoint)

Both meta-data are necessary for reproducing or improving DPR on KILT, and I think it will be more convenient for people to follow up if the above meta-data are shared. Thanks.

@vlad-karpukhin
Copy link
Contributor

Hi @sysu-zjw,
I unfortunately don't have time to add all our code changes for KILT multi-task training setup, but can share some of the data we used:

https://dl.fbaipublicfiles.com/ur/wikipedia_split/psgs_w100.tsv.gz - this is the KILT passages set split into DPR 100-words format. It has ~ 34 mln passages afair

Train & dev sets for this passages split:
https://dl.fbaipublicfiles.com/ur/data/retriever/nq-train-multikilt.json.gz
https://dl.fbaipublicfiles.com/ur/data/retriever/wow-dev-multikilt.json.gz
https://dl.fbaipublicfiles.com/ur/data/retriever/wow-train-multikilt.json.gz
https://dl.fbaipublicfiles.com/ur/data/retriever/nq-dev-multikilt.json.gz
https://dl.fbaipublicfiles.com/ur/data/retriever/trex-dev-multikilt.json.gz
https://dl.fbaipublicfiles.com/ur/data/retriever/trex-train-multikilt.json.gz
https://dl.fbaipublicfiles.com/ur/data/retriever/aidayago2-dev-multikilt.json.gz
https://dl.fbaipublicfiles.com/ur/data/retriever/aidayago2-train-multikilt.json.gz
https://dl.fbaipublicfiles.com/ur/data/retriever/fever-dev-multikilt.json.gz
https://dl.fbaipublicfiles.com/ur/data/retriever/zeroshot-dev-multikilt.json.gz
https://dl.fbaipublicfiles.com/ur/data/retriever/zeroshot-train-multikilt.json.gz
https://dl.fbaipublicfiles.com/ur/data/retriever/hotpotqa-dev-multikilt.json.gz
https://dl.fbaipublicfiles.com/ur/data/retriever/hotpotqa-train-multikilt.json.gz
https://dl.fbaipublicfiles.com/ur/data/retriever/triviaqa-dev-multikilt.json.gz
https://dl.fbaipublicfiles.com/ur/data/retriever/triviaqa-train-multikilt.json.gz

Datasets with new adversarial hard negative samples:
https://dl.fbaipublicfiles.com/ur/data/retriever/trivia-train-adv.json.gz
https://dl.fbaipublicfiles.com/ur/data/retriever/hotpot-train-adv.json.gz
https://dl.fbaipublicfiles.com/ur/data/retriever/nq-train-adv.json.gz

Validation datasets in the tsv format:
https://dl.fbaipublicfiles.com/ur/data/retriever/trex-dev-kilt.tsv
https://dl.fbaipublicfiles.com/ur/data/retriever/zeroshot-dev-kilt.tsv
https://dl.fbaipublicfiles.com/ur/data/retriever/hotpot-dev-kilt.tsv
https://dl.fbaipublicfiles.com/ur/data/retriever/trivia-dev-kilt.tsv
https://dl.fbaipublicfiles.com/ur/data/retriever/nq-dev-kilt.tsv

@jzhoubu
Copy link
Author

jzhoubu commented Sep 21, 2021

Thanks, @vlad-karpukhin . This helps a lot.

According to the 21st footnote in KILT paper, they use 22,220,793 passages for DPR reproduction while your provided version has 34 mln (35,678,076). May I confirm if this is the same corpus to reproduce the reported scores in the KILT paper? Thanks.

@vlad-karpukhin
Copy link
Contributor

34mln - this is what we've got when split KILT's wikipedia set into 100-word passages.
Reporting the results in KILT paper is tricky and requires some code modification at both DPR as KILT evaluation sides. KILT has its own codebase suitable for replication.
The links above are the data used in this paper (https://arxiv.org/pdf/2101.00117.pdf)

@vlad-karpukhin
Copy link
Contributor

UPD: the datasets above is actually 36 mln, not 34 passages

@jzhoubu
Copy link
Author

jzhoubu commented Sep 21, 2021

THanks @vlad-karpukhin . I just found 48950 passage is NaN in text field, and I think it is better to confirm with you if this is fine for the following experiments? Thanks.

@light42
Copy link

light42 commented Sep 30, 2021

https://dl.fbaipublicfiles.com/ur/wikipedia_split/psgs_w100.tsv.gz - this is the KILT passages set split into DPR 100-words format. It has ~ 34 mln passages afair

@vlad-karpukhin What's the format of the tsv file? I want to experiment using my own passages.

@vlad-karpukhin
Copy link
Contributor

The format is flexible.
Have a look at

class CsvCtxSrc(RetrieverData):

You can configure which column in your tsv is id, body or title.
Or you can write your own data parsing class, just need to implement that load_data_to() method.

@light42
Copy link

light42 commented Sep 30, 2021

Thanks, that really cleared things up.

@oklen
Copy link

oklen commented Jul 18, 2022

https://dl.fbaipublicfiles.com/ur/data/retriever/aidayago2-dev-multikilt.json.gz

Hi, @vlad-karpukhin the above link is not available now, could you fix it?

@vladk232
Copy link

Hi @oklen , I'm not longer at Meta and can't update this repo, sorry.

@philhoonoh
Copy link

Hi @vlad-karpukhin, Thank you for sharing the dataset.
I believe my assumptions are correct but I want to make sure on 'hard_negative_ctxs' in each dataset.

  1. Train & dev sets for this passages split:
  • 'hard_negative_ctxs' in this dataset are retrieved by BM25
  1. Datasets with new adversarial hard negative samples:
  • 'hard_negative_ctxs' in this dataset are retrieved by DPR
  1. Validation datasets in the tsv format:
  • same as the first one

Thank you again for your work

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants