Seeking KILT meta-data for DPR #186
Comments
Thanks, @vlad-karpukhin . This helps a lot. According to the 21st footnote in KILT paper, they use 22,220,793 passages for DPR reproduction while your provided version has 34 mln (35,678,076). May I confirm if this is the same corpus to reproduce the reported scores in the KILT paper? Thanks. |
34mln - this is what we've got when split KILT's wikipedia set into 100-word passages. |
UPD: the datasets above is actually 36 mln, not 34 passages |
THanks @vlad-karpukhin . I just found 48950 passage is NaN in text field, and I think it is better to confirm with you if this is fine for the following experiments? Thanks. |
@vlad-karpukhin What's the format of the tsv file? I want to experiment using my own passages. |
The format is flexible. DPR/dpr/data/retriever_data.py Line 224 in 1ee31c6
You can configure which column in your tsv is id, body or title. Or you can write your own data parsing class, just need to implement that load_data_to() method. |
Thanks, that really cleared things up. |
Hi, @vlad-karpukhin the above link is not available now, could you fix it? |
Hi @oklen , I'm not longer at Meta and can't update this repo, sorry. |
Hi @vlad-karpukhin, Thank you for sharing the dataset.
Thank you again for your work |
Hi, I notice some updates have been made here to facilitate the KILT dataset format.
In addition, I wonder if the below meta-data from KILT can be also shared here:
passage_id
of positive and negative passages for each query of the NQ dataset (mined by the DPR checkpoint)Both meta-data are necessary for reproducing or improving DPR on KILT, and I think it will be more convenient for people to follow up if the above meta-data are shared. Thanks.
The text was updated successfully, but these errors were encountered: