Question about reader training. #69

ReyonRen · 2020-10-07T02:56:12Z

Hi, I'm here again :)

I tried to use the test data constructed by my retrieved passage in NQ dataset to test the reader model trained by your provided training data, but the effect is not very good although it has pretty good retrieval performance.

I feel that the problem maybe that the training data does not match my data, so I would like to ask how your training data of reader is structured? Such as what is the query and the passage source?

Thank you!

ReyonRen · 2020-10-07T03:26:22Z

I found that there used gold-info file when constructing pkl file. What is its role in constructing pkl files?

If I construct the training data of reader by myself, can I keep this file unchanged, or what should I note?

The retriever training data I used was released by your DPR work, and finally we used 58812 queries of them.

ReyonRen · 2020-10-07T03:36:19Z

Or maybe I made some mistake when I construct test pkl file?

I used the top 100 retrieved results of 3610 test data and construct a json file same as your nq-test.json file. And I ignore "has_answer" part, I set it to 1 with all passages, I don't know if it has affect to result.
Then I used your nq-test_gold_info.json as a input to "preprocess_reader_data.py" to preprocess pkl file.

Did I make some mistake?

vlad-karpukhin · 2020-10-07T04:46:17Z

Our reader training data is taken directly from retriever results - the is the idea of our paper. But for the reader training and those datasets which have gold passages provided, we have some special reader data filtering logic.

vlad-karpukhin · 2020-10-07T04:47:19Z

gold-info is the parameter to those datasets with ctx+ provided - reader training only data composition logic has some special heuristics when those are available.

ReyonRen · 2020-10-07T06:44:10Z

Get it, so I need to construct a gold-info file of my training data that each query contains a positive passage (first passage of positive_ctxs in your retriever training data)?

Thank you so much for your friendly help!

P.S. So it's OK I used same test-gold-info same as yours? Since the test data seems seem that contains 3610 queries.

vlad-karpukhin · 2020-10-07T16:58:35Z

Hi,
gold_passages_src & gold_passages_src_dev are not "needed".
It can only improve model training slightly and is applicable for train/dev set ONLY.
It should NOT be used for the test set.

ReyonRen · 2020-10-14T09:30:38Z

Hi Vladimir,
I have a question that during the reader training process, is the passage of your training data the top 100 retrieved by the query? Is there an artificial control of the ratio of positive and negative examples?

Thank you!

vlad-karpukhin · 2020-10-14T20:05:50Z

Hi,
yes, there are a set of heuristics we use to convert the retriever results into reader training batches.
As mentioned above, if the gold positive pages are available, we use retrieved positive passage from them only. If not, we use top-k positive (containing answer) retrieved passages.
You can have a detailed look at our code starting at
https://github.com/facebookresearch/DPR/blob/master/dpr/data/reader_data.py#L103

ReyonRen · 2020-10-20T10:50:30Z

Thank you!

I have another question again :)

In

DPR/dpr/data/reader_data.py

Line 96 in 27a8436

include_gold_passage=False, gold_page_only_positives=True,

I'd like to ask the effect of include_gold_passage, I notice that it is default to be False, how the performance change if I change it to True?

Thank you again!

vlad-karpukhin · 2020-10-20T20:26:57Z

Hi,
we planned to evaluate that idea of forcing gold passages into the training process but we actually never did that experiment.
Based on our experience with forcing gold positive contexts into the training process of the retriever model and the heuristics used for the reader training, It may probably give you 0.5-1 exact match point gain but hardly more.

vlad-karpukhin · 2021-03-22T18:10:56Z

guess I can close this now

vlad-karpukhin closed this as completed Mar 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about reader training. #69

Question about reader training. #69

ReyonRen commented Oct 7, 2020

ReyonRen commented Oct 7, 2020

ReyonRen commented Oct 7, 2020

vlad-karpukhin commented Oct 7, 2020

vlad-karpukhin commented Oct 7, 2020

ReyonRen commented Oct 7, 2020

vlad-karpukhin commented Oct 7, 2020

ReyonRen commented Oct 14, 2020

vlad-karpukhin commented Oct 14, 2020

ReyonRen commented Oct 20, 2020

vlad-karpukhin commented Oct 20, 2020 •

edited

vlad-karpukhin commented Mar 22, 2021

Question about reader training. #69

Question about reader training. #69

Comments

ReyonRen commented Oct 7, 2020

ReyonRen commented Oct 7, 2020

ReyonRen commented Oct 7, 2020

vlad-karpukhin commented Oct 7, 2020

vlad-karpukhin commented Oct 7, 2020

ReyonRen commented Oct 7, 2020

vlad-karpukhin commented Oct 7, 2020

ReyonRen commented Oct 14, 2020

vlad-karpukhin commented Oct 14, 2020

ReyonRen commented Oct 20, 2020

vlad-karpukhin commented Oct 20, 2020 • edited

vlad-karpukhin commented Mar 22, 2021

vlad-karpukhin commented Oct 20, 2020 •

edited