Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Question about reader training. #69

Closed
ReyonRen opened this issue Oct 7, 2020 · 11 comments
Closed

Question about reader training. #69

ReyonRen opened this issue Oct 7, 2020 · 11 comments

Comments

@ReyonRen
Copy link

ReyonRen commented Oct 7, 2020

Hi, I'm here again :)

I tried to use the test data constructed by my retrieved passage in NQ dataset to test the reader model trained by your provided training data, but the effect is not very good although it has pretty good retrieval performance.

I feel that the problem maybe that the training data does not match my data, so I would like to ask how your training data of reader is structured? Such as what is the query and the passage source?

Thank you!

@ReyonRen
Copy link
Author

ReyonRen commented Oct 7, 2020

I found that there used gold-info file when constructing pkl file. What is its role in constructing pkl files?

If I construct the training data of reader by myself, can I keep this file unchanged, or what should I note?

The retriever training data I used was released by your DPR work, and finally we used 58812 queries of them.

@ReyonRen
Copy link
Author

ReyonRen commented Oct 7, 2020

Or maybe I made some mistake when I construct test pkl file?

I used the top 100 retrieved results of 3610 test data and construct a json file same as your nq-test.json file. And I ignore "has_answer" part, I set it to 1 with all passages, I don't know if it has affect to result.
Then I used your nq-test_gold_info.json as a input to "preprocess_reader_data.py" to preprocess pkl file.

Did I make some mistake?

@vlad-karpukhin
Copy link
Contributor

Our reader training data is taken directly from retriever results - the is the idea of our paper. But for the reader training and those datasets which have gold passages provided, we have some special reader data filtering logic.

@vlad-karpukhin
Copy link
Contributor

gold-info is the parameter to those datasets with ctx+ provided - reader training only data composition logic has some special heuristics when those are available.

@ReyonRen
Copy link
Author

ReyonRen commented Oct 7, 2020

Get it, so I need to construct a gold-info file of my training data that each query contains a positive passage (first passage of positive_ctxs in your retriever training data)?

Thank you so much for your friendly help!

P.S. So it's OK I used same test-gold-info same as yours? Since the test data seems seem that contains 3610 queries.

@vlad-karpukhin
Copy link
Contributor

Hi,
gold_passages_src & gold_passages_src_dev are not "needed".
It can only improve model training slightly and is applicable for train/dev set ONLY.
It should NOT be used for the test set.

@ReyonRen
Copy link
Author

Hi Vladimir,
I have a question that during the reader training process, is the passage of your training data the top 100 retrieved by the query? Is there an artificial control of the ratio of positive and negative examples?

Thank you!

@vlad-karpukhin
Copy link
Contributor

Hi,
yes, there are a set of heuristics we use to convert the retriever results into reader training batches.
As mentioned above, if the gold positive pages are available, we use retrieved positive passage from them only. If not, we use top-k positive (containing answer) retrieved passages.
You can have a detailed look at our code starting at
https://github.com/facebookresearch/DPR/blob/master/dpr/data/reader_data.py#L103

@ReyonRen
Copy link
Author

Thank you!

I have another question again :)

In

include_gold_passage=False, gold_page_only_positives=True,

I'd like to ask the effect of include_gold_passage, I notice that it is default to be False, how the performance change if I change it to True?

Thank you again!

@vlad-karpukhin
Copy link
Contributor

vlad-karpukhin commented Oct 20, 2020

Hi,
we planned to evaluate that idea of forcing gold passages into the training process but we actually never did that experiment.
Based on our experience with forcing gold positive contexts into the training process of the retriever model and the heuristics used for the reader training, It may probably give you 0.5-1 exact match point gain but hardly more.

@vlad-karpukhin
Copy link
Contributor

guess I can close this now

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants