Questions on new implementation #114

yeliu918 · 2021-03-16T18:07:32Z

Hi,
Nice work on the new performance!
I saw you mentioned that the new model is trained on new training data combined with your original training data.
I have some confusion here.

How do you get the "nq-adv-hn-train.json"? Does the gold and hard_negative is retrieved from pre-trained DPR model rather than BM25? And which pre-trained DPR model does you use? Is that the "single-adv-hn.nq.bert-base-encoder"?
If I use new training data "nq-adv-hn-train.json" should I still use nq-train.json to get your performance? If I need, does that mean I need to add one BM25 hard_negative from nq-train.json?

vlad-karpukhin · 2021-03-16T20:00:51Z

Hi @yeliu918 ,

nq-adv-hn-train.json is obtained by using the original NQ trian set we provided but where hard negatives were obtained using DPR NQ model itself (checkpoint=checkpoint.retriever.single.nq.bert-base-encoder), not BM25.
single-adv-hn.nq.bert-base-encoder - is a checkpoint trained by concatenating the original NQ train data set we provided & this new dataset.
Yes, you should use both train datasets, please have a look at the "New model training combines two NQ datatsets:" example provided on the main readme page how to train with tow datasets.

yeliu918 · 2021-03-16T21:25:57Z

Hi @vlad-karpukhin,

Thanks for the clarification. It's very helpful!

yeliu918 · 2021-03-18T21:34:07Z

Hi @vlad-karpukhin,

Forget to ask you do you try the second step: using nq_train,nq_train_hn1 from scratch or from the checkpoints (checkpoint=checkpoint.retriever.single.nq.bert-base-encoder)? And how many more epochs did you train?

vlad-karpukhin · 2021-03-18T21:44:18Z

The new checkpoint 'single-adv-hn.nq.bert-base-encoder' is exactly the model that is trained from scratch on [nq_train,nq_train_hn1] combined dataset using the same params as our original model (i.e. trained for 40 epochs)

'data.retriever_results.nq.single-adv-hn.test' - is the NQ test set results from this model
'data.retriever.nq-adv-hn-train' - is the training data

And I've just added links to this model wikipedia embeddings.
Resource name = data.retriever_results.nq.single-adv-hn.wikipedia_passages

yeliu918 · 2021-03-18T23:03:45Z

Thanks for the clarification. May I ask you why you train the model from scratch rather than further training from the "single.nq.bert-base-encoder" checkpoint? Because I think it may lead to better performance.

vlad-karpukhin · 2021-03-18T23:53:09Z

I tried both ways in a lot of different combinations including re-indexing the entire wikipedia index during training and mining new hard negatives (during training) and got more or less same results.
Training from scratch on both datasets is a little bit better overall.

yeliu918 · 2021-03-19T02:32:33Z

Hi @vlad-karpukhin,

Thank you a lot for letting me know that.
I saw the gold passage in "nq-adv-hn-train" is gold one from NQ dataset, so it is as same as that in "nq_train".
So why do you use gold passage, BM25 negative passage and DPR negative passage in each training example. Rather than double the training example with the gold passage and DPR negative passage?

And Why the "nq-adv-hn-train" has 69639 examples, while "nq_train" has 58880 (same as in paper). And all the score of the gold passage is 1000. Does that means it is from the "context" in the "gold_passages_info"?

vlad-karpukhin · 2021-03-19T04:30:41Z

"I saw the gold passage in "nq-adv-hn-train" is gold one from NQ dataset, so it is as same as that in "nq_train". -why should they be different?

"So why do you use gold passage, BM25 negative passage and DPR negative passage in each training example. Rather than double the training example with the gold passage and DPR negative passage?" - you can try to train on nq-adv-hn-train only from scratch and let me know the result.

The quality of hard negatives drops rapidly, so if you use 60 instead of 30 hard negs from DPR, those 'second' quality 30 new samples probably won't be that good as top 30 from bm25. Another reason may be that bm25 samples present a different form of challenge for the model (lexical matching)

vlad-karpukhin · 2021-03-19T04:32:44Z

"And all the score of the gold passage is 1000" - the score is not used anywhere, it is just a marker.
'Gold passage' and gold_passages_info is basically the same.

yeliu918 · 2021-03-19T16:09:16Z

Hi @vlad-karpukhin,

Thanks for the response! When I train the model using ["nq_train", "nq-adv-hn-train"], I find the Val loss gets lowest at epoch 1, and then it keeps growing larger. Do you have this problem? And do you think we also need to set the DPR negative in the Val dataset?

vlad-karpukhin · 2021-03-19T16:22:05Z

"I find the Val loss gets lowest at epoch 1, and then it keeps growing larger. Do you have this problem? " - yes, I can see the same - NLL loss grows but the ration of correct predictions also grows at the same time.
I have no explanation for this.

But NLL loss has always been a bad quality marker for this model. "Average rank" is better but also far from being a perfect validation criteria.

"And do you think we also need to set the DPR negative in the Val dataset?" - Sorry, I didn't get this question.

yeliu918 · 2021-03-19T16:37:07Z

Thanks for that info.

So after training 40 epochs, which checkpoint do you view as the best? The one with the highest correct predictions?

""And do you think we also need to set the DPR negative in the Val dataset?" - Sorry, I didn't get this question. " I mean as same as training you use ["nq_train", "nq-adv-hn-train"] as your training examples. Why not using ["nq_dev", "nq-adv-hn-dev"] as the validation examples. And "nq-adv-hn-dev" is the one using DPR negative on dev.

vlad-karpukhin · 2021-03-19T18:21:14Z

"So after training 40 epochs, which checkpoint do you view as the best? The one with the highest correct predictions?" - I always try 2 checkpoints for full index evaluation

The one with best (lowest) average gold passage rank
The last one.

They usually don't differ significantly and are often the same checkpoints.

"Why not using ["nq_dev", "nq-adv-hn-dev"] as the validation examples." - to be able to directly compare to previous models.
I haven't even created nq-adv-hn-dev version.

yeliu918 · 2021-03-19T23:13:44Z

Got that! Thanks a lot.

hangzhang-nlp · 2021-04-06T09:23:09Z

Why the "nq-adv-hn-train" has 69639 examples, while "nq_train" has 58880 (same as in paper) ?

vlad-karpukhin · 2021-04-06T18:14:18Z

It includes all NQ train set samples, even those which don't have good gold ctx+ mapping to our wikipedia split.
It just uses semi-supervised positive passages returned by DPR index instead.
It might be better to use same approach for the original train data as well, I just didn't have time to do that.

hangzhang-nlp · 2021-04-07T13:48:55Z

Hi, I found that in the Iterator implementation, samples from a single batch are from the same dataset. Why not mix samples in batch for "NQ-ADV-HN-train" and "NQ_train" data sources? Is this a deliberate design or is it just a convenient implementation?

vlad-karpukhin · 2021-04-07T18:30:28Z

Hi @onedoge ,
It is a good question - I did try to mix samples from different datasets.
Just not with these two, but from different tasks (i.e. mix Q&A sample with other tasks's samples).
The result wasn't great so I just released this version where every batch consists of samples from a single dataset only.

hangzhang-nlp · 2021-04-08T05:41:53Z

Thanks a lot.

yeliu918 closed this as completed Mar 16, 2021

yeliu918 reopened this Mar 18, 2021

vlad-karpukhin mentioned this issue Apr 19, 2021

Additional data in the new NQ dataset #137

Closed

vlad-karpukhin closed this as completed Apr 28, 2021

jzhoubu mentioned this issue Jun 8, 2021

different sample num of nq dev dataset #157

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions on new implementation #114

Questions on new implementation #114

yeliu918 commented Mar 16, 2021

vlad-karpukhin commented Mar 16, 2021

yeliu918 commented Mar 16, 2021

yeliu918 commented Mar 18, 2021

vlad-karpukhin commented Mar 18, 2021

yeliu918 commented Mar 18, 2021 •

edited

vlad-karpukhin commented Mar 18, 2021 •

edited

yeliu918 commented Mar 19, 2021 •

edited

vlad-karpukhin commented Mar 19, 2021

vlad-karpukhin commented Mar 19, 2021

yeliu918 commented Mar 19, 2021

vlad-karpukhin commented Mar 19, 2021

yeliu918 commented Mar 19, 2021

vlad-karpukhin commented Mar 19, 2021

yeliu918 commented Mar 19, 2021

hangzhang-nlp commented Apr 6, 2021

vlad-karpukhin commented Apr 6, 2021 •

edited

hangzhang-nlp commented Apr 7, 2021

vlad-karpukhin commented Apr 7, 2021

hangzhang-nlp commented Apr 8, 2021

Questions on new implementation #114

Questions on new implementation #114

Comments

yeliu918 commented Mar 16, 2021

vlad-karpukhin commented Mar 16, 2021

yeliu918 commented Mar 16, 2021

yeliu918 commented Mar 18, 2021

vlad-karpukhin commented Mar 18, 2021

yeliu918 commented Mar 18, 2021 • edited

vlad-karpukhin commented Mar 18, 2021 • edited

yeliu918 commented Mar 19, 2021 • edited

vlad-karpukhin commented Mar 19, 2021

vlad-karpukhin commented Mar 19, 2021

yeliu918 commented Mar 19, 2021

vlad-karpukhin commented Mar 19, 2021

yeliu918 commented Mar 19, 2021

vlad-karpukhin commented Mar 19, 2021

yeliu918 commented Mar 19, 2021

hangzhang-nlp commented Apr 6, 2021

vlad-karpukhin commented Apr 6, 2021 • edited

hangzhang-nlp commented Apr 7, 2021

vlad-karpukhin commented Apr 7, 2021

hangzhang-nlp commented Apr 8, 2021

yeliu918 commented Mar 18, 2021 •

edited

vlad-karpukhin commented Mar 18, 2021 •

edited

yeliu918 commented Mar 19, 2021 •

edited

vlad-karpukhin commented Apr 6, 2021 •

edited