Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Questions on new implementation #114

Closed
yeliu918 opened this issue Mar 16, 2021 · 19 comments
Closed

Questions on new implementation #114

yeliu918 opened this issue Mar 16, 2021 · 19 comments

Comments

@yeliu918
Copy link

Hi,
Nice work on the new performance!
I saw you mentioned that the new model is trained on new training data combined with your original training data.
I have some confusion here.

  1. How do you get the "nq-adv-hn-train.json"? Does the gold and hard_negative is retrieved from pre-trained DPR model rather than BM25? And which pre-trained DPR model does you use? Is that the "single-adv-hn.nq.bert-base-encoder"?
  2. If I use new training data "nq-adv-hn-train.json" should I still use nq-train.json to get your performance? If I need, does that mean I need to add one BM25 hard_negative from nq-train.json?
@vlad-karpukhin
Copy link
Contributor

Hi @yeliu918 ,

  1. nq-adv-hn-train.json is obtained by using the original NQ trian set we provided but where hard negatives were obtained using DPR NQ model itself (checkpoint=checkpoint.retriever.single.nq.bert-base-encoder), not BM25.
    single-adv-hn.nq.bert-base-encoder - is a checkpoint trained by concatenating the original NQ train data set we provided & this new dataset.
  2. Yes, you should use both train datasets, please have a look at the "New model training combines two NQ datatsets:" example provided on the main readme page how to train with tow datasets.

@yeliu918
Copy link
Author

Hi @vlad-karpukhin,

Thanks for the clarification. It's very helpful!

@yeliu918 yeliu918 reopened this Mar 18, 2021
@yeliu918
Copy link
Author

Hi @vlad-karpukhin,

Forget to ask you do you try the second step: using nq_train,nq_train_hn1 from scratch or from the checkpoints (checkpoint=checkpoint.retriever.single.nq.bert-base-encoder)? And how many more epochs did you train?

@vlad-karpukhin
Copy link
Contributor

The new checkpoint 'single-adv-hn.nq.bert-base-encoder' is exactly the model that is trained from scratch on [nq_train,nq_train_hn1] combined dataset using the same params as our original model (i.e. trained for 40 epochs)

'data.retriever_results.nq.single-adv-hn.test' - is the NQ test set results from this model
'data.retriever.nq-adv-hn-train' - is the training data

And I've just added links to this model wikipedia embeddings.
Resource name = data.retriever_results.nq.single-adv-hn.wikipedia_passages

@yeliu918
Copy link
Author

yeliu918 commented Mar 18, 2021

Thanks for the clarification. May I ask you why you train the model from scratch rather than further training from the "single.nq.bert-base-encoder" checkpoint? Because I think it may lead to better performance.

@vlad-karpukhin
Copy link
Contributor

vlad-karpukhin commented Mar 18, 2021

I tried both ways in a lot of different combinations including re-indexing the entire wikipedia index during training and mining new hard negatives (during training) and got more or less same results.
Training from scratch on both datasets is a little bit better overall.

@yeliu918
Copy link
Author

yeliu918 commented Mar 19, 2021

Hi @vlad-karpukhin,

Thank you a lot for letting me know that.
I saw the gold passage in "nq-adv-hn-train" is gold one from NQ dataset, so it is as same as that in "nq_train".
So why do you use gold passage, BM25 negative passage and DPR negative passage in each training example. Rather than double the training example with the gold passage and DPR negative passage?

And Why the "nq-adv-hn-train" has 69639 examples, while "nq_train" has 58880 (same as in paper). And all the score of the gold passage is 1000. Does that means it is from the "context" in the "gold_passages_info"?

@vlad-karpukhin
Copy link
Contributor

"I saw the gold passage in "nq-adv-hn-train" is gold one from NQ dataset, so it is as same as that in "nq_train". -why should they be different?

"So why do you use gold passage, BM25 negative passage and DPR negative passage in each training example. Rather than double the training example with the gold passage and DPR negative passage?" - you can try to train on nq-adv-hn-train only from scratch and let me know the result.

The quality of hard negatives drops rapidly, so if you use 60 instead of 30 hard negs from DPR, those 'second' quality 30 new samples probably won't be that good as top 30 from bm25. Another reason may be that bm25 samples present a different form of challenge for the model (lexical matching)

@vlad-karpukhin
Copy link
Contributor

"And all the score of the gold passage is 1000" - the score is not used anywhere, it is just a marker.
'Gold passage' and gold_passages_info is basically the same.

@yeliu918
Copy link
Author

Hi @vlad-karpukhin,

Thanks for the response! When I train the model using ["nq_train", "nq-adv-hn-train"], I find the Val loss gets lowest at epoch 1, and then it keeps growing larger. Do you have this problem? And do you think we also need to set the DPR negative in the Val dataset?

@vlad-karpukhin
Copy link
Contributor

"I find the Val loss gets lowest at epoch 1, and then it keeps growing larger. Do you have this problem? " - yes, I can see the same - NLL loss grows but the ration of correct predictions also grows at the same time.
I have no explanation for this.

But NLL loss has always been a bad quality marker for this model. "Average rank" is better but also far from being a perfect validation criteria.

"And do you think we also need to set the DPR negative in the Val dataset?" - Sorry, I didn't get this question.

@yeliu918
Copy link
Author

Thanks for that info.

So after training 40 epochs, which checkpoint do you view as the best? The one with the highest correct predictions?

""And do you think we also need to set the DPR negative in the Val dataset?" - Sorry, I didn't get this question. " I mean as same as training you use ["nq_train", "nq-adv-hn-train"] as your training examples. Why not using ["nq_dev", "nq-adv-hn-dev"] as the validation examples. And "nq-adv-hn-dev" is the one using DPR negative on dev.

@vlad-karpukhin
Copy link
Contributor

"So after training 40 epochs, which checkpoint do you view as the best? The one with the highest correct predictions?" - I always try 2 checkpoints for full index evaluation

  1. The one with best (lowest) average gold passage rank
  2. The last one.

They usually don't differ significantly and are often the same checkpoints.

"Why not using ["nq_dev", "nq-adv-hn-dev"] as the validation examples." - to be able to directly compare to previous models.
I haven't even created nq-adv-hn-dev version.

@yeliu918
Copy link
Author

Got that! Thanks a lot.

@hangzhang-nlp
Copy link

Why the "nq-adv-hn-train" has 69639 examples, while "nq_train" has 58880 (same as in paper) ?

@vlad-karpukhin
Copy link
Contributor

vlad-karpukhin commented Apr 6, 2021

It includes all NQ train set samples, even those which don't have good gold ctx+ mapping to our wikipedia split.
It just uses semi-supervised positive passages returned by DPR index instead.
It might be better to use same approach for the original train data as well, I just didn't have time to do that.

@hangzhang-nlp
Copy link

Hi, I found that in the Iterator implementation, samples from a single batch are from the same dataset. Why not mix samples in batch for "NQ-ADV-HN-train" and "NQ_train" data sources? Is this a deliberate design or is it just a convenient implementation?

@vlad-karpukhin
Copy link
Contributor

Hi @onedoge ,
It is a good question - I did try to mix samples from different datasets.
Just not with these two, but from different tasks (i.e. mix Q&A sample with other tasks's samples).
The result wasn't great so I just released this version where every batch consists of samples from a single dataset only.

@hangzhang-nlp
Copy link

Thanks a lot.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants