Questions on new implementation #114
Comments
Hi @yeliu918 ,
|
Hi @vlad-karpukhin, Thanks for the clarification. It's very helpful! |
Hi @vlad-karpukhin, Forget to ask you do you try the second step: using nq_train,nq_train_hn1 from scratch or from the checkpoints (checkpoint=checkpoint.retriever.single.nq.bert-base-encoder)? And how many more epochs did you train? |
The new checkpoint 'single-adv-hn.nq.bert-base-encoder' is exactly the model that is trained from scratch on [nq_train,nq_train_hn1] combined dataset using the same params as our original model (i.e. trained for 40 epochs) 'data.retriever_results.nq.single-adv-hn.test' - is the NQ test set results from this model And I've just added links to this model wikipedia embeddings. |
Thanks for the clarification. May I ask you why you train the model from scratch rather than further training from the "single.nq.bert-base-encoder" checkpoint? Because I think it may lead to better performance. |
I tried both ways in a lot of different combinations including re-indexing the entire wikipedia index during training and mining new hard negatives (during training) and got more or less same results. |
Hi @vlad-karpukhin, Thank you a lot for letting me know that. And Why the "nq-adv-hn-train" has 69639 examples, while "nq_train" has 58880 (same as in paper). And all the score of the gold passage is 1000. Does that means it is from the "context" in the "gold_passages_info"? |
"I saw the gold passage in "nq-adv-hn-train" is gold one from NQ dataset, so it is as same as that in "nq_train". -why should they be different? "So why do you use gold passage, BM25 negative passage and DPR negative passage in each training example. Rather than double the training example with the gold passage and DPR negative passage?" - you can try to train on nq-adv-hn-train only from scratch and let me know the result. The quality of hard negatives drops rapidly, so if you use 60 instead of 30 hard negs from DPR, those 'second' quality 30 new samples probably won't be that good as top 30 from bm25. Another reason may be that bm25 samples present a different form of challenge for the model (lexical matching) |
"And all the score of the gold passage is 1000" - the score is not used anywhere, it is just a marker. |
Hi @vlad-karpukhin, Thanks for the response! When I train the model using ["nq_train", "nq-adv-hn-train"], I find the Val loss gets lowest at epoch 1, and then it keeps growing larger. Do you have this problem? And do you think we also need to set the DPR negative in the Val dataset? |
"I find the Val loss gets lowest at epoch 1, and then it keeps growing larger. Do you have this problem? " - yes, I can see the same - NLL loss grows but the ration of correct predictions also grows at the same time. But NLL loss has always been a bad quality marker for this model. "Average rank" is better but also far from being a perfect validation criteria. "And do you think we also need to set the DPR negative in the Val dataset?" - Sorry, I didn't get this question. |
Thanks for that info. So after training 40 epochs, which checkpoint do you view as the best? The one with the highest correct predictions? ""And do you think we also need to set the DPR negative in the Val dataset?" - Sorry, I didn't get this question. " I mean as same as training you use ["nq_train", "nq-adv-hn-train"] as your training examples. Why not using ["nq_dev", "nq-adv-hn-dev"] as the validation examples. And "nq-adv-hn-dev" is the one using DPR negative on dev. |
"So after training 40 epochs, which checkpoint do you view as the best? The one with the highest correct predictions?" - I always try 2 checkpoints for full index evaluation
They usually don't differ significantly and are often the same checkpoints. "Why not using ["nq_dev", "nq-adv-hn-dev"] as the validation examples." - to be able to directly compare to previous models. |
Got that! Thanks a lot. |
Why the "nq-adv-hn-train" has 69639 examples, while "nq_train" has 58880 (same as in paper) ? |
It includes all NQ train set samples, even those which don't have good gold ctx+ mapping to our wikipedia split. |
Hi, I found that in the Iterator implementation, samples from a single batch are from the same dataset. Why not mix samples in batch for "NQ-ADV-HN-train" and "NQ_train" data sources? Is this a deliberate design or is it just a convenient implementation? |
Hi @onedoge , |
Thanks a lot. |
Hi,
Nice work on the new performance!
I saw you mentioned that the new model is trained on new training data combined with your original training data.
I have some confusion here.
The text was updated successfully, but these errors were encountered: