Best results reproduction instruction #7

iftachg · 2020-06-03T08:14:21Z

Hello,

I am trying to train a model based on your instructions and tried to run train_dense_encoder.py
In the instruction you are refering to --dev_file {path to downloaded data.retriever.qas.nq-dev resource} but it is unclear to which file you mean.

Is it retriever/qas/nq-dev.csv or retriever/nq-dev.json? The first option fails as the code expects a json file but the second one doesn't seem like a "retriever.qas" resource based on its name.

The text was updated successfully, but these errors were encountered:

vlad-karpukhin · 2020-06-03T16:44:22Z

Hi,

The --dev-file while training the retriever model should point to retriever/nq-dev.json file since it contains the pools of negative passages as well as positive ones. The 'qas' are basically just questions and answers which you need to use when doing the full evaluation (i.e against entire wikipedia).
Please also note that our dev split of NQ is not the official dev set.
You should use our NQ 'test' for comparison with other works. It is a set which is reported in our paper. It is actually the official 'dev' set and is used in the papers we are referring to.

iftachg · 2020-06-03T17:09:36Z

Just to be certain - the *.json files are not the splits used in the papers but the *.csv are?

vlad-karpukhin · 2020-06-03T17:26:30Z

"the *.json files are not the splits used in the papers but the *.csv are? " - json are subsets of their csv counterparts since we lost some portion of data while preprocessing them.
The numbers reported in the papers (ours and the ones were are referring to) are for NQ 'test' split.

morningmoni · 2020-06-03T19:20:35Z

Hi there,

Is there any instruction on the inference/test of the reader too? I'd like to use the provided checkpoint checkpoint.reader.nq-single.hf-bert-base and data data.reader.nq.single.test to reproduce the results (if I understand correctly). It looks to me the reader input is from data.retriever_results.nq.single.test? What is data.gold_passages_info.nq_test used for then?

Thank you!

vlad-karpukhin · 2020-06-03T23:15:20Z

Hi,
thanks for bringing this question - we do need more instructions.
The inference test for the reader is much easier than for the retriever - one needs to use exactly same train_reader.py tool but with two differences:

remove train_file parameter.
specify reader checkpoint by model_file parameter.

You are correct that you need to use data.retriever_results.nq.single.test as the input for validation. You should NOT use data.gold_passages_info.nq_test when doing inference.
We provided that file just for consistency and experiments. We did not used that when evaluating the NQ test set (reported in the paper).
gold_passages_info files are used for the train time data processing (look at preprocess_retriever_data method). Specifically, it includes checking if a positive passage from the retriever is from the "gold" wikipedia page or not. One obviously should not do that for the test set. But the code actually has a defense against accidental usage of those gold_passages_info - it is not used for the dev/test split preprocessing (https://github.com/facebookresearch/DPR/blob/master/dpr/data/reader_data.py#L272).

Here is an example of how we evaluate the reader on NQ test set:

python train_reader.py --prediction_results_file {some dir}/results.json --eval_top_docs 10 20 40 50 80 100 --dev_file {path to data.retriever_results.nq.single.test file} --model_file {path to the reader checkpoint} --dev_batch_size 80 --passages_per_question_predict 100 --sequence_length 350

The example above is for 8x32GB gpu server. One should tune dev_batch_size for their hardware.

morningmoni · 2020-06-04T23:44:46Z

Thanks for your detailed reply! I have obtained results close to those in the paper.
n=10 EM 40.66
n=20 EM 40.94
n=40 EM 41.25
n=50 EM 41.39
n=80 EM 41.22
n=100 EM 41.14

vlad-karpukhin · 2020-06-04T23:51:55Z

hmm...
We've got:
n=10 EM 40.78
n=20 EM 41.02
n=40 EM 41.39
n=50 EM 41.52
n=80 EM 41.36
n=100 EM 41.27

Could you please sent the exact command you used?
It is possible I shared a slightly different checkpoint which gives slightly worse numbers

morningmoni · 2020-06-04T23:58:01Z

CUDA_VISIBLE_DEVICES=4 python train_reader.py --prediction_results_file /workspace/DPR/data/res/nq-test-reader.json --eval_top_docs 10 20 40 50 80 100 --dev_file /workspace/DPR/data/retriever_results/nq/single/test.json --model_file /workspace/DPR/checkpoint/reader/nq-single/hf-bert-base.cp --dev_batch_size 2 --passages_per_question_predict 100 --sequence_length 350

I simply reduced the dev_tacth_size, which took about 9GB in my case.
FYI, Loading checkpoint @ batch=1736 and epoch=18

I can also paste the whole output if you like, but I don't think I changed anything so..

vlad-karpukhin · 2020-06-05T00:03:53Z

"Loading checkpoint @ batch=1736 and epoch=18" - this is the correct one.
Let me re-run using your settings.

vlad-karpukhin · 2020-06-05T17:26:58Z

Hi morningmoni ,
I got the same results as you. The thing is in input data, the checkpoint is the correct one.
There seems to be some minor mismatch between 'raw' data.retriever_results.nq.single.test file and its preprocessed by preprocess_reader_data.py version in the form of data.reader.nq.single.test.
It might be something related with slightly different preprocessing parameters.
Need to have a closer look.

In the meantime you could use 'data.reader.nq.single.test' file instead of 'data.retriever_results.nq.single.test' to get those extra 0.1-0.2 accuracy.

Note: the reader model is quite sensitive in all data preprocessing and hyperparameter settings.
Variation in the final EM metric of 0.5-1 points are expected in case of changing even one of the parameters.

morningmoni · 2020-06-05T21:38:56Z

Thanks! I have got the same results when using the .pkl directly. Could you let me know when you say changing even one of the parameters, what are the parameters you are referring to (what are the important ones)?

vlad-karpukhin · 2020-06-08T19:05:28Z

"Could you let me know when you say changing even one of the parameters," - basically all the hyper parameters for the reader training and data preprocessing ones: random seed, learning rate, gradient clip, and all reader data preparation settings which we don't expose yet via command line args(https://github.com/facebookresearch/DPR/blob/master/dpr/data/reader_data.py#L95)

iftachg closed this as completed Jun 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best results reproduction instruction #7

Best results reproduction instruction #7

iftachg commented Jun 3, 2020

vlad-karpukhin commented Jun 3, 2020

iftachg commented Jun 3, 2020

vlad-karpukhin commented Jun 3, 2020

morningmoni commented Jun 3, 2020

vlad-karpukhin commented Jun 3, 2020 •

edited

morningmoni commented Jun 4, 2020

vlad-karpukhin commented Jun 4, 2020

morningmoni commented Jun 4, 2020

vlad-karpukhin commented Jun 5, 2020

vlad-karpukhin commented Jun 5, 2020

morningmoni commented Jun 5, 2020

vlad-karpukhin commented Jun 8, 2020

Best results reproduction instruction #7

Best results reproduction instruction #7

Comments

iftachg commented Jun 3, 2020

vlad-karpukhin commented Jun 3, 2020

iftachg commented Jun 3, 2020

vlad-karpukhin commented Jun 3, 2020

morningmoni commented Jun 3, 2020

vlad-karpukhin commented Jun 3, 2020 • edited

morningmoni commented Jun 4, 2020

vlad-karpukhin commented Jun 4, 2020

morningmoni commented Jun 4, 2020

vlad-karpukhin commented Jun 5, 2020

vlad-karpukhin commented Jun 5, 2020

morningmoni commented Jun 5, 2020

vlad-karpukhin commented Jun 8, 2020

vlad-karpukhin commented Jun 3, 2020 •

edited