Reproduce Table 3 in paper #185
Comments
Hi @yxliu-ntu , |
Thanks for your advice! I will switch to the same server with 8 V100 GPUs and try it again.
|
Hi @vlad-karpukhin, I wonder how you select the best checkpoint or conduct grid searches. I have tried different checkpoints but come to find that the checkpoint with the best Average rank on the validation set does not give the best performance when testing. The different document candidate sets we use for validation and testing might result in this inconsistency. |
I always evaluate on the last checkpoint and the one selected by average rank. Average rank doesn't always correlate with the full evaluation, but still a better checkpoint selection criteria than the NLL loss. From my experience the best is usually the last checkpoint but it doesn't differ much from the one selected by best av rank. |
Hi, @vlad-karpukhin. I have run more exps with the DDP mode, but still can not get the result in Table 3. The last four exps are all conducted on the same server with 8 V100 GPUs, whose CUDA version is 10.1.
The training logs are attached in https://drive.google.com/file/d/1qK_nAgSDaOh_LblnvV4H8LXwb3hNOVmy/view?usp=sharing. |
Hi @yxliu-ntu , |
@vlad-karpukhin We want to reproduce the second block of Table 3 first, so I set hard_negatives=0. |
Your bsize128, (DP mode) line roughly corresponds to our Gold-127 line with your 57.6% for top-5 even better than our reported 55.8. Which is expected since the reported numbers were for a pretty old HF version and slightly different The line with Gold-31 should have batch size to be set to 4 and Gold-7 to 1. Overall, I can clearly see the improvements in the accuracy with the batch size increase in your results table with the Gold-127 value matching(even slightly better than) our results - which is exactly the point of that ablation. I could provide you with second block checkpoints but unfortunately we don't have them anymore and honestly I don't see the value in reproducing the exact numbers for that ablation. It should just show that the model results grow with the batch size increase saturating on 127->256 step and that is what you got already. |
@vlad-karpukhin Thank you very much! |
Hi, all.
I am reproducing the second block of Table 3 in the paper but meet problems for Gold #N = 7. I wonder what causes this inconsistency.
The results I get are shown below.
For Gold #N = 7, the result has a big gap with the paper while it is fine for Gold #N = 127.
I run the experiment for Gold #N = 7 on a server with 4 V100 GPUs and the experiment for Gold #N = 127 on a server with 8 V100 GPUs. The two servers are with the same CUDA version and virtual env. The only difference is the setting of batch size.
The script I used for experiments is shown below.
Note that, I run these experiments in DataParallel (single node - multi gpu) mode.
The text was updated successfully, but these errors were encountered: