Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Reproduce Table 3 in paper #185

Closed
yxliu-ntu opened this issue Aug 30, 2021 · 9 comments
Closed

Reproduce Table 3 in paper #185

yxliu-ntu opened this issue Aug 30, 2021 · 9 comments

Comments

@yxliu-ntu
Copy link

yxliu-ntu commented Aug 30, 2021

Hi, all.

I am reproducing the second block of Table 3 in the paper but meet problems for Gold #N = 7. I wonder what causes this inconsistency.
image

The results I get are shown below.

batch_size top5 top20 top100
bsize=8 (Gold #N = 7) 44.0% 64.2% 78.2%
bsize=128 (Gold #N = 127) 57.6% 74.5% 84.0%

For Gold #N = 7, the result has a big gap with the paper while it is fine for Gold #N = 127.

I run the experiment for Gold #N = 7 on a server with 4 V100 GPUs and the experiment for Gold #N = 127 on a server with 8 V100 GPUs. The two servers are with the same CUDA version and virtual env. The only difference is the setting of batch size.

The script I used for experiments is shown below.

#!/bin/bash

set -x
HYDRA_FULL_ERROR=1 python train_dense_encoder.py \
    train_datasets=[nq_train] \
    dev_datasets=[nq_dev] \
    train=biencoder_nq \
    train.batch_size=$1 \
    train.hard_negatives=0 \
    output_dir=./runs/

Note that, I run these experiments in DataParallel (single node - multi gpu) mode.

@vlad-karpukhin
Copy link
Contributor

Hi @yxliu-ntu ,
I recommend to use train_dense_encoder in the DDP mode - this is the mode we used to train all the results we reported in the paper.
Also, if you use DDP mode, the batch size is measured per gpu, so you will need to set batch_size=2 and hard_negatives=0 on 4 gpu setup.

@yxliu-ntu
Copy link
Author

Hi @yxliu-ntu ,
I recommend to use train_dense_encoder in the DDP mode - this is the mode we used to train all the results we reported in the paper.
Also, if you use DDP mode, the batch size is measured per gpu, so you will need to set batch_size=2 and hard_negatives=0 on 4 gpu setup.

Thanks for your advice! I will switch to the same server with 8 V100 GPUs and try it again.

#!/bin/bash

source ./venv/bin/activate
set -x
python3 -m torch.distributed.launch --nproc_per_node=8 train_dense_encoder.py \
    train_datasets=[nq_train] \
    dev_datasets=[nq_dev] \
    train=biencoder_nq \
    train.batch_size=1 \
    train.hard_negatives=0 \
    output_dir=./runs/

@yxliu-ntu
Copy link
Author

Hi @vlad-karpukhin, I wonder how you select the best checkpoint or conduct grid searches. I have tried different checkpoints but come to find that the checkpoint with the best Average rank on the validation set does not give the best performance when testing. The different document candidate sets we use for validation and testing might result in this inconsistency.

@vlad-karpukhin
Copy link
Contributor

I always evaluate on the last checkpoint and the one selected by average rank. Average rank doesn't always correlate with the full evaluation, but still a better checkpoint selection criteria than the NLL loss. From my experience the best is usually the last checkpoint but it doesn't differ much from the one selected by best av rank.
we reported the numbers selected by av rank, as far as I remember

@yxliu-ntu
Copy link
Author

Hi, @vlad-karpukhin. I have run more exps with the DDP mode, but still can not get the result in Table 3. The last four exps are all conducted on the same server with 8 V100 GPUs, whose CUDA version is 10.1.

bert-base top5 top20 top100
bsize8, (DP mode) 44.0% 64.2% 78.2%
bsize8, (DDP mode) 41.3% 61.5% 77.2%
bsize16, (DDP mode) 48.3% 67.7% 80.5%
bsize128, (DP mode) 57.6% 74.5% 84.0%
bsize256, (DDP mode) 59.1% 75.4% 84.5%

The training logs are attached in https://drive.google.com/file/d/1qK_nAgSDaOh_LblnvV4H8LXwb3hNOVmy/view?usp=sharing.

@vlad-karpukhin
Copy link
Contributor

Hi @yxliu-ntu ,
I opened a log for 256 batch size and see that you have hard_negatives=0
You should set this to 1 to get to 65% accuracy at top-5

@yxliu-ntu
Copy link
Author

Hi @yxliu-ntu ,
I opened a log for 256 batch size and see that you have hard_negatives=0
You should set this to 1 to get to 65% accuracy at top-5

@vlad-karpukhin We want to reproduce the second block of Table 3 first, so I set hard_negatives=0.
image

@vlad-karpukhin
Copy link
Contributor

Your bsize128, (DP mode) line roughly corresponds to our Gold-127 line with your 57.6% for top-5 even better than our reported 55.8. Which is expected since the reported numbers were for a pretty old HF version and slightly different
hyperparams than the current recommended values.

The line with Gold-31 should have batch size to be set to 4 and Gold-7 to 1.
I unfortunately no longer have my logs for those ablation runs.
I might have used different number of training epochs - 30 vs default 40 and the reported results are actually for our NQ dev set, not for the test set. Also, I might have used some learning rate adjustments, the common practice when you change your batch size.

Overall, I can clearly see the improvements in the accuracy with the batch size increase in your results table with the Gold-127 value matching(even slightly better than) our results - which is exactly the point of that ablation.

I could provide you with second block checkpoints but unfortunately we don't have them anymore and honestly I don't see the value in reproducing the exact numbers for that ablation. It should just show that the model results grow with the batch size increase saturating on 127->256 step and that is what you got already.

@yxliu-ntu
Copy link
Author

@vlad-karpukhin Thank you very much!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants