Reproduce Table 3 in paper #185

yxliu-ntu · 2021-08-30T05:56:16Z

Hi, all.

I am reproducing the second block of Table 3 in the paper but meet problems for Gold #N = 7. I wonder what causes this inconsistency.

The results I get are shown below.

batch_size	top5	top20	top100
bsize=8 (Gold #N = 7)	44.0%	64.2%	78.2%
bsize=128 (Gold #N = 127)	57.6%	74.5%	84.0%

For Gold #N = 7, the result has a big gap with the paper while it is fine for Gold #N = 127.

I run the experiment for Gold #N = 7 on a server with 4 V100 GPUs and the experiment for Gold #N = 127 on a server with 8 V100 GPUs. The two servers are with the same CUDA version and virtual env. The only difference is the setting of batch size.

The script I used for experiments is shown below.

#!/bin/bash

set -x
HYDRA_FULL_ERROR=1 python train_dense_encoder.py \
    train_datasets=[nq_train] \
    dev_datasets=[nq_dev] \
    train=biencoder_nq \
    train.batch_size=$1 \
    train.hard_negatives=0 \
    output_dir=./runs/

Note that, I run these experiments in DataParallel (single node - multi gpu) mode.

The text was updated successfully, but these errors were encountered:

vlad-karpukhin · 2021-08-30T20:09:24Z

Hi @yxliu-ntu ,
I recommend to use train_dense_encoder in the DDP mode - this is the mode we used to train all the results we reported in the paper.
Also, if you use DDP mode, the batch size is measured per gpu, so you will need to set batch_size=2 and hard_negatives=0 on 4 gpu setup.

yxliu-ntu · 2021-08-31T07:50:58Z

Hi @yxliu-ntu ,
I recommend to use train_dense_encoder in the DDP mode - this is the mode we used to train all the results we reported in the paper.
Also, if you use DDP mode, the batch size is measured per gpu, so you will need to set batch_size=2 and hard_negatives=0 on 4 gpu setup.

Thanks for your advice! I will switch to the same server with 8 V100 GPUs and try it again.

#!/bin/bash

source ./venv/bin/activate
set -x
python3 -m torch.distributed.launch --nproc_per_node=8 train_dense_encoder.py \
    train_datasets=[nq_train] \
    dev_datasets=[nq_dev] \
    train=biencoder_nq \
    train.batch_size=1 \
    train.hard_negatives=0 \
    output_dir=./runs/

yxliu-ntu · 2021-08-31T12:55:37Z

Hi @vlad-karpukhin, I wonder how you select the best checkpoint or conduct grid searches. I have tried different checkpoints but come to find that the checkpoint with the best Average rank on the validation set does not give the best performance when testing. The different document candidate sets we use for validation and testing might result in this inconsistency.

vlad-karpukhin · 2021-08-31T17:35:23Z

I always evaluate on the last checkpoint and the one selected by average rank. Average rank doesn't always correlate with the full evaluation, but still a better checkpoint selection criteria than the NLL loss. From my experience the best is usually the last checkpoint but it doesn't differ much from the one selected by best av rank.
we reported the numbers selected by av rank, as far as I remember

yxliu-ntu · 2021-09-03T06:14:45Z

Hi, @vlad-karpukhin. I have run more exps with the DDP mode, but still can not get the result in Table 3. The last four exps are all conducted on the same server with 8 V100 GPUs, whose CUDA version is 10.1.

bert-base	top5	top20	top100
bsize8, (DP mode)	44.0%	64.2%	78.2%
bsize8, (DDP mode)	41.3%	61.5%	77.2%
bsize16, (DDP mode)	48.3%	67.7%	80.5%
bsize128, (DP mode)	57.6%	74.5%	84.0%
bsize256, (DDP mode)	59.1%	75.4%	84.5%

The training logs are attached in https://drive.google.com/file/d/1qK_nAgSDaOh_LblnvV4H8LXwb3hNOVmy/view?usp=sharing.

vlad-karpukhin · 2021-09-07T21:57:11Z

Hi @yxliu-ntu ,
I opened a log for 256 batch size and see that you have hard_negatives=0
You should set this to 1 to get to 65% accuracy at top-5

yxliu-ntu · 2021-09-08T09:05:39Z

Hi @yxliu-ntu ,
I opened a log for 256 batch size and see that you have hard_negatives=0
You should set this to 1 to get to 65% accuracy at top-5

@vlad-karpukhin We want to reproduce the second block of Table 3 first, so I set hard_negatives=0.

vlad-karpukhin · 2021-09-08T16:38:17Z

Your bsize128, (DP mode) line roughly corresponds to our Gold-127 line with your 57.6% for top-5 even better than our reported 55.8. Which is expected since the reported numbers were for a pretty old HF version and slightly different
hyperparams than the current recommended values.

The line with Gold-31 should have batch size to be set to 4 and Gold-7 to 1.
I unfortunately no longer have my logs for those ablation runs.
I might have used different number of training epochs - 30 vs default 40 and the reported results are actually for our NQ dev set, not for the test set. Also, I might have used some learning rate adjustments, the common practice when you change your batch size.

Overall, I can clearly see the improvements in the accuracy with the batch size increase in your results table with the Gold-127 value matching(even slightly better than) our results - which is exactly the point of that ablation.

I could provide you with second block checkpoints but unfortunately we don't have them anymore and honestly I don't see the value in reproducing the exact numbers for that ablation. It should just show that the model results grow with the batch size increase saturating on 127->256 step and that is what you got already.

yxliu-ntu · 2021-09-09T03:58:50Z

@vlad-karpukhin Thank you very much!

yxliu-ntu closed this as completed Sep 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduce Table 3 in paper #185

Reproduce Table 3 in paper #185

yxliu-ntu commented Aug 30, 2021 •

edited

vlad-karpukhin commented Aug 30, 2021

yxliu-ntu commented Aug 31, 2021

yxliu-ntu commented Aug 31, 2021

vlad-karpukhin commented Aug 31, 2021

yxliu-ntu commented Sep 3, 2021

vlad-karpukhin commented Sep 7, 2021

yxliu-ntu commented Sep 8, 2021

vlad-karpukhin commented Sep 8, 2021

yxliu-ntu commented Sep 9, 2021

Reproduce Table 3 in paper #185

Reproduce Table 3 in paper #185

Comments

yxliu-ntu commented Aug 30, 2021 • edited

vlad-karpukhin commented Aug 30, 2021

yxliu-ntu commented Aug 31, 2021

yxliu-ntu commented Aug 31, 2021

vlad-karpukhin commented Aug 31, 2021

yxliu-ntu commented Sep 3, 2021

vlad-karpukhin commented Sep 7, 2021

yxliu-ntu commented Sep 8, 2021

vlad-karpukhin commented Sep 8, 2021

yxliu-ntu commented Sep 9, 2021

yxliu-ntu commented Aug 30, 2021 •

edited