Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accuracy variation depending on the number of GPUs used #2

Open
zhl98 opened this issue Mar 31, 2022 · 10 comments
Open

accuracy variation depending on the number of GPUs used #2

zhl98 opened this issue Mar 31, 2022 · 10 comments
Assignees

Comments

@zhl98
Copy link

zhl98 commented Mar 31, 2022

Hello,Thank you very much for your code!
I used the setting of dytox in the code for 10 steps of training, but I failed to achieve the accuracy in the paper.
bash train.sh 0 --options options/data/cifar100_10-10.yaml options/data/cifar100_order1.yaml options/model/cifar_dytox.yaml --name dytox --data-path MY_PATH_TO_DATASET --output-basedir PATH_TO_SAVE_CHECKPOINTS
Here is the reproduction result:
image
avg acc is 69.54.
Can you give me some advice? thank you very much!

@arthurdouillard
Copy link
Owner

After cleaning the code I've only tested for cifar 50 steps where results where exactly reproduced. I'm re-launching 10 steps to check that.

@zhl98
Copy link
Author

zhl98 commented Mar 31, 2022

OK, thank you very much!

@arthurdouillard
Copy link
Owner

arthurdouillard commented Apr 1, 2022

Hey, so I haven't time to full reproduce 10 steps with a single GPU but the first 5 steps are indeed like yours.
While when runned with 2 GPUs, I got the exact (even a little better) results from my paper.

I think the error comes from that with two GPUs, I'm actually using a batch size twice larger (PyTorch's DDP will use batch_size on each GPU). So my batch size is bigger than yours which can explain the results.

So what you can do is modifying the cifar_dytox.yaml, and increase the batch size to 256 (128*2).
This option file should work:

#######################
# DyTox, for CIFAR100 #
#######################

# Model definition
model: convit
embed_dim: 384
depth: 6
num_heads: 12
patch_size: 4
input_size: 32
local_up_to_layer: 5
class_attention: true

# Training setting
no_amp: true
eval_every: 50

# Base hyperparameter
weight_decay: 0.000001
batch_size: 128
incremental_lr: 0.0005
incremental_batch_size: 256  # UPDATE VALUE
rehearsal: icarl_all

# Knowledge Distillation
auto_kd: true

# Finetuning
finetuning: balanced
finetuning_epochs: 20

# Dytox model
dytox: true
freeze_task: [old_task_tokens, old_heads]
freeze_ft: [sab]

# Divergence head to get diversity
head_div: 0.1
head_div_mode: tr

# Independent Classifiers
ind_clf: 1-1
bce_loss: true


# Advanced Augmentations, here disabled

## Erasing
reprob: 0.0
remode: pixel
recount: 1
resplit: false

## MixUp & CutMix
mixup: 0.0
cutmix: 0.0

If you have time to tell me if it's working better great, otherwise I'll check it in the coming weeks.

Since I'm 100% sure the results are reproducible with two GPUs, the problem must be that.

@arthurdouillard arthurdouillard self-assigned this Apr 1, 2022
@zhl98
Copy link
Author

zhl98 commented Apr 3, 2022

Hey, after update the incremental_batch_size to 256 , runned with 1 GPU ,the result is still only 69.50%.
image

But it does seem that the effect of two GPUs is better.
I tested dytox_plus with 2 GPUS getting avg 76.17% (even a little better than your paper).

@arthurdouillard
Copy link
Owner

Hum... I'm launching experiments with batch size of 256 (the yaml that I gave you only did it for step t>1 not t=0 my bad), with a LR of 0.0005 (the default one) and a LR of 0.001 (twice bigger as it would have been if using two GPUs).

I'm also enabling mixed-precision (no_amp: false) to go faster.

I'll keep you updated.

@arthurdouillard arthurdouillard changed the title accuracy accuracy variation depending on the number of GPUs used Apr 3, 2022
@Kishaan
Copy link

Kishaan commented May 10, 2022

HI,

Posting it here because I'm having the same issue. I ran the Dytox model on Cifar-100 with the same setting as in the first comment here, on a single GPU, and I'm getting the following log

{"task": 0, "epoch": 499, "acc": 92.5, "avg_acc": 92.5, "forgetting": 0.0, "acc_per_task": [92.5], "train_lr": 1.0004539958280581e-05, "bwt": 0.0, "fwt": 0.0, "test_acc1": 92.5, "test_acc5": 99.4, "mean_acc5": 99.4, "train_loss": 0.05053, "test_loss": 0.36721, "token_mean_dist": 0.0, "token_min_dist": 0.0, "token_max_dist": 0.0}
{"task": 1, "epoch": 19, "acc": 85.55, "avg_acc": 89.02, "forgetting": 0.0, "acc_per_task": [87.7, 83.4], "train_lr": 1.2500000000000004e-05, "bwt": 0.0, "fwt": 87.7, "test_acc1": 85.55, "test_acc5": 96.95, "mean_acc5": 98.18, "train_loss": 0.03499, "test_loss": 0.80777, "token_mean_dist": 0.54355, "token_min_dist": 0.54355, "token_max_dist": 0.54355}
{"task": 2, "epoch": 19, "acc": 78.67, "avg_acc": 85.57, "forgetting": 6.25, "acc_per_task": [80.0, 74.0, 82.0], "train_lr": 1.2500000000000004e-05, "bwt": -4.17, "fwt": 80.57, "test_acc1": 78.67, "test_acc5": 94.9, "mean_acc5": 97.08, "train_loss": 0.0259, "test_loss": 1.07032, "token_mean_dist": 0.58243, "token_min_dist": 0.53487, "token_max_dist": 0.61953}
{"task": 3, "epoch": 19, "acc": 73.32, "avg_acc": 82.51, "forgetting": 11.6, "acc_per_task": [71.3, 69.8, 70.6, 81.6], "train_lr": 1.2500000000000004e-05, "bwt": -7.88, "fwt": 75.57, "test_acc1": 73.33, "test_acc5": 93.1, "mean_acc5": 96.09, "train_loss": 0.02083, "test_loss": 1.37981, "token_mean_dist": 0.58081, "token_min_dist": 0.52581, "token_max_dist": 0.61908}
{"task": 4, "epoch": 19, "acc": 69.46, "avg_acc": 79.9, "forgetting": 16.5, "acc_per_task": [65.3, 65.9, 60.7, 71.7, 83.7], "train_lr": 1.2500000000000004e-05, "bwt": -11.33, "fwt": 71.7, "test_acc1": 69.46, "test_acc5": 92.04, "mean_acc5": 95.28, "train_loss": 0.0163, "test_loss": 1.65585, "token_mean_dist": 0.58517, "token_min_dist": 0.51872, "token_max_dist": 0.62832}
{"task": 5, "epoch": 19, "acc": 68.23, "avg_acc": 77.96, "forgetting": 19.32, "acc_per_task": [64.1, 59.3, 54.6, 64.9, 79.3, 87.2], "train_lr": 1.2500000000000004e-05, "bwt": -13.99, "fwt": 69.28, "test_acc1": 68.23, "test_acc5": 91.15, "mean_acc5": 94.59, "train_loss": 0.01265, "test_loss": 1.64966, "token_mean_dist": 0.6064, "token_min_dist": 0.5128, "token_max_dist": 0.70423}
{"task": 6, "epoch": 19, "acc": 64.01, "avg_acc": 75.96, "forgetting": 22.3, "acc_per_task": [60.5, 52.0, 48.8, 56.2, 71.9, 80.3, 78.4], "train_lr": 1.2500000000000004e-05, "bwt": -16.37, "fwt": 67.09, "test_acc1": 64.01, "test_acc5": 89.11, "mean_acc5": 93.81, "train_loss": 0.01232, "test_loss": 1.96759, "token_mean_dist": 0.60002, "token_min_dist": 0.50834, "token_max_dist": 0.7036}
{"task": 7, "epoch": 19, "acc": 60.25, "avg_acc": 74.0, "forgetting": 25.642857, "acc_per_task": [55.3, 46.9, 43.2, 50.9, 60.3, 74.3, 65.3, 85.8], "train_lr": 1.2500000000000004e-05, "bwt": -18.69, "fwt": 64.47, "test_acc1": 60.25, "test_acc5": 87.64, "mean_acc5": 93.04, "train_loss": 0.00952, "test_loss": 2.14214, "token_mean_dist": 0.59949, "token_min_dist": 0.50265, "token_max_dist": 0.70439}
{"task": 8, "epoch": 19, "acc": 58.38, "avg_acc": 72.26, "forgetting": 28.075, "acc_per_task": [53.6, 42.7, 41.5, 48.0, 53.9, 67.2, 57.3, 77.7, 83.5], "train_lr": 1.2500000000000004e-05, "bwt": -20.77, "fwt": 62.42, "test_acc1": 58.38, "test_acc5": 85.98, "mean_acc5": 92.25, "train_loss": 0.00978, "test_loss": 2.24582, "token_mean_dist": 0.59777, "token_min_dist": 0.49842, "token_max_dist": 0.70554}
{"task": 9, "epoch": 19, "acc": 54.61, "avg_acc": 70.5, "forgetting": 31.277778, "acc_per_task": [50.0, 39.4, 32.4, 44.1, 47.7, 63.2, 49.8, 66.5, 74.0, 79.0], "train_lr": 1.2500000000000004e-05, "bwt": -22.87, "fwt": 60.31, "test_acc1": 54.61, "test_acc5": 83.76, "mean_acc5": 91.4, "train_loss": 0.00789, "test_loss": 2.54448, "token_mean_dist": 0.59817, "token_min_dist": 0.49496, "token_max_dist": 0.70778}
{"avg": 70.49870843967983}

Is this accuracy expected? The final accuracy (54.61) is lower than the number I see on the paper for cifar-100, 10 steps. I'm trying to understand how multi-gpu training alone can bring in such a big improvement. Any help would be much appreciated.

@arthurdouillard
Copy link
Owner

Hello, I'm still trying to improve perfs on a single GPU. I'll keep this issue updated if I find ways to do it.

In the mean time, try running on two GPUs, as the results have been reproduced by multiple people (including @zhl98 for openned this issue).

@Kishaan
Copy link

Kishaan commented May 23, 2022

Hi,

Just a short update. I thought repeated augmentation could be the reason behind improved results in multi-GPU, so I ran it without RA, but I was still getting around 59% accuracy, which means that cannot be the reason. Please let us know if you were able to figure out how to make it work in single-GPU setting.

@arthurdouillard
Copy link
Owner

Yeah, I chatted with Hugo Touvron (the DeiT main author) and he also suggested RA. I've tried multi-gpu without RA and single-gpu with RA, and nothing significantly changed.

I'll keep you updated.

@arthurdouillard
Copy link
Owner

Accuracy variation is in major part explained in the following erratum.
We are trying to see how we could emulate our distributed memory (see erratum) in the single GPU setting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants