Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper Benchmarking for DER and DER++ #6

Closed
cl-for-life opened this issue Mar 22, 2021 · 6 comments
Closed

Proper Benchmarking for DER and DER++ #6

cl-for-life opened this issue Mar 22, 2021 · 6 comments

Comments

@cl-for-life
Copy link

Hi!

I'm looking to add DER / DER++ results as a baseline in my paper. I'm interested in the setting where only a single pass is done through the dataset (i.e. num_epochs == 1). I was wondering if you had any advice / suggestions regarding hyperparameter selections in this setting ? Specifically alpha and beta and the learning rate. Thanks in advance :)

and great job on the library! I'm usually not a fan; I usually find them too convoluted. This one is super minimal and easy to play with 👌

@cl-for-life
Copy link
Author

Also, is there an easy way to disable data augmentations in the pipeline ? I could not find a argument for that

@cl-for-life
Copy link
Author

cl-for-life commented Mar 24, 2021

I ended up running a grid search over the following parameters, drawing inspiration from the best arguments reported.

'--alpha': [0.1, 0.2, 0.3, .4, .5],
'--beta': [0.25, 0.5, .75, 1],
'--lr': [0.05, 0.01, 0.1],

I'm interested in reproducing the method on the setting observed in (Aljundi & al.), i.e. I'm running with arguments --dataset seq-cifar10 --n_epochs 1 --batch_size 10 --minibatch_size 10 and --buffer_size 200, 500, 1000.

I thought it would be good to post my results here, in case someone wanted to benchmark in a similar setting. The winning args for DER++ were

derpp     alpha=0.1  beta=0.50 lr=0.01 M=1000
derpp     alpha=0.4  beta=1.00 lr=0.01 M=500
derpp     alpha=0.4  beta=0.50 lr=0.01 M=200

(chosen by selecting best accuracy averaged over 3 runs, with --validation)

If one of the authors could double validate this or point out flaws in the methodology that would be awesome. This way I can make sure DER results are properly reported.

@JosephKJ @baraldilorenzo @mbosc @angpo

@mbosc
Copy link
Collaborator

mbosc commented Mar 24, 2021

Hello there,

Sorry for not replying sooner, we are a little busy with a paper rebuttal atm.

Thank you for your compliments to the framework, we use it all the time ourselves and designed it to be easily extended, so we're very glad it works for you! :D

To disable augmentations, you can modify the dataset directly and change the transforms in the TRANSFORM variable. If you simply comment out the lines containing RandomCrops and RandomFlips you are left with just the Normalisation. The bare minimum you need to leave in place for everything to work is a transform.ToTensor(). Notice however, as we write in our paper, that we believe that augmentation is very important for the models to provide their best accuracy.

I guess your method for finding the best parameters is OK, just a word of warning: as we've written in our paper, we are very critical of the single-epoch setting usually promoted by Aljundi, Lopez-Paz, Chaudhry and others when referred to non-trivial datasets such as cifar10, cifar100 or even ImageNet (see section F.3 of our appendix) as we believe that it does not allow for a disentanglement of forgetting from underfitting. However, as you did, you can easily implement your desired behaviour by reducing the number of epochs.

Best of luck for your sub. Feel free to reach out if you need anything else!

@cl-for-life
Copy link
Author

Thanks for getting back! Yea I understand the concerns about separating forgetting from underfitting, which I agree with. Interesting discussion in the appendix, I had not examined it previously. For what it's worth, DER++ was able to reach stronger performance under a single epoch (58%,57%,50%) for M=1K,200,100 than what was in the appendix. I'm guessing you guys didn't run another hparam search given that this was not the main setting.

Good luck with the rebuttal!

@cl-for-life
Copy link
Author

Actually I think the discrepancy in the results may be due to the batch size more than anything else. Using bs == 10 vs say 32 gives the learner more optimization steps, leading to better performance.

@mbosc
Copy link
Collaborator

mbosc commented Mar 24, 2021

Thanks for the info, we look forward to reading more about it in the final paper!

@mbosc mbosc closed this as completed Mar 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants