Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving reproducibility by additional settings in set_random_seeds #333

Merged
merged 2 commits into from
Nov 7, 2021

Conversation

sliwy
Copy link
Collaborator

@sliwy sliwy commented Oct 10, 2021

Last days I faced issues with reproducibility when using set_random_seeds.

First, set_random_seeds does not ensure reproducibility because we are not setting use_deterministic_algorithms() and we are not setting torch.backends.cudnn.benchmark = False.
Those settings can slow down computations which is a side effect of reproducibility in pytorch (see https://pytorch.org/docs/stable/notes/randomness.html). Still, I feel all people that set random seeds want reproducibility and not just the same weights initialization and batch sampling order.

However some operations may not be possible with deterministic behavior and cuda >= 10.2, so I guess we might want to provide a parameter whether to set it to be deterministic or not (default to deterministic), what do you think? For example it fails in our tests because we use set_random_seeds for tests_acceptance, so use_deterministic_algorithms affects all tests.

I also added a note about the PYTHONHASHSEED setting, which in some cases may be needed to ensure reproducibility. I spent a long time looking for a solution to this problem, so I think it may be good to keep it as a note in the docstring. I had to set PYTHONHASHSEED before running scripts which is different than usually suggested os.environ['PYTHONHASHSEED'].

This caused me a lot of troubles last week, because I believed that set_random_seeds will make my code reproducible 馃槀

@codecov
Copy link

codecov bot commented Oct 10, 2021

Codecov Report

Merging #333 (b1d9bdd) into master (d9ad8fb) will increase coverage by 0.03%.
The diff coverage is 85.71%.

@@            Coverage Diff             @@
##           master     #333      +/-   ##
==========================================
+ Coverage   81.75%   81.78%   +0.03%     
==========================================
  Files          51       51              
  Lines        3485     3491       +6     
==========================================
+ Hits         2849     2855       +6     
  Misses        636      636              

@sliwy sliwy force-pushed the set_random_seeds_fix branch 2 times, most recently from 2e1e5f4 to bcffb04 Compare October 10, 2021 20:35
@robintibor
Copy link
Contributor

robintibor commented Oct 14, 2021

I am quite against this. I think by default I wouldn't want the slowdowns coming from deterministic mode. set_random_seeds atm should give you results that are very similar, but not exactly same, which is enough in my view for many cases for scientific reproducibility. I would be fine with adding a deterministic flag, but with default false... Then one could set this to true if one tries to have exact reproducibility (keeping in mind that it may also only be exactly reproducible on same machine).

@sliwy
Copy link
Collaborator Author

sliwy commented Oct 14, 2021

I think that we do not see those differences in our CI because we don't use cuda and maybe our datasets are not big enough? (I am not sure regarding the dataset size). I can show you what happens with reproducibility when setting random seeds without cudnn.benchmark = False. I am executing on a cluster plot_sleep_staging.py example (on my desktop results are exactly the same but I think differences depend on a type of hardware you use). Below training log and confusion matrices (the form is not optimal but I wanted to make it quickly, you should have feeling what's wrong with not setting cudnn.benchmark).

cudnn.benchmark = True and same random seed:

First run:

epoch    train_bal_acc    train_loss    valid_bal_acc    valid_loss     dur
-------  ---------------  ------------  ---------------  ------------  ------
      1           0.1500        1.6480           0.2236        1.6100  0.3188
      2           0.2021        1.6084           0.2667        1.6095  0.1696
      3           0.2258        1.6001           0.2000        1.6110  0.1670
      4           0.2000        1.5655           0.2000        1.6145  0.1634
      5           0.2100        1.5134           0.2000        1.6331  0.1641
      6           0.2384        1.4419           0.2000        1.7156  0.1633
      7           0.2533        1.3695           0.2000        1.8053  0.1726
      8           0.2835        1.3215           0.2000        1.8395  0.1641
      9           0.3248        1.2944           0.2056        1.8472  0.1730
     10           0.3281        1.2747           0.2203        1.8473  0.1639
[[  0   0   3  58   0]
 [  0   0   8  13   0]
 [  0   0  27 167   0]
 [  0   0   8  67   0]
 [  0   0   4  28   0]]

Second run:

  epoch    train_bal_acc    train_loss    valid_bal_acc    valid_loss     dur
-------  ---------------  ------------  ---------------  ------------  ------
      1           0.1500        1.6480           0.2236        1.6100  0.3220
      2           0.2021        1.6084           0.2667        1.6095  0.1744
      3           0.2258        1.6001           0.2000        1.6110  0.1726
      4           0.2000        1.5655           0.2000        1.6145  0.1697
      5           0.2100        1.5133           0.2000        1.6331  0.1698
      6           0.2384        1.4420           0.2000        1.7158  0.1700
      7           0.2533        1.3697           0.2000        1.8075  0.1701
      8           0.2835        1.3213           0.2000        1.8381  0.1700
      9           0.3153        1.2950           0.2113        1.8506  0.1724
     10           0.2947        1.2753           0.2553        1.7925  0.1810
[[  0   0   8  53   0]
 [  0   0   9  12   0]
 [  0   0  61 133   0]
 [  0   0  25  50   0]
 [  0   0  12  20   0]]

cudnn.benchmark = False and different random seeds:

First seed:

  epoch    train_bal_acc    train_loss    valid_bal_acc    valid_loss     dur
-------  ---------------  ------------  ---------------  ------------  ------
      1           0.1500        1.6480           0.2236        1.6100  0.3003
      2           0.2001        1.6084           0.2667        1.6095  0.2471
      3           0.2278        1.6002           0.2000        1.6112  0.2407
      4           0.2000        1.5653           0.2000        1.6144  0.2394
      5           0.2100        1.5134           0.2000        1.6333  0.2389
      6           0.2384        1.4418           0.2000        1.7156  0.2400
      7           0.2533        1.3697           0.2000        1.8068  0.2398
      8           0.2953        1.3218           0.2000        1.8405  0.2390
      9           0.3204        1.2945           0.2113        1.8492  0.2451
     10           0.3127        1.2740           0.2682        1.7797  0.2415
[[  0   0  12  49   0]
 [  0   0  10  11   0]
 [  0   0  69 125   0]
 [  0   0  30  45   0]
 [  0   0  17  15   0]]

Second seed:

  epoch    train_bal_acc    train_loss    valid_bal_acc    valid_loss     dur
-------  ---------------  ------------  ---------------  ------------  ------
      1           0.1515        1.6403           0.2499        1.6105  0.3094
      2           0.2159        1.5812           0.2000        1.6109  0.2465
      3           0.2048        1.5229           0.2037        1.5922  0.2376
      4           0.2092        1.4716           0.2009        1.5920  0.2362
      5           0.2097        1.3990           0.2009        1.5825  0.2369
      6           0.2193        1.3485           0.2018        1.5893  0.2365
      7           0.2373        1.3037           0.2065        1.5896  0.2371
      8           0.2817        1.2681           0.2195        1.5838  0.2369
      9           0.3147        1.2418           0.2520        1.5861  0.2484
     10           0.3304        1.2302           0.3253        1.5654  0.2380
[[ 16   0   4  40   1]
 [  1   0   4  16   0]
 [  4   0  31 157   2]
 [  1   0  14  60   0]
 [  0   0   6  26   0]]

cudnn.benchmark = False and same seed (Only small difference in the results)

First run:

  epoch    train_bal_acc    train_loss    valid_bal_acc    valid_loss     dur
-------  ---------------  ------------  ---------------  ------------  ------
      1           0.1500        1.6480           0.2236        1.6100  0.2961
      2           0.2001        1.6084           0.2667        1.6095  0.2447
      3           0.2278        1.6002           0.2000        1.6112  0.2403
      4           0.2000        1.5653           0.2000        1.6144  0.2359
      5           0.2100        1.5134           0.2000        1.6333  0.2367
      6           0.2384        1.4418           0.2000        1.7156  0.2387
      7           0.2533        1.3696           0.2000        1.8067  0.2399
      8           0.2953        1.3218           0.2000        1.8406  0.2364
      9           0.3171        1.2945           0.2113        1.8474  0.2395
     10           0.3146        1.2742           0.2664        1.7863  0.2398
[[  0   0  11  50   0]
 [  0   0  10  11   0]
 [  0   0  66 128   0]
 [  0   0  29  46   0]
 [  0   0  17  15   0]]

Second run:

  epoch    train_bal_acc    train_loss    valid_bal_acc    valid_loss     dur
-------  ---------------  ------------  ---------------  ------------  ------
      1           0.1500        1.6480           0.2236        1.6100  0.2988
      2           0.2001        1.6084           0.2667        1.6095  0.2498
      3           0.2278        1.6002           0.2000        1.6112  0.2456
      4           0.2000        1.5653           0.2000        1.6144  0.2363
      5           0.2100        1.5135           0.2000        1.6333  0.2391
      6           0.2418        1.4418           0.2000        1.7156  0.2375
      7           0.2533        1.3696           0.2000        1.8061  0.2377
      8           0.2953        1.3219           0.2000        1.8399  0.2372
      9           0.3237        1.2947           0.2113        1.8504  0.2411
     10           0.3011        1.2742           0.2664        1.7833  0.2366
[[  0   0  11  50   0]
 [  0   0  10  11   0]
 [  0   0  67 127   0]
 [  0   0  29  46   0]
 [  0   0  17  15   0]]

Third run:

  epoch    train_bal_acc    train_loss    valid_bal_acc    valid_loss     dur
-------  ---------------  ------------  ---------------  ------------  ------
      1           0.1500        1.6480           0.2236        1.6100  0.3003
      2           0.2001        1.6084           0.2667        1.6095  0.2471
      3           0.2278        1.6002           0.2000        1.6112  0.2407
      4           0.2000        1.5653           0.2000        1.6144  0.2394
      5           0.2100        1.5134           0.2000        1.6333  0.2389
      6           0.2384        1.4418           0.2000        1.7156  0.2400
      7           0.2533        1.3697           0.2000        1.8068  0.2398
      8           0.2953        1.3218           0.2000        1.8405  0.2390
      9           0.3204        1.2945           0.2113        1.8492  0.2451
     10           0.3127        1.2740           0.2682        1.7797  0.2415
[[  0   0  12  49   0]
 [  0   0  10  11   0]
 [  0   0  69 125   0]
 [  0   0  30  45   0]
 [  0   0  17  15   0]]

Conclusion:

  1. Setting random seed may work well when cuda is not used but I haven't checked it yet. On CPU there are some nondeterministic operations that can be selected by torch as well.
  2. Setting random seed when using gpu and not setting cudnn.benchmark = False may make your work not reproducible at all. Differences are close to running model with different random seeds. On a different problem I observed differences between different random seeds around 15-20% while for the same seed but cudnn.benchmark = True around 10-15% (comparing predictions between them, not the accuracy with true labels).
  3. I feel like this function should show a huge warning that if you're using cuda your results won't be reproducible on some devices.
  4. Setting random seeds and not setting cudnn.benchmark in some cases will make you believe that your computations are reproducible while actually not and this is really bad. Usually people set random seeds for reproducibility and this won't ensure it.

@sliwy
Copy link
Collaborator Author

sliwy commented Oct 14, 2021

One more thing, regarding torch.use_deterministic_algorithms(deterministic), I spent more time to investigate that. I see that a lot of algorithms can't work with this, so I am more on the side of not including this in the set_random_seed, however for the cudnn.deterministic = False I would still keep this as it improves reproducibility.

@sliwy
Copy link
Collaborator Author

sliwy commented Oct 19, 2021

@robintibor what do you think about this behavior?

@robintibor
Copy link
Contributor

robintibor commented Nov 2, 2021

I feel like cudnn.benchmark is something you should know when you are setting it why you set it and that it will result in some nondeterminism. So maybe what we could do is just inside set_random_seeds if cudnn benchmark is set to True, give out a warning that it may not be so well reproducible. Function could also have a flag to suppress that warning. Could be like an argument cudnn_benchmark=None, if it is not set, and it was set to True outside the function, then there is a warning... and warning is suppressed if you supply True explicitly... what do you think?

@robintibor
Copy link
Contributor

great maybe add a small note to whats_new?

@sliwy
Copy link
Collaborator Author

sliwy commented Nov 3, 2021

@robintibor thanks for taking a look at this.

It's a good idea to have this warning and I think it's enough to somehow warn users about lack of reproducibility that may happen in some cases and give them the solution.

Sure I'll add a line in whats_new, just waiting for the doc to render, I want to check if everything is ok.

@sliwy
Copy link
Collaborator Author

sliwy commented Nov 3, 2021

@robintibor I think it's ready, let me know if we need something more here :)

@sliwy
Copy link
Collaborator Author

sliwy commented Nov 3, 2021

One more thing @robintibor

Should we remove the line about reproducing results in all the examples? May be misleading if we set benchmark=True?

if cuda:
torch.backends.cudnn.benchmark = True
# Set random seed to be able to reproduce results
set_random_seeds(seed=random_state, cuda=cuda)

@robintibor
Copy link
Contributor

robintibor commented Nov 3, 2021

We could enhance it to:

# Set random seed to be able to roughly reproduce results 
# Note that with cudnn benchmark set to True, GPU indeterminism
# may still make results substantially different between runs
 set_random_seeds(seed=random_state, cuda=cuda) 

@sliwy
Copy link
Collaborator Author

sliwy commented Nov 3, 2021

@robintibor done :)

@robintibor
Copy link
Contributor

Great that we have a version that works for both of us! Thanks for the work!

@robintibor robintibor merged commit 9136057 into braindecode:master Nov 7, 2021
Copy link
Collaborator

@agramfort agramfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry to be a bit late to the party

@sliwy do you think it's relevant? if so can you open a new PR to fix this?

馃檹

braindecode/util.py Show resolved Hide resolved
examples/plot_bcic_iv_2a_moabb_cropped.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants