Improving reproducibility by additional settings in `set_random_seeds` #333

sliwy · 2021-10-10T19:12:02Z

Last days I faced issues with reproducibility when using set_random_seeds.

First, set_random_seeds does not ensure reproducibility because we are not setting use_deterministic_algorithms() and we are not setting torch.backends.cudnn.benchmark = False.
Those settings can slow down computations which is a side effect of reproducibility in pytorch (see https://pytorch.org/docs/stable/notes/randomness.html). Still, I feel all people that set random seeds want reproducibility and not just the same weights initialization and batch sampling order.

However some operations may not be possible with deterministic behavior and cuda >= 10.2, so I guess we might want to provide a parameter whether to set it to be deterministic or not (default to deterministic), what do you think? For example it fails in our tests because we use set_random_seeds for tests_acceptance, so use_deterministic_algorithms affects all tests.

I also added a note about the PYTHONHASHSEED setting, which in some cases may be needed to ensure reproducibility. I spent a long time looking for a solution to this problem, so I think it may be good to keep it as a note in the docstring. I had to set PYTHONHASHSEED before running scripts which is different than usually suggested os.environ['PYTHONHASHSEED'].

This caused me a lot of troubles last week, because I believed that set_random_seeds will make my code reproducible 😂

codecov · 2021-10-10T19:18:41Z

Codecov Report

Merging #333 (b1d9bdd) into master (d9ad8fb) will increase coverage by 0.03%.
The diff coverage is 85.71%.

@@            Coverage Diff             @@
##           master     #333      +/-   ##
==========================================
+ Coverage   81.75%   81.78%   +0.03%     
==========================================
  Files          51       51              
  Lines        3485     3491       +6     
==========================================
+ Hits         2849     2855       +6     
  Misses        636      636

robintibor · 2021-10-14T10:39:38Z

I am quite against this. I think by default I wouldn't want the slowdowns coming from deterministic mode. set_random_seeds atm should give you results that are very similar, but not exactly same, which is enough in my view for many cases for scientific reproducibility. I would be fine with adding a deterministic flag, but with default false... Then one could set this to true if one tries to have exact reproducibility (keeping in mind that it may also only be exactly reproducible on same machine).

sliwy · 2021-10-14T12:35:16Z

I think that we do not see those differences in our CI because we don't use cuda and maybe our datasets are not big enough? (I am not sure regarding the dataset size). I can show you what happens with reproducibility when setting random seeds without cudnn.benchmark = False. I am executing on a cluster plot_sleep_staging.py example (on my desktop results are exactly the same but I think differences depend on a type of hardware you use). Below training log and confusion matrices (the form is not optimal but I wanted to make it quickly, you should have feeling what's wrong with not setting cudnn.benchmark).

`cudnn.benchmark = True` and same random seed:

First run:

epoch    train_bal_acc    train_loss    valid_bal_acc    valid_loss     dur
-------  ---------------  ------------  ---------------  ------------  ------
      1           0.1500        1.6480           0.2236        1.6100  0.3188
      2           0.2021        1.6084           0.2667        1.6095  0.1696
      3           0.2258        1.6001           0.2000        1.6110  0.1670
      4           0.2000        1.5655           0.2000        1.6145  0.1634
      5           0.2100        1.5134           0.2000        1.6331  0.1641
      6           0.2384        1.4419           0.2000        1.7156  0.1633
      7           0.2533        1.3695           0.2000        1.8053  0.1726
      8           0.2835        1.3215           0.2000        1.8395  0.1641
      9           0.3248        1.2944           0.2056        1.8472  0.1730
     10           0.3281        1.2747           0.2203        1.8473  0.1639
[[  0   0   3  58   0]
 [  0   0   8  13   0]
 [  0   0  27 167   0]
 [  0   0   8  67   0]
 [  0   0   4  28   0]]

Second run:

  epoch    train_bal_acc    train_loss    valid_bal_acc    valid_loss     dur
-------  ---------------  ------------  ---------------  ------------  ------
      1           0.1500        1.6480           0.2236        1.6100  0.3220
      2           0.2021        1.6084           0.2667        1.6095  0.1744
      3           0.2258        1.6001           0.2000        1.6110  0.1726
      4           0.2000        1.5655           0.2000        1.6145  0.1697
      5           0.2100        1.5133           0.2000        1.6331  0.1698
      6           0.2384        1.4420           0.2000        1.7158  0.1700
      7           0.2533        1.3697           0.2000        1.8075  0.1701
      8           0.2835        1.3213           0.2000        1.8381  0.1700
      9           0.3153        1.2950           0.2113        1.8506  0.1724
     10           0.2947        1.2753           0.2553        1.7925  0.1810
[[  0   0   8  53   0]
 [  0   0   9  12   0]
 [  0   0  61 133   0]
 [  0   0  25  50   0]
 [  0   0  12  20   0]]

`cudnn.benchmark = False` and different random seeds:

First seed:

  epoch    train_bal_acc    train_loss    valid_bal_acc    valid_loss     dur
-------  ---------------  ------------  ---------------  ------------  ------
      1           0.1500        1.6480           0.2236        1.6100  0.3003
      2           0.2001        1.6084           0.2667        1.6095  0.2471
      3           0.2278        1.6002           0.2000        1.6112  0.2407
      4           0.2000        1.5653           0.2000        1.6144  0.2394
      5           0.2100        1.5134           0.2000        1.6333  0.2389
      6           0.2384        1.4418           0.2000        1.7156  0.2400
      7           0.2533        1.3697           0.2000        1.8068  0.2398
      8           0.2953        1.3218           0.2000        1.8405  0.2390
      9           0.3204        1.2945           0.2113        1.8492  0.2451
     10           0.3127        1.2740           0.2682        1.7797  0.2415
[[  0   0  12  49   0]
 [  0   0  10  11   0]
 [  0   0  69 125   0]
 [  0   0  30  45   0]
 [  0   0  17  15   0]]

Second seed:

  epoch    train_bal_acc    train_loss    valid_bal_acc    valid_loss     dur
-------  ---------------  ------------  ---------------  ------------  ------
      1           0.1515        1.6403           0.2499        1.6105  0.3094
      2           0.2159        1.5812           0.2000        1.6109  0.2465
      3           0.2048        1.5229           0.2037        1.5922  0.2376
      4           0.2092        1.4716           0.2009        1.5920  0.2362
      5           0.2097        1.3990           0.2009        1.5825  0.2369
      6           0.2193        1.3485           0.2018        1.5893  0.2365
      7           0.2373        1.3037           0.2065        1.5896  0.2371
      8           0.2817        1.2681           0.2195        1.5838  0.2369
      9           0.3147        1.2418           0.2520        1.5861  0.2484
     10           0.3304        1.2302           0.3253        1.5654  0.2380
[[ 16   0   4  40   1]
 [  1   0   4  16   0]
 [  4   0  31 157   2]
 [  1   0  14  60   0]
 [  0   0   6  26   0]]

`cudnn.benchmark = False` and same seed (Only small difference in the results)

First run:

  epoch    train_bal_acc    train_loss    valid_bal_acc    valid_loss     dur
-------  ---------------  ------------  ---------------  ------------  ------
      1           0.1500        1.6480           0.2236        1.6100  0.2961
      2           0.2001        1.6084           0.2667        1.6095  0.2447
      3           0.2278        1.6002           0.2000        1.6112  0.2403
      4           0.2000        1.5653           0.2000        1.6144  0.2359
      5           0.2100        1.5134           0.2000        1.6333  0.2367
      6           0.2384        1.4418           0.2000        1.7156  0.2387
      7           0.2533        1.3696           0.2000        1.8067  0.2399
      8           0.2953        1.3218           0.2000        1.8406  0.2364
      9           0.3171        1.2945           0.2113        1.8474  0.2395
     10           0.3146        1.2742           0.2664        1.7863  0.2398
[[  0   0  11  50   0]
 [  0   0  10  11   0]
 [  0   0  66 128   0]
 [  0   0  29  46   0]
 [  0   0  17  15   0]]

Second run:

  epoch    train_bal_acc    train_loss    valid_bal_acc    valid_loss     dur
-------  ---------------  ------------  ---------------  ------------  ------
      1           0.1500        1.6480           0.2236        1.6100  0.2988
      2           0.2001        1.6084           0.2667        1.6095  0.2498
      3           0.2278        1.6002           0.2000        1.6112  0.2456
      4           0.2000        1.5653           0.2000        1.6144  0.2363
      5           0.2100        1.5135           0.2000        1.6333  0.2391
      6           0.2418        1.4418           0.2000        1.7156  0.2375
      7           0.2533        1.3696           0.2000        1.8061  0.2377
      8           0.2953        1.3219           0.2000        1.8399  0.2372
      9           0.3237        1.2947           0.2113        1.8504  0.2411
     10           0.3011        1.2742           0.2664        1.7833  0.2366
[[  0   0  11  50   0]
 [  0   0  10  11   0]
 [  0   0  67 127   0]
 [  0   0  29  46   0]
 [  0   0  17  15   0]]

Third run:

  epoch    train_bal_acc    train_loss    valid_bal_acc    valid_loss     dur
-------  ---------------  ------------  ---------------  ------------  ------
      1           0.1500        1.6480           0.2236        1.6100  0.3003
      2           0.2001        1.6084           0.2667        1.6095  0.2471
      3           0.2278        1.6002           0.2000        1.6112  0.2407
      4           0.2000        1.5653           0.2000        1.6144  0.2394
      5           0.2100        1.5134           0.2000        1.6333  0.2389
      6           0.2384        1.4418           0.2000        1.7156  0.2400
      7           0.2533        1.3697           0.2000        1.8068  0.2398
      8           0.2953        1.3218           0.2000        1.8405  0.2390
      9           0.3204        1.2945           0.2113        1.8492  0.2451
     10           0.3127        1.2740           0.2682        1.7797  0.2415
[[  0   0  12  49   0]
 [  0   0  10  11   0]
 [  0   0  69 125   0]
 [  0   0  30  45   0]
 [  0   0  17  15   0]]

Conclusion:

Setting random seed may work well when cuda is not used but I haven't checked it yet. On CPU there are some nondeterministic operations that can be selected by torch as well.
Setting random seed when using gpu and not setting cudnn.benchmark = False may make your work not reproducible at all. Differences are close to running model with different random seeds. On a different problem I observed differences between different random seeds around 15-20% while for the same seed but cudnn.benchmark = True around 10-15% (comparing predictions between them, not the accuracy with true labels).
I feel like this function should show a huge warning that if you're using cuda your results won't be reproducible on some devices.
Setting random seeds and not setting cudnn.benchmark in some cases will make you believe that your computations are reproducible while actually not and this is really bad. Usually people set random seeds for reproducibility and this won't ensure it.

sliwy · 2021-10-14T13:11:35Z

One more thing, regarding torch.use_deterministic_algorithms(deterministic), I spent more time to investigate that. I see that a lot of algorithms can't work with this, so I am more on the side of not including this in the set_random_seed, however for the cudnn.deterministic = False I would still keep this as it improves reproducibility.

sliwy · 2021-10-19T09:16:18Z

@robintibor what do you think about this behavior?

robintibor · 2021-11-02T15:24:51Z

I feel like cudnn.benchmark is something you should know when you are setting it why you set it and that it will result in some nondeterminism. So maybe what we could do is just inside set_random_seeds if cudnn benchmark is set to True, give out a warning that it may not be so well reproducible. Function could also have a flag to suppress that warning. Could be like an argument cudnn_benchmark=None, if it is not set, and it was set to True outside the function, then there is a warning... and warning is suppressed if you supply True explicitly... what do you think?

robintibor · 2021-11-03T12:13:36Z

great maybe add a small note to whats_new?

sliwy · 2021-11-03T12:17:05Z

@robintibor thanks for taking a look at this.

It's a good idea to have this warning and I think it's enough to somehow warn users about lack of reproducibility that may happen in some cases and give them the solution.

Sure I'll add a line in whats_new, just waiting for the doc to render, I want to check if everything is ok.

sliwy · 2021-11-03T13:03:48Z

@robintibor I think it's ready, let me know if we need something more here :)

sliwy · 2021-11-03T13:07:26Z

One more thing @robintibor

Should we remove the line about reproducing results in all the examples? May be misleading if we set benchmark=True?

braindecode/examples/plot_sleep_staging.py

Lines 222 to 225 in fa6042b

    
           if cuda: 
        
               torch.backends.cudnn.benchmark = True 
        
           # Set random seed to be able to reproduce results 
        
           set_random_seeds(seed=random_state, cuda=cuda)

robintibor · 2021-11-03T13:26:29Z

We could enhance it to:

# Set random seed to be able to roughly reproduce results 
# Note that with cudnn benchmark set to True, GPU indeterminism
# may still make results substantially different between runs
 set_random_seeds(seed=random_state, cuda=cuda)

sliwy · 2021-11-03T16:33:47Z

@robintibor done :)

robintibor · 2021-11-07T09:11:26Z

Great that we have a version that works for both of us! Thanks for the work!

agramfort

sorry to be a bit late to the party

@sliwy do you think it's relevant? if so can you open a new PR to fix this?

🙏

braindecode/util.py

examples/plot_bcic_iv_2a_moabb_cropped.py

sliwy force-pushed the set_random_seeds_fix branch 2 times, most recently from 2e1e5f4 to bcffb04 Compare October 10, 2021 20:35

added torch deterministic to seed seting

3a7fa31

sliwy force-pushed the set_random_seeds_fix branch from a3b05b0 to af48988 Compare November 3, 2021 12:32

changed the way of hadnling cudnn_benchmark, added warning

b1d9bdd

sliwy force-pushed the set_random_seeds_fix branch from af48988 to b1d9bdd Compare November 3, 2021 14:57

robintibor merged commit 9136057 into braindecode:master Nov 7, 2021

agramfort reviewed Nov 7, 2021

View reviewed changes

braindecode/util.py Show resolved Hide resolved

examples/plot_bcic_iv_2a_moabb_cropped.py Show resolved Hide resolved

sliwy mentioned this pull request Nov 8, 2021

Improved description for set_random_seeds #360

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving reproducibility by additional settings in `set_random_seeds` #333

Improving reproducibility by additional settings in `set_random_seeds` #333

sliwy commented Oct 10, 2021 •

edited

Loading

codecov bot commented Oct 10, 2021 •

edited

Loading

robintibor commented Oct 14, 2021 •

edited

Loading

sliwy commented Oct 14, 2021 •

edited

Loading

sliwy commented Oct 14, 2021

sliwy commented Oct 19, 2021

robintibor commented Nov 2, 2021 •

edited

Loading

robintibor commented Nov 3, 2021

sliwy commented Nov 3, 2021

sliwy commented Nov 3, 2021

sliwy commented Nov 3, 2021

robintibor commented Nov 3, 2021 •

edited

Loading

sliwy commented Nov 3, 2021

robintibor commented Nov 7, 2021

agramfort left a comment

Improving reproducibility by additional settings in set_random_seeds #333

Improving reproducibility by additional settings in set_random_seeds #333

Conversation

sliwy commented Oct 10, 2021 • edited Loading

codecov bot commented Oct 10, 2021 • edited Loading

Codecov Report

robintibor commented Oct 14, 2021 • edited Loading

sliwy commented Oct 14, 2021 • edited Loading

cudnn.benchmark = True and same random seed:

First run:

Second run:

cudnn.benchmark = False and different random seeds:

First seed:

Second seed:

cudnn.benchmark = False and same seed (Only small difference in the results)

First run:

Second run:

Third run:

Conclusion:

sliwy commented Oct 14, 2021

sliwy commented Oct 19, 2021

robintibor commented Nov 2, 2021 • edited Loading

robintibor commented Nov 3, 2021

sliwy commented Nov 3, 2021

sliwy commented Nov 3, 2021

sliwy commented Nov 3, 2021

robintibor commented Nov 3, 2021 • edited Loading

sliwy commented Nov 3, 2021

robintibor commented Nov 7, 2021

agramfort left a comment

Choose a reason for hiding this comment

Improving reproducibility by additional settings in `set_random_seeds` #333

Improving reproducibility by additional settings in `set_random_seeds` #333

sliwy commented Oct 10, 2021 •

edited

Loading

codecov bot commented Oct 10, 2021 •

edited

Loading

robintibor commented Oct 14, 2021 •

edited

Loading

sliwy commented Oct 14, 2021 •

edited

Loading

`cudnn.benchmark = True` and same random seed:

`cudnn.benchmark = False` and different random seeds:

`cudnn.benchmark = False` and same seed (Only small difference in the results)

robintibor commented Nov 2, 2021 •

edited

Loading

robintibor commented Nov 3, 2021 •

edited

Loading