Different results after running `lr_find()` at different times #3295

ekdnam · 2021-04-03T10:04:36Z

Libraries

fastai: 2.1.10
fastcore: 1.3.19
torch: 1.8.1+cpu
sklearn: 0.23.2

Describe the bug
Running learn.lr_find() leads to different results per run.

To Reproduce
Steps to reproduce the behavior:

The code is shown here.
Run till this cell:

from fastai.callback import schedule

lr_min, lr_steep = learn.lr_find()

print('Learning rate with the minimum loss:', lr_min)
print('Learning rate with the steepest gradient:', lr_steep)

Earlier the results were (as given in the notebook)

Learning rate with the minimum loss: 0.006918309628963471
Learning rate with the steepest gradient: 0.033113110810518265

Running the same cell now (on my local machine) gives:

Learning rate with the minimum loss: 0.012022644281387329
Learning rate with the steepest gradient: 0.010964781977236271

Expected behavior
A clear and concise description of what you expected to happen.

Ideally, the results per run should be same.

Thanks to the maintainers and contributors for lowering the barrier to AI!

The text was updated successfully, but these errors were encountered:

tmabraham · 2021-04-10T19:41:06Z

Are the LR finder curves vastly different? This could be just due to randomness from one machine to another...

ekdnam · 2021-04-11T14:10:00Z

Yes. They were very different.

tcapelle · 2021-04-12T12:23:04Z

Related to: #2892 and this #3013

warner-benjamin · 2021-10-25T18:26:44Z

I investigated this issue and this does not appear to be related to #2892, as the optimizer is restored correctly after lr_find. This is a the same issue as #3013 though.

I have uploaded my investigation as a gist here. I have annotated my examples with the notebook headings they were ran under.

Restating the Issue

Liberally using with no_random() to maintain a reproducible state, I'd expect the following code (Trail 1):

with no_random():
    dls = get_dls(192, False, 64)
    learn = Learner(dls, xresnet18(n_out=dls.c), metrics=accuracy)

with no_random():
    learn.lr_find()

with no_random():
    learn.fit_one_cycle(2, 3e-3)

to have the same result as this code (Trial 2):

with no_random():
    dls = get_dls(192, False, 64)
    learn = Learner(dls, xresnet18(n_out=dls.c), metrics=accuracy)

with no_random():
    learn.fit_one_cycle(2, 3e-3)

However, the results differ despite using with no_random().

Recreating the dataloader does result in the same training output (Trial 3):

with no_random():
    dls = get_dls(192, False, 64)
    learn = Learner(dls, xresnet18(n_out=dls.c), metrics=accuracy)

with no_random():
    learn.lr_find()

with no_random():
    dls = get_dls(192, False, 64)
    learn.dls = dls

with no_random():
    learn.fit_one_cycle(2, 3e-3)

and so does Trial 4 (not shown here) with less use of no_random.

Solution 1: Save and Restore Dataloader

Modify LRFinder.before_fit and LRFinder.after_fit to save and restore the dataloader:

    def before_fit(self):
        super().before_fit()
        self.learn.save('_tmp')
        self.old_dls = deepcopy(self.learn.dls)
        self.best_loss = float('inf')

    def after_fit(self):
        self.learn.opt.zero_grad() # Needed before detaching the optimizer for future fits
        tmp_f = self.path/self.model_dir/'_tmp.pth'
        if tmp_f.exists():
            self.learn.load('_tmp', with_opt=True)
            os.remove(tmp_f)
        self.learn.dls = self.old_dls

This fixes the issue of Trial 1 resulting with different training then Trial 2. Trial 5 shown below:

with no_random():
    dls = get_dls(192, False, 64)
    learn = Learner(dls, xresnet18(n_out=dls.c), metrics=accuracy)

with no_random():
    learn.lr_find()

with no_random():
    learn.fit_one_cycle(2, 3e-3)

However, calling lr_find still changes the random state, so the following code results in different training output (Trial 7):

with no_random():
    dls = get_dls(192, False, 64)
    learn = Learner(dls, xresnet18(n_out=dls.c), metrics=accuracy)

with no_random():
    learn.lr_find()
    learn.fit_one_cycle(2, 3e-3)

This solution would require the user to call manually reset the random state in between lr_find and training.

Solution 2: Save and Restore Both Dataloader & Random State

Which leads to the second possible solution, modifying LRFinder.before_fit and LRFinder.after_fit to save and restore the both the dataloader and random state:

    def before_fit(self):
        super().before_fit()
        self.learn.save('_tmp')
        self.old_dls = deepcopy(self.learn.dls)
        self.states = get_random_states()
        self.best_loss = float('inf')

    def after_fit(self):
        self.learn.opt.zero_grad() # Needed before detaching the optimizer for future fits
        tmp_f = self.path/self.model_dir/'_tmp.pth'
        if tmp_f.exists():
            self.learn.load('_tmp', with_opt=True)
            os.remove(tmp_f)
        self.learn.dls = self.old_dls
        set_random_states(**self.states)

Then training the model directly after calling lr_find results in the same training output as training without lr_find (Trial 10):

with no_random():
    dls = get_dls(192, False, 64)
    learn = Learner(dls, xresnet18(n_out=dls.c), metrics=accuracy)

with no_random():
    learn.lr_find()
    learn.fit_one_cycle(2, 3e-3)

Potential Issue with Solution

While these changes will resolve the issue of lr_find effecting training, they will limit lr_find to the same images in the dataloader which will result in less variation in the results returned by lr_find when called multiple times without using no_random. Especially with restoring the random state, then the primary difference from calling lr_find multiple times appears to be from cuda not being set in deterministic mode.

Allowing users to control the restoring of dataloaders and random state by a restore_state option in lr_find would resolve this potential issue. Thoughts on whether it should default to true or false?

muellerzr · 2021-10-25T18:33:42Z

I'd be open to the restore_state option (or something along those lines), I think that's a good approach. What I'm not sure of is again, whether it should be true or false. Personally I'd leave it as False, maintaining original behavior but also putting in an explicit fix for those that want it.

@jph00 what are your thoughts here?

jph00 · 2021-10-26T04:27:15Z

Personally I don't think there's anything to fix here - it's behaving as I'd expect it to behave. There is random state, and doing any training will change that state. Personally I don't see the point of having restore_state. If folks really need this functionality, it's easy enough to create a modified LR find class, or to just manually save and restore state.

ekdnam changed the title ~~Different results after running learn.lr_find()~~ Different results after running lr_find() at different times Apr 3, 2021

hamelsmu added the bug label May 1, 2021

jph00 closed this as completed Oct 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different results after running `lr_find()` at different times #3295

Different results after running `lr_find()` at different times #3295

ekdnam commented Apr 3, 2021

tmabraham commented Apr 10, 2021

ekdnam commented Apr 11, 2021

tcapelle commented Apr 12, 2021 •

edited

Loading

warner-benjamin commented Oct 25, 2021

muellerzr commented Oct 25, 2021

jph00 commented Oct 26, 2021

Different results after running lr_find() at different times #3295

Different results after running lr_find() at different times #3295

Comments

ekdnam commented Apr 3, 2021

tmabraham commented Apr 10, 2021

ekdnam commented Apr 11, 2021

tcapelle commented Apr 12, 2021 • edited Loading

warner-benjamin commented Oct 25, 2021

Restating the Issue

Solution 1: Save and Restore Dataloader

Solution 2: Save and Restore Both Dataloader & Random State

Potential Issue with Solution

muellerzr commented Oct 25, 2021

jph00 commented Oct 26, 2021

Different results after running `lr_find()` at different times #3295

Different results after running `lr_find()` at different times #3295

tcapelle commented Apr 12, 2021 •

edited

Loading