Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different results after running lr_find() at different times #3295

Closed
ekdnam opened this issue Apr 3, 2021 · 6 comments
Closed

Different results after running lr_find() at different times #3295

ekdnam opened this issue Apr 3, 2021 · 6 comments
Labels

Comments

@ekdnam
Copy link

ekdnam commented Apr 3, 2021

Libraries

  • fastai: 2.1.10
  • fastcore: 1.3.19
  • torch: 1.8.1+cpu
  • sklearn: 0.23.2

Describe the bug
Running learn.lr_find() leads to different results per run.

To Reproduce
Steps to reproduce the behavior:

  1. The code is shown here.
  2. Run till this cell:
from fastai.callback import schedule

lr_min, lr_steep = learn.lr_find()

print('Learning rate with the minimum loss:', lr_min)
print('Learning rate with the steepest gradient:', lr_steep)
  1. Earlier the results were (as given in the notebook)
Learning rate with the minimum loss: 0.006918309628963471
Learning rate with the steepest gradient: 0.033113110810518265

Running the same cell now (on my local machine) gives:

Learning rate with the minimum loss: 0.012022644281387329
Learning rate with the steepest gradient: 0.010964781977236271

Expected behavior
A clear and concise description of what you expected to happen.

  • Ideally, the results per run should be same.

Thanks to the maintainers and contributors for lowering the barrier to AI!

@ekdnam ekdnam changed the title Different results after running learn.lr_find() Different results after running lr_find() at different times Apr 3, 2021
@tmabraham
Copy link
Contributor

Are the LR finder curves vastly different? This could be just due to randomness from one machine to another...

@ekdnam
Copy link
Author

ekdnam commented Apr 11, 2021

Yes. They were very different.

@tcapelle
Copy link
Contributor

tcapelle commented Apr 12, 2021

Related to: #2892 and this #3013

@hamelsmu hamelsmu added the bug label May 1, 2021
@warner-benjamin
Copy link
Collaborator

I investigated this issue and this does not appear to be related to #2892, as the optimizer is restored correctly after lr_find. This is a the same issue as #3013 though.

I have uploaded my investigation as a gist here. I have annotated my examples with the notebook headings they were ran under.

Restating the Issue

Liberally using with no_random() to maintain a reproducible state, I'd expect the following code (Trail 1):

with no_random():
    dls = get_dls(192, False, 64)
    learn = Learner(dls, xresnet18(n_out=dls.c), metrics=accuracy)

with no_random():
    learn.lr_find()

with no_random():
    learn.fit_one_cycle(2, 3e-3)

image
to have the same result as this code (Trial 2):

with no_random():
    dls = get_dls(192, False, 64)
    learn = Learner(dls, xresnet18(n_out=dls.c), metrics=accuracy)

with no_random():
    learn.fit_one_cycle(2, 3e-3)

image
However, the results differ despite using with no_random().

Recreating the dataloader does result in the same training output (Trial 3):

with no_random():
    dls = get_dls(192, False, 64)
    learn = Learner(dls, xresnet18(n_out=dls.c), metrics=accuracy)

with no_random():
    learn.lr_find()

with no_random():
    dls = get_dls(192, False, 64)
    learn.dls = dls

with no_random():
    learn.fit_one_cycle(2, 3e-3)

image
and so does Trial 4 (not shown here) with less use of no_random.

Solution 1: Save and Restore Dataloader

Modify LRFinder.before_fit and LRFinder.after_fit to save and restore the dataloader:

    def before_fit(self):
        super().before_fit()
        self.learn.save('_tmp')
        self.old_dls = deepcopy(self.learn.dls)
        self.best_loss = float('inf')

    def after_fit(self):
        self.learn.opt.zero_grad() # Needed before detaching the optimizer for future fits
        tmp_f = self.path/self.model_dir/'_tmp.pth'
        if tmp_f.exists():
            self.learn.load('_tmp', with_opt=True)
            os.remove(tmp_f)
        self.learn.dls = self.old_dls

This fixes the issue of Trial 1 resulting with different training then Trial 2. Trial 5 shown below:

with no_random():
    dls = get_dls(192, False, 64)
    learn = Learner(dls, xresnet18(n_out=dls.c), metrics=accuracy)

with no_random():
    learn.lr_find()

with no_random():
    learn.fit_one_cycle(2, 3e-3)

image
However, calling lr_find still changes the random state, so the following code results in different training output (Trial 7):

with no_random():
    dls = get_dls(192, False, 64)
    learn = Learner(dls, xresnet18(n_out=dls.c), metrics=accuracy)

with no_random():
    learn.lr_find()
    learn.fit_one_cycle(2, 3e-3)

image
This solution would require the user to call manually reset the random state in between lr_find and training.

Solution 2: Save and Restore Both Dataloader & Random State

Which leads to the second possible solution, modifying LRFinder.before_fit and LRFinder.after_fit to save and restore the both the dataloader and random state:

    def before_fit(self):
        super().before_fit()
        self.learn.save('_tmp')
        self.old_dls = deepcopy(self.learn.dls)
        self.states = get_random_states()
        self.best_loss = float('inf')

    def after_fit(self):
        self.learn.opt.zero_grad() # Needed before detaching the optimizer for future fits
        tmp_f = self.path/self.model_dir/'_tmp.pth'
        if tmp_f.exists():
            self.learn.load('_tmp', with_opt=True)
            os.remove(tmp_f)
        self.learn.dls = self.old_dls
        set_random_states(**self.states)

Then training the model directly after calling lr_find results in the same training output as training without lr_find (Trial 10):

with no_random():
    dls = get_dls(192, False, 64)
    learn = Learner(dls, xresnet18(n_out=dls.c), metrics=accuracy)

with no_random():
    learn.lr_find()
    learn.fit_one_cycle(2, 3e-3)

image

Potential Issue with Solution

While these changes will resolve the issue of lr_find effecting training, they will limit lr_find to the same images in the dataloader which will result in less variation in the results returned by lr_find when called multiple times without using no_random. Especially with restoring the random state, then the primary difference from calling lr_find multiple times appears to be from cuda not being set in deterministic mode.

Allowing users to control the restoring of dataloaders and random state by a restore_state option in lr_find would resolve this potential issue. Thoughts on whether it should default to true or false?

@muellerzr
Copy link
Contributor

I'd be open to the restore_state option (or something along those lines), I think that's a good approach. What I'm not sure of is again, whether it should be true or false. Personally I'd leave it as False, maintaining original behavior but also putting in an explicit fix for those that want it.

@jph00 what are your thoughts here?

@jph00
Copy link
Member

jph00 commented Oct 26, 2021

Personally I don't think there's anything to fix here - it's behaving as I'd expect it to behave. There is random state, and doing any training will change that state. Personally I don't see the point of having restore_state. If folks really need this functionality, it's easy enough to create a modified LR find class, or to just manually save and restore state.

@jph00 jph00 closed this as completed Oct 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants