-
Notifications
You must be signed in to change notification settings - Fork 7.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different results after running lr_find()
at different times
#3295
Comments
learn.lr_find()
lr_find()
at different times
Are the LR finder curves vastly different? This could be just due to randomness from one machine to another... |
Yes. They were very different. |
I investigated this issue and this does not appear to be related to #2892, as the optimizer is restored correctly after I have uploaded my investigation as a gist here. I have annotated my examples with the notebook headings they were ran under. Restating the IssueLiberally using with no_random():
dls = get_dls(192, False, 64)
learn = Learner(dls, xresnet18(n_out=dls.c), metrics=accuracy)
with no_random():
learn.lr_find()
with no_random():
learn.fit_one_cycle(2, 3e-3)
with no_random():
dls = get_dls(192, False, 64)
learn = Learner(dls, xresnet18(n_out=dls.c), metrics=accuracy)
with no_random():
learn.fit_one_cycle(2, 3e-3)
Recreating the dataloader does result in the same training output (Trial 3): with no_random():
dls = get_dls(192, False, 64)
learn = Learner(dls, xresnet18(n_out=dls.c), metrics=accuracy)
with no_random():
learn.lr_find()
with no_random():
dls = get_dls(192, False, 64)
learn.dls = dls
with no_random():
learn.fit_one_cycle(2, 3e-3)
Solution 1: Save and Restore DataloaderModify def before_fit(self):
super().before_fit()
self.learn.save('_tmp')
self.old_dls = deepcopy(self.learn.dls)
self.best_loss = float('inf')
def after_fit(self):
self.learn.opt.zero_grad() # Needed before detaching the optimizer for future fits
tmp_f = self.path/self.model_dir/'_tmp.pth'
if tmp_f.exists():
self.learn.load('_tmp', with_opt=True)
os.remove(tmp_f)
self.learn.dls = self.old_dls This fixes the issue of Trial 1 resulting with different training then Trial 2. Trial 5 shown below: with no_random():
dls = get_dls(192, False, 64)
learn = Learner(dls, xresnet18(n_out=dls.c), metrics=accuracy)
with no_random():
learn.lr_find()
with no_random():
learn.fit_one_cycle(2, 3e-3)
with no_random():
dls = get_dls(192, False, 64)
learn = Learner(dls, xresnet18(n_out=dls.c), metrics=accuracy)
with no_random():
learn.lr_find()
learn.fit_one_cycle(2, 3e-3)
Solution 2: Save and Restore Both Dataloader & Random StateWhich leads to the second possible solution, modifying def before_fit(self):
super().before_fit()
self.learn.save('_tmp')
self.old_dls = deepcopy(self.learn.dls)
self.states = get_random_states()
self.best_loss = float('inf')
def after_fit(self):
self.learn.opt.zero_grad() # Needed before detaching the optimizer for future fits
tmp_f = self.path/self.model_dir/'_tmp.pth'
if tmp_f.exists():
self.learn.load('_tmp', with_opt=True)
os.remove(tmp_f)
self.learn.dls = self.old_dls
set_random_states(**self.states) Then training the model directly after calling with no_random():
dls = get_dls(192, False, 64)
learn = Learner(dls, xresnet18(n_out=dls.c), metrics=accuracy)
with no_random():
learn.lr_find()
learn.fit_one_cycle(2, 3e-3) Potential Issue with SolutionWhile these changes will resolve the issue of Allowing users to control the restoring of dataloaders and random state by a |
I'd be open to the @jph00 what are your thoughts here? |
Personally I don't think there's anything to fix here - it's behaving as I'd expect it to behave. There is random state, and doing any training will change that state. Personally I don't see the point of having |
Libraries
Describe the bug
Running
learn.lr_find()
leads to different results per run.To Reproduce
Steps to reproduce the behavior:
Running the same cell now (on my local machine) gives:
Expected behavior
A clear and concise description of what you expected to happen.
Thanks to the maintainers and contributors for lowering the barrier to AI!
The text was updated successfully, but these errors were encountered: