Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lr_find() may fail if run in parallel from the same directory #3240

Closed
brunoasm opened this issue Mar 3, 2021 · 0 comments · Fixed by #3528
Closed

lr_find() may fail if run in parallel from the same directory #3240

brunoasm opened this issue Mar 3, 2021 · 0 comments · Fixed by #3528
Labels

Comments

@brunoasm
Copy link

brunoasm commented Mar 3, 2021

Hi,

I am using fast to fit several models using more or less the same data. To speed up work, I am doing this in parallel and I am using lr_find() to automatically find a decent learning rate.

I found that sometimes, but not always, I get the following error when calling a custom function that does the fitting:

Traceback (most recent call last):
  File "07.1-fit_one_model.py", line 162, in <module>
    res = fit_one(sp_filtered.copy(), 
  File "07.1-fit_one_model.py", line 79, in fit_one
    lr = learn.lr_find()
  File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastai/callback/schedule.py", line 222, in lr_find
    with self.no_logging(): self.fit(n_epoch, cbs=cb)
  File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastai/learner.py", line 211, in fit
    self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
  File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastai/learner.py", line 162, in _with_events
    self(f'after_{event_type}');  final()
  File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastai/learner.py", line 141, in __call__
    def __call__(self, event_name): L(event_name).map(self._call_one)
  File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastcore/foundation.py", line 154, in map
    def map(self, f, *args, gen=False, **kwargs): return self._new(map_ex(self, f, *args, gen=gen, **kwargs))
  File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastcore/basics.py", line 666, in map_ex
    return list(res)
  File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastcore/basics.py", line 651, in __call__
    return self.func(*fargs, **kwargs)
  File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastai/learner.py", line 145, in _call_one
    for cb in self.cbs.sorted('order'): cb(event_name)
  File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastai/callback/core.py", line 44, in __call__
    if self.run and _run: res = getattr(self, event_name, noop)()
  File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastai/callback/schedule.py", line 193, in after_fit
    os.remove(tmp_f)
FileNotFoundError: [Errno 2] No such file or directory: 'models/_tmp.pth'

I believe this could be reproduced in any case in which lr_find() is run in parallel from the same folder, but the exact timing of the different runs will result in an error sometimes, but not always.

By looking at the source code of LRFinder it seems to me that the problem is that it saves temporary files with the same name in the same folder and then deletes them (particularly models/_tmp.pth). So if by chance the different jobs end up running lr_find() at the same time, these temporary files will end up being shared by the two instances of lr_find() and one of them will be deleted (see method after_fit() of class LRFinder.). A possible solution is of course not to use the same folder when running these computations in parallel, but maybe the library could be updated to allow several instances of lr_find() to run simultaneously from the same folder? Or maybe to document how to define the path for temporary training files and avoid conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants