lr_find() may fail if run in parallel from the same directory #3240

brunoasm · 2021-03-03T16:34:04Z

Hi,

I am using fast to fit several models using more or less the same data. To speed up work, I am doing this in parallel and I am using lr_find() to automatically find a decent learning rate.

I found that sometimes, but not always, I get the following error when calling a custom function that does the fitting:

Traceback (most recent call last):
  File "07.1-fit_one_model.py", line 162, in <module>
    res = fit_one(sp_filtered.copy(), 
  File "07.1-fit_one_model.py", line 79, in fit_one
    lr = learn.lr_find()
  File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastai/callback/schedule.py", line 222, in lr_find
    with self.no_logging(): self.fit(n_epoch, cbs=cb)
  File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastai/learner.py", line 211, in fit
    self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
  File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastai/learner.py", line 162, in _with_events
    self(f'after_{event_type}');  final()
  File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastai/learner.py", line 141, in __call__
    def __call__(self, event_name): L(event_name).map(self._call_one)
  File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastcore/foundation.py", line 154, in map
    def map(self, f, *args, gen=False, **kwargs): return self._new(map_ex(self, f, *args, gen=gen, **kwargs))
  File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastcore/basics.py", line 666, in map_ex
    return list(res)
  File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastcore/basics.py", line 651, in __call__
    return self.func(*fargs, **kwargs)
  File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastai/learner.py", line 145, in _call_one
    for cb in self.cbs.sorted('order'): cb(event_name)
  File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastai/callback/core.py", line 44, in __call__
    if self.run and _run: res = getattr(self, event_name, noop)()
  File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastai/callback/schedule.py", line 193, in after_fit
    os.remove(tmp_f)
FileNotFoundError: [Errno 2] No such file or directory: 'models/_tmp.pth'

I believe this could be reproduced in any case in which lr_find() is run in parallel from the same folder, but the exact timing of the different runs will result in an error sometimes, but not always.

By looking at the source code of LRFinder it seems to me that the problem is that it saves temporary files with the same name in the same folder and then deletes them (particularly models/_tmp.pth). So if by chance the different jobs end up running lr_find() at the same time, these temporary files will end up being shared by the two instances of lr_find() and one of them will be deleted (see method after_fit() of class LRFinder.). A possible solution is of course not to use the same folder when running these computations in parallel, but maybe the library could be updated to allow several instances of lr_find() to run simultaneously from the same folder? Or maybe to document how to define the path for temporary training files and avoid conflicts.

The text was updated successfully, but these errors were encountered:

hamelsmu added the bug label May 1, 2021

jph00 mentioned this issue Nov 23, 2021

Error(s) in loading state_dict for TabularModel: #3415

Closed

warner-benjamin mentioned this issue Nov 26, 2021

Fix concurrent LRFinder instances overwriting each other by using tempfile #3528

Merged

jph00 closed this as completed in #3528 Nov 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lr_find() may fail if run in parallel from the same directory #3240

lr_find() may fail if run in parallel from the same directory #3240

brunoasm commented Mar 3, 2021

lr_find() may fail if run in parallel from the same directory #3240

lr_find() may fail if run in parallel from the same directory #3240

Comments

brunoasm commented Mar 3, 2021