You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using fast to fit several models using more or less the same data. To speed up work, I am doing this in parallel and I am using lr_find() to automatically find a decent learning rate.
I found that sometimes, but not always, I get the following error when calling a custom function that does the fitting:
Traceback (most recent call last):
File "07.1-fit_one_model.py", line 162, in <module>
res = fit_one(sp_filtered.copy(),
File "07.1-fit_one_model.py", line 79, in fit_one
lr = learn.lr_find()
File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastai/callback/schedule.py", line 222, in lr_find
with self.no_logging(): self.fit(n_epoch, cbs=cb)
File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastai/learner.py", line 211, in fit
self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastai/learner.py", line 162, in _with_events
self(f'after_{event_type}'); final()
File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastai/learner.py", line 141, in __call__
def __call__(self, event_name): L(event_name).map(self._call_one)
File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastcore/foundation.py", line 154, in map
def map(self, f, *args, gen=False, **kwargs): return self._new(map_ex(self, f, *args, gen=gen, **kwargs))
File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastcore/basics.py", line 666, in map_ex
return list(res)
File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastcore/basics.py", line 651, in __call__
return self.func(*fargs, **kwargs)
File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastai/learner.py", line 145, in _call_one
for cb in self.cbs.sorted('order'): cb(event_name)
File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastai/callback/core.py", line 44, in __call__
if self.run and _run: res = getattr(self, event_name, noop)()
File "/n/home08/souzademedeiros/.conda/envs/fastai_v2/lib/python3.8/site-packages/fastai/callback/schedule.py", line 193, in after_fit
os.remove(tmp_f)
FileNotFoundError: [Errno 2] No such file or directory: 'models/_tmp.pth'
I believe this could be reproduced in any case in which lr_find() is run in parallel from the same folder, but the exact timing of the different runs will result in an error sometimes, but not always.
By looking at the source code of LRFinder it seems to me that the problem is that it saves temporary files with the same name in the same folder and then deletes them (particularly models/_tmp.pth). So if by chance the different jobs end up running lr_find() at the same time, these temporary files will end up being shared by the two instances of lr_find() and one of them will be deleted (see method after_fit() of class LRFinder.). A possible solution is of course not to use the same folder when running these computations in parallel, but maybe the library could be updated to allow several instances of lr_find() to run simultaneously from the same folder? Or maybe to document how to define the path for temporary training files and avoid conflicts.
The text was updated successfully, but these errors were encountered:
Hi,
I am using fast to fit several models using more or less the same data. To speed up work, I am doing this in parallel and I am using lr_find() to automatically find a decent learning rate.
I found that sometimes, but not always, I get the following error when calling a custom function that does the fitting:
I believe this could be reproduced in any case in which lr_find() is run in parallel from the same folder, but the exact timing of the different runs will result in an error sometimes, but not always.
By looking at the source code of LRFinder it seems to me that the problem is that it saves temporary files with the same name in the same folder and then deletes them (particularly
models/_tmp.pth
). So if by chance the different jobs end up running lr_find() at the same time, these temporary files will end up being shared by the two instances of lr_find() and one of them will be deleted (see methodafter_fit()
of classLRFinder
.). A possible solution is of course not to use the same folder when running these computations in parallel, but maybe the library could be updated to allow several instances of lr_find() to run simultaneously from the same folder? Or maybe to document how to define the path for temporary training files and avoid conflicts.The text was updated successfully, but these errors were encountered: