Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async model checkpoint upload fails when the file is renamed #123

Closed
iakremnev opened this issue Apr 13, 2020 · 6 comments
Closed

Async model checkpoint upload fails when the file is renamed #123

iakremnev opened this issue Apr 13, 2020 · 6 comments
Labels
bug Something isn't working

Comments

@iakremnev
Copy link

Bug

While using TrainsLogger in Pytorch-Lightning, the latter saves model checkpoint via _atomic_save function, which renames the file right after calling torch.save. Consequently, train's hook on torch.save tries to asynchronously upload .ckpt.part file and fails.

The original issue on Pytorch-Lightning Github: Lightning-AI/pytorch-lightning#1466

@iakremnev
Copy link
Author

https://allegroai-trains.slack.com/archives/CTK20V944/p1586771229061800

I think the best solution is for Trains to create a copy of the file and upload it in the background. That said the name will still end with .part

@bmartinn
Copy link
Member

Hi @iakremnev ,
Seems like any renaming or deleting a stored model file before the async upload is completed will cause this error. What we will do is create a temp copy of the model file, upload it and then delete the temp file when upload is completed.

@bmartinn bmartinn added the bug Something isn't working label Apr 13, 2020
@bmartinn
Copy link
Member

Hi @iakremnev,

Please see if the latest RC fixes the problem:
pip install trains==0.14.2rc0

@iakremnev
Copy link
Author

Yes, the issue is solved! Thanks for the quick fix ❤️

@S-aiueo32
Copy link

@bmartinn Is this really fixed? I'm facing the same issue. I usetrains==0.15.1 and trains-server== 0.15.1-367 through TrainsLogger of pytorch-lightning-bolts==0.1.0 as follow:

logger = TrainsLogger(auto_connect_frameworks=True)

@bmartinn
Copy link
Member

@S-aiueo32 what exactly are you experiencing?
Yes the issue was fixed, but recently they moved the TrainsLogger to pytorch-lightning-bolts and I lost track.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants