-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training auto resume #154
Training auto resume #154
Conversation
Codecov Report
@@ Coverage Diff @@
## main #154 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 66 66
Lines 1905 1905
=========================================
Hits 1905 1905 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And in case I want a new experiment based on the state of the last one, I should specify #ENCODER_CKPT
and #DECODER_CKPT
?
@supersergiy Is the |
Yes, if you want to resume with a change of learning rate, the easiest way is to load model weights rather than the full state. There are other ways to do it that involve specific modifications of the training regime. If this becomes a problem, we can make a base zetta regime class that does this. However, if you're changing train/val data and not hyper parameters, full state checkpoint should work CC Lightning-AI/pytorch-lightning#5339 Lightning-AI/pytorch-lightning#12118
|
Re-running the same spec will now automatically resume from the
last
checkpoint, if present.Unfortunately it wasn't as easy as passing
ckpt_path="last"
to trainer.fit because that is actually not properly implemented for GCS checkpoints.