Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training auto resume #154

Merged
merged 3 commits into from
Dec 8, 2022
Merged

Training auto resume #154

merged 3 commits into from
Dec 8, 2022

Conversation

supersergiy
Copy link
Member

Re-running the same spec will now automatically resume from the last checkpoint, if present.

Unfortunately it wasn't as easy as passing ckpt_path="last" to trainer.fit because that is actually not properly implemented for GCS checkpoints.

@codecov
Copy link

codecov bot commented Dec 8, 2022

Codecov Report

Merging #154 (36e7507) into main (2b75dce) will not change coverage.
The diff coverage is n/a.

@@            Coverage Diff            @@
##              main      #154   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           66        66           
  Lines         1905      1905           
=========================================
  Hits          1905      1905           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@supersergiy supersergiy self-assigned this Dec 8, 2022
@supersergiy supersergiy added the enhancement New feature or request label Dec 8, 2022
Copy link
Collaborator

@nkemnitz nkemnitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in case I want a new experiment based on the state of the last one, I should specify #ENCODER_CKPT and #DECODER_CKPT?

@nkemnitz
Copy link
Collaborator

nkemnitz commented Dec 8, 2022

@supersergiy Is the last checkpoint always the end of an epoch, or is that name also used for checkpoints in the middle of an epoch? I saw a warning when I explicitely tried to restart in the middle of an epoch: https://pytorch-lightning.readthedocs.io/en/stable/clouds/fault_tolerant_training_basic.html

@supersergiy
Copy link
Member Author

supersergiy commented Dec 8, 2022

Yes, if you want to resume with a change of learning rate, the easiest way is to load model weights rather than the full state. There are other ways to do it that involve specific modifications of the training regime. If this becomes a problem, we can make a base zetta regime class that does this. However, if you're changing train/val data and not hyper parameters, full state checkpoint should work

CC Lightning-AI/pytorch-lightning#5339 Lightning-AI/pytorch-lightning#12118

last checkpoint is not the end of the last epoch, but the last checkpoint written out. I think the warning is caused by the fact that the checkpoint doesn't remember where it is in the epoch, as it doesn't save anything about the dataset. That means training is resumed from the beginning of the epoch, and so some samples are going to be used more than others. This is a minor concern in our usecase, and resuming only from epoch start is impractical.

@supersergiy supersergiy merged commit 1e567b8 into main Dec 8, 2022
@supersergiy supersergiy deleted the sergiy/auto_resume_training branch January 18, 2023 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants