Training auto resume #154

supersergiy · 2022-12-08T00:17:29Z

Re-running the same spec will now automatically resume from the last checkpoint, if present.

Unfortunately it wasn't as easy as passing ckpt_path="last" to trainer.fit because that is actually not properly implemented for GCS checkpoints.

codecov · 2022-12-08T00:22:42Z

Codecov Report

Merging #154 (36e7507) into main (2b75dce) will not change coverage.
The diff coverage is n/a.

@@            Coverage Diff            @@
##              main      #154   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           66        66           
  Lines         1905      1905           
=========================================
  Hits          1905      1905

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

nkemnitz

And in case I want a new experiment based on the state of the last one, I should specify #ENCODER_CKPT and #DECODER_CKPT?

nkemnitz · 2022-12-08T16:37:00Z

@supersergiy Is the last checkpoint always the end of an epoch, or is that name also used for checkpoints in the middle of an epoch? I saw a warning when I explicitely tried to restart in the middle of an epoch: https://pytorch-lightning.readthedocs.io/en/stable/clouds/fault_tolerant_training_basic.html

supersergiy · 2022-12-08T17:11:30Z

Yes, if you want to resume with a change of learning rate, the easiest way is to load model weights rather than the full state. There are other ways to do it that involve specific modifications of the training regime. If this becomes a problem, we can make a base zetta regime class that does this. However, if you're changing train/val data and not hyper parameters, full state checkpoint should work

CC Lightning-AI/pytorch-lightning#5339 Lightning-AI/pytorch-lightning#12118

last checkpoint is not the end of the last epoch, but the last checkpoint written out. I think the warning is caused by the fact that the checkpoint doesn't remember where it is in the epoch, as it doesn't save anything about the dataset. That means training is resumed from the beginning of the epoch, and so some samples are going to be used more than others. This is a minor concern in our usecase, and resuming only from epoch start is impractical.

supersergiy added 3 commits December 8, 2022 00:11

feat: training auto-resume

059cb53

chore: remove spec jsons

ae36afa

chore: removed all references to the old ckpt

36e7507

supersergiy requested a review from nkemnitz December 8, 2022 00:17

supersergiy self-assigned this Dec 8, 2022

supersergiy added the enhancement New feature or request label Dec 8, 2022

nkemnitz approved these changes Dec 8, 2022

View reviewed changes

supersergiy merged commit 1e567b8 into main Dec 8, 2022

supersergiy deleted the sergiy/auto_resume_training branch January 18, 2023 10:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training auto resume #154

Training auto resume #154

supersergiy commented Dec 8, 2022

codecov bot commented Dec 8, 2022

nkemnitz left a comment

nkemnitz commented Dec 8, 2022 •

edited

Loading

supersergiy commented Dec 8, 2022 •

edited

Loading

Training auto resume #154

Training auto resume #154

Conversation

supersergiy commented Dec 8, 2022

codecov bot commented Dec 8, 2022

Codecov Report

nkemnitz left a comment

Choose a reason for hiding this comment

nkemnitz commented Dec 8, 2022 • edited Loading

supersergiy commented Dec 8, 2022 • edited Loading

nkemnitz commented Dec 8, 2022 •

edited

Loading

supersergiy commented Dec 8, 2022 •

edited

Loading