SaveModelCallback every nth epoch #3375
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Minor modification to allow saving a model every nth epoch by passing an integer to the existing
every_epoch
parameter. Main motivation is to reduce the disk space occupied by saving models especially when usingwith_opt=True
during long runs. For example, a training run with 300 epochs can create ~100 GB+ checkpoint files but usually people are fine saving every 10/20/50 epoch during training.Also, added a test with synthetic learner which trains 4 epoch and saves every 2nd epoch starting from epoch 0.
P.S. If desired, we can also extend this callback to allow every nth iteration. It would be particularly useful when training set is large and users would like to save checkpoints before waiting for a single epoch to complete. I made a similar modification during training CLIP with 40m image-text pairs, simply by adding something like: