Skip to content

Fix PP checkpoint bloat#1324

Merged
tjruwase merged 1 commit intobig-sciencefrom
olruwase/pp_checkpoint_bloat
Aug 25, 2021
Merged

Fix PP checkpoint bloat#1324
tjruwase merged 1 commit intobig-sciencefrom
olruwase/pp_checkpoint_bloat

Conversation

@tjruwase
Copy link
Contributor

Use tensor cloning to avoid checkpoint bloat resulting from bad interaction of torch.save() and PP.

See here for more details of the relevant torch.save() behavior.

@stas00, FYI

Copy link
Collaborator

@jeffra jeffra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks Tunji!

@stas00
Copy link
Collaborator

stas00 commented Aug 25, 2021

Thanks a lot, Tunji! I validated that it works.

Please don't forget to replay it into the master branch too!

@tjruwase tjruwase merged commit 72ce55a into big-science Aug 25, 2021
@tjruwase tjruwase deleted the olruwase/pp_checkpoint_bloat branch August 25, 2021 17:31
@stas00
Copy link
Collaborator

stas00 commented Aug 25, 2021

If someone needs to fix bloated checkpoints created before @tjruwase's fix here, you can use this script to do the work for you:
https://github.com/stas00/toolbox/blob/master/pytorch/pt-checkpoint-shrink.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants