ckpt: add open/close for reading checkpoint by adammoody · Pull Request #3009 · deepspeedai/DeepSpeed

adammoody · 2023-03-13T22:00:44Z

For consideration, this adds open() and close() calls to the CheckpointEngine to serve as start and end markers during a restart. Similar to how create() and commit() define the start and end markers when writing a checkpoint, these new calls are useful for checkpoint engines while reading a checkpoint.

The open() function can be used by the checkpoint engine to find and prepare a checkpoint for reading. The caller may specify a directory in load_dir and a checkpoint name in tag. If tag == None or "latest", the checkpoint engine loads the most recent checkpoint that it can find. Otherwise, the checkpoint engine attempts to load the checkpoint named in tag. open() returns the tag value of the checkpoint that it actually loaded, or None if it fails to find a checkpoint.

The close() call can be used to free any resources that were allocated by the checkpoint engine during open().

All load() calls that read checkpoint files should be placed between open() and close() bookends.

Implementation details and TODOs:

This moves the processing for the latest file to TorchCheckpointEngine. It is written during commit() and read during open(). For now, save_latest is a member variable of the class that is hard coded to be True. To make this dynamic, this could either be a config parameter or an option passed to create(). So that only rank 0 creates the latest file, I pass the rank of each process in the TorchCheckpointEngine constructor. If it's acceptable to use dist within the checkpoint engine, we could alternatively get the rank directly.
I took an educated guess at updating Nebula to use open() and close(), but it's not possible for me to test the code. The main change is that the search logic that locates the checkpoint has been moved from load() to open(). In particular, one should review my changes to tag_flag.
As an extension, the load_dir parameter that is passed to open() could be made optional, in which case, one could use the current working directory, or load_dir could be passed as a parameter to the checkpoint engine constructor. Similar changes could be made for save_dir. I'm guessing that in most cases, these directory paths do not change during a run.

adammoody · 2023-03-13T22:13:10Z

TODO: the handling of optional params in open() needs to be fixed in case either load_dir or tag is None.

adammoody · 2023-03-14T20:47:40Z

TODO: the handling of optional params in open() needs to be fixed in case either load_dir or tag is None.

Decided to make load_dir required, and updated TorchCheckpointEngine to handle tag=None

adammoody · 2023-04-18T00:41:48Z

@tjruwase , I'd like to get your feedback on this one, too. Do you have some time to review and consider the types of changes suggested in this PR?

tjruwase · 2023-04-18T03:13:25Z

@adammoody, apologies for the delay on this PR. I will get to it no later than next week. Thanks for your patience.

tjruwase · 2023-05-03T21:33:07Z

@adammoody, apologies for the delay. This looks good. Thanks!

adammoody · 2023-05-03T22:01:19Z

Great. Thanks, @tjruwase ! There are a few spots in here that I know could use some cleanup / attention.

For one, I could not test the nebula code changes at all, so it'd be good to have someone comb through those in detail and make any necessary changes.

I see there are some conflicts with the main branch. I'll try to refresh the PR soon.

adammoody · 2023-06-22T19:28:41Z

@tjruwase , I'm working to refresh this PR again.

Since I can't test the changes I made to nebula checkpoint engine, is there someone from the team who could help me there?

adammoody · 2023-11-03T21:19:57Z

Hi, @loadams . If someone is willing to look this over and if the idea is still acceptable, I'll try to get this back up to date. In particular, I'd need help to make sure it doesn't break Nebula (which I can't test) or other checkpoint/restart paths I may have missed.

These open/close hooks are needed for some checkpoint libraries like SCR/VeloC, but they look to be useful for the existing checkpoint engines in torch/Nebula. It requires two new global collectives (open and close), and one must be sure that any load calls are placed in between.

adammoody · 2023-11-28T05:06:45Z

Hi, @loadams . I hope you had a good Thanksgiving!

I could commit some time to this again. However, I'd still need an assist from your side to really verify and test things, assuming the DeepSpeed team agrees with the approach here. If you have a chance, please let me know. Thanks.

adammoody requested review from jeffra and tjruwase as code owners March 13, 2023 22:00

adammoody mentioned this pull request Mar 13, 2023

SCR checkpoint engine #2972

Open

adammoody force-pushed the ckptread2 branch from e98d546 to a98c414 Compare March 14, 2023 20:54

tjruwase approved these changes May 3, 2023

View reviewed changes

HeyangQin and others added 3 commits June 21, 2023 11:51

zero++ tutorial PR (deepspeedai#3783)

df1859d

ckpt: add open/close for reading checkpoint

2ffeefa

ckpt: make load_dir required, update README

d7bc406

adammoody force-pushed the ckptread2 branch from a98c414 to d7bc406 Compare June 21, 2023 19:37

formatting

b25b1a0

add save_dir to save_16bit_model checkpoint create

5fd670f

jeffra force-pushed the master branch from f6e2e38 to bafaf3c Compare June 23, 2023 21:31

jeffra requested a review from mrwyattii as a code owner June 23, 2023 21:31

Merge branch 'master' into ckptread2

f366a82

adammoody mentioned this pull request Nov 3, 2023

ckpt: add open and close methods to checkpoint engine #4608

Closed

ckpt: fix open/close merge conflict

a7da1ac

Merge branch 'master' into ckptread2

5034afc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ckpt: add open/close for reading checkpoint#3009

ckpt: add open/close for reading checkpoint#3009
adammoody wants to merge 8 commits intodeepspeedai:masterfrom
adammoody:ckptread2

adammoody commented Mar 13, 2023 •

edited

Loading

Uh oh!

adammoody commented Mar 13, 2023

Uh oh!

adammoody commented Mar 14, 2023

Uh oh!

adammoody commented Apr 18, 2023

Uh oh!

tjruwase commented Apr 18, 2023

Uh oh!

tjruwase commented May 3, 2023

Uh oh!

adammoody commented May 3, 2023

Uh oh!

adammoody commented Jun 22, 2023

Uh oh!

adammoody commented Nov 3, 2023 •

edited

Loading

Uh oh!

adammoody commented Nov 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

adammoody commented Mar 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adammoody commented Mar 13, 2023

Uh oh!

adammoody commented Mar 14, 2023

Uh oh!

adammoody commented Apr 18, 2023

Uh oh!

tjruwase commented Apr 18, 2023

Uh oh!

tjruwase commented May 3, 2023

Uh oh!

adammoody commented May 3, 2023

Uh oh!

adammoody commented Jun 22, 2023

Uh oh!

adammoody commented Nov 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adammoody commented Nov 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adammoody commented Mar 13, 2023 •

edited

Loading

adammoody commented Nov 3, 2023 •

edited

Loading