Skip to content

ckpt: add open/close for reading checkpoint#3009

Open
adammoody wants to merge 8 commits intodeepspeedai:masterfrom
adammoody:ckptread2
Open

ckpt: add open/close for reading checkpoint#3009
adammoody wants to merge 8 commits intodeepspeedai:masterfrom
adammoody:ckptread2

Conversation

@adammoody
Copy link
Contributor

@adammoody adammoody commented Mar 13, 2023

For consideration, this adds open() and close() calls to the CheckpointEngine to serve as start and end markers during a restart. Similar to how create() and commit() define the start and end markers when writing a checkpoint, these new calls are useful for checkpoint engines while reading a checkpoint.

The open() function can be used by the checkpoint engine to find and prepare a checkpoint for reading. The caller may specify a directory in load_dir and a checkpoint name in tag. If tag == None or "latest", the checkpoint engine loads the most recent checkpoint that it can find. Otherwise, the checkpoint engine attempts to load the checkpoint named in tag. open() returns the tag value of the checkpoint that it actually loaded, or None if it fails to find a checkpoint.

The close() call can be used to free any resources that were allocated by the checkpoint engine during open().

All load() calls that read checkpoint files should be placed between open() and close() bookends.

Implementation details and TODOs:

  • This moves the processing for the latest file to TorchCheckpointEngine. It is written during commit() and read during open(). For now, save_latest is a member variable of the class that is hard coded to be True. To make this dynamic, this could either be a config parameter or an option passed to create(). So that only rank 0 creates the latest file, I pass the rank of each process in the TorchCheckpointEngine constructor. If it's acceptable to use dist within the checkpoint engine, we could alternatively get the rank directly.
  • I took an educated guess at updating Nebula to use open() and close(), but it's not possible for me to test the code. The main change is that the search logic that locates the checkpoint has been moved from load() to open(). In particular, one should review my changes to tag_flag.
  • As an extension, the load_dir parameter that is passed to open() could be made optional, in which case, one could use the current working directory, or load_dir could be passed as a parameter to the checkpoint engine constructor. Similar changes could be made for save_dir. I'm guessing that in most cases, these directory paths do not change during a run.

@adammoody
Copy link
Contributor Author

TODO: the handling of optional params in open() needs to be fixed in case either load_dir or tag is None.

@adammoody
Copy link
Contributor Author

TODO: the handling of optional params in open() needs to be fixed in case either load_dir or tag is None.

Decided to make load_dir required, and updated TorchCheckpointEngine to handle tag=None

@adammoody
Copy link
Contributor Author

@tjruwase , I'd like to get your feedback on this one, too. Do you have some time to review and consider the types of changes suggested in this PR?

@tjruwase
Copy link
Contributor

@adammoody, apologies for the delay on this PR. I will get to it no later than next week. Thanks for your patience.

@tjruwase
Copy link
Contributor

tjruwase commented May 3, 2023

@adammoody, apologies for the delay. This looks good. Thanks!

@adammoody
Copy link
Contributor Author

Great. Thanks, @tjruwase ! There are a few spots in here that I know could use some cleanup / attention.

For one, I could not test the nebula code changes at all, so it'd be good to have someone comb through those in detail and make any necessary changes.

I see there are some conflicts with the main branch. I'll try to refresh the PR soon.

@adammoody
Copy link
Contributor Author

@tjruwase , I'm working to refresh this PR again.

Since I can't test the changes I made to nebula checkpoint engine, is there someone from the team who could help me there?

@adammoody
Copy link
Contributor Author

adammoody commented Nov 3, 2023

Hi, @loadams . If someone is willing to look this over and if the idea is still acceptable, I'll try to get this back up to date. In particular, I'd need help to make sure it doesn't break Nebula (which I can't test) or other checkpoint/restart paths I may have missed.

These open/close hooks are needed for some checkpoint libraries like SCR/VeloC, but they look to be useful for the existing checkpoint engines in torch/Nebula. It requires two new global collectives (open and close), and one must be sure that any load calls are placed in between.

@adammoody
Copy link
Contributor Author

Hi, @loadams . I hope you had a good Thanksgiving!

I could commit some time to this again. However, I'd still need an assist from your side to really verify and test things, assuming the DeepSpeed team agrees with the approach here. If you have a chance, please let me know. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants