Open
Conversation
Collaborator
|
Hi @adammoody - apologies for the very slow reply on this. Is this a PR that you would still like to see merged? If so, would you be able to - when you have time - merge the latest master and then we can review it? Let us know, if not we can close this, but we will get the right people to review it after it is updated. Thanks! |
Contributor
Author
|
Thanks, @loadams . Actually, before getting into details on this PR, there is a related PR to add "open/close" operations to the checkpoint engine: Could we start with that one? It also needs to be refreshed, but it would be help if folks could consider the ideas there and let me know whether the approach sounds reasonable. |
0e01e76 to
5cbd8de
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The existing checkpoint engine model is a close fit for plugging in the Scalable Checkpoint / Restart library, which like Nebula, caches checkpoint files in node-local storage. There are two extensions that could be added that would help. I wanted to open this PR as a means for discussion.
Could we move directory creation into the checkpoint engine, e.g., maybe add a new
checkpoint_engine.makedirs()that could fill in foros.makedirs()calls like this one?https://github.com/microsoft/DeepSpeed/blob/58a4a4d4c19bda86d489ac171fa10f3ddb27c9d6/deepspeed/runtime/pipe/module.py#L590
In some situations, it's useful to delay directory creation until the files are copied. In particular, some checkpoint libraries do not copy every checkpoint, in which case, it's good to avoid creating directories at all.
I opened a separate PR regarding this first request: #2988
Just as the checkpoint sequence has "start" and "end" operations in the form of
create()andcommit(), it would be useful to have start and end operations for the restart sequence. Maybe new functions likecheckpoint_engine.open()andcheckpoint_engine.close()?One would then call
checkpoint_engine.load()in betweenopen()andclose(). This makes it easier for the library to allow the application to read from temporary paths. For the existing engines, one could move some of the logic fromload()toopen(). In the case of SCR, one must read all checkpoint files between calls toscr.start_restart()andscr.complete_restart().A PR adding
openandclosecan be previewed at #3009With the above two extensions to the
CheckpointEngine, this current PR reduces to something like adammoody#5 so that SCR-specific code is essentially fully contained to its engine file.