Skip to content

SCR checkpoint engine#2972

Open
adammoody wants to merge 10 commits intodeepspeedai:masterfrom
adammoody:screngine
Open

SCR checkpoint engine#2972
adammoody wants to merge 10 commits intodeepspeedai:masterfrom
adammoody:screngine

Conversation

@adammoody
Copy link
Contributor

@adammoody adammoody commented Mar 8, 2023

The existing checkpoint engine model is a close fit for plugging in the Scalable Checkpoint / Restart library, which like Nebula, caches checkpoint files in node-local storage. There are two extensions that could be added that would help. I wanted to open this PR as a means for discussion.

  1. Delegate checkpoint directory creation to CheckpoinEngine

Could we move directory creation into the checkpoint engine, e.g., maybe add a new checkpoint_engine.makedirs() that could fill in for os.makedirs() calls like this one?

https://github.com/microsoft/DeepSpeed/blob/58a4a4d4c19bda86d489ac171fa10f3ddb27c9d6/deepspeed/runtime/pipe/module.py#L590

In some situations, it's useful to delay directory creation until the files are copied. In particular, some checkpoint libraries do not copy every checkpoint, in which case, it's good to avoid creating directories at all.

I opened a separate PR regarding this first request: #2988

  1. Define new open/close calls in CheckpointEngine for reading a checkpoint

Just as the checkpoint sequence has "start" and "end" operations in the form of create() and commit(), it would be useful to have start and end operations for the restart sequence. Maybe new functions like checkpoint_engine.open() and checkpoint_engine.close()?

One would then call checkpoint_engine.load() in between open() and close(). This makes it easier for the library to allow the application to read from temporary paths. For the existing engines, one could move some of the logic from load() to open(). In the case of SCR, one must read all checkpoint files between calls to scr.start_restart() and scr.complete_restart().

A PR adding open and close can be previewed at #3009

With the above two extensions to the CheckpointEngine, this current PR reduces to something like adammoody#5 so that SCR-specific code is essentially fully contained to its engine file.

@loadams
Copy link
Collaborator

loadams commented Oct 9, 2023

Hi @adammoody - apologies for the very slow reply on this. Is this a PR that you would still like to see merged? If so, would you be able to - when you have time - merge the latest master and then we can review it?

Let us know, if not we can close this, but we will get the right people to review it after it is updated. Thanks!

@adammoody
Copy link
Contributor Author

Thanks, @loadams . Actually, before getting into details on this PR, there is a related PR to add "open/close" operations to the checkpoint engine:

#3009

Could we start with that one?

It also needs to be refreshed, but it would be help if folks could consider the ideas there and let me know whether the approach sounds reasonable.

@adammoody adammoody force-pushed the screngine branch 2 times, most recently from 0e01e76 to 5cbd8de Compare December 18, 2023 02:20
@adammoody adammoody requested a review from mrwyattii as a code owner March 26, 2024 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants