ckpt: add open/close for reading checkpoint#3009
ckpt: add open/close for reading checkpoint#3009adammoody wants to merge 8 commits intodeepspeedai:masterfrom
Conversation
|
TODO: the handling of optional params in |
Decided to make |
|
@tjruwase , I'd like to get your feedback on this one, too. Do you have some time to review and consider the types of changes suggested in this PR? |
|
@adammoody, apologies for the delay on this PR. I will get to it no later than next week. Thanks for your patience. |
|
@adammoody, apologies for the delay. This looks good. Thanks! |
|
Great. Thanks, @tjruwase ! There are a few spots in here that I know could use some cleanup / attention. For one, I could not test the nebula code changes at all, so it'd be good to have someone comb through those in detail and make any necessary changes. I see there are some conflicts with the main branch. I'll try to refresh the PR soon. |
|
@tjruwase , I'm working to refresh this PR again. Since I can't test the changes I made to nebula checkpoint engine, is there someone from the team who could help me there? |
|
Hi, @loadams . If someone is willing to look this over and if the idea is still acceptable, I'll try to get this back up to date. In particular, I'd need help to make sure it doesn't break Nebula (which I can't test) or other checkpoint/restart paths I may have missed. These open/close hooks are needed for some checkpoint libraries like SCR/VeloC, but they look to be useful for the existing checkpoint engines in torch/Nebula. It requires two new global collectives ( |
|
Hi, @loadams . I hope you had a good Thanksgiving! I could commit some time to this again. However, I'd still need an assist from your side to really verify and test things, assuming the DeepSpeed team agrees with the approach here. If you have a chance, please let me know. Thanks. |
For consideration, this adds
open()andclose()calls to theCheckpointEngineto serve as start and end markers during a restart. Similar to howcreate()andcommit()define the start and end markers when writing a checkpoint, these new calls are useful for checkpoint engines while reading a checkpoint.The
open()function can be used by the checkpoint engine to find and prepare a checkpoint for reading. The caller may specify a directory inload_dirand a checkpoint name intag. Iftag == Noneor"latest", the checkpoint engine loads the most recent checkpoint that it can find. Otherwise, the checkpoint engine attempts to load the checkpoint named intag.open()returns thetagvalue of the checkpoint that it actually loaded, orNoneif it fails to find a checkpoint.The
close()call can be used to free any resources that were allocated by the checkpoint engine duringopen().All
load()calls that read checkpoint files should be placed betweenopen()andclose()bookends.Implementation details and TODOs:
latestfile toTorchCheckpointEngine. It is written duringcommit()and read duringopen(). For now,save_latestis a member variable of the class that is hard coded to beTrue. To make this dynamic, this could either be a config parameter or an option passed tocreate(). So that only rank 0 creates thelatestfile, I pass the rank of each process in theTorchCheckpointEngineconstructor. If it's acceptable to usedistwithin the checkpoint engine, we could alternatively get the rank directly.open()andclose(), but it's not possible for me to test the code. The main change is that the search logic that locates the checkpoint has been moved fromload()toopen(). In particular, one should review my changes totag_flag.load_dirparameter that is passed toopen()could be made optional, in which case, one could use the current working directory, orload_dircould be passed as a parameter to the checkpoint engine constructor. Similar changes could be made forsave_dir. I'm guessing that in most cases, these directory paths do not change during a run.