New-style checkpointing (again) #307

epwalsh · 2023-10-02T23:51:52Z

Switches to PyTorch's new recommended checkpointing functionality - torch.distributed.checkpoint.

The benefits of using the new checkpointing module is that we can save and load (sharded) checkpoints from different world sizes M and N, even if M or N is 1, and the total size of the checkpoints should be much smaller than the artifacts from our current sharded checkpointing method.

In order to make this work smoothly on MosaicML or other platforms where there isn't a shared file system between nodes I had to implement a custom StorageWriter and StorageReader. These classes - RemoteFileSystemWriter and RemoteFileSystemReader, respectively - work just like the standard PyTorch FileSystemWriter and FileSystemReader when writing/reading checkpoints to/from a local directory, but they are also capable of writing/reading to/from cloud storage, which is necessary when nodes don't have access to a shared file system.

These changes are backwards compatible in that we can still load our "old-style" sharded checkpoints so we can resume an existing run after this merges.

Automatically detect the old format and handle accordingly.

epwalsh · 2023-10-02T23:54:14Z

Still in "draft" mode because I have yet to test this on LUMI. I will tomorrow when it's back up.

dirkgr

This format means we'd need a new unsharder, but we can write it in gloo and without hacks. We just spin up 256 ranks in separate processes on a single machine, and have them load a checkpoint, and call the save_unsharded() function.

dirkgr · 2023-10-03T00:20:44Z

olmo/checkpoint.py

+        fut = super().write_data(plan, planner)
+        if self.upload_to is not None:
+            files_to_upload = set()
+            for write_result in fut.value():


Calling .value() here means the first future needs to be completed at this point. I assume it waits? Wouldn't it be better (and possibly necessary for correctness?) to wait inside the thread pool?

The future needs to get waited on I think.
https://pytorch.org/docs/stable/futures.html#torch.futures.Future.value

The value of this particular future is available immediately. See https://github.com/pytorch/pytorch/blob/7827ae2864afa1955bc9ce04d168b274700d24e5/torch/distributed/checkpoint/filesystem.py#L429.

Why do they have this system with the futures if they are not using it?

It's the API for this class. Maybe they had another use case in mind when they designed it, I don't know. I added a .wait() just in case FileSystemWriter changes in a later release. 7c7f6dc

dirkgr · 2023-10-03T00:22:54Z

olmo/checkpoint.py

+                for f in as_completed(futures):
+                    f.result()


Wait, this thing is returning futures, but not the futures that are doing the uploading? Is that right? Why?

Right. The future it returns is a PyTorch Future, not a Future from the Python std lib.

😵‍💫
No way to convert one to the other? No benefit of doing so?

Not that I know of

dirkgr · 2023-10-03T00:27:28Z

olmo/train.py

+            # Load the model state dict in place.
+            log.info("Loading model state...")
+            model_state = {"model": self.fsdp_model.state_dict()}
+            load_state_dict(model_state, RemoteFileSystemReader(f"{load_path}/model_and_optim"))


By the presence of the model_and_optim directory will we know that this is a new-style checkpoint?

dirkgr · 2023-10-03T00:29:25Z

olmo/train.py

+                # Restoring RNG state isn't necessary and in the case of going from world size 1 to world size N
+                # we probably don't want every rank to have the exact same RNG state.
+                del trainer_state["rng"]


From a RNG point of view, this means we get a different model when switching world sizes mid-training, I assume?

Yes.. but there's no way to avoid that.

As long as the files with the indices reflect what happened, it's all good.

dirkgr · 2023-10-03T00:30:46Z

olmo/train.py

+
+        barrier()
+
+    def restore_legacy_sharded_checkpoint(self, load_path: PathOrStr):


It seems like this is not too much code, but I'd be fine saying we only restore legacy unsharded checkpoints. At least long term. Tomorrow I want to start running immediately on one of those, which hasn't been unsharded yet. So maybe it's good this is here.

2015aroras

Curious to see how it goes on LUMI

2015aroras · 2023-10-03T00:21:54Z

olmo/train.py

+            try:
+                resource_path(load_path, f"rank{get_global_rank()}.pt")
+                legacy_mode = True
+            except FileNotFoundError:


This seems to implicitly imply that FileNotFoundError will happen if the file passed to resource_path does not exist. However, the else block of resource_path looks like it can be satisfied by a local file that does not exist.

Maybe a bit irrelevant, but why not just remove the else block of resource_path? It looks like cached_path will check existence of local files.

This seems to implicitly imply that FileNotFoundError will happen if the file passed to resource_path does not exist. However, the else block of resource_path looks like it can be satisfied by a local file that does not exist.

Good catch. f357b5e

Maybe a bit irrelevant, but why not just remove the else block of resource_path? It looks like cached_path will check existence of local files.

Technically that would break for local files Windows OS where the path separator is not a "/". Not like we'll be training on Windows anyway, but might as well avoid potential bugs when possible.

Technically that would break for local files Windows OS where the path separator is not a "/".

I would think that the Python path abstraction would be able to deal with the alternate path separator. The doc does seem to allow forward slashes for windows paths: https://docs.python.org/3/library/pathlib.html#pathlib.PureWindowsPath.

@2015aroras you're right. deeb8fb

2015aroras · 2023-10-03T00:42:05Z

olmo/train.py

+            try:
+                train_state_dict = torch.load(resource_path(load_path, "other.pt"))  # for backwards compatibility
+            except FileNotFoundError:
+                train_state_dict = torch.load(resource_path(load_path, "train.pt"))


nit: Maybe let's try train.pt first, and fallback to the legacy if needed.

2015aroras · 2023-10-03T00:54:54Z

olmo/checkpoint.py

+        fut = super().write_data(plan, planner)
+        if self.upload_to is not None:
+            files_to_upload = set()
+            for write_result in fut.value():


The future needs to get waited on I think.
https://pytorch.org/docs/stable/futures.html#torch.futures.Future.value

epwalsh · 2023-10-03T18:31:22Z

For the medium model on LUMI, this decreases the total sharded checkpoint size from 200G to 28G! 😮

dirkgr · 2023-10-03T18:38:41Z

^ @2015aroras, this is why we have to unshard all the checkpoints. Even the ones we already made with the old method.

epwalsh · 2023-10-03T18:47:27Z

Saving / loading from LUMI to S3 is pretty quick too for the medium model (1-2 mins).

epwalsh · 2023-10-03T20:14:05Z

As of bb16bd7, unsharding a new-style sharded checkpoint is as simple as this:

from olmo import Olmo, CheckpointType

model = Olmo.from_checkpoint(
  "path/to/sharded/checkpoint",
  device="cpu",  # "cuda" works fine too, and might be faster
  checkpoint_type=CheckpointType.sharded,
)
torch.save(model.state_dict(), "path/to/unsharded/checkpoint/model.pt")

epwalsh added 3 commits October 2, 2023 15:51

New sharded checkpointing

07e1e4d

comments and docstring

f884969

Make sure we can still load old sharded ckpts

7eeb1aa

Automatically detect the old format and handle accordingly.

epwalsh mentioned this pull request Oct 2, 2023

Changes how we checkpoint back to the torch method #299

Closed

epwalsh requested review from dirkgr and 2015aroras October 2, 2023 23:54

dirkgr reviewed Oct 3, 2023

View reviewed changes

2015aroras reviewed Oct 3, 2023

View reviewed changes

epwalsh added 7 commits October 2, 2023 18:22

Fix

f357b5e

Try train.pt first

b9102cd

call .wait() just in case

7c7f6dc

fix

7854e4b

fix loading remote sharded checkpoints

6b251b0

remove else block in resource_path()

deeb8fb

update logging statement

ebf0f9f

epwalsh added 2 commits October 3, 2023 11:42

fix local cache path

7d4839c

another fix

a77a6d9

epwalsh added 2 commits October 3, 2023 13:08

handle sharded checkpoints from Olmo.from_checkpoint()

bb16bd7

remove unused import

f24593e

Merge branch 'main' into petew/checkpointing

78fd19d

epwalsh marked this pull request as ready for review October 3, 2023 22:47

dirkgr approved these changes Oct 3, 2023

View reviewed changes

epwalsh merged commit 602968a into main Oct 3, 2023
10 checks passed

epwalsh deleted the petew/checkpointing branch October 3, 2023 22:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New-style checkpointing (again) #307

New-style checkpointing (again) #307

epwalsh commented Oct 2, 2023 •

edited

Loading

epwalsh commented Oct 2, 2023

dirkgr left a comment

dirkgr Oct 3, 2023

2015aroras Oct 3, 2023

epwalsh Oct 3, 2023

dirkgr Oct 3, 2023

epwalsh Oct 3, 2023

dirkgr Oct 3, 2023

epwalsh Oct 3, 2023

dirkgr Oct 3, 2023

epwalsh Oct 3, 2023

dirkgr Oct 3, 2023

epwalsh Oct 3, 2023

dirkgr Oct 3, 2023

epwalsh Oct 3, 2023

dirkgr Oct 3, 2023

dirkgr Oct 3, 2023

2015aroras left a comment

2015aroras Oct 3, 2023

2015aroras Oct 3, 2023

epwalsh Oct 3, 2023

2015aroras Oct 3, 2023 •

edited

Loading

epwalsh Oct 3, 2023

2015aroras Oct 3, 2023

epwalsh Oct 3, 2023

2015aroras Oct 3, 2023

epwalsh commented Oct 3, 2023

dirkgr commented Oct 3, 2023

epwalsh commented Oct 3, 2023

epwalsh commented Oct 3, 2023 •

edited

Loading


		barrier()

		def restore_legacy_sharded_checkpoint(self, load_path: PathOrStr):

New-style checkpointing (again) #307

New-style checkpointing (again) #307

Conversation

epwalsh commented Oct 2, 2023 • edited Loading

epwalsh commented Oct 2, 2023

dirkgr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

2015aroras left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

2015aroras Oct 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

epwalsh commented Oct 3, 2023

dirkgr commented Oct 3, 2023

epwalsh commented Oct 3, 2023

epwalsh commented Oct 3, 2023 • edited Loading

epwalsh commented Oct 2, 2023 •

edited

Loading

2015aroras Oct 3, 2023 •

edited

Loading

epwalsh commented Oct 3, 2023 •

edited

Loading