-
Notifications
You must be signed in to change notification settings - Fork 400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds a YAML validator to automatically find the last checkpoint #348
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will use this all the time just for the ability to find the latest checkpoint regardless of whether it is sharded or not. This is an ability the symlinks do not have.
if not checkpoint_name.startswith("step"): | ||
continue | ||
try: | ||
step = int(checkpoint_name.replace("step", "").replace("-unsharded", "")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Between sharded and unsharded checkpoints, this will always pick the latest one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest is always picked, but as of 3a38d9a, if there's a sharded and an unsharded checkpoint at the same step, the sharded checkpoint will be prioritized.
Can we stick an example somewhere prominent, like in a comment at the end of a run script or something? |
Do you have any ideas? This doesn't really help with kempner/lumi configs since we have the symlinks. |
olmo/util.py
Outdated
elif parsed.scheme == "file": | ||
return find_latest_checkpoint(str(dir).replace("file://", "", 1)) | ||
else: | ||
raise NotImplementedError(f"file size not implemented for '{parsed.scheme}' files") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: change file size
to find latest checkpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, good catch.
olmo/util.py
Outdated
step = int(path.name.replace("step", "").replace("-unsharded", "")) | ||
except ValueError: | ||
continue | ||
if step > latest_step: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of relying on sorting and using a comment about prioritizing sharded checkpoints, you could be more direct and use a condition like step > latest_step or (step == latest_step and not path.name.endswith("-unsharded"))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Adds an Omegaconfg YAML validator called
path.last_checkpoint
for automatically finding the latest checkpoint in a local or remote directory. For example:python scripts/train.py configs/v1_5-mix-medium-mitch-ish-s3.yaml ... \ --load_path='${path.last_checkpoint:s3://ai2-llm/checkpoints/7b/v1_5-mix-mitch-ish}'
As a result the latest checkpoint gets resolved while loading the config, so the config that's saved with the checkpoints and the W&B run will have
--load_path
set to the resolved path.