Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a YAML validator to automatically find the last checkpoint #348

Merged
merged 5 commits into from
Oct 31, 2023

Conversation

epwalsh
Copy link
Member

@epwalsh epwalsh commented Oct 31, 2023

Adds an Omegaconfg YAML validator called path.last_checkpoint for automatically finding the latest checkpoint in a local or remote directory. For example:

python scripts/train.py configs/v1_5-mix-medium-mitch-ish-s3.yaml ... \
  --load_path='${path.last_checkpoint:s3://ai2-llm/checkpoints/7b/v1_5-mix-mitch-ish}'

As a result the latest checkpoint gets resolved while loading the config, so the config that's saved with the checkpoints and the W&B run will have --load_path set to the resolved path.

@epwalsh epwalsh changed the title Add a YAML validator to automatically find the last checkpoint Adds a YAML validator to automatically find the last checkpoint Oct 31, 2023
Copy link
Member

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will use this all the time just for the ability to find the latest checkpoint regardless of whether it is sharded or not. This is an ability the symlinks do not have.

if not checkpoint_name.startswith("step"):
continue
try:
step = int(checkpoint_name.replace("step", "").replace("-unsharded", ""))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Between sharded and unsharded checkpoints, this will always pick the latest one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest is always picked, but as of 3a38d9a, if there's a sharded and an unsharded checkpoint at the same step, the sharded checkpoint will be prioritized.

@dirkgr
Copy link
Member

dirkgr commented Oct 31, 2023

Can we stick an example somewhere prominent, like in a comment at the end of a run script or something?

@epwalsh
Copy link
Member Author

epwalsh commented Oct 31, 2023

Can we stick an example somewhere prominent, like in a comment at the end of a run script or something?

Do you have any ideas? This doesn't really help with kempner/lumi configs since we have the symlinks.

olmo/util.py Outdated
elif parsed.scheme == "file":
return find_latest_checkpoint(str(dir).replace("file://", "", 1))
else:
raise NotImplementedError(f"file size not implemented for '{parsed.scheme}' files")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: change file size to find latest checkpoint

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, good catch.

olmo/util.py Outdated
step = int(path.name.replace("step", "").replace("-unsharded", ""))
except ValueError:
continue
if step > latest_step:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of relying on sorting and using a comment about prioritizing sharded checkpoints, you could be more direct and use a condition like step > latest_step or (step == latest_step and not path.name.endswith("-unsharded"))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@epwalsh epwalsh merged commit fd2425f into main Oct 31, 2023
9 of 10 checks passed
@epwalsh epwalsh deleted the epwalsh/auto-find-last-checkpoint branch October 31, 2023 23:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants