Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Storage Cleaner] Unsharding improvements #483

Merged
merged 6 commits into from
Mar 6, 2024

Conversation

2015aroras
Copy link
Contributor

This is the first of a few PRs with refactored versions of changes I have been using locally for checkpoint management. This PR focusses on my changes to the storage cleaner's unsharding functionality:

  1. Add the option to unshard just 1 checkpoint within a run. This is helpful if you want to unshard checkpoints in parallel using something like GNU parallel.
  2. Add the option to delete a sharded checkpoint after it has been unsharded. This stops local file storage from blowing up.
  3. Add support for unsharding checkpoints whose configs are not compatible with the current schema in TrainConfig. This has been done in a hacky way (passing an empty TrainConfig to the checkpointer), relying on the fact that unsharding doesn't use the config at all.

Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one question

for sharded_checkpoint_dir in sharded_checkpoint_directories
if _get_checkpoint_number(sharded_checkpoint_dir) == checkpoint_num
]
assert len(sharded_checkpoint_directories) <= 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this script warn anywhere if there are no matching checkpoints here?

Copy link
Contributor Author

@2015aroras 2015aroras Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will say (as an info log, line 817) there are no matching directories and then will exit gracefully. There will be ~10 lines of logs from the whole program in this scenario, so it won't be a needle in a haystack.

@2015aroras 2015aroras merged commit fd3a57b into main Mar 6, 2024
10 checks passed
@2015aroras 2015aroras deleted the shanea/storage-cleaner-unshard-improvements branch March 6, 2024 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants