-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mitchish LUMI run on its own branch #368
Conversation
@@ -684,7 +684,7 @@ def _make_optim_state_dict_compatible( | |||
# This state dict comes in two forms: one where the state keys are integers and one where the | |||
# keys are fully qualified parameter names. The latter case is easier to deal with here so we | |||
# first transform the integer key form into the FQN key form. | |||
if isinstance(next(iter(optim_state_dict["state"].keys())), int): | |||
if isinstance(optim_state_dict["param_groups"][0]["params"][0], int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to make these fixes after trying to load an unsharded checkpoint with an uninitialized optimizer.
--save_interval=10000 \ | ||
--save_interval_ephemeral=1000 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a really low save interval? Total number of steps is only like 400k, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this change to save disk space on LUMI. So we still save every 1k steps, and those get uploaded to S3, but since they are "ephemeral" (locally) they get cleaned up from LUMI.
grad_clip_warmup_steps: 1000 | ||
grad_clip_warmup_factor: 10.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also be in the s3 yaml?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, good catch. 8c81318
See also #350.
To restart the run on LUMI:
git checkout mitchish-lumi git pull sbatch scripts/lumi/mitch-ish-7b.sh --load_path=$FLASH_DIR/checkpoints/{LAST_RUN_ID}/latest