Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mitchish LUMI run on its own branch #368

Merged
merged 29 commits into from
Jan 9, 2024
Merged

mitchish LUMI run on its own branch #368

merged 29 commits into from
Jan 9, 2024

Conversation

epwalsh
Copy link
Member

@epwalsh epwalsh commented Nov 7, 2023

See also #350.

To restart the run on LUMI:

git checkout mitchish-lumi
git pull
sbatch scripts/lumi/mitch-ish-7b.sh --load_path=$FLASH_DIR/checkpoints/{LAST_RUN_ID}/latest

@epwalsh epwalsh marked this pull request as ready for review January 5, 2024 18:53
@@ -684,7 +684,7 @@ def _make_optim_state_dict_compatible(
# This state dict comes in two forms: one where the state keys are integers and one where the
# keys are fully qualified parameter names. The latter case is easier to deal with here so we
# first transform the integer key form into the FQN key form.
if isinstance(next(iter(optim_state_dict["state"].keys())), int):
if isinstance(optim_state_dict["param_groups"][0]["params"][0], int):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to make these fixes after trying to load an unsharded checkpoint with an uninitialized optimizer.

@epwalsh epwalsh requested a review from dirkgr January 5, 2024 18:55
Comment on lines +67 to +68
--save_interval=10000 \
--save_interval_ephemeral=1000 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a really low save interval? Total number of steps is only like 400k, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change to save disk space on LUMI. So we still save every 1k steps, and those get uploaded to S3, but since they are "ephemeral" (locally) they get cleaned up from LUMI.

Comment on lines +55 to +56
grad_clip_warmup_steps: 1000
grad_clip_warmup_factor: 10.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be in the s3 yaml?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, good catch. 8c81318

@epwalsh epwalsh merged commit a2e1d13 into main Jan 9, 2024
10 checks passed
@epwalsh epwalsh deleted the mitchish-lumi branch January 9, 2024 03:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants