mitchish LUMI run on its own branch #368

epwalsh · 2023-11-07T23:06:23Z

See also #350.

To restart the run on LUMI:

git checkout mitchish-lumi
git pull
sbatch scripts/lumi/mitch-ish-7b.sh --load_path=$FLASH_DIR/checkpoints/{LAST_RUN_ID}/latest

epwalsh · 2024-01-05T18:55:11Z

olmo/checkpoint.py

@@ -684,7 +684,7 @@ def _make_optim_state_dict_compatible(
        # This state dict comes in two forms: one where the state keys are integers and one where the
        # keys are fully qualified parameter names. The latter case is easier to deal with here so we
        # first transform the integer key form into the FQN key form.
-        if isinstance(next(iter(optim_state_dict["state"].keys())), int):
+        if isinstance(optim_state_dict["param_groups"][0]["params"][0], int):


I had to make these fixes after trying to load an unsharded checkpoint with an uninitialized optimizer.

dirkgr · 2024-01-05T19:42:00Z

scripts/lumi/mitch-ish-7b.sh

+      --save_interval=10000 \
+      --save_interval_ephemeral=1000 \


That's a really low save interval? Total number of steps is only like 400k, right?

I made this change to save disk space on LUMI. So we still save every 1k steps, and those get uploaded to S3, but since they are "ephemeral" (locally) they get cleaned up from LUMI.

2015aroras · 2024-01-08T23:27:16Z

configs/v1_5-mix-medium-mitch-ish.yaml

+  grad_clip_warmup_steps: 1000
+  grad_clip_warmup_factor: 10.0


Should this also be in the s3 yaml?

Yea, good catch. 8c81318

Merge branch 'mitchish' into mitchish-lumi

fdb27da

epwalsh mentioned this pull request Nov 7, 2023

Mitchish mosaic run on its own branch #350

Merged

epwalsh added 27 commits November 7, 2023 15:19

Add grad clip warmup

92a849e

update scripts

7c3532f

update settings for LUMI run

0cfd0ab

switch to local sharded chpts

3b2b15b

set export GPU_MAX_HW_QUEUES=8

79962e7

decrease min time to 12hrs

419f7e6

handle case where FSDP instance doesn't have params

1aa1bad

make more robust

70a0307

update run script

ac3fd52

Merge branch 'epwalsh/threaded-data-loading' into mitchish-lumi

d012452

Merge branch 'epwalsh/threaded-data-loading' into mitchish-lumi

3a9b591

increase min time

94ca56c

Merge branch 'epwalsh/threaded-data-loading' into mitchish-lumi

b378b87

Merge branch 'main' into mitchish-lumi

9dc35b4

set max_duration in tokens

b449731

fix

4e569a0

init process group before logging

12cb296

another fix

09aa457

another fix

4d75dd9

fix merge conflicts

0214ec9

Merge branch 'main' into mitchish-lumi

fd2d329

Merge branch 'main' into mitchish-lumi

cc3504e

updates

d5a26a1

Merge branch 'main' into mitchish-lumi

774f476

Merge branch 'main' into mitchish-lumi

23fb39b

Merge branch 'main' into mitchish-lumi

1a6ee5c

Merge branch 'main' into mitchish-lumi

b320a74

epwalsh marked this pull request as ready for review January 5, 2024 18:53

epwalsh commented Jan 5, 2024

View reviewed changes

epwalsh requested a review from dirkgr January 5, 2024 18:55

dirkgr reviewed Jan 5, 2024

View reviewed changes

2015aroras reviewed Jan 8, 2024

View reviewed changes

synchronize grad clip warmup between configs

8c81318

dirkgr approved these changes Jan 9, 2024

View reviewed changes

2015aroras approved these changes Jan 9, 2024

View reviewed changes

epwalsh merged commit a2e1d13 into main Jan 9, 2024
10 checks passed

epwalsh deleted the mitchish-lumi branch January 9, 2024 03:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mitchish LUMI run on its own branch #368

mitchish LUMI run on its own branch #368

epwalsh commented Nov 7, 2023

epwalsh Jan 5, 2024

dirkgr Jan 5, 2024

epwalsh Jan 5, 2024

2015aroras Jan 8, 2024

epwalsh Jan 9, 2024

mitchish LUMI run on its own branch #368

mitchish LUMI run on its own branch #368

Conversation

epwalsh commented Nov 7, 2023

epwalsh Jan 5, 2024

Choose a reason for hiding this comment

dirkgr Jan 5, 2024

Choose a reason for hiding this comment

epwalsh Jan 5, 2024

Choose a reason for hiding this comment

2015aroras Jan 8, 2024

Choose a reason for hiding this comment

epwalsh Jan 9, 2024

Choose a reason for hiding this comment