Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scripts for Dave #516

Draft
wants to merge 169 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
169 commits
Select commit Hold shift + click to select a range
645d320
Add scripts for Dave
epwalsh Mar 21, 2024
f24d098
Merge branch 'main' into dave/annealing
dwadden Mar 22, 2024
a802176
Training config for first annealing run.
dwadden Mar 22, 2024
fcf0ddf
Fix image.
dwadden Mar 23, 2024
4426835
Add config for 1T starting checkpoint.
dwadden Mar 24, 2024
6a0bf66
Load from S3 instead of R2.
dwadden Mar 25, 2024
f6ecec6
Run on 8 nodes instead of 2.
dwadden Mar 25, 2024
0472b29
Kick off annealing run that goes for 50B steps.
Mar 26, 2024
f9d9833
Restart failed run from last checkpoint.
Mar 26, 2024
18982e1
Run on 6 nodes.
dwadden Mar 26, 2024
65d2754
Run on 8 nodes.
dwadden Mar 26, 2024
9641479
Kick off shorter job on 2 nodes.
dwadden Mar 26, 2024
3d9daf1
Annealing 50B on MosaicML
dirkgr Mar 28, 2024
8178fb4
Unclear why it is reading the "default" one? Where does it even get t…
dirkgr Mar 28, 2024
8b0bea9
Make config with no Flan.
dwadden Mar 29, 2024
321a830
Merge branch 'dave/annealing' of https://github.com/allenai/OLMo into…
dwadden Mar 29, 2024
895347b
Merge branch 'main' into dave/annealing
dwadden Mar 29, 2024
f2ce790
Kick off a run with no flan.
dwadden Mar 29, 2024
5c01987
Mcli config for v0-step_1T-warmup_true
dirkgr Mar 30, 2024
393fae3
We got nodes!
dirkgr Mar 30, 2024
d58aa7a
superhigh25
soldni Apr 2, 2024
b9fbe9a
superweb config for mcli
dirkgr Apr 2, 2024
558f5c4
Fix what was probably a copy and paste error?
dirkgr Apr 2, 2024
cad1d42
Save override
dirkgr Apr 2, 2024
6ccd999
overwrite
dirkgr Apr 2, 2024
1fc6a55
Update config to continue the existing job
dirkgr Apr 2, 2024
d8e4f66
Try debugging this job
dirkgr Apr 2, 2024
1d90cb8
Debug logging for torchrun
dirkgr Apr 2, 2024
1300199
strace
dirkgr Apr 2, 2024
1be0448
yes
dirkgr Apr 2, 2024
5ee283e
Debug differently
dirkgr Apr 3, 2024
255f849
We didn't need this after all.
dirkgr Apr 3, 2024
7125997
Annealing for 1T checkpoint of new 7B run.
dwadden Apr 7, 2024
1bec77f
Copy unsharding script from `train-olmo-large`.
dwadden Apr 8, 2024
4ee0089
Kick off anealing run for new 7b model at 2T.
dwadden Apr 9, 2024
5c949e6
Add annealing for new 7b with no flan.
dwadden Apr 9, 2024
83705eb
Change num nodes.
dwadden Apr 9, 2024
af1ee77
Fix typo in filename.
dwadden Apr 9, 2024
0c0e407
Remove extra `/` in S3 path.
dwadden Apr 9, 2024
5e9f85a
Change number of train nodes.
dwadden Apr 10, 2024
6f18656
Run on 8 nodes.
dwadden Apr 10, 2024
025c89b
Add cfg option `--scheduler.warmup_min_lr`
epwalsh Apr 11, 2024
89c6fde
Merge branch 'main' into dave/annealing
dwadden Apr 11, 2024
34a617f
Merge branch 'epwalsh/warmup-min-lr' into dave/annealing
dwadden Apr 11, 2024
ff80bf8
Create new config for 1.7B run.
dwadden Apr 11, 2024
2e0d6cf
Script to launch annealing jobs.
dwadden Apr 12, 2024
6f21b06
Small typo in launch script.
dwadden Apr 12, 2024
cc07bad
Small typo.
dwadden Apr 12, 2024
c28938b
Don't print config info.
dwadden Apr 12, 2024
447f68c
Set the cluster.
dwadden Apr 12, 2024
d4ea229
Update annealing configs for olmo v1.7.
dwadden Apr 12, 2024
e77f0f3
Merge branch 'train-olmo-large' into dave/annealing
dwadden Apr 12, 2024
ada52da
Use legacy checkpointer.
dwadden Apr 12, 2024
d0bf954
Launch annealing run with downweighted flan.
dwadden Apr 14, 2024
1088aad
Launch 2.1T run.
dwadden Apr 14, 2024
1e1616e
Create 100B config for 2T of OLMo 1.7.
dwadden Apr 15, 2024
b589d12
v1.7-step_2T-resume_optimizer-steps_100B on MosaicML
dirkgr Apr 15, 2024
d06ff7e
Fix cluster name
dirkgr Apr 15, 2024
2ce181a
Some more updates to the config
dirkgr Apr 15, 2024
c2a9678
Create config for 200B annealing run.
dwadden Apr 15, 2024
3c7d0e7
Add new data to get to 200B tokens.
dwadden Apr 15, 2024
ac77472
Fix redpajama data.
dwadden Apr 15, 2024
dc11fdd
Add `stop_at` flag.
dwadden Apr 16, 2024
c86e6fd
Make 50B config with new data.
dwadden Apr 16, 2024
ee3f216
`stop_at` needs to be set in steps
dirkgr Apr 16, 2024
3bbb463
Add 200B config.
dwadden Apr 16, 2024
dfa3b94
Config for fixed data
dirkgr Apr 16, 2024
6def8e8
Merge branch 'dave/annealing' of https://github.com/allenai/OLMo into…
dwadden Apr 16, 2024
5a9f8ad
Merge branch 'dave/annealing' of https://github.com/allenai/OLMo into…
dwadden Apr 16, 2024
ac19e90
Add `stop_at` in steps for all configs.
dwadden Apr 16, 2024
d8b24cc
Configs for 200B and 50B
dirkgr Apr 16, 2024
f70868c
Continue the 200B anneal.
dwadden Apr 20, 2024
a1a3167
Continue 200B anneal, corrected.
dwadden Apr 20, 2024
2de7b99
Add config for baseline with no data changes.
dwadden Apr 20, 2024
d80f6bd
Switch image and kick off on Jupiter.
dwadden Apr 21, 2024
fc5285f
Update launcher script.
dwadden Apr 21, 2024
11d707b
Update launcher script again.
dwadden Apr 21, 2024
f245138
Resume baseline model training.
dwadden Apr 23, 2024
d89ad7a
Add a scheduler to do cosine in linear envelope.
dwadden Apr 24, 2024
c1fe834
Config for cosine schedule with linear envelope.
dwadden Apr 24, 2024
be6573b
Keep trucking on the 200B.
dwadden Apr 24, 2024
7999213
Resume training for cosine run.
dwadden Apr 25, 2024
9a2c128
Do cosine run with correct checkpointer.
dwadden Apr 25, 2024
719bdb5
Add flag if model is already unsharded.
dwadden Apr 26, 2024
dc523f5
Create config for 70B anneal.
dwadden May 2, 2024
dd565fd
Merge branch 'train-olmo-large' into dave/annealing
dwadden May 2, 2024
c6bbbed
Update anneal config to match 70B run.
dwadden May 2, 2024
1e54882
Merge branch 'train-olmo-large' into dave/annealing
dwadden May 2, 2024
fc39f37
Mosaic config for annealing
dirkgr May 3, 2024
270695c
Continue the run with less Flan.
dwadden May 3, 2024
03ae5f0
Start from an earlier checkpoint
dirkgr May 3, 2024
631210d
Merge branch 'dave/annealing' of https://github.com/allenai/LLM into …
dirkgr May 3, 2024
ebc403c
Merge branch 'train-olmo-large' into dave/annealing
dwadden May 7, 2024
f6cbd70
Add flag if model is already downloaded.
dwadden May 7, 2024
b02c17b
Small updates to unshard -> hf script.
dwadden May 7, 2024
4ae8d4b
Merge branch 'main' into dave/annealing
dwadden May 7, 2024
184e7e0
Add new-style checkpointing to unshard->hf script.
dwadden May 9, 2024
d803dbb
Merge remote-tracking branch 'origin/train-olmo-large' into dave/anne…
dirkgr May 11, 2024
98e47d1
100B annealing config
dirkgr May 11, 2024
21404d2
Try the Beaker way of rdzv
dirkgr May 11, 2024
4822804
Needs one more flag
dirkgr May 11, 2024
ea4248d
Argh
dirkgr May 11, 2024
3affc7e
300B anneal config.
dwadden May 13, 2024
788baa0
300B annealing config
dirkgr May 13, 2024
c329398
We never actually ran the lower LR, so I'm setting it back to what we…
dirkgr May 17, 2024
babecfe
Cleanup
dirkgr May 17, 2024
417317c
Annealing config from ~900B tokens
dirkgr May 17, 2024
7e8202d
Merge branch 'main' into dave/annealing
dwadden May 18, 2024
bc6234f
Merge branch 'dave/annealing' of https://github.com/allenai/OLMo into…
dwadden May 18, 2024
43e61a9
150B anneal config.
dwadden May 18, 2024
a7fa0e0
Fix name and assign to olmo-large project.
dwadden May 18, 2024
5e06ac2
OLMo core main is broken
dirkgr May 18, 2024
df29f2f
Can't specify SHAs this way
dirkgr May 18, 2024
713cfac
Run died, need to restart
dirkgr May 18, 2024
8528969
Specify nodes by hand
dirkgr May 18, 2024
ecf2120
Iterating towards working nodes
dirkgr May 18, 2024
f858a08
Wrong setting
dirkgr May 18, 2024
c496a6e
More hunting for good nodes
dirkgr May 18, 2024
3469604
Continue 200B anneal.
dwadden May 21, 2024
e92f111
Merge branch 'dave/annealing' of https://github.com/allenai/OLMo into…
dwadden May 21, 2024
888792d
Run anneal with different random seed.
dwadden May 21, 2024
a559481
Script to launch runs on Jupiter.
dwadden May 21, 2024
9de9bd6
Don't change the image.
dwadden May 21, 2024
b2c5b04
Fix environment variables.
dwadden May 21, 2024
3e65728
Get rid of `--no-python`.
dwadden May 21, 2024
4532376
Restart the run with flan downweighted.
dwadden May 24, 2024
d35f0fb
Merge branch 'main' into dave/annealing
dwadden Jun 14, 2024
ebe7684
Anneal for final OLMo 7 v1.7.
dwadden Jun 14, 2024
d0585ae
Save fewer checkpoints.
dwadden Jun 18, 2024
035e5b8
Rename config and put under `olmo-medium`.
dwadden Jun 18, 2024
c8565d5
Get rid of profiles.
dwadden Jun 18, 2024
a6499b8
Modify anneal launcher.
dwadden Jun 18, 2024
95908ab
Fix s3 path.
dwadden Jun 20, 2024
a13dada
Update train config.
dwadden Jun 20, 2024
175c165
Update config.
dwadden Jun 20, 2024
a06d6f3
Start again.
dwadden Jun 21, 2024
eb5f89e
Merge branch 'main' into dave/annealing
dwadden Jun 25, 2024
2d94314
Anneal config for flan fix.
dwadden Jun 26, 2024
37110a8
Fix name.
dwadden Jun 26, 2024
8099be4
Add preemptible.
dwadden Jun 26, 2024
a59322e
Specify distributed strategy.
dwadden Jun 27, 2024
4bfe9ad
Look at variation due to random seed.
dwadden Jun 27, 2024
1957a8f
Add FSDP flag.
dwadden Jun 27, 2024
38cac62
Save fewer checkpoints.
dwadden Jul 1, 2024
555ac82
Anneal final checkpoint with fixed flan.
dwadden Jul 4, 2024
689dc7c
Anneal final checkpoint.
dwadden Jul 4, 2024
f55e500
Fix `num_checkpoints_to_keep` argument.
dwadden Jul 4, 2024
d1e57af
Merge branch 'main' into dave/annealing
dwadden Jul 10, 2024
b809a00
Start on config for amberish anneal run.
dwadden Jul 10, 2024
25095cb
Progress on Amberish config.
dwadden Jul 11, 2024
cef0fde
Final config for Amberish anneal.
dwadden Jul 11, 2024
ec888ab
Merge branch 'main' into dave/annealing
dwadden Jul 12, 2024
b42a3ed
Update anneal config.
dwadden Jul 14, 2024
4e27b32
Fix model path.
dwadden Jul 14, 2024
55cd195
Merge branch 's3_unshard_to_hf' into dave/annealing
dwadden Jul 19, 2024
909adbd
Get rid of git conflicts.
dwadden Jul 22, 2024
7822b04
Merge branch 'main' into dave/annealing
dwadden Aug 2, 2024
8de24c1
Merge branch 'main' into dave/annealing
dwadden Aug 5, 2024
93083bd
Try anneal on later checkpoint with LR warmup.
dwadden Aug 6, 2024
dfd12ee
Fix num checkpoints to save.
dwadden Aug 6, 2024
e85020c
Merge branch 'main' into dave/annealing
dwadden Aug 8, 2024
cf46005
Resume train from checkpoint.
dwadden Aug 9, 2024
8aa06b2
Merge branch 'main' into dave/annealing
dwadden Aug 16, 2024
47f47ef
Copy annealing config.
dwadden Aug 16, 2024
4a90d30
Merge branch 'main' into dave/annealing
dwadden Aug 16, 2024
8405780
Annealing configs for peteish.
dwadden Aug 16, 2024
46b8c67
Delete peteish anneals
dwadden Aug 20, 2024
fbda5d5
Delete old launch scripts
dwadden Aug 20, 2024
eb9df7b
Remove peteish launch
dwadden Aug 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
206 changes: 206 additions & 0 deletions configs/annealing/OLMo-7B.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,206 @@
run_name: OLMo-7B
seed: 6198
dry_run: false

wandb:
name: ${run_name}
project: olmo-medium
group: OLMo-7B-annealing # TODO: change to what you like

model:
d_model: 4096
n_heads: 32
n_layers: 32
mlp_hidden_size: 22016
weight_tying: false
alibi: false
rope: true
flash_attention: true
attention_dropout: 0.0
attention_layer_norm: false
multi_query_attention: false
include_bias: false
block_type: sequential
layer_norm_type: default
layer_norm_with_affine: false
bias_for_layer_norm: false
attention_layer_norm_with_affine: false
activation_type: swiglu
residual_dropout: 0.0
embedding_dropout: 0.0
max_sequence_length: 2048
vocab_size: 50280
embedding_size: 50304
eos_token_id: 50279
pad_token_id: 1
init_device: meta
init_fn: mitchell

compile:
fullgraph: false

optimizer:
name: adamw
learning_rate: 3.0e-4 # TODO: change to your peak learning
weight_decay: 0.1
betas:
- 0.9
- 0.95
metrics_log_interval: 10

scheduler: # TODO: change to what you want
name: linear_with_warmup
t_warmup: 100
alpha_f: 0.1

tokenizer:
identifier: tokenizers/allenai_eleuther-ai-gpt-neox-20b-pii-special.json
truncate_direction: right

save_folder: runs/${run_name}
remote_save_folder: s3://ai2-llm/checkpoints/oe-data-annealing/${run_name}
save_overwrite: true
# Sharded checkpoints (best for restarts)
save_interval: 500
save_num_checkpoints_to_keep: -1
# Unsharded checkpoints (for final storage)
save_interval_unsharded: null
save_num_unsharded_checkpoints_to_keep: -1

restore_dataloader: false # TODO: this should only be 'false' initially

load_path: /net/nfs/allennlp/llm-checkpoints/step551000-unsharded #TODO: change this

max_duration: null
global_train_batch_size: 2048 # TODO: adjust as needed
device_train_microbatch_size: 2 # TODO: adjust as needed
time_limit: null

precision: amp_bf16

fsdp:
wrapping_strategy: by_block
precision: mixed

max_grad_norm: 1.0
max_grad_norm_ratio: null

speed_monitor:
window_size: 1

eval_interval: ${save_interval}
eval_subset_num_batches: -1
device_eval_batch_size: ${device_train_microbatch_size}
evaluators:
- label: v3-small-ppl-validation
data:
num_workers: 0
drop_last: true
datasets:
v3-small-c4_en-validation:
- r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/c4_en/val/part-0-00000.npy
v3-small-dolma_books-validation:
- r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/dolma_books/val/part-0-00000.npy
v3-small-dolma_common-crawl-validation:
- r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/dolma_common-crawl/val/part-0-00000.npy
v3-small-dolma_pes2o-validation:
- r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/dolma_pes2o/val/part-0-00000.npy
v3-small-dolma_reddit-validation:
- r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/dolma_reddit/val/part-0-00000.npy
v3-small-dolma_stack-validation:
- r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/dolma_stack/val/part-0-00000.npy
v3-small-dolma_wiki-validation:
- r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/dolma_wiki/val/part-0-00000.npy
v3-small-ice-validation:
- r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/ice/val/part-0-00000.npy
v3-small-m2d2_s2orc-validation:
- r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/m2d2_s2orc/val/part-0-00000.npy
v3-small-pile-validation:
- r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/pile/val/part-0-00000.npy
v3-small-wikitext_103-validation:
- r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/wikitext_103/val/part-0-00000.npy

- label: v2-small-ppl-validation
data:
num_workers: 0
drop_last: true
datasets:
v2-small-4chan-validation:
- r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/4chan/val.npy
v2-small-c4_100_domains-validation:
- r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/c4_100_domains/val.npy
v2-small-c4_en-validation:
- r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/c4_en/val.npy
v2-small-gab-validation:
- r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/gab/val.npy
v2-small-ice-validation:
- r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/ice/val.npy
v2-small-m2d2_s2orc-validation:
- r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/m2d2_s2orc/val.npy
v2-small-m2d2_wiki-validation:
- r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/m2d2_wiki/val.npy
v2-small-manosphere-validation:
- r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/manosphere/val.npy
v2-small-mc4_en-validation:
- r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/mc4_en/val.npy
v2-small-pile-validation:
- r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/pile/val.npy
v2-small-ptb-validation:
- r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/ptb/val.npy
v2-small-twitterAEE-validation:
- r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/twitterAEE/val.npy
v2-small-wikitext_103-validation:
- r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/wikitext_103/val.npy

##########################
# Downstream evaluations #
##########################
- label: piqa
type: downstream

- label: hellaswag
type: downstream

- label: winogrande
type: downstream

- label: openbook_qa
type: downstream

# - label: boolq # requires implemention of the pmi_dc matrix
# type: downstream

- label: sciq
type: downstream

- label: arc_easy
type: downstream

# - label: arc_challenge # requires implemention of the pmi_dc matrix
# type: downstream

- label: copa
type: downstream

- label: rte
type: downstream

- label: commitment_bank
type: downstream

- label: mrpc
type: downstream

- label: sst2
type: downstream

data:
pad_direction: right
num_workers: 8
drop_last: true
pin_memory: true
prefetch_factor: 2
persistent_workers: true
timeout: 0
paths:
- s3://ai2-llm/data/... # TODO: update these paths
Loading
Loading