Reshape deepspeed checkpoint #239

tjruwase · 2022-01-24T22:41:50Z

Reshape deepspeed checkpoints on tp/pp dimensions

TODO before merging:

the DS PR Checkpoint reshaping microsoft/DeepSpeed#1953 needs to be merged first
remove @olruwase/elastic-ckpt-refresh in requirements.txt

stas00 · 2022-01-25T02:03:09Z

if you give it a bogus path it fails to report that the path is bogus and fails with:

stderr:   warnings.warn("Parameter count with the embeddings will be inaccurate with PP > 1, as the first and last stage hold several copies of the embeddings")
stderr: Traceback (most recent call last):
stderr:   File "tools/convert_checkpoint/deepspeed_to_deepspeed.py", line 135, in <module>
stderr:     main()
stderr:   File "tools/convert_checkpoint/deepspeed_to_deepspeed.py", line 116, in main
stderr:     ds_checkpoint = DeepSpeedCheckpoint(args.input_folder, args.target_tp,
stderr:   File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/Megatron-DeepSpeed-master-3/megatron/checkpoint/deepspeed_checkpoint.py", line 47, in __init__
stderr:     self.original_pp_degree = len(
stderr: ZeroDivisionError: integer division or modulo by zero

this could be a small test.

stas00 · 2022-01-25T02:07:05Z

Also there is a bit of asymmetry between input and output paths. Intuitively this:

            python tools/convert_checkpoint/deepspeed_to_deepspeed.py
            --input_folder  checkpoints-in/global_step20
            --output_folder checkpoints-out/global_step20

should generate a new checkpoint under checkpoints-out/global_step20, but it does checkpoints-out/global_step20/global_step20 instead:

This is odd:

            python tools/convert_checkpoint/deepspeed_to_deepspeed.py
            --input_folder  checkpoints-in/global_step20
            --output_folder checkpoints-out

since the user has no idea global_step20 will be replicated. Does it make sense?

tests/test_checkpoints.py

stas00 · 2022-01-25T02:26:35Z

tests/test_checkpoints.py

+        # XXX: fails to handle:
+        #--embed-layernorm
+        #
+# stderr: RuntimeError: Error(s) in loading state_dict for VocabParallelEmbedding:
+# stderr:         size mismatch for norm.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]).
+# stderr:         size mismatch for norm.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]).
+


Any idea how to make the checkpoint tool discover new entries in the model as its architecture grows?

When an optional --embed-layernorm is added it pushed a layer norm into the embed layer, and then the reshaper fails to change its size and then it fails as I pasted the error above.

Unless the only way is to create a map of all possible param names and may be flag which are optional?

tjruwase · 2022-01-25T13:34:45Z

Also there is a bit of asymmetry between input and output paths. Intuitively this:

            python tools/convert_checkpoint/deepspeed_to_deepspeed.py
            --input_folder  checkpoints-in/global_step20
            --output_folder checkpoints-out/global_step20

Good point. For this case, is it reasonable for the tool to write latest.txt (and others) in checkpoints-out/? Or perhaps the tool should not be doing that, but should simply be convert checkpoints-in/global_step20 to corresponding checkpoints-out/global_step20?

tjruwase · 2022-01-25T18:10:18Z

if you give it a bogus path it fails to report that the path is bogus and fails with:

stderr:   warnings.warn("Parameter count with the embeddings will be inaccurate with PP > 1, as the first and last stage hold several copies of the embeddings")
stderr: Traceback (most recent call last):
stderr:   File "tools/convert_checkpoint/deepspeed_to_deepspeed.py", line 135, in <module>
stderr:     main()
stderr:   File "tools/convert_checkpoint/deepspeed_to_deepspeed.py", line 116, in main
stderr:     ds_checkpoint = DeepSpeedCheckpoint(args.input_folder, args.target_tp,
stderr:   File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/Megatron-DeepSpeed-master-3/megatron/checkpoint/deepspeed_checkpoint.py", line 47, in __init__
stderr:     self.original_pp_degree = len(
stderr: ZeroDivisionError: integer division or modulo by zero

this could be a small test.

Added following --input_folder validation

Path exists
Path is a folder
Contains mp_rank* files
Contains layer_* files
Contains layer_01* files

stas00 · 2022-01-25T19:56:32Z

Also there is a bit of asymmetry between input and output paths. Intuitively this:
            python tools/convert_checkpoint/deepspeed_to_deepspeed.py
            --input_folder  checkpoints-in/global_step20
            --output_folder checkpoints-out/global_step20
Good point. For this case, is it reasonable for the tool to write latest.txt (and others) in checkpoints-out/? Or perhaps the tool should not be doing that, but should simply be convert checkpoints-in/global_step20 to corresponding checkpoints-out/global_step20?

have a flag for that feature? and by default overwrite latest so that the converted checkpoint can be used right away but give an option not to overwrite it?

e.g. for testing this default would make things simpler.

tjruwase · 2022-01-25T23:12:03Z

tests/test_checkpoints.py

+
+        output_dir1 = self.get_auto_remove_tmp_dir("./xxx1", after=False)
+        output_dir2 = self.get_auto_remove_tmp_dir("./xxx2", after=False)
+        with self.assertRaises(AssertionError) as context:


@stas00, this test is meant to validate that we assert on empty input folder. However, the test current fails. Do you have any idea what could be wrong? I am still looking into it though.

I will have a look.

I originally committed a test with hardcoded path, which shouldn't be in the committed version. It'll automatically pick a random empty dir and self-destruct at the end.

Use the hardcoded one, only while testing, because then it requires manual care of removing it.

So I pushed a change that undid the hard-coding, but you may want to restore it for while you test on your own so that it's easier to debug.

so that was the main source of the problem, you had tests rely on each other and thus the order they run in makes a difference.

Also for your testing needs you can further override self.get_auto_remove_tmp_dir defaults to automatically clean up the output dir before or after testing. See the full doc here: https://huggingface.co/docs/transformers/master/testing#temporary-files-and-directories

tests/test_checkpoints.py

stas00 · 2022-01-25T23:50:30Z

tests/test_checkpoints.py

+        self.reshape_checkpoint(input_dir=output_dir1, output_dir=output_dir2, target_tp_size=1, target_pp_size=1)
+
+        # 3. check we can resume training from a reshaped checkpoint with TP=1 / PP=1
+        self.resume_from_checkpoint(output_dir2, tp_size=1, pp_size=1, dp_size=1)


on my 2-gpu box this test fails with:

stderr: Traceback (most recent call last): stderr: File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/Megatron-DeepSpeed-master-3/pretrain_gpt.py", line 239, in <module> stderr: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, stderr: File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/Megatron-DeepSpeed-master-3/megatron/training.py", line 99, in pretrain stderr: initialize_megatron(extra_args_provider=extra_args_provider, stderr: File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/Megatron-DeepSpeed-master-3/megatron/initialize.py", line 155, in initialize_megatron stderr: finish_mpu_init() stderr: File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/Megatron-DeepSpeed-master-3/megatron/initialize.py", line 95, in finish_mpu_init stderr: _initialize_distributed() stderr: File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/Megatron-DeepSpeed-master-3/megatron/initialize.py", line 285, in _initialize_distributed stderr: assert args.local_rank == device, \ stderr: AssertionError: expected local-rank to be the same as rank % device-count.

sdtblck · 2022-02-07T20:42:56Z

Just dropping in here because I saw you referenced our merging PR over at gpt-neox (EleutherAI/gpt-neox#466) - we have the merging "technically" working fine, but we see performance regression in the merged model (about 1-2% drop in lambada score, doesn't totally break the model but it brings the performance of our 20B model down to about a 6B equivalent, see EleutherAI/gpt-neox#466 (comment))

I'd been meaning to reach out to you guys and the Megatron team to see if you were having a similar problem. The source of the problem, I believe, is that the parameters that should be replicated across model parallel groups (e.g 'input_layernorm.weight') are actually slightly different.

I'm not that sure why this is the case (I guess they're never explicitly synced?) in neox and/or whether this is something unique to our codebase or not, so it'd be great to know if you have or don't have the same problem. I'll send an email to the Megatron devs with the same question.

tjruwase · 2022-02-08T00:02:30Z

@sdtblck, thanks for reaching out. Yes, the entire reshaping line is one of slow progress on our side. The plan is to push our updates into this PR and others on DeepSpeed. Perhaps we can sync in a few weeks when we should a bit more clarity/results on our approach.

stas00 · 2022-02-23T18:26:10Z

cc: @DanielHesslow and @conglongli - for awareness wrt lm-harness integration here: #212 - the deepspeed checkpoint utils are moving into deepspeed, so once this PR is complete some syncing might be required.

stas00 · 2022-06-04T21:23:02Z

Cross link the other half: microsoft/DeepSpeed#1953

Muennighoff · 2022-06-22T10:43:50Z

Okay if I merge master into this PR? Need some of the last changes

Muennighoff · 2022-06-23T08:30:49Z

tools/convert_checkpoint/deepspeed_to_deepspeed.py

+                                   pp_index):
+    sd = ds_checkpoint.get_2d_parallel_state(tp_index=tp_index,
+                                             pp_index=pp_index)
+    sd[MP_WORLD_SIZE] = ds_checkpoint.tp_degree


Shouldn't this be sd[MP_WORLD_SIZE] = ds_checkpoint.tp_degree * ds_checkpoint.pp_degree?
This snippet in DeepSpeed sets mp_world_size to be all gpus that are not data parallel, thus tp & pp

for dp in range(self.data_parallel_size): ranks = sorted(self._topo.get_axis_list(axis='data', idx=dp)) if self.global_rank == 0: #print(f'RANK={self.global_rank} building DeepSpeed model group: {ranks}') pass proc_group = dist.new_group(ranks=ranks) if self.global_rank in ranks: self.ds_model_proc_group = proc_group self.ds_model_world_size = len(ranks) self.ds_model_rank = ranks.index(self.global_rank) assert self.ds_model_rank > -1 assert self.ds_model_proc_group is not None

The size of self.ds_model_proc_group is used to set MP_WORLD_SIZE.

My understanding is that self.ds_model_proc_group is tp_degree. Historically, MP_WORLD_SIZE stood for model-parallel world size and was independent of pipeline parallelism. Please see here.
Are you observing a different behavior or some error with this?

Yeah, in the 176B trained with this codebase:

Parallelism:

TP_SIZE=4 PP_SIZE=12

Inside of mp_rank_00_model_states.pt, it says mp_world_size=48, i.e. 4*12.

... 'lr_scheduler': {'max_lr': 6e-05, 'warmup_steps': 183105, 'num_steps': 147682224, 'warmup_tokens': 0, 'num_tokens': 302449000448, 'decay_style': 'cosine', 'decay_steps': 200000000, 'min_lr': 6e-06}, 'sparse_tensor_module_names': set(), 'skipped_steps': 0, 'global_steps': 80000, 'global_samples': 147682224, 'dp_world_size': 8, 'mp_world_size': 48, 'ds_config': './ds_config.472098.json', 'ds_version': '0.6.1+b7d64fd', 'args': ...

I think this is because of the snippet I pasted above.

* Reshape deepspeed checkpoint (#239) * Reshape deepspeed checkpoint * add checkpoint tests * Validate input folder * Tests for tp/pp reshape * remove debug folders * fix test_checkpoint_reshaping_empty_dir * Fix unit tests * Remove deepspeed checkpoint utils * Use DS 3D reshaping utils * convert to bf16 * wip universal chkpt * rename * rename * wip on fragments dealing * cleanup * Loading universal checkpoint with reshaping * all gpu1<->2 reshapes work * param attrs * make the tests adaptable to the number of available gpus * WIP * WIP * WIP * WIP * Debug functions * args should be required, don't create another latest file * Parallelize shard extraction * close+join pool; add tqdm; comment out noise * rename * parameterize * Parallel slice merging * Cleanup * allow inspection on a machine w/o gpus * test against the right DS branch * DS size was merged Co-authored-by: Stas Bekman <stas@stason.org> * BLOOM Inference via DeepSpeed-Inference, Accelerate and DeepSpeed-ZeRO (#308) * hardcode the dtype depending on the model * change the mp based on the world_size * remove hardcoded world_size * add bigscience/bigscience-small-testing * fixes * add zero-inference script * fixes * fix * working script * renames * fixes * fix for offline use * add benchmark * add benchmark * update * cleanup * update * msecs * cleanup * improve * fix benchmark, add warmup * update * fix; thanks Michael Wyatt * clarify * add bloom batch-inference script * removed the names :-) * fold the bs functionality from the other script * fix * restore do_sample * dump generate args * fix * fix * support any batchsize * div by bs * mul by bs * add cpu_offload; sync scripts * wip * improvements * fixes * fixes * add accelerate script * fix * wip * wip * stats * add OnDevice and remove zero-inference (#316) * wip * rework generate + benchmark * figure out the memory map dynamically * bug fix * fix ds-zero-inference wrt device * bug fix * update * update * fix Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Stas Bekman <stas@stason.org> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Reshape deepspeed checkpoint * add checkpoint tests * Validate input folder * Tests for tp/pp reshape * remove debug folders * fix test_checkpoint_reshaping_empty_dir * Fix unit tests * Remove deepspeed checkpoint utils * Use DS 3D reshaping utils * convert to bf16 * wip universal chkpt * rename * rename * wip on fragments dealing * cleanup * Loading universal checkpoint with reshaping * all gpu1<->2 reshapes work * param attrs * make the tests adaptable to the number of available gpus * WIP * WIP * WIP * WIP * Debug functions * args should be required, don't create another latest file * Parallelize shard extraction * close+join pool; add tqdm; comment out noise * rename * parameterize * Parallel slice merging * Cleanup * allow inspection on a machine w/o gpus * test against the right DS branch * DS size was merged Co-authored-by: Stas Bekman <stas@stason.org>

Reshape deepspeed checkpoint

67c08f0

tjruwase requested a review from stas00 January 24, 2022 22:41

This was referenced Jan 24, 2022

Reshaping DeepSpeed checkpoints #238

Closed

DeepSpeed to DeepSpeed converter for changing tp/pp microsoft/Megatron-DeepSpeed#27

Closed

Merge remote-tracking branch 'origin/main' into ds_ckpt_reshape

fec1ec5

add checkpoint tests

675f12c

stas00 reviewed Jan 25, 2022

View reviewed changes

tests/test_checkpoints.py Outdated Show resolved Hide resolved

stas00 reviewed Jan 25, 2022

View reviewed changes

Validate input folder

e379065

Tests for tp/pp reshape

a1068e4

tjruwase commented Jan 25, 2022

View reviewed changes

stas00 added 2 commits January 25, 2022 15:36

remove debug folders

115bd31

fix test_checkpoint_reshaping_empty_dir

cc2fad1

stas00 reviewed Jan 25, 2022

View reviewed changes

tests/test_checkpoints.py Outdated Show resolved Hide resolved

stas00 reviewed Jan 25, 2022

View reviewed changes

tests/test_checkpoints.py Outdated Show resolved Hide resolved

stas00 reviewed Jan 25, 2022

View reviewed changes

Fix unit tests

b6733d5

thomasw21 mentioned this pull request Feb 1, 2022

Alibi Tensor Parallel Fix #244

Merged

Remove deepspeed checkpoint utils

9bf7ac5

sdtblck mentioned this pull request Feb 13, 2022

GPT-NeoX-20B Integration huggingface/transformers#15642

Closed

Use DS 3D reshaping utils

29ca2bc

stas00 mentioned this pull request Feb 23, 2022

Eval harness #212

Merged

stas00 and others added 9 commits May 27, 2022 21:20

all gpu1<->2 reshapes work

d5e33de

param attrs

85ff56c

make the tests adaptable to the number of available gpus

f01fa4a

WIP

f29bacc

WIP

dd0aeb6

WIP

3bf14fd

WIP

7ae002d

Debug functions

55bb514

args should be required, don't create another latest file

795fedb

stas00 mentioned this pull request Jun 4, 2022

Checkpoint reshaping microsoft/DeepSpeed#1953

Merged

tjruwase and others added 6 commits June 7, 2022 03:18

Parallelize shard extraction

cc8810b

close+join pool; add tqdm; comment out noise

04d9ad0

rename

bca5af4

parameterize

721380b

Parallel slice merging

e8a1ccf

Cleanup

a247614

Muennighoff reviewed Jun 23, 2022

View reviewed changes

stas00 added 2 commits June 28, 2022 20:26

Merge remote-tracking branch 'origin/main' into ds_ckpt_reshape

9bb3dc3

allow inspection on a machine w/o gpus

d845a1f

stas00 mentioned this pull request Jun 30, 2022

a branch combining layer-norm-auto-sync and ds_ckpt_reshape #292

Open

stas00 added 4 commits July 4, 2022 16:53

Merge remote-tracking branch 'origin/main' into ds_ckpt_reshape

9fa081b

test against the right DS branch

90d720c

Merge remote-tracking branch 'origin/main' into ds_ckpt_reshape

9edd939

DS size was merged

ebff495

stas00 merged commit 0f23a72 into main Jul 20, 2022

stas00 deleted the ds_ckpt_reshape branch July 20, 2022 20:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reshape deepspeed checkpoint #239

Reshape deepspeed checkpoint #239

tjruwase commented Jan 24, 2022 •

edited by stas00

Loading

stas00 commented Jan 25, 2022 •

edited

Loading

stas00 commented Jan 25, 2022

stas00 Jan 25, 2022 •

edited

Loading

tjruwase commented Jan 25, 2022

tjruwase commented Jan 25, 2022

stas00 commented Jan 25, 2022

tjruwase Jan 25, 2022

stas00 Jan 25, 2022

stas00 Jan 25, 2022 •

edited

Loading

stas00 Jan 25, 2022

stas00 Jan 25, 2022 •

edited

Loading

sdtblck commented Feb 7, 2022

tjruwase commented Feb 8, 2022

stas00 commented Feb 23, 2022

stas00 commented Jun 4, 2022

Muennighoff commented Jun 22, 2022

Muennighoff Jun 23, 2022 •

edited

Loading

tjruwase Jun 23, 2022

Muennighoff Jun 23, 2022

Reshape deepspeed checkpoint #239

Reshape deepspeed checkpoint #239

Conversation

tjruwase commented Jan 24, 2022 • edited by stas00 Loading

stas00 commented Jan 25, 2022 • edited Loading

stas00 commented Jan 25, 2022

stas00 Jan 25, 2022 • edited Loading

Choose a reason for hiding this comment

tjruwase commented Jan 25, 2022

tjruwase commented Jan 25, 2022

stas00 commented Jan 25, 2022

tjruwase Jan 25, 2022

Choose a reason for hiding this comment

stas00 Jan 25, 2022

Choose a reason for hiding this comment

stas00 Jan 25, 2022 • edited Loading

Choose a reason for hiding this comment

stas00 Jan 25, 2022

Choose a reason for hiding this comment

stas00 Jan 25, 2022 • edited Loading

Choose a reason for hiding this comment

sdtblck commented Feb 7, 2022

tjruwase commented Feb 8, 2022

stas00 commented Feb 23, 2022

stas00 commented Jun 4, 2022

Muennighoff commented Jun 22, 2022

Muennighoff Jun 23, 2022 • edited Loading

Choose a reason for hiding this comment

tjruwase Jun 23, 2022

Choose a reason for hiding this comment

Muennighoff Jun 23, 2022

Choose a reason for hiding this comment

tjruwase commented Jan 24, 2022 •

edited by stas00

Loading

stas00 commented Jan 25, 2022 •

edited

Loading

stas00 Jan 25, 2022 •

edited

Loading

stas00 Jan 25, 2022 •

edited

Loading

stas00 Jan 25, 2022 •

edited

Loading

Muennighoff Jun 23, 2022 •

edited

Loading