Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reshape deepspeed checkpoint #239

Merged
merged 39 commits into from
Jul 20, 2022
Merged

Reshape deepspeed checkpoint #239

merged 39 commits into from
Jul 20, 2022

Conversation

tjruwase
Copy link
Collaborator

@tjruwase tjruwase commented Jan 24, 2022

Reshape deepspeed checkpoints on tp/pp dimensions


TODO before merging:

@stas00
Copy link
Member

stas00 commented Jan 25, 2022

if you give it a bogus path it fails to report that the path is bogus and fails with:

stderr:   warnings.warn("Parameter count with the embeddings will be inaccurate with PP > 1, as the first and last stage hold several copies of the embeddings")
stderr: Traceback (most recent call last):
stderr:   File "tools/convert_checkpoint/deepspeed_to_deepspeed.py", line 135, in <module>
stderr:     main()
stderr:   File "tools/convert_checkpoint/deepspeed_to_deepspeed.py", line 116, in main
stderr:     ds_checkpoint = DeepSpeedCheckpoint(args.input_folder, args.target_tp,
stderr:   File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/Megatron-DeepSpeed-master-3/megatron/checkpoint/deepspeed_checkpoint.py", line 47, in __init__
stderr:     self.original_pp_degree = len(
stderr: ZeroDivisionError: integer division or modulo by zero

this could be a small test.

@stas00
Copy link
Member

stas00 commented Jan 25, 2022

Also there is a bit of asymmetry between input and output paths. Intuitively this:

            python tools/convert_checkpoint/deepspeed_to_deepspeed.py
            --input_folder  checkpoints-in/global_step20
            --output_folder checkpoints-out/global_step20

should generate a new checkpoint under checkpoints-out/global_step20, but it does checkpoints-out/global_step20/global_step20 instead:

This is odd:

            python tools/convert_checkpoint/deepspeed_to_deepspeed.py
            --input_folder  checkpoints-in/global_step20
            --output_folder checkpoints-out

since the user has no idea global_step20 will be replicated. Does it make sense?

tests/test_checkpoints.py Outdated Show resolved Hide resolved
Comment on lines +121 to +127
# XXX: fails to handle:
#--embed-layernorm
#
# stderr: RuntimeError: Error(s) in loading state_dict for VocabParallelEmbedding:
# stderr: size mismatch for norm.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]).
# stderr: size mismatch for norm.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]).

Copy link
Member

@stas00 stas00 Jan 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea how to make the checkpoint tool discover new entries in the model as its architecture grows?

When an optional --embed-layernorm is added it pushed a layer norm into the embed layer, and then the reshaper fails to change its size and then it fails as I pasted the error above.

Unless the only way is to create a map of all possible param names and may be flag which are optional?

@tjruwase
Copy link
Collaborator Author

Also there is a bit of asymmetry between input and output paths. Intuitively this:

            python tools/convert_checkpoint/deepspeed_to_deepspeed.py
            --input_folder  checkpoints-in/global_step20
            --output_folder checkpoints-out/global_step20

Good point. For this case, is it reasonable for the tool to write latest.txt (and others) in checkpoints-out/? Or perhaps the tool should not be doing that, but should simply be convert checkpoints-in/global_step20 to corresponding checkpoints-out/global_step20?

@tjruwase
Copy link
Collaborator Author

if you give it a bogus path it fails to report that the path is bogus and fails with:

stderr:   warnings.warn("Parameter count with the embeddings will be inaccurate with PP > 1, as the first and last stage hold several copies of the embeddings")
stderr: Traceback (most recent call last):
stderr:   File "tools/convert_checkpoint/deepspeed_to_deepspeed.py", line 135, in <module>
stderr:     main()
stderr:   File "tools/convert_checkpoint/deepspeed_to_deepspeed.py", line 116, in main
stderr:     ds_checkpoint = DeepSpeedCheckpoint(args.input_folder, args.target_tp,
stderr:   File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/Megatron-DeepSpeed-master-3/megatron/checkpoint/deepspeed_checkpoint.py", line 47, in __init__
stderr:     self.original_pp_degree = len(
stderr: ZeroDivisionError: integer division or modulo by zero

this could be a small test.

Added following --input_folder validation

  1. Path exists
  2. Path is a folder
  3. Contains mp_rank* files
  4. Contains layer_* files
  5. Contains layer_01* files

@stas00
Copy link
Member

stas00 commented Jan 25, 2022

Also there is a bit of asymmetry between input and output paths. Intuitively this:

            python tools/convert_checkpoint/deepspeed_to_deepspeed.py
            --input_folder  checkpoints-in/global_step20
            --output_folder checkpoints-out/global_step20

Good point. For this case, is it reasonable for the tool to write latest.txt (and others) in checkpoints-out/? Or perhaps the tool should not be doing that, but should simply be convert checkpoints-in/global_step20 to corresponding checkpoints-out/global_step20?

have a flag for that feature? and by default overwrite latest so that the converted checkpoint can be used right away but give an option not to overwrite it?

e.g. for testing this default would make things simpler.


output_dir1 = self.get_auto_remove_tmp_dir("./xxx1", after=False)
output_dir2 = self.get_auto_remove_tmp_dir("./xxx2", after=False)
with self.assertRaises(AssertionError) as context:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stas00, this test is meant to validate that we assert on empty input folder. However, the test current fails. Do you have any idea what could be wrong? I am still looking into it though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will have a look.

I originally committed a test with hardcoded path, which shouldn't be in the committed version. It'll automatically pick a random empty dir and self-destruct at the end.

Use the hardcoded one, only while testing, because then it requires manual care of removing it.

So I pushed a change that undid the hard-coding, but you may want to restore it for while you test on your own so that it's easier to debug.

Copy link
Member

@stas00 stas00 Jan 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so that was the main source of the problem, you had tests rely on each other and thus the order they run in makes a difference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for your testing needs you can further override self.get_auto_remove_tmp_dir defaults to automatically clean up the output dir before or after testing. See the full doc here: https://huggingface.co/docs/transformers/master/testing#temporary-files-and-directories

tests/test_checkpoints.py Outdated Show resolved Hide resolved
tests/test_checkpoints.py Outdated Show resolved Hide resolved
self.reshape_checkpoint(input_dir=output_dir1, output_dir=output_dir2, target_tp_size=1, target_pp_size=1)

# 3. check we can resume training from a reshaped checkpoint with TP=1 / PP=1
self.resume_from_checkpoint(output_dir2, tp_size=1, pp_size=1, dp_size=1)
Copy link
Member

@stas00 stas00 Jan 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on my 2-gpu box this test fails with:

stderr: Traceback (most recent call last):
stderr:   File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/Megatron-DeepSpeed-master-3/pretrain_gpt.py", line 239, in <module>
stderr:     pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
stderr:   File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/Megatron-DeepSpeed-master-3/megatron/training.py", line 99, in pretrain
stderr:     initialize_megatron(extra_args_provider=extra_args_provider,
stderr:   File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/Megatron-DeepSpeed-master-3/megatron/initialize.py", line 155, in initialize_megatron
stderr:     finish_mpu_init()
stderr:   File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/Megatron-DeepSpeed-master-3/megatron/initialize.py", line 95, in finish_mpu_init
stderr:     _initialize_distributed()
stderr:   File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/Megatron-DeepSpeed-master-3/megatron/initialize.py", line 285, in _initialize_distributed
stderr:     assert args.local_rank == device, \
stderr: AssertionError: expected local-rank to be the same as rank % device-count.

@sdtblck
Copy link

sdtblck commented Feb 7, 2022

Just dropping in here because I saw you referenced our merging PR over at gpt-neox (EleutherAI/gpt-neox#466) - we have the merging "technically" working fine, but we see performance regression in the merged model (about 1-2% drop in lambada score, doesn't totally break the model but it brings the performance of our 20B model down to about a 6B equivalent, see EleutherAI/gpt-neox#466 (comment))

I'd been meaning to reach out to you guys and the Megatron team to see if you were having a similar problem. The source of the problem, I believe, is that the parameters that should be replicated across model parallel groups (e.g 'input_layernorm.weight') are actually slightly different.

I'm not that sure why this is the case (I guess they're never explicitly synced?) in neox and/or whether this is something unique to our codebase or not, so it'd be great to know if you have or don't have the same problem. I'll send an email to the Megatron devs with the same question.

@tjruwase
Copy link
Collaborator Author

tjruwase commented Feb 8, 2022

@sdtblck, thanks for reaching out. Yes, the entire reshaping line is one of slow progress on our side. The plan is to push our updates into this PR and others on DeepSpeed. Perhaps we can sync in a few weeks when we should a bit more clarity/results on our approach.

@stas00
Copy link
Member

stas00 commented Feb 23, 2022

cc: @DanielHesslow and @conglongli - for awareness wrt lm-harness integration here: #212 - the deepspeed checkpoint utils are moving into deepspeed, so once this PR is complete some syncing might be required.

@stas00 stas00 mentioned this pull request Feb 23, 2022
@stas00
Copy link
Member

stas00 commented Jun 4, 2022

Cross link the other half: microsoft/DeepSpeed#1953

@Muennighoff
Copy link
Collaborator

Okay if I merge master into this PR? Need some of the last changes

pp_index):
sd = ds_checkpoint.get_2d_parallel_state(tp_index=tp_index,
pp_index=pp_index)
sd[MP_WORLD_SIZE] = ds_checkpoint.tp_degree
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be sd[MP_WORLD_SIZE] = ds_checkpoint.tp_degree * ds_checkpoint.pp_degree?
This snippet in DeepSpeed sets mp_world_size to be all gpus that are not data parallel, thus tp & pp

        for dp in range(self.data_parallel_size):
            ranks = sorted(self._topo.get_axis_list(axis='data', idx=dp))
            if self.global_rank == 0:
                #print(f'RANK={self.global_rank} building DeepSpeed model group: {ranks}')
                pass
            proc_group = dist.new_group(ranks=ranks)
            if self.global_rank in ranks:
                self.ds_model_proc_group = proc_group
                self.ds_model_world_size = len(ranks)
                self.ds_model_rank = ranks.index(self.global_rank)
        assert self.ds_model_rank > -1
        assert self.ds_model_proc_group is not None

The size of self.ds_model_proc_group is used to set MP_WORLD_SIZE.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that self.ds_model_proc_group is tp_degree. Historically, MP_WORLD_SIZE stood for model-parallel world size and was independent of pipeline parallelism. Please see here.
Are you observing a different behavior or some error with this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, in the 176B trained with this codebase:

Parallelism:

TP_SIZE=4
PP_SIZE=12

Inside of mp_rank_00_model_states.pt, it says mp_world_size=48, i.e. 4*12.

...
'lr_scheduler': {'max_lr': 6e-05, 'warmup_steps': 183105, 'num_steps': 147682224, 'warmup_tokens': 0, 'num_tokens': 302449000448, 'decay_style': 'cosine', 'decay_steps': 200000000, 'min_lr': 6e-06}, 'sparse_tensor_module_names': set(), 'skipped_steps': 0, 'global_steps': 80000, 'global_samples': 147682224, 'dp_world_size': 8, 'mp_world_size': 48, 'ds_config': './ds_config.472098.json', 'ds_version': '0.6.1+b7d64fd', 'args':
...

I think this is because of the snippet I pasted above.

@stas00 stas00 merged commit 0f23a72 into main Jul 20, 2022
@stas00 stas00 deleted the ds_ckpt_reshape branch July 20, 2022 20:45
Muennighoff added a commit that referenced this pull request Aug 17, 2022
* Reshape deepspeed checkpoint (#239)

* Reshape deepspeed checkpoint

* add checkpoint tests

* Validate input folder

* Tests for tp/pp reshape

* remove debug folders

* fix test_checkpoint_reshaping_empty_dir

* Fix unit tests

* Remove deepspeed checkpoint utils

* Use DS 3D reshaping utils

* convert to bf16

* wip universal chkpt

* rename

* rename

* wip on fragments dealing

* cleanup

* Loading universal checkpoint with reshaping

* all gpu1<->2 reshapes work

* param attrs

* make the tests adaptable to the number of available gpus

* WIP

* WIP

* WIP

* WIP

* Debug functions

* args should be required, don't create another latest file

* Parallelize shard extraction

* close+join pool; add tqdm; comment out noise

* rename

* parameterize

* Parallel slice merging

* Cleanup

* allow inspection on a machine w/o gpus

* test against the right DS branch

* DS size was merged

Co-authored-by: Stas Bekman <stas@stason.org>

* BLOOM Inference via DeepSpeed-Inference, Accelerate and DeepSpeed-ZeRO (#308)

* hardcode the dtype depending on the model

* change the mp based on the world_size

* remove hardcoded world_size

* add bigscience/bigscience-small-testing

* fixes

* add zero-inference script

* fixes

* fix

* working script

* renames

* fixes

* fix for offline use

* add benchmark

* add benchmark

* update

* cleanup

* update

* msecs

* cleanup

* improve

* fix benchmark, add warmup

* update

* fix; thanks Michael Wyatt

* clarify

* add bloom batch-inference script

* removed the names :-)

* fold the bs functionality from the other script

* fix

* restore do_sample

* dump generate args

* fix

* fix

* support any batchsize

* div by bs

* mul by bs

* add cpu_offload; sync scripts

* wip

* improvements

* fixes

* fixes

* add accelerate script

* fix

* wip

* wip

* stats

* add OnDevice and remove zero-inference (#316)

* wip

* rework generate + benchmark

* figure out the memory map dynamically

* bug fix

* fix ds-zero-inference wrt device

* bug fix

* update

* update

* fix

Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Stas Bekman <stas@stason.org>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
younesbelkada pushed a commit to younesbelkada/Megatron-DeepSpeed that referenced this pull request Sep 28, 2022
* Reshape deepspeed checkpoint

* add checkpoint tests

* Validate input folder

* Tests for tp/pp reshape

* remove debug folders

* fix test_checkpoint_reshaping_empty_dir

* Fix unit tests

* Remove deepspeed checkpoint utils

* Use DS 3D reshaping utils

* convert to bf16

* wip universal chkpt

* rename

* rename

* wip on fragments dealing

* cleanup

* Loading universal checkpoint with reshaping

* all gpu1<->2 reshapes work

* param attrs

* make the tests adaptable to the number of available gpus

* WIP

* WIP

* WIP

* WIP

* Debug functions

* args should be required, don't create another latest file

* Parallelize shard extraction

* close+join pool; add tqdm; comment out noise

* rename

* parameterize

* Parallel slice merging

* Cleanup

* allow inspection on a machine w/o gpus

* test against the right DS branch

* DS size was merged

Co-authored-by: Stas Bekman <stas@stason.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants