-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reshape deepspeed checkpoint #239
Conversation
if you give it a bogus path it fails to report that the path is bogus and fails with:
this could be a small test. |
Also there is a bit of asymmetry between input and output paths. Intuitively this:
should generate a new checkpoint under This is odd:
since the user has no idea |
# XXX: fails to handle: | ||
#--embed-layernorm | ||
# | ||
# stderr: RuntimeError: Error(s) in loading state_dict for VocabParallelEmbedding: | ||
# stderr: size mismatch for norm.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). | ||
# stderr: size mismatch for norm.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea how to make the checkpoint tool discover new entries in the model as its architecture grows?
When an optional --embed-layernorm
is added it pushed a layer norm into the embed layer, and then the reshaper fails to change its size and then it fails as I pasted the error above.
Unless the only way is to create a map of all possible param names and may be flag which are optional?
Good point. For this case, is it reasonable for the tool to write latest.txt (and others) in checkpoints-out/? Or perhaps the tool should not be doing that, but should simply be convert checkpoints-in/global_step20 to corresponding checkpoints-out/global_step20? |
Added following --input_folder validation
|
have a flag for that feature? and by default overwrite latest so that the converted checkpoint can be used right away but give an option not to overwrite it? e.g. for testing this default would make things simpler. |
tests/test_checkpoints.py
Outdated
|
||
output_dir1 = self.get_auto_remove_tmp_dir("./xxx1", after=False) | ||
output_dir2 = self.get_auto_remove_tmp_dir("./xxx2", after=False) | ||
with self.assertRaises(AssertionError) as context: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stas00, this test is meant to validate that we assert on empty input folder. However, the test current fails. Do you have any idea what could be wrong? I am still looking into it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will have a look.
I originally committed a test with hardcoded path, which shouldn't be in the committed version. It'll automatically pick a random empty dir and self-destruct at the end.
Use the hardcoded one, only while testing, because then it requires manual care of removing it.
So I pushed a change that undid the hard-coding, but you may want to restore it for while you test on your own so that it's easier to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so that was the main source of the problem, you had tests rely on each other and thus the order they run in makes a difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for your testing needs you can further override self.get_auto_remove_tmp_dir
defaults to automatically clean up the output dir before or after testing. See the full doc here: https://huggingface.co/docs/transformers/master/testing#temporary-files-and-directories
tests/test_checkpoints.py
Outdated
self.reshape_checkpoint(input_dir=output_dir1, output_dir=output_dir2, target_tp_size=1, target_pp_size=1) | ||
|
||
# 3. check we can resume training from a reshaped checkpoint with TP=1 / PP=1 | ||
self.resume_from_checkpoint(output_dir2, tp_size=1, pp_size=1, dp_size=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on my 2-gpu box this test fails with:
stderr: Traceback (most recent call last):
stderr: File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/Megatron-DeepSpeed-master-3/pretrain_gpt.py", line 239, in <module>
stderr: pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
stderr: File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/Megatron-DeepSpeed-master-3/megatron/training.py", line 99, in pretrain
stderr: initialize_megatron(extra_args_provider=extra_args_provider,
stderr: File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/Megatron-DeepSpeed-master-3/megatron/initialize.py", line 155, in initialize_megatron
stderr: finish_mpu_init()
stderr: File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/Megatron-DeepSpeed-master-3/megatron/initialize.py", line 95, in finish_mpu_init
stderr: _initialize_distributed()
stderr: File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/Megatron-DeepSpeed-master-3/megatron/initialize.py", line 285, in _initialize_distributed
stderr: assert args.local_rank == device, \
stderr: AssertionError: expected local-rank to be the same as rank % device-count.
Just dropping in here because I saw you referenced our merging PR over at gpt-neox (EleutherAI/gpt-neox#466) - we have the merging "technically" working fine, but we see performance regression in the merged model (about 1-2% drop in lambada score, doesn't totally break the model but it brings the performance of our 20B model down to about a 6B equivalent, see EleutherAI/gpt-neox#466 (comment)) I'd been meaning to reach out to you guys and the Megatron team to see if you were having a similar problem. The source of the problem, I believe, is that the parameters that should be replicated across model parallel groups (e.g 'input_layernorm.weight') are actually slightly different. I'm not that sure why this is the case (I guess they're never explicitly synced?) in neox and/or whether this is something unique to our codebase or not, so it'd be great to know if you have or don't have the same problem. I'll send an email to the Megatron devs with the same question. |
@sdtblck, thanks for reaching out. Yes, the entire reshaping line is one of slow progress on our side. The plan is to push our updates into this PR and others on DeepSpeed. Perhaps we can sync in a few weeks when we should a bit more clarity/results on our approach. |
cc: @DanielHesslow and @conglongli - for awareness wrt lm-harness integration here: #212 - the deepspeed checkpoint utils are moving into deepspeed, so once this PR is complete some syncing might be required. |
Cross link the other half: microsoft/DeepSpeed#1953 |
Okay if I merge master into this PR? Need some of the last changes |
pp_index): | ||
sd = ds_checkpoint.get_2d_parallel_state(tp_index=tp_index, | ||
pp_index=pp_index) | ||
sd[MP_WORLD_SIZE] = ds_checkpoint.tp_degree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be sd[MP_WORLD_SIZE] = ds_checkpoint.tp_degree * ds_checkpoint.pp_degree
?
This snippet in DeepSpeed sets mp_world_size
to be all gpus that are not data parallel, thus tp & pp
for dp in range(self.data_parallel_size):
ranks = sorted(self._topo.get_axis_list(axis='data', idx=dp))
if self.global_rank == 0:
#print(f'RANK={self.global_rank} building DeepSpeed model group: {ranks}')
pass
proc_group = dist.new_group(ranks=ranks)
if self.global_rank in ranks:
self.ds_model_proc_group = proc_group
self.ds_model_world_size = len(ranks)
self.ds_model_rank = ranks.index(self.global_rank)
assert self.ds_model_rank > -1
assert self.ds_model_proc_group is not None
The size of self.ds_model_proc_group
is used to set MP_WORLD_SIZE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that self.ds_model_proc_group
is tp_degree
. Historically, MP_WORLD_SIZE
stood for model-parallel world size and was independent of pipeline parallelism. Please see here.
Are you observing a different behavior or some error with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, in the 176B trained with this codebase:
Parallelism:
TP_SIZE=4
PP_SIZE=12
Inside of mp_rank_00_model_states.pt
, it says mp_world_size=48, i.e. 4*12.
...
'lr_scheduler': {'max_lr': 6e-05, 'warmup_steps': 183105, 'num_steps': 147682224, 'warmup_tokens': 0, 'num_tokens': 302449000448, 'decay_style': 'cosine', 'decay_steps': 200000000, 'min_lr': 6e-06}, 'sparse_tensor_module_names': set(), 'skipped_steps': 0, 'global_steps': 80000, 'global_samples': 147682224, 'dp_world_size': 8, 'mp_world_size': 48, 'ds_config': './ds_config.472098.json', 'ds_version': '0.6.1+b7d64fd', 'args':
...
I think this is because of the snippet I pasted above.
* Reshape deepspeed checkpoint (#239) * Reshape deepspeed checkpoint * add checkpoint tests * Validate input folder * Tests for tp/pp reshape * remove debug folders * fix test_checkpoint_reshaping_empty_dir * Fix unit tests * Remove deepspeed checkpoint utils * Use DS 3D reshaping utils * convert to bf16 * wip universal chkpt * rename * rename * wip on fragments dealing * cleanup * Loading universal checkpoint with reshaping * all gpu1<->2 reshapes work * param attrs * make the tests adaptable to the number of available gpus * WIP * WIP * WIP * WIP * Debug functions * args should be required, don't create another latest file * Parallelize shard extraction * close+join pool; add tqdm; comment out noise * rename * parameterize * Parallel slice merging * Cleanup * allow inspection on a machine w/o gpus * test against the right DS branch * DS size was merged Co-authored-by: Stas Bekman <stas@stason.org> * BLOOM Inference via DeepSpeed-Inference, Accelerate and DeepSpeed-ZeRO (#308) * hardcode the dtype depending on the model * change the mp based on the world_size * remove hardcoded world_size * add bigscience/bigscience-small-testing * fixes * add zero-inference script * fixes * fix * working script * renames * fixes * fix for offline use * add benchmark * add benchmark * update * cleanup * update * msecs * cleanup * improve * fix benchmark, add warmup * update * fix; thanks Michael Wyatt * clarify * add bloom batch-inference script * removed the names :-) * fold the bs functionality from the other script * fix * restore do_sample * dump generate args * fix * fix * support any batchsize * div by bs * mul by bs * add cpu_offload; sync scripts * wip * improvements * fixes * fixes * add accelerate script * fix * wip * wip * stats * add OnDevice and remove zero-inference (#316) * wip * rework generate + benchmark * figure out the memory map dynamically * bug fix * fix ds-zero-inference wrt device * bug fix * update * update * fix Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Stas Bekman <stas@stason.org> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* Reshape deepspeed checkpoint * add checkpoint tests * Validate input folder * Tests for tp/pp reshape * remove debug folders * fix test_checkpoint_reshaping_empty_dir * Fix unit tests * Remove deepspeed checkpoint utils * Use DS 3D reshaping utils * convert to bf16 * wip universal chkpt * rename * rename * wip on fragments dealing * cleanup * Loading universal checkpoint with reshaping * all gpu1<->2 reshapes work * param attrs * make the tests adaptable to the number of available gpus * WIP * WIP * WIP * WIP * Debug functions * args should be required, don't create another latest file * Parallelize shard extraction * close+join pool; add tqdm; comment out noise * rename * parameterize * Parallel slice merging * Cleanup * allow inspection on a machine w/o gpus * test against the right DS branch * DS size was merged Co-authored-by: Stas Bekman <stas@stason.org>
Reshape deepspeed checkpoints on tp/pp dimensions
TODO before merging:
@olruwase/elastic-ckpt-refresh
inrequirements.txt