[zero_to_fp32.py] support param groups by stas00 · Pull Request #1017 · deepspeedai/DeepSpeed

stas00 · 2021-04-29T03:17:43Z

In my original version I happened to use a model with 1 param group so I wasn't aware that there could be multiple flattened tensors - one per group, so my reconstruction script was breaking when it run into non-one-single-flat-tensor.

This PR tries to fix that.

There might be a more efficient way to do it, but for now just trying to make sure that it functions correctly.

I also left some disabled debug code for now while it's new and likely to need debug still. We can remove it later if we feel it's solid.

While I tested this on a few live models, it'd be great to have a functional test for zero2 and zero3 for this code. But I'm not quite familiar with how your test suite is done and unfortunately don't really have time right now to sort it out.

The simplest conceptual test would be:

# original_fp32_state_dict_path
model = (multi-param-group model).from_pretrained(original_fp32_state_dict_path)
engine = deepspeed.initialize(model, ...)
engine.save_checkpoint()
! ./zero_to_fp32.py global_step1 pytorch_model.bin
! diff pytorch_model.bin original_fp32_state_dict_path # should be identical

Fixes: #1009

@exelents, please check that this PR solves the problem for you.

stas00 · 2021-04-29T03:24:00Z

deepspeed/utils/zero_to_fp32.py

+        torch.cat(state_dicts[i]['optimizer_state_dict'][fp32_groups_key],
+                  0) for i in range(len(state_dicts))


This is the only functional change in this PR. Instead of using just the first element, it now uses them all.

This seems fine for now. I agree we have to revisit, especially for very large models that could cause CPU OOM.

exelents · 2021-04-29T10:33:22Z

Converting of "siamese" model based on t5-11b encoders is done successfully. But when I load it into CPU memory I gone a stange message:

Some weights of T5Siamese were not initialized from the model checkpoint at
 ./siamese_train_deepspeed/models/siamese-t5-11b-fp16/checkpoint-1625 and are newly initialized: 
['encoder_left.encoder.embed_tokens.weight', 'encoder_right.encoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

tjruwase · 2021-04-29T12:05:14Z

deepspeed/utils/zero_to_fp32.py

+        torch.cat(state_dicts[i]['optimizer_state_dict'][fp32_groups_key],
+                  0) for i in range(len(state_dicts))


This seems fine for now. I agree we have to revisit, especially for very large models that could cause CPU OOM.

stas00 · 2021-04-29T16:29:14Z

Thank you for running the checks, @exelents

./siamese_train_deepspeed/models/siamese-t5-11b-fp16/checkpoint-1625 and are newly initialized:
['encoder_left.encoder.embed_tokens.weight', 'encoder_right.encoder.embed_tokens.weight']

You have encoder_right.shared.weight - aren't those tied/aliased? (Same for left)

The weights are being restored based on this dict which gets saved when the checkpoint is created

    def _get_param_shapes(self):
        param_shapes = OrderedDict()
        for name, param in self.module.named_parameters():
            param_shapes[name] = param.ds_shape if hasattr(param,
                                                           "ds_shape") else param.shape
            # print(f"saving param {name} {param_shapes[name]}")
        return param_shapes

so if you do this on your model, you won't find encoder_left.encoder.embed_tokens.weight in the names.

give it a run. self is the ds engine - so instead of self.module it'd be just your transformers model

exelents · 2021-04-29T16:49:30Z

You have encoder_right.shared.weight - aren't those tied/aliased? (Same for left)

Yes, I have these weights returned from named_parameters() function of my model, so I suppose they should exist in the checkpoint.

dd = list(model_engine.module.named_parameters())
dd = list(filter(lambda x: 'shared' in x[0], dd))
dd

[('encoder_left.shared.weight',
  Parameter containing:
  tensor([1.], device='cuda:0', dtype=torch.float16, requires_grad=True)),
 ('encoder_right.shared.weight',
  Parameter containing:
  tensor([1.], device='cuda:0', dtype=torch.float16, requires_grad=True))]

stas00 · 2021-04-29T16:53:18Z

But that's what I'm saying: the checkpoint does have encoder_left.shared.weight and encoder_right.shared.weight and those are restored.

The loader complains about encoder_left.encoder.embed_tokens.weight and encoder_right.encoder.embed_tokens.weight.

Your code above doesn't check for these 2.

BTW, the new version of zero_to_fp32.py has a debug flag - turn it on and when you run it you will see each weight as it gets loaded.

exelents · 2021-04-29T17:11:54Z

I turned on debug flag and it show me all weights in checkpoint including:

└──>$ cat debug.txt | grep shared
encoder_left.shared.weight full shape: torch.Size([32128, 512]) partition0 numel=16449536 partitioned_padding_numel=0
encoder_right.shared.weight full shape: torch.Size([32128, 512]) partition0 numel=16449536 partitioned_padding_numel=0

stas00 · 2021-04-29T17:34:24Z

We have already established that.

Please review #1017 (comment) the warning is for 2 other names.

exelents · 2021-04-29T18:09:23Z

Ah, okay. I understood. In the code of T5EncoderModel I see that variable encoder_right.encoder.embed_tokens is initialized from external variable encoder_right.shared while T5Stack is created, and it isn't needed to be saved.
Thank you.

stas00 · 2021-04-29T18:16:55Z

Yes, probably could clean that up so that it doesn't produce a misleading warning.

The key is to please check that the resumed checkpoint scores well for you. I did only a quick 100 or so steps and the loss looked correct. Also please re-check with zero2. I did test it as well, but a second pair of eyes is always better.

I was just concerned that perhaps somehow the saved weights weren't in the same order as the param_names dict, so it'd always reshape it correctly, since it's the same number of elements once it's all flattened into a single tensor, but I checked that the order appears to be correct.

support param groups

88ddb49

stas00 requested review from RezaYazdaniAminabadi, ShadenSmith, arashashari, awan-10, cli99, conglongli, eltonzheng, jeffra, minjiaz, niumanar, samyam and tjruwase as code owners April 29, 2021 03:17

terrible autoformatter

e0f8b6e

stas00 mentioned this pull request Apr 29, 2021

Reconstruction of fp32 weights on stage3 doesn't work #1009

Closed

stas00 commented Apr 29, 2021

View reviewed changes

tjruwase approved these changes Apr 29, 2021

View reviewed changes

tjruwase merged commit a8cf887 into deepspeedai:master Apr 29, 2021

stas00 deleted the zero_to_fp32-param_groups branch April 29, 2021 16:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[zero_to_fp32.py] support param groups#1017

[zero_to_fp32.py] support param groups#1017
tjruwase merged 2 commits intodeepspeedai:masterfrom
stas00:zero_to_fp32-param_groups

stas00 commented Apr 29, 2021 •

edited

Loading

Uh oh!

stas00 Apr 29, 2021 •

edited

Loading

Uh oh!

tjruwase Apr 29, 2021

Uh oh!

exelents commented Apr 29, 2021 •

edited

Loading

Uh oh!

tjruwase Apr 29, 2021

Uh oh!

stas00 commented Apr 29, 2021 •

edited

Loading

Uh oh!

exelents commented Apr 29, 2021

Uh oh!

stas00 commented Apr 29, 2021 •

edited

Loading

Uh oh!

exelents commented Apr 29, 2021 •

edited

Loading

Uh oh!

stas00 commented Apr 29, 2021

Uh oh!

exelents commented Apr 29, 2021 •

edited

Loading

Uh oh!

stas00 commented Apr 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		torch.cat(state_dicts[i]['optimizer_state_dict'][fp32_groups_key],
		0) for i in range(len(state_dicts))

Conversation

stas00 commented Apr 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 Apr 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tjruwase Apr 29, 2021

Choose a reason for hiding this comment

Uh oh!

exelents commented Apr 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tjruwase Apr 29, 2021

Choose a reason for hiding this comment

Uh oh!

stas00 commented Apr 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

exelents commented Apr 29, 2021

Uh oh!

stas00 commented Apr 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

exelents commented Apr 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Apr 29, 2021

Uh oh!

exelents commented Apr 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Apr 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stas00 commented Apr 29, 2021 •

edited

Loading

stas00 Apr 29, 2021 •

edited

Loading

exelents commented Apr 29, 2021 •

edited

Loading

stas00 commented Apr 29, 2021 •

edited

Loading

stas00 commented Apr 29, 2021 •

edited

Loading

exelents commented Apr 29, 2021 •

edited

Loading

exelents commented Apr 29, 2021 •

edited

Loading