Samyamr/largest partitioned params calculation fix by samyam · Pull Request #1150 · deepspeedai/DeepSpeed

samyam · 2021-06-09T00:20:54Z

largest_partiitoned_params is calculated incorrectly making it much larger than it has to be. The PR fixes the calculation.

largest partitioned params was getting calculated incorrectly

tjruwase · 2021-06-09T00:28:41Z

deepspeed/runtime/zero/stage3.py


        #Largest partitioned param
-        largest_partitioned_param_numel = max(self.fp16_partitioned_groups_flat_numel)
+        largest_partitioned_param_numel = max([max([tensor.numel() for tensor in fp16_partitioned_group]) for fp16_partitioned_group in self.fp16_partitioned_groups])


Does self.fp16_partitioned_groups_flat_numel need to be updated?

Do references like this need updating?

…tion-fix

mrgomdev · 2021-11-08T09:13:19Z

The change uses tensor.numel() for retrieving the sizes. However, if we already built the model in the context of deepspeed.zero.Init(), tensor.numel() or tensor.data get pointing only the placeholder, after partitioning and offloading the original tensor into ds_tensor and ds_numel, accordingly. In some cases, this can cause some problems, including the initialization of a deepspeed model engine after deepspeed.zero.Init(). Please check out this situation and confirm the change is valid.

The original version, self.fp16_partitioned_groups_flat_numel [seems that it got retrieved from ds_numel. ](The change uses tensor.numel() for retrieving the sizes. However, if we already built the model in the context of deepspeed.zero.Init(), tensor.numel() or tensor.data get pointing only the placeholder, after partitioning and offloading the original tensor into ds_tensor and ds_numel, accordingly. In some cases, this can cause some problems, including the initialization of a deepspeed model engine after deepspeed.zero.Init(). Please check out this situation and confirm the change is valid.

The original version, self.fp16_partitioned_groups_flat_numel seems that it got retrieved from ds_numel. https://github.com/microsoft/DeepSpeed/blob/7567c76c05626c5acd8b5700bedfc412c55d5354/deepspeed/runtime/zero/stage3.py#L1151)

samyam added 3 commits June 8, 2021 16:44

largest_partitioned_params calculation fix

b16c781

largest partitioned params was getting calculated incorrectly

Update stage3.py

c1ec6e2

Update stage3.py

b52b333

samyam requested review from jeffra and tjruwase June 9, 2021 00:20

samyam requested review from RezaYazdaniAminabadi, ShadenSmith, arashashari, awan-10, cli99, conglongli, eltonzheng, minjiaz and niumanar as code owners June 9, 2021 00:20

formatting fix

b7e9e9b

tjruwase approved these changes Jun 9, 2021

View reviewed changes

samyam and others added 3 commits June 9, 2021 00:44

changing sub-group size default to 1e9

307ccdf

Merge branch 'master' into samyamr/largest-partitioned-params-calcula…

75bdbbc

…tion-fix

Merge branch 'master' into samyamr/largest-partitioned-params-calcula…

faaa87d

…tion-fix

tjruwase merged commit 4eaf910 into master Jun 16, 2021

mrgomdev mentioned this pull request Nov 15, 2021

The change of largest_partitioned_param_numel looks to be fixed, regarding of ds_numel, not numel(). #1561

Closed

mrwyattii deleted the samyamr/largest-partitioned-params-calculation-fix branch July 7, 2023 02:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Samyamr/largest partitioned params calculation fix#1150

Samyamr/largest partitioned params calculation fix#1150
tjruwase merged 7 commits intomasterfrom
samyamr/largest-partitioned-params-calculation-fix

samyam commented Jun 9, 2021

Uh oh!

tjruwase Jun 9, 2021

Uh oh!

tjruwase Jun 9, 2021

Uh oh!

mrgomdev commented Nov 8, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

samyam commented Jun 9, 2021

Uh oh!

tjruwase Jun 9, 2021

Choose a reason for hiding this comment

Uh oh!

tjruwase Jun 9, 2021

Choose a reason for hiding this comment

Uh oh!

mrgomdev commented Nov 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mrgomdev commented Nov 8, 2021 •

edited

Loading