Skip to content

Samyamr/largest partitioned params calculation fix#1150

Merged
tjruwase merged 7 commits intomasterfrom
samyamr/largest-partitioned-params-calculation-fix
Jun 16, 2021
Merged

Samyamr/largest partitioned params calculation fix#1150
tjruwase merged 7 commits intomasterfrom
samyamr/largest-partitioned-params-calculation-fix

Conversation

@samyam
Copy link
Contributor

@samyam samyam commented Jun 9, 2021

largest_partiitoned_params is calculated incorrectly making it much larger than it has to be. The PR fixes the calculation.

samyam added 3 commits June 8, 2021 16:44
largest partitioned params was getting calculated incorrectly

#Largest partitioned param
largest_partitioned_param_numel = max(self.fp16_partitioned_groups_flat_numel)
largest_partitioned_param_numel = max([max([tensor.numel() for tensor in fp16_partitioned_group]) for fp16_partitioned_group in self.fp16_partitioned_groups])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does self.fp16_partitioned_groups_flat_numel need to be updated?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do references like this need updating?

@tjruwase tjruwase merged commit 4eaf910 into master Jun 16, 2021
@mrgomdev
Copy link

mrgomdev commented Nov 8, 2021

The change uses tensor.numel() for retrieving the sizes. However, if we already built the model in the context of deepspeed.zero.Init(), tensor.numel() or tensor.data get pointing only the placeholder, after partitioning and offloading the original tensor into ds_tensor and ds_numel, accordingly. In some cases, this can cause some problems, including the initialization of a deepspeed model engine after deepspeed.zero.Init(). Please check out this situation and confirm the change is valid.

The original version, self.fp16_partitioned_groups_flat_numel [seems that it got retrieved from ds_numel. ](The change uses tensor.numel() for retrieving the sizes. However, if we already built the model in the context of deepspeed.zero.Init(), tensor.numel() or tensor.data get pointing only the placeholder, after partitioning and offloading the original tensor into ds_tensor and ds_numel, accordingly. In some cases, this can cause some problems, including the initialization of a deepspeed model engine after deepspeed.zero.Init(). Please check out this situation and confirm the change is valid.

The original version, self.fp16_partitioned_groups_flat_numel seems that it got retrieved from ds_numel. https://github.com/microsoft/DeepSpeed/blob/7567c76c05626c5acd8b5700bedfc412c55d5354/deepspeed/runtime/zero/stage3.py#L1151)

@mrwyattii mrwyattii deleted the samyamr/largest-partitioned-params-calculation-fix branch July 7, 2023 02:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants