Skip to content

Consolidate norm computation logic #1839

@tjruwase

Description

@tjruwase

I wonder if it's possible to consolidate some of the implementations of global norm? I count at least seven implementations in DeepSpeed: five in this file and one each in stage1/2 and stage 3. I'm

Summarizing implementations in utils.py:

clip_grad_norm_: has additional MOE code, but does not handle model parallelism correctly (uses get_model_parallel_rank() instead of bwc_tensor_model_parallel_rank()
get_grad_norm: does not include the MOE considerations, handles model parallelism
get_weight_norm: handles model parallelism, but if the list of input grads are already extracted from parameters then we will not have the model parallel attributes.
get_global_norm_of_tensors: does not consider model parallelism. I see it's used in conjuction with filtering tensors based on MP attributes. I like this refactor :)
get_global_norm: used to get the total of a list of existing partial norms. No parallelism considered. FP32, fused FP16, and unfused FP16 all use this combined with split_params_grads... and get_weight_norm, which I believe means we will not account for any model parallelism.

Credit to @ShadenSmith for raising this issue while reviewing #1801.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions