Consolidate norm computation logic

I wonder if it's possible to consolidate some of the implementations of global norm? I count at least seven implementations in DeepSpeed: five in this file and one each in stage1/2 and stage 3. I'm

Summarizing implementations in utils.py:

clip_grad_norm_: has additional MOE code, but does not handle model parallelism correctly (uses get_model_parallel_rank() instead of bwc_tensor_model_parallel_rank()
get_grad_norm: does not include the MOE considerations, handles model parallelism
get_weight_norm: handles model parallelism, but if the list of input grads are already extracted from parameters then we will not have the model parallel attributes.
get_global_norm_of_tensors: does not consider model parallelism. I see it's used in conjuction with filtering tensors based on MP attributes. I like this refactor :)
get_global_norm: used to get the total of a list of existing partial norms. No parallelism considered. FP32, fused FP16, and unfused FP16 all use this combined with split_params_grads... and get_weight_norm, which I believe means we will not account for any model parallelism.

Credit to @ShadenSmith for raising this issue while reviewing #1801.  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate norm computation logic #1839

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consolidate norm computation logic #1839

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions