I wonder if it's possible to consolidate some of the implementations of global norm? I count at least seven implementations in DeepSpeed: five in this file and one each in stage1/2 and stage 3. I'm
Summarizing implementations in utils.py:
clip_grad_norm_: has additional MOE code, but does not handle model parallelism correctly (uses get_model_parallel_rank() instead of bwc_tensor_model_parallel_rank()
get_grad_norm: does not include the MOE considerations, handles model parallelism
get_weight_norm: handles model parallelism, but if the list of input grads are already extracted from parameters then we will not have the model parallel attributes.
get_global_norm_of_tensors: does not consider model parallelism. I see it's used in conjuction with filtering tensors based on MP attributes. I like this refactor :)
get_global_norm: used to get the total of a list of existing partial norms. No parallelism considered. FP32, fused FP16, and unfused FP16 all use this combined with split_params_grads... and get_weight_norm, which I believe means we will not account for any model parallelism.
Credit to @ShadenSmith for raising this issue while reviewing #1801.