Conversation
Signed-off-by: Moshe Island <misland@habana.ai>
Propagate aux loss along GPTModelPipe layers by forwarding the aggregated loss from each transformer layer to the next transformer layer. In addition, add a layer to GPTModelPipe, after the last transformer layer, to catch the final aggregated aux loss and cache it for use in the loss function. Signed-off-by: Moshe Island <misland@habana.ai>
Signed-off-by: Moshe Island <misland@habana.ai>
Currently PipelineEngine supports only a single tensor partitioning with grad. MoE model requires to forward with grad both the activations and the aux_loss. Therefore, until PilelineEngine limitation is removed, verify no partitioning when using MoE. Signed-off-by: Moshe Island <misland@habana.ai>
|
Thank you @mosheisland, this work is truly amazing! I deeply appreciate your effort for the thorough verification as well.
|
Moshe: For non-MoE runs, the losses are LM loss
Moshe: sure, below: Total loss (LM + Aux): Color legend: Also there are two runs called gpt_3d_moe_with_pr and gpt_3d_moe_with_pr_new.
Moshe: I am mainly interested in running with BF16_Optimizer. |
|
@mosheisland Thank you for sharing the results! They all look good to me. |
|
@tohtana, deepspeedai/DeepSpeed#5338 is merged. So I think now we can progress with this one. |
|
Thank you @mosheisland, merged now. |




Main changes:
NOTE that this PR is dependent on DeepSpeed PR#5338 https://github.com/microsoft/DeepSpeed/pull/5338
Below are tensorboard captures of tests to verify MoE support for pipeline and test no regressions.
Testing was done with following configurations:
Training runs with and without this PR:
Training loss curve of a GPTModel model with fp16, No MOE (i.e. Dense network).

Scaling: 8xA100 DP=4 TP=2 PP=1, ZERO=0, Using GPTModel
Comparing without vs with this PR.
Training loss curve of GPTModel model with fp16, with MOE (4 experts, top2).

Scaling: 8xA100 DP=4 TP=2 PP=1, ZERO=0, Using GPTModel
Comparing without vs with this PR.
Training loss curve of a GPTModelPipe model with BF16_Optimizer, No MOE (i.e. Dense network).

Scaling: 8xA100 DP=2 TP=2 PP=2, ZERO=0, Using GPTModelPipe
Comparing without vs with this PR
Comparing using with this PR:
At the beginning of the training, GPTModel fp16 is a little behind due to few steps of loss-scale adjustments. However, both configurations end up with very close loss.
