Starting point: https://github.com/microsoft/DeepSpeed/issues/966 Test matrix 1. gradient accumulation: one vs many 2. #gpus: one vs many 3. stages: 1 vs 2 vs 3 4. dtype: bf16 vs fp16 vs fp32 @stas00