Queries regarding communcation calls in Deepspeed for Stage1(Zero1) and Stage2(Zero2):
-
As per paper, Stage1(Zero1) has Optimizer split implementation and we should see below calls on 2 GPU's(1 layer GPT2/345m model) with NCCL:
1 call - Reduce_Scatter for Gradients
1 call - All_Gather for optimizer
As per details mentioned in one of prior github issues(Deepspeed) that reduce_scatter for stage 1 is not implemented. Thats why it calls all_reduce for stage1 Zero optimizer. Is it still not implemented?
Based on the logs we are seeing All_Reduce calls in the backpass + Reduce_Scatter calls post backpass before optimizer step. So can you explain why reduce_scatter needed here as we can just do all_reduce for the gradient and all_gather for optimizer?
We are seeing an overflow occur at the end of Optimizer step(we expected to see All_Gather comms at the end of the optimizer which we dont observe). Can you pls explain why overflow occurs as mentioned in the below logs:
rank 0 detected overflow nan in tensor 0:1 shape torch.Size([50304, 1024])
[deepspeed] OVERFLOW! Skipping step. Attempted loss scale: 4294967296, reducing to 4294967296
-
For Stage2(Zero2), Zero with Optimizer+gradient split- In the backpass the gradient should get scattered to gpu that owns that gradient chunk for reduce, and then perform all gather for optimizer(as per paper). In the run logs we are seeing:
No calls in the backpass which means no comms and compute overlap in backpass because of enabling fusion buffer. As a result we have see only 2 All_Reduce calls post backpass for gradients. Pls confirm if our understanding is correct?
Queries regarding communcation calls in Deepspeed for Stage1(Zero1) and Stage2(Zero2):
As per paper, Stage1(Zero1) has Optimizer split implementation and we should see below calls on 2 GPU's(1 layer GPT2/345m model) with NCCL:
1 call - Reduce_Scatter for Gradients
1 call - All_Gather for optimizer
As per details mentioned in one of prior github issues(Deepspeed) that reduce_scatter for stage 1 is not implemented. Thats why it calls all_reduce for stage1 Zero optimizer. Is it still not implemented?
Based on the logs we are seeing All_Reduce calls in the backpass + Reduce_Scatter calls post backpass before optimizer step. So can you explain why reduce_scatter needed here as we can just do all_reduce for the gradient and all_gather for optimizer?
We are seeing an overflow occur at the end of Optimizer step(we expected to see All_Gather comms at the end of the optimizer which we dont observe). Can you pls explain why overflow occurs as mentioned in the below logs:
rank 0 detected overflow nan in tensor 0:1 shape torch.Size([50304, 1024])
[deepspeed] OVERFLOW! Skipping step. Attempted loss scale: 4294967296, reducing to 4294967296
For Stage2(Zero2), Zero with Optimizer+gradient split- In the backpass the gradient should get scattered to gpu that owns that gradient chunk for reduce, and then perform all gather for optimizer(as per paper). In the run logs we are seeing:
No calls in the backpass which means no comms and compute overlap in backpass because of enabling fusion buffer. As a result we have see only 2 All_Reduce calls post backpass for gradients. Pls confirm if our understanding is correct?