Why does not the stage 3 implementation use broadcast and reduce-scatter?

Hi there,
(Thanks for sharing your implementations!)

After reading the paper, and partially going throught the `stage3.py` implementation. 
I have two questions in general:
1) is there any specific reason for picking all-gather instead of broadcast operation, which is mentioned in the paper.
2) by profiling the stage-3, using Nsight System, I didn't see any nccl reduce-scatter kernel on the timeline. As shown in the following. While, I do see the nccl-all-reduce kernel on the timeline, does that mean the ZeRO3 implementation using all-reduce for gradient reduction? 

![image](https://user-images.githubusercontent.com/1150493/122625353-bff0a380-d072-11eb-93ac-cca6f8076053.png)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does not the stage 3 implementation use broadcast and reduce-scatter? #1173

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Why does not the stage 3 implementation use broadcast and reduce-scatter? #1173

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions