Hi there,
(Thanks for sharing your implementations!)
After reading the paper, and partially going throught the stage3.py implementation.
I have two questions in general:
- is there any specific reason for picking all-gather instead of broadcast operation, which is mentioned in the paper.
- by profiling the stage-3, using Nsight System, I didn't see any nccl reduce-scatter kernel on the timeline. As shown in the following. While, I do see the nccl-all-reduce kernel on the timeline, does that mean the ZeRO3 implementation using all-reduce for gradient reduction?
