Skip to content

Why does not the stage 3 implementation use broadcast and reduce-scatter? #1173

@zarzen

Description

@zarzen

Hi there,
(Thanks for sharing your implementations!)

After reading the paper, and partially going throught the stage3.py implementation.
I have two questions in general:

  1. is there any specific reason for picking all-gather instead of broadcast operation, which is mentioned in the paper.
  2. by profiling the stage-3, using Nsight System, I didn't see any nccl reduce-scatter kernel on the timeline. As shown in the following. While, I do see the nccl-all-reduce kernel on the timeline, does that mean the ZeRO3 implementation using all-reduce for gradient reduction?

image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions