Skip to content

Comparison of Deepspeed Stage 1,2 and 3 vs DDP #4815

@jpatel-bdai

Description

@jpatel-bdai

Describe the bug
When the model fits on a single GPU, how does Deepspeed ZeRO stage 1 compare with DDP? In my experiments, the Deepspeed ZeRO stage 1. I see that my overall loss training progresses similarly in both the cases but after a few iterations, the Deepspeed ZeRO stage 1 and stage 2 performance degrades.

Expected behavior
I would expect both DDP and Deepspeed ZeRO Stage 1 to give similar results when run of single GPU. The total loss is a combination of a few losses and one of which is trans loss. Do you have experiments that compare DDP and Deepspeed ZeRO stage 1 or 2 that I can refer. Are these supposed to give similar performance? The attached screenshots are for single GPU and 2 GPU experiments for total loss and trans loss.

Screenshots
image
image
image
image

System info (please complete the following information):

  • OS: [e.g. Ubuntu 20.04]
  • A100s -> singe GPU and 2 - GPU
  • Python version - 3.10

Docker context
No

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingtraining

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions