Skip to content

Severe performance degradation with deepspeed > v.0.3.12 #1073

@g-karthik

Description

@g-karthik

Creating a new issue with more details around what I reported in the comments of an unrelated issue #1057

I have been training a 8B parameter GPT-2 model with DeepSpeed Stage 2 enabled (micro-batch size 2, world size 64).

I used this base image nvcr.io/nvidia/pytorch:21.03-py3 and installed the latest DeepSpeed on top of it. It has pytorch 1.9, python 3.8, NCCL 2.8.4.

I was seeing the following stats per iteration when profiling:

forward: 294.88 | backward: 17065.34 | backward_inner: 15988.85 | backward_allreduce: 1076.32 | step: 2102.37
SamplesPerSec: ~7.5

I found this to contradict my previous tests which used older dependencies and so I downgraded deepspeed to v0.3.12, and then got the following stats (keeping all else equal):

forward: 294.88 | backward: 3244.52 | backward_inner: 3051.20 | backward_allreduce: 193.21 | step: 1651.08
SamplesPerSec: ~33.4

I don't use a user-defined optimizer and I see from my logs that DeepSpeed's FusedAdam is picked in both cases.

Clearly, the latest deepspeed seems to be severely hurting performance, as seen by backward time (a difference of ~14000 milliseconds).

I want to upgrade to the latest deepspeed, what is the cause of this degradation in performance with the latest deepspeed?

@jeffra

EDIT: Just a correction, it seems from my logs that DeepSpeed's FusedAdam is picked in the second case, but DeepSpeedCPUAdam is picked in the first case. That may possibly explain the difference in backward times. But this is despite my usage of "type": "adam" for the optimizer in the DeepSpeed config JSON in both cases. Why does this inconsistency exist?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions