Severe performance degradation with deepspeed > v.0.3.12

Creating a new issue with more details around what I reported in the comments of an unrelated issue https://github.com/microsoft/DeepSpeed/issues/1057

I have been training a 8B parameter GPT-2 model with DeepSpeed Stage 2 enabled (micro-batch size 2, world size 64).

I used this base image `nvcr.io/nvidia/pytorch:21.03-py3` and installed the latest DeepSpeed on top of it. It has pytorch 1.9, python 3.8, NCCL 2.8.4.

I was seeing the following stats per iteration when profiling:
```
forward: 294.88 | backward: 17065.34 | backward_inner: 15988.85 | backward_allreduce: 1076.32 | step: 2102.37
SamplesPerSec: ~7.5
```

I found this to contradict my previous tests which used older dependencies and so I downgraded deepspeed to v0.3.12, and then got the following stats (keeping all else equal):
```
forward: 294.88 | backward: 3244.52 | backward_inner: 3051.20 | backward_allreduce: 193.21 | step: 1651.08
SamplesPerSec: ~33.4
```

I don't use a user-defined optimizer and I see from my logs that DeepSpeed's `FusedAdam` is picked in both cases.

Clearly, the latest deepspeed seems to be **severely** hurting performance, as seen by `backward` time (a difference of ~14000 milliseconds).

I want to upgrade to the latest deepspeed, what is the cause of this degradation in performance with the latest deepspeed?

@jeffra 

EDIT: Just a correction, it seems from my logs that DeepSpeed's `FusedAdam` is picked in the second case, but `DeepSpeedCPUAdam` is picked in the first case. That may possibly explain the difference in `backward` times. But this is despite my usage of `"type": "adam"` for the optimizer in the DeepSpeed config JSON in both cases. Why does this inconsistency exist?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Severe performance degradation with deepspeed > v.0.3.12 #1073

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Severe performance degradation with deepspeed > v.0.3.12 #1073

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions