-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Description
Creating a new issue with more details around what I reported in the comments of an unrelated issue #1057
I have been training a 8B parameter GPT-2 model with DeepSpeed Stage 2 enabled (micro-batch size 2, world size 64).
I used this base image nvcr.io/nvidia/pytorch:21.03-py3 and installed the latest DeepSpeed on top of it. It has pytorch 1.9, python 3.8, NCCL 2.8.4.
I was seeing the following stats per iteration when profiling:
forward: 294.88 | backward: 17065.34 | backward_inner: 15988.85 | backward_allreduce: 1076.32 | step: 2102.37
SamplesPerSec: ~7.5
I found this to contradict my previous tests which used older dependencies and so I downgraded deepspeed to v0.3.12, and then got the following stats (keeping all else equal):
forward: 294.88 | backward: 3244.52 | backward_inner: 3051.20 | backward_allreduce: 193.21 | step: 1651.08
SamplesPerSec: ~33.4
I don't use a user-defined optimizer and I see from my logs that DeepSpeed's FusedAdam is picked in both cases.
Clearly, the latest deepspeed seems to be severely hurting performance, as seen by backward time (a difference of ~14000 milliseconds).
I want to upgrade to the latest deepspeed, what is the cause of this degradation in performance with the latest deepspeed?
EDIT: Just a correction, it seems from my logs that DeepSpeed's FusedAdam is picked in the second case, but DeepSpeedCPUAdam is picked in the first case. That may possibly explain the difference in backward times. But this is despite my usage of "type": "adam" for the optimizer in the DeepSpeed config JSON in both cases. Why does this inconsistency exist?