Fix ZeRO-3 optimizer initialization validation (#7844) by amadhan882 · Pull Request #7929 · deepspeedai/DeepSpeed

amadhan882 · 2026-03-28T16:20:19Z

Overview

This PR addresses issue #7844 by adding a validation check to ensure the ZeRO-3 optimizer is correctly initialized before training begins.

Changes

Added a check for the .step attribute on the optimizer specifically for ZeRO-Stage 3 configurations.
This prevents the engine from entering a deadlock state during the first training step when JIT kernels fail to compile or toolchain mismatches occur.

Related Issue

Fixes #7844

…ckward pass Signed-off-by: amadhan882 <amadhan882@gmail.com>

deepspeed/runtime/engine.py

…F16/ZenFlow integration Signed-off-by: amadhan882 <amadhan882@gmail.com>

Signed-off-by: amadhan882 <amadhan882@gmail.com>

sfc-gh-truwase · 2026-03-30T12:18:23Z

deepspeed/runtime/engine.py

+                raise RuntimeError(
+                    "DeepSpeedEngine: Optimizer initialization failed. Check for JIT compilation errors.")
+
+            optimizer_methods = ['step', 'load_state_dict']


Please add backward to this list.

sfc-gh-truwase · 2026-03-30T12:20:12Z

deepspeed/runtime/engine.py

+                    )
+
+            # Validate engine separately
+            if not hasattr(self, "backward") or not callable(getattr(self, "backward")):


Apologies if not been previously clear, but self.optimizer.backward needs validating not self.backward.

sfc-gh-truwase · 2026-03-30T12:21:07Z

deepspeed/runtime/engine.py

        # ZeRO stage >= 2 communicates during non gradient accumulation boundaries as well
        if self.zero_optimization_partition_gradients():
-            self.optimizer.overlapping_partition_gradients_reduce_epilogue()
+            if hasattr(self.optimizer, 'overlapping_partition_gradients_reduce_epilogue'):


It seems this check is now redundant due to line 425.

Signed-off-by: amadhan882 <amadhan882@gmail.com>

amadhan882 · 2026-03-31T14:15:00Z

@sfc-gh-truwase Thanks for the clarification!

Added backward to optimizer validation.
Updated validation to check self.optimizer.backward instead of self.backward.
Redundant hasattr check for the ZeRO-3 epilogue has been removed.

Please let me know if anything else needs adjustment.

Signed-off-by: amadhan882 <amadhan882@gmail.com>

sfc-gh-truwase · 2026-04-01T14:05:31Z

@amadhan882 can you please address the formatting issues?

amadhan882 · 2026-04-01T14:21:33Z

@amadhan882 can you please address the formatting issues?

Thanks for the feedback. I am currently reviewing the changes to resolve the formatting issues and will push the updated commits shortly.

amadhan882 requested review from tjruwase and tohtana as code owners March 28, 2026 16:20

amadhan882 force-pushed the master branch 2 times, most recently from 6339f8c to c7417b9 Compare March 28, 2026 17:14

Fix: Add ZeRO-3 optimizer validation and prevent AttributeError in ba…

a39180e

…ckward pass Signed-off-by: amadhan882 <amadhan882@gmail.com>

amadhan882 force-pushed the master branch from c7417b9 to a39180e Compare March 28, 2026 17:51

Merge branch 'master' into master

d931be0

sfc-gh-truwase reviewed Mar 28, 2026

View reviewed changes

deepspeed/runtime/engine.py Outdated Show resolved Hide resolved

sfc-gh-truwase reviewed Mar 28, 2026

View reviewed changes

deepspeed/runtime/engine.py Outdated Show resolved Hide resolved

Refactor: Generalize ZeROOptimizer support and extend core APIs for B…

ddca910

…F16/ZenFlow integration Signed-off-by: amadhan882 <amadhan882@gmail.com>

amadhan882 force-pushed the master branch from 13af835 to ddca910 Compare March 29, 2026 04:29

Fix: Add optimizer initialization validation and callable checks

b36d39a

Signed-off-by: amadhan882 <amadhan882@gmail.com>

sfc-gh-truwase reviewed Mar 30, 2026

View reviewed changes

Fix: Add optimizer initialization validation for backward and ZeRO-3

b2e17ab

Signed-off-by: amadhan882 <amadhan882@gmail.com>

amadhan882 force-pushed the master branch from 7a66291 to b2e17ab Compare March 31, 2026 14:11

amadhan882 requested a review from sfc-gh-truwase March 31, 2026 14:17

sfc-gh-truwase approved these changes Mar 31, 2026

View reviewed changes

amadhan882 force-pushed the master branch from 1741444 to b2e17ab Compare March 31, 2026 16:38

amadhan882 and others added 3 commits March 31, 2026 22:24

Fix: Add ZeRO-3 specific optimizer validation and fix formatting

0711f9b

Signed-off-by: amadhan882 <amadhan882@gmail.com>

Merge branch 'master' into master

d749679

Merge branch 'master' into master

b1a3bc1

sfc-gh-truwase mentioned this pull request Apr 1, 2026

fix(zero): guard ds_grads_remaining #7904

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ZeRO-3 optimizer initialization validation (#7844)#7929

Fix ZeRO-3 optimizer initialization validation (#7844)#7929
amadhan882 wants to merge 8 commits intodeepspeedai:masterfrom
amadhan882:master

amadhan882 commented Mar 28, 2026

Uh oh!

Uh oh!

Uh oh!

sfc-gh-truwase Mar 30, 2026

Uh oh!

sfc-gh-truwase Mar 30, 2026

Uh oh!

sfc-gh-truwase Mar 30, 2026

Uh oh!

amadhan882 commented Mar 31, 2026

Uh oh!

sfc-gh-truwase commented Apr 1, 2026

Uh oh!

amadhan882 commented Apr 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

amadhan882 commented Mar 28, 2026

Overview

Changes

Related Issue

Uh oh!

Uh oh!

Uh oh!

sfc-gh-truwase Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

sfc-gh-truwase Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

sfc-gh-truwase Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

amadhan882 commented Mar 31, 2026

Uh oh!

sfc-gh-truwase commented Apr 1, 2026

Uh oh!

amadhan882 commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

amadhan882 commented Apr 1, 2026 •

edited

Loading