Skip to content

Fix ZeRO-3 optimizer initialization validation (#7844)#7929

Open
amadhan882 wants to merge 8 commits intodeepspeedai:masterfrom
amadhan882:master
Open

Fix ZeRO-3 optimizer initialization validation (#7844)#7929
amadhan882 wants to merge 8 commits intodeepspeedai:masterfrom
amadhan882:master

Conversation

@amadhan882
Copy link
Copy Markdown

Overview

This PR addresses issue #7844 by adding a validation check to ensure the ZeRO-3 optimizer is correctly initialized before training begins.

Changes

  • Added a check for the .step attribute on the optimizer specifically for ZeRO-Stage 3 configurations.
  • This prevents the engine from entering a deadlock state during the first training step when JIT kernels fail to compile or toolchain mismatches occur.

Related Issue

Fixes #7844

@amadhan882 amadhan882 force-pushed the master branch 2 times, most recently from 6339f8c to c7417b9 Compare March 28, 2026 17:14
…ckward pass

Signed-off-by: amadhan882 <amadhan882@gmail.com>
…F16/ZenFlow integration

Signed-off-by: amadhan882 <amadhan882@gmail.com>
Signed-off-by: amadhan882 <amadhan882@gmail.com>
raise RuntimeError(
"DeepSpeedEngine: Optimizer initialization failed. Check for JIT compilation errors.")

optimizer_methods = ['step', 'load_state_dict']
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add backward to this list.

)

# Validate engine separately
if not hasattr(self, "backward") or not callable(getattr(self, "backward")):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies if not been previously clear, but self.optimizer.backward needs validating not self.backward.

# ZeRO stage >= 2 communicates during non gradient accumulation boundaries as well
if self.zero_optimization_partition_gradients():
self.optimizer.overlapping_partition_gradients_reduce_epilogue()
if hasattr(self.optimizer, 'overlapping_partition_gradients_reduce_epilogue'):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this check is now redundant due to line 425.

Signed-off-by: amadhan882 <amadhan882@gmail.com>
@amadhan882
Copy link
Copy Markdown
Author

@sfc-gh-truwase Thanks for the clarification!

  • Added backward to optimizer validation.

  • Updated validation to check self.optimizer.backward instead of self.backward.

  • Redundant hasattr check for the ZeRO-3 epilogue has been removed.

Please let me know if anything else needs adjustment.

@sfc-gh-truwase
Copy link
Copy Markdown
Collaborator

@amadhan882 can you please address the formatting issues?

@amadhan882
Copy link
Copy Markdown
Author

amadhan882 commented Apr 1, 2026

@amadhan882 can you please address the formatting issues?

Thanks for the feedback. I am currently reviewing the changes to resolve the formatting issues and will push the updated commits shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] DeepSpeed ZeRO-3 deadlock in engine.step() at step 0 under 2-GPU execution (RTX 3090, torch 2.2.1, DS 0.14.2)

2 participants