Fix ZeRO-3 optimizer initialization validation (#7844)#7929
Fix ZeRO-3 optimizer initialization validation (#7844)#7929amadhan882 wants to merge 8 commits intodeepspeedai:masterfrom
Conversation
6339f8c to
c7417b9
Compare
…ckward pass Signed-off-by: amadhan882 <amadhan882@gmail.com>
…F16/ZenFlow integration Signed-off-by: amadhan882 <amadhan882@gmail.com>
Signed-off-by: amadhan882 <amadhan882@gmail.com>
deepspeed/runtime/engine.py
Outdated
| raise RuntimeError( | ||
| "DeepSpeedEngine: Optimizer initialization failed. Check for JIT compilation errors.") | ||
|
|
||
| optimizer_methods = ['step', 'load_state_dict'] |
There was a problem hiding this comment.
Please add backward to this list.
deepspeed/runtime/engine.py
Outdated
| ) | ||
|
|
||
| # Validate engine separately | ||
| if not hasattr(self, "backward") or not callable(getattr(self, "backward")): |
There was a problem hiding this comment.
Apologies if not been previously clear, but self.optimizer.backward needs validating not self.backward.
deepspeed/runtime/engine.py
Outdated
| # ZeRO stage >= 2 communicates during non gradient accumulation boundaries as well | ||
| if self.zero_optimization_partition_gradients(): | ||
| self.optimizer.overlapping_partition_gradients_reduce_epilogue() | ||
| if hasattr(self.optimizer, 'overlapping_partition_gradients_reduce_epilogue'): |
There was a problem hiding this comment.
It seems this check is now redundant due to line 425.
Signed-off-by: amadhan882 <amadhan882@gmail.com>
|
@sfc-gh-truwase Thanks for the clarification!
Please let me know if anything else needs adjustment. |
Signed-off-by: amadhan882 <amadhan882@gmail.com>
|
@amadhan882 can you please address the formatting issues? |
Thanks for the feedback. I am currently reviewing the changes to resolve the formatting issues and will push the updated commits shortly. |
Overview
This PR addresses issue #7844 by adding a validation check to ensure the ZeRO-3 optimizer is correctly initialized before training begins.
Changes
.stepattribute on the optimizer specifically for ZeRO-Stage 3 configurations.Related Issue
Fixes #7844