fix(pt): remove meaningless error raising#5411
Conversation
📝 WalkthroughWalkthroughThe Trainer constructor now automatically downgrades Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@deepmd/pt/train/training.py`:
- Around line 185-186: The code silently resets self.zero_stage to 0 when
self.is_distributed is False; change this to emit a clear warning before
mutating the value so users know their requested ZeRO/FSDP stage was ignored
(e.g., call warnings.warn(...) or self.logger.warning(...) with a message like
"zero_stage X requested but distributed launch not detected; forcing
zero_stage=0"), then set self.zero_stage = 0 as before; update the block
containing self.zero_stage and self.is_distributed to perform the warning first
and ensure the message includes the original requested value and guidance about
using torchrun.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 23b375ec-8cc0-4f65-ba75-efbeb613411d
📒 Files selected for processing (1)
deepmd/pt/train/training.py
There was a problem hiding this comment.
Pull request overview
This PR changes the PyTorch Trainer initialization behavior so that requesting training.zero_stage > 0 without an initialized distributed process group no longer raises an error and instead falls back to zero_stage = 0.
Changes:
- Removed the
ValueErrorwhenzero_stage > 0but training is not running under distributed initialization. - Added an implicit fallback that sets
self.zero_stage = 0in that case.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5411 +/- ##
==========================================
- Coverage 80.47% 80.46% -0.01%
==========================================
Files 820 820
Lines 86005 86005
Branches 4139 4139
==========================================
- Hits 69209 69208 -1
Misses 15521 15521
- Partials 1275 1276 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
njzjz
left a comment
There was a problem hiding this comment.
This change is reasonable to me.
5d9cbdf
Summary by CodeRabbit