fix(cc): handle nloc==0 in DeepSpinPTExpt with phantom-atom padding#5485
fix(cc): handle nloc==0 in DeepSpinPTExpt with phantom-atom padding#5485wanghan-iapcm wants to merge 2 commits into
Conversation
Multi-rank spin MD can leave a rank with zero real local atoms when all atoms migrate to other subdomains. The with-comm AOTI artifact hits an intermittent SIGFPE (integer divide by zero) at runtime in inductor-generated shape arithmetic that uses nloc as a divisor. The graph was traced with nloc_min=1 and inductor lowered an even stricter nloc>=2 runtime-check which is silently bypassed because AOTI_RUNTIME_CHECK_INPUTS is unset by default. Whether the offending divide is actually emitted depends on inductor's code-gen choices, which vary across compiles -- hence the random nature of the failure (reproduced on CI run 26667802665). Fix: prepend two phantom atoms with empty neighbour lists ahead of the real atoms when nloc_real==0. The AOTI graph then runs with nloc==2, satisfying the inductor specialisation. Phantoms have no neighbours so they contribute zero atomic energy / force / virial, preserving the physically-correct 'this rank has no real atoms' result. comm_dict's nlocal is set to 2 so border_op writes received ghost features past the phantom slots; outputs are stripped of the phantom prefix before being scattered back to LAMMPS via select_map.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 47f15b41a6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // received ghost features past the phantom slots; outputs are stripped | ||
| // of the phantom prefix before being scattered back to LAMMPS atoms | ||
| // via ``select_map``. | ||
| const int phantom_n = (nloc_real == 0 && nall_real > 0) ? 2 : 0; |
There was a problem hiding this comment.
Subtract phantom atoms before returning energy
When an empty-subdomain rank takes this new path, the two inserted type-0 phantom atoms are included in the model's energy_redu, and that reduced energy is assigned directly before any phantom prefix is stripped. DeepMD energy fitting nets can add per-type biases / nonzero zero-neighbor outputs, so phantoms are not guaranteed to have zero atomic energy; in those models every rank with nloc_real == 0 will contribute extra phantom energy to the MPI-reduced LAMMPS total. Please either remove/subtract the phantom atomic energies from energy_redu or make the model see them through a mask that excludes them from reductions.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in commit c43468fb7 (force-pushed). I went one step beyond a phantom-only subtract: a rank with no real local atoms contributes zero to the total energy by definition, so I just zero ener directly when phantom_n > 0.
The simpler-subtract approach would have been incomplete here — the spin path doubles atoms internally (real + spin half), so both halves carry the per-type-bias and zero-neighbour MLP output into energy_redu. output_map["energy"] only exposes the real half after the SpinModel's [:, :nloc] slice, so subtracting that alone would leave the spin half leaking through. (Confirmed empirically: an earlier attempt to subtract the real half showed mpi-2=-2.45 vs mpi-1=-1.49 in CI run 26796476553. The zero-out has no such residual.)
Forces / force_mag / virial don't need an analogous correction because phantom atomic outputs are coord-independent (no neighbours) so their derivatives are zero.
📝 WalkthroughWalkthroughThis PR modifies DeepSpinPTExpt::compute to handle ranks with zero real local atoms by prepending phantom atoms for model inputs, adjusting mapping and neighbor-list tensors so phantoms are isolated, zeroing reduced energy when needed, and stripping phantom entries from force and atomic output tensors before returning to LAMMPS. ChangesPhantom-atom padding for zero-local-atoms corner case
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@source/api_cc/src/DeepSpinPTExpt.cc`:
- Around line 487-489: The mapping assignment for real atoms fails to account
for phantom padding: when phantom_n > 0 the indices returned by fwd_map(...) are
in the pre-padding space and must be shifted by phantom_n. Update the loop that
sets mapping[ii] (which uses fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]])
to add phantom_n to the fwd_map result so real-atom targets point into the
post-padding index range; keep all other indexing (ii, bkw_map,
lmp_list.mapping) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: e4e67575-7512-48ab-a532-4f17dad67d93
📒 Files selected for processing (1)
source/api_cc/src/DeepSpinPTExpt.cc
| for (int ii = phantom_n; ii < nall_real; ii++) { | ||
| mapping[ii] = fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]]; | ||
| } |
There was a problem hiding this comment.
Missing phantom_n offset in mapping values for real atoms.
When phantom_n > 0, fwd_map[...] returns indices in the pre-padding coordinate space [0, nall_real_orig). After phantom padding, real atoms are shifted to indices [phantom_n, nall_real), so the mapping values must be offset accordingly.
For example, with phantom_n=2: if fwd_map[...] returns 0 (first real ghost in pre-padding space), the mapping should point to index 2 (first real ghost in post-padding space), not 0 (which is now a phantom slot).
🐛 Proposed fix
for (int ii = phantom_n; ii < nall_real; ii++) {
- mapping[ii] = fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]];
+ mapping[ii] = fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]] + phantom_n;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for (int ii = phantom_n; ii < nall_real; ii++) { | |
| mapping[ii] = fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]]; | |
| } | |
| for (int ii = phantom_n; ii < nall_real; ii++) { | |
| mapping[ii] = fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]] + phantom_n; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@source/api_cc/src/DeepSpinPTExpt.cc` around lines 487 - 489, The mapping
assignment for real atoms fails to account for phantom padding: when phantom_n >
0 the indices returned by fwd_map(...) are in the pre-padding space and must be
shifted by phantom_n. Update the loop that sets mapping[ii] (which uses
fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]]) to add phantom_n to the
fwd_map result so real-atom targets point into the post-padding index range;
keep all other indexing (ii, bkw_map, lmp_list.mapping) unchanged.
There was a problem hiding this comment.
Not reachable in current code, so I'm not applying the suggestion. The reasoning:
-
The
if (lmp_list.mapping)branch is single-rank-only. Bothpair_deepmd.cpp:194andpair_deepspin.cpp:204-257gatelmp_list.set_mapping(...)oncomm->nprocs == 1. So in multi-rank,lmp_list.mapping == nullptrand we take the else (identity-fallback) branch instead. -
phantom_n > 0is multi-rank-only — it only fires whennloc_real == 0 && nall_real > 0, which can only happen in a multi-rank LAMMPS run with an empty subdomain. -
Combining:
phantom_n > 0implies multi-rank, which implieslmp_list.mapping == nullptr, which means the branch you're flagging is never entered. The identity-fallback branch handles phantoms correctly becausemapping[ii] = iiis self-consistent regardless of phantom offsets, and the with-comm AOTI artifact ignoresmappingfor ghost feature exchange anyway (border_opdoes that).
If a future change ever populates lmp_list.mapping in multi-rank, the fix would become applicable. Until then, applying it would be defensive against an unreachable scenario.
There was a problem hiding this comment.
@wanghan-iapcm, that's a clear and well-documented analysis — thank you. The mutual exclusivity is airtight:
lmp_list.mapping != nullptr↔ single-rank (nprocs == 1)phantom_n > 0↔ multi-rank empty subdomain (nloc_real == 0 && nall_real > 0)
So the two conditions can never hold simultaneously in the current LAMMPS pair-style code, and the branch is indeed unreachable. I'll withdraw the suggestion and record this constraint to avoid flagging it again.
(ᵒ ᵕ ᵒ)ᵇ
✏️ Learnings added
Learnt from: wanghan-iapcm
Repo: deepmodeling/deepmd-kit PR: 5485
File: source/api_cc/src/DeepSpinPTExpt.cc:487-489
Timestamp: 2026-06-02T06:34:42.158Z
Learning: In `source/api_cc/src/DeepSpinPTExpt.cc`, the combination `(phantom_n > 0) && (lmp_list.mapping != nullptr)` is structurally unreachable:
- `lmp_list.mapping` is populated only when `comm->nprocs == 1` (single-rank), gated in `source/lmp/pair_deepmd.cpp:194` and `source/lmp/pair_deepspin.cpp:204-257`.
- `phantom_n > 0` only fires when `nloc_real == 0 && nall_real > 0`, which can only occur in a multi-rank LAMMPS run with an empty subdomain.
Therefore the `if (lmp_list.mapping)` block inside the `ago == 0` section is never entered when phantom padding is active; the identity-fallback `else` branch handles that case correctly. Do not flag the missing `+ phantom_n` offset in the `lmp_list.mapping` branch as a bug.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: wanghan-iapcm
Repo: deepmodeling/deepmd-kit PR: 5407
File: source/api_cc/src/DeepSpinPTExpt.cc:135-151
Timestamp: 2026-04-20T13:36:43.594Z
Learning: In deepmd-kit, `.pt2` model files do not include any version/schema field. When reviewing code in `*PTExpt.cc` that reads `.pt2` metadata, do not add or require a version/schema lookup/expectation. Ensure backward compatibility is implemented via metadata fallback defaults—e.g., if `do_atomic_virial` is missing, default it to `true`, and if `nnei` metadata is missing, default `nnei` to `sum(sel)`—so older exported `.pt2` models continue to work.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5485 +/- ##
==========================================
- Coverage 81.36% 81.36% -0.01%
==========================================
Files 868 868
Lines 96437 96463 +26
Branches 4233 4241 +8
==========================================
+ Hits 78463 78484 +21
+ Misses 16674 16673 -1
- Partials 1300 1306 +6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Codex flagged on PR deepmodeling#5485: phantoms have constant atomic-energy outputs that flow into 'energy_redu'. On the spin path the SpinModel doubles atoms internally, so both real and spin phantom halves contribute -- and 'output_map["energy"]' only exposes the real half after the '[:, :nloc]' slice. Subtracting only that real half (a first attempt) left the spin half leaking into the MPI-reduced LAMMPS total: CI run 26796476553 showed mpi-2 = -2.45 vs mpi-1 ref = -1.49. Simpler exact fix: a rank with no real local atoms contributes zero to the total energy by definition. The phantoms are pure scaffolding to satisfy inductor's nloc>=2 specialisation; their fitting output is a numerical artifact, not physics. Zero 'ener' directly when phantom_n > 0. Forces / force_mag / virial are unaffected because phantom outputs are coord-independent (no neighbours) so their derivatives are zero -- no analogous correction is needed there.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
source/api_cc/src/DeepSpinPTExpt.cc (1)
393-401:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPad
aparam_when you synthesize phantom locals.Lines 398-400 change the local-atom count from
0to2, butaparam_stays in the pre-padding state. On an empty-local rank withdaparam > 0, Lines 551-560 still send an empty tensor, so this path can still break for spin models that were exported with atomic parameters.🧩 Suggested fix
if (phantom_n > 0) { dcoord.insert(dcoord.begin(), static_cast<size_t>(phantom_n) * 3, static_cast<VALUETYPE>(0)); datype.insert(datype.begin(), static_cast<size_t>(phantom_n), 0); + if (daparam > 0) { + aparam_.insert(aparam_.begin(), + static_cast<size_t>(phantom_n) * daparam, + static_cast<VALUETYPE>(0)); + } nall_real += phantom_n; nloc_real = phantom_n; nloc = nall_real - nghost_real; }Also applies to: 550-560
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@source/api_cc/src/DeepSpinPTExpt.cc` around lines 393 - 401, When synthesizing phantom locals (phantom_n > 0) you must pad the per-atom parameter array aparam_ to match the new phantom atoms: insert static_cast<size_t>(phantom_n) * static_cast<size_t>(daparam) default-valued entries at the front of aparam_ (similar to how dcoord and datype are padded) so subsequent send logic (the path around the existing tensor sends) sees the correct size; perform this insert in the same block that updates dcoord, datype, nall_real, nloc_real, and nloc.
♻️ Duplicate comments (1)
source/api_cc/src/DeepSpinPTExpt.cc (1)
487-489:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winOffset real-atom mapping targets into the post-padding index space.
This is still using
fwd_map[...]from the pre-padding layout. After Lines 395-400 prepend two phantom rows, every real/ghost row has moved byphantom_n, so the current mapping can resolve to a phantom slot instead of the intended atom row.🐛 Minimal fix
for (int ii = phantom_n; ii < nall_real; ii++) { - mapping[ii] = fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]]; + mapping[ii] = + fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]] + phantom_n; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@source/api_cc/src/DeepSpinPTExpt.cc` around lines 487 - 489, The mapping loop uses fwd_map from the pre-padding layout so indices can point into phantom rows; update the resolved index by offsetting into the post-padding layout by phantom_n. Concretely, in the loop that sets mapping[ii] (which references fwd_map, lmp_list.mapping and bkw_map), add phantom_n to the index into fwd_map (e.g. use fwd_map[ lmp_list.mapping[bkw_map[ii - phantom_n]] + phantom_n ] or otherwise shift the resolved value by phantom_n) so every real/ghost row maps into the post-padding index space.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@source/api_cc/src/DeepSpinPTExpt.cc`:
- Around line 393-401: When synthesizing phantom locals (phantom_n > 0) you must
pad the per-atom parameter array aparam_ to match the new phantom atoms: insert
static_cast<size_t>(phantom_n) * static_cast<size_t>(daparam) default-valued
entries at the front of aparam_ (similar to how dcoord and datype are padded) so
subsequent send logic (the path around the existing tensor sends) sees the
correct size; perform this insert in the same block that updates dcoord, datype,
nall_real, nloc_real, and nloc.
---
Duplicate comments:
In `@source/api_cc/src/DeepSpinPTExpt.cc`:
- Around line 487-489: The mapping loop uses fwd_map from the pre-padding layout
so indices can point into phantom rows; update the resolved index by offsetting
into the post-padding layout by phantom_n. Concretely, in the loop that sets
mapping[ii] (which references fwd_map, lmp_list.mapping and bkw_map), add
phantom_n to the index into fwd_map (e.g. use fwd_map[
lmp_list.mapping[bkw_map[ii - phantom_n]] + phantom_n ] or otherwise shift the
resolved value by phantom_n) so every real/ghost row maps into the post-padding
index space.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 1b0da563-36d4-460d-8300-e08ca6d9f052
📒 Files selected for processing (1)
source/api_cc/src/DeepSpinPTExpt.cc
|
Thanks for the update. CI is green now, but I think two small correctness issues still need to be addressed before approval:
mapping[ii] = fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]];When
Could you please add a small follow-up commit for these two cases? After that this should be good to approve. — OpenClaw 2026.5.28 (model: gpt-5.5) |
Problem
Multi-rank spin MD can leave a rank with zero real local atoms (
nloc_real == 0) when atoms migrate to other subdomains. The with-comm AOTI artifact hits an intermittent SIGFPE (integer divide by zero) at runtime in inductor-generated shape arithmetic that usesnlocas a divisor.Reproduced on master CI run
26667802665:Root cause:
nloc_min=1(serialization.py:362) and inductor lowered an even stricternloc >= 2runtime-check (visible in the generatedwrapper.cpp'scheck_input_3).AOTI_RUNTIME_CHECK_INPUTS(default OFF), so withnloc = 0the check is silently bypassed and the compiled graph runs through its own divide-by-zero on shape arithmetic.Fix
Prepend two phantom atoms with empty neighbour lists when
nloc_real == 0so the AOTI graph runs withnloc == 2and never reaches the integer-divide-by-zero path. Phantoms have no neighbours so they contribute zero atomic energy / force / virial, preserving the physically-correct "this rank has no real atoms" result.Key details (all in
source/api_cc/src/DeepSpinPTExpt.cc):dcoord/datype/dspinget two zero-valued rows prepended.firstneigh_tensorgets two-1rows prepended (no neighbours).mapping_tensorgets two identity entries prepended.comm_dict.nlocalis set to2(not the LAMMPS-reported0) soborder_opwrites received ghost features past the phantom slots.dforce,dforce_mag,datom_energy,datom_virial) get the phantom prefix stripped before being scattered back to LAMMPS viaselect_map.Why phantoms rather than
Dim(min=0)re-exportBumping the trace constraint to
min=0would require:nloc-dependent divide indeepmd/dpmodel/{descriptor,fitting,model}/and protecting withxp.maximum(nloc, 1);torch.exportre-emitting compatible guards (currently fails because spin-side shape relationships requirenloc >= 1to be inferable);.pt2archive insource/tests/infer/.The phantom approach is a strict superset of correctness and self-contained in one C++ file. The two approaches aren't mutually exclusive — the
min=0route can land as a follow-up once the dpmodel audit is done.Test plan
runUnitTests_cc --gtest_filter='*Spin*': 42 / 42 spin C++ regression tests pass (12 TF-backend tests skipped, as expected in the PT-only venv).test_pair_deepmd_mpi_dpa3_spin_empty_subdomainshould now pass deterministically. Local Python LAMMPS-MPI verification is blocked by a pre-existing OpenMPI/MPICH ABI mismatch in my local venv (the plugin'sompi_mpi_*symbols can't resolve against MPICH'slibmpi.so.12), so end-to-end verification falls to CI.Known limitations
nloc_real > 0(theif (phantom_n > 0)branch never fires), so the common path is unchanged.nloclower-bound to >2,phantom_nwill need to track that minimum.DeepSpinPTExptonly. The corresponding non-spin path inDeepPotPTExpthas the same code shape; non-spin DPA3 empty-subdomain currently passes in CI but could regress similarly with a future inductor change. Deferred to a follow-up if observed.Summary by CodeRabbit