feat(pt_expt): O(N) on-device NeighborGraph builders — vesin & nvalchemiops (PR-C)#5714
Conversation
Whole-branch review M1: mirror ase_builder's empty-part-list guard so the vesin builder handles nf==0 without a torch.cat([]) RuntimeError. M3: extend the dpmodel fail-fast test to cover method='nv' (not just 'vesin').
📝 WalkthroughWalkthroughThe PR adds NV and vesin neighbor-graph builders for PT experimental graph flows, routes inference and model dispatch through method चयन, filters virtual atoms in ASE graph construction, and expands regression coverage for decoding, parity, and import behavior. ChangesNeighbor-graph builders and dispatch
Estimated code review effort: 4 (Complex) | ~60 minutes Possibly related issues
Possibly related PRs
Suggested reviewers: 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
for more information, see https://pre-commit.ci
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #5714 +/- ##
==========================================
- Coverage 81.24% 81.13% -0.12%
==========================================
Files 981 990 +9
Lines 109860 111007 +1147
Branches 4234 4232 -2
==========================================
+ Hits 89257 90063 +806
- Misses 19080 19420 +340
- Partials 1523 1524 +1 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
… (review) Address iProzd's deepmodeling#5714 review: - extract the nv dense-matrix -> (i, j, S) decode into nv_matrix_to_ijs (pure torch, device-agnostic) and unit-test it on the default CPU CI with hand-checked, empty, and random-vs-oracle synthetic inputs; the CUDA neighbor_list search stays behind the opt-in GPU suite - document why nv searches/recomputes on NORMALIZED coords while vesin uses the ORIGINAL coords (nvalchemiops needs in-cell positions; vesin handles unwrapped positions natively) - each self-consistent with its S - document the vesin per-frame Python loop scope: intended for nf==1 inference / CPU; never a default (neighbor_graph_method=None resolves to 'dense'); batched multi-frame GPU work should use nv
iProzd
left a comment
There was a problem hiding this comment.
Agree with @OutisLi, and I think this is a correctness gap for the opt-in O(N) builders, not just a docstring scope note. Confirmed in code:
build_neighbor_graph(dense) drops virtual-atom edges viakeep_mask = ... & ~(extended_atype<0) & ~(atype<0).vesin/nv/asebuild geometric edges from coords only and never look atatype.forward_common_atomic_graphmasks edges only forpair_excl; foratype<0it only clamps the type and zeroes the virtual atom's own output — it does not drop edges from a virtual neighbor into a real center.
So for inputs with atype<0 (e.g. mixed-natoms padding), a real center that has a virtual atom within rcut gets an extra type-0 edge under vesin/nv/ase but not under dense, polluting the real atom's descriptor/energy/force. The "SAME neighbor set / parity" claim then holds only for all-real-atom inputs, which is likely all the current set-equality fixtures cover.
Rather than only documenting the caller's responsibility, could the O(N) builders filter edges touching atype<0 (drop where center or neighbor-owner is virtual) to match dense, so the parity claim stays literally true? A set-equality test with a virtual-atom fixture would lock it in. Default users are unaffected (default resolves to dense), so this is opt-in-only severity, but it undermines the stated dense parity.
OutisLi
left a comment
There was a problem hiding this comment.
Please rework the nv graph path to reuse the existing NvNeighborList implementation rather than adding a second nvalchemiops search stack. The new graph builder duplicates the kernel setup, capacity loop, device handling, and matrix decode that NvNeighborList already owns, which increases maintenance cost and has already diverged in behavior. This can be fixed by extending NvNeighborList with a graph-oriented return mode or a thin build_graph method while preserving the existing extended/edges modes so current SeZM users are not affected.
|
Others LGTM. |
OutisLi
left a comment
There was a problem hiding this comment.
The same design concern applies to both nv and vesin: graph-form NeighborGraph construction should extend the existing NeighborList strategies instead of adding separate parallel builders. VesinNeighborList and NvNeighborList already provide the O(N) search backends used by SeZM through extended/edges return modes; the graph path should be another output contract on the same strategy layer, not a second copy of each backend. Please unify this design while preserving the existing extended and edges modes for current SeZM users.
OutisLi
left a comment
There was a problem hiding this comment.
Please also restore the graph-lower eligibility check in the pt_expt graph dispatch. Explicit neighbor_graph_method values should not bypass the same mixed_types/uses_graph_lower contract enforced by dpmodel and by graph export.
OutisLi
left a comment
There was a problem hiding this comment.
Please revisit the public neighbor-backend API design. The PR now exposes two knobs, nlist_backend and neighbor_graph_method, whose scopes are hard to infer from their names and whose supported values differ. From a user perspective both choose the neighbor construction backend, so the API should be made more natural and unambiguous. In particular, if nv is a supported O(N) backend, it should not work only for graph-form .pt2 while the nlist path has no nv option, especially since NvNeighborList already exists.
test_plugin popped deepmd.pt_expt from sys.modules without restoring it,
leaving the package's cached submodules bound to a dead parent. Any later
import of a cached submodule (e.g. deepmd.pt_expt.infer.deep_eval)
re-created a BARE parent whose utils/infer attributes were never rebound,
and mock.patch('deepmd.pt_expt.utils...') in
test_deep_eval_serialize_api failed with AttributeError under py3.10's
mock target resolution. Shard-order dependent: this PR's new test files
reshuffled CI shard 4 and exposed the latent master-side hygiene bug.
Snapshot the whole deepmd.pt_expt module tree before the re-import and
restore it (including the parent-package attribute) afterwards.
The dense reference builder filters virtual atoms (atype < 0) as both centers and neighbors during construction, but the geometric ase/vesin/nv searches are type-blind, so with virtual/padding atoms present the edge sets diverged from the dense builder and the 'SAME neighbor set' contract was violated (OutisLi review, deepmodeling#5714). Post-filter the (i, j, S) edge list in all three builders by the atype of both endpoints. Set-equality tests vs the dense builder with a virtual atom added for ase (default CI), vesin (guarded), and nv (CUDA-gated).
…egacy strategies delegate The shared search helpers now live in the World-2 graph-builder modules (vesin_graph_builder.vesin_search_ijs, nv_graph_builder.nv_search_matrix), which are the long-term primary path. The legacy VesinNeighborList and NvNeighborList classes (scheduled for deprecation with the dense-nlist path) import and call these helpers rather than owning their own copies of the same logic. Direction is graph-builder-centric (World-1 strategies are callers, not owners), matching the reviewer's intent and the project's deprecation roadmap. Key details: - vesin_search_ijs: returns (ii, jj, ss) as int64; handles the zeros-box for non-periodic internally; pins torch.device for vesin's internal allocs. _build_single casts ss to float at the call site (unchanged math for ss@box). _build_single_edges uses int64 ss directly (edge_schema_from_ij_shifts accepts int shifts). Function-level import in vesin_neighbor_list avoids a module-level cycle (vesin_graph_builder imports is_vesin_torch_available from vesin_neighbor_list). - nv_search_matrix: wraps the full search pipeline including _input_device_context pinning (restoring the guard the nv graph builder previously omitted), periodic normalization, batch tensor construction, and the grow-until-fit capacity loop. NvNeighborList.build uses a function-level import (lazy, avoids pt->pt_expt module-level cycle). Post-search code in build() is wrapped in its own _input_device_context block so device behavior is byte-unchanged relative to the original single-context design. - _matrix_to_extended_inputs, edge_schema_from_neighbor_matrix, and _truncate_to_sel_compiled are entirely unchanged (extended/edges modes byte-identical). - _grow_search_capacity and _input_device_context remain in nv_nlist.py for callers that import them directly (test_nv_nlist imports _input_device_context).
…ble descriptors Mirror the dpmodel guard in the pt_expt _call_common_graph override: _resolve_graph_method's eligibility check only protects the default (None) path, so an EXPLICIT method reached the builders for descriptors without a graph lower. Regression test drives se_e2_a (mixed_types False) through all four explicit methods and expects NotImplementedError. Addresses OutisLi review on deepmodeling#5714.
…backend split neighbor_graph_method is consumed only by graph-form .pt2 eval; a non-default value on a nlist-form artifact was silently ignored, misleading users who wanted an O(N) builder (that knob is nlist_backend). Document both knobs in the DeepEval docstring (incl. the planned consolidation at the dense-nlist deprecation) and raise at construction when neighbor_graph_method != 'dense' on a non-graph artifact; regression test on a nlist-form .pt2. Addresses OutisLi review on deepmodeling#5714.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
deepmd/pt_expt/utils/nv_graph_builder.py (1)
242-244: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winHandle the documented 2D
coordform here. Any(nf, nloc*3)tensor is reshaped withnf = 1, so a flattened multi-frame input is merged into one neighbor graph; periodic inputs will also fail onceboxis reshaped with the wrong frame count. Derivenffromatype(or reject 2D inputs explicitly).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/pt_expt/utils/nv_graph_builder.py` around lines 242 - 244, The coord reshaping logic in nv_graph_builder’s neighbor-graph setup does not correctly handle the documented 2D coord form, since it forces nf=1 for any non-3D tensor and can merge flattened multi-frame inputs into one graph. Update the frame-count derivation in this coord/box preprocessing path to use atype (or explicitly reject unsupported 2D inputs), and make sure the same nf is applied consistently before reshaping coord and box.
🧹 Nitpick comments (2)
deepmd/pt_expt/infer/deep_eval.py (1)
149-208: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick winConsider validating
vesin/nvavailability at construction, not firsteval().The new fail-fast check (Lines 181-194) only validates that
neighbor_graph_methodis used on a graph-form artifact; it does not validate that"vesin"(needsvesin.torch) or"nv"(CUDA-only per PR objectives) are actually usable._setup_nlist_backend(Lines 209-251) eagerly validates its equivalentvesinoption (raisesImportError/ValueErrorat construction), butneighbor_graph_method="vesin"/"nv"will only fail once_build_eval_graphis reached inside the firsteval()call — a less consistent, later failure point for the same class of misconfiguration.💡 Possible early-validation addition
if ( neighbor_graph_method != "dense" and getattr(self, "metadata", {}).get("lower_input_kind") != "graph" ): raise ValueError( f"neighbor_graph_method={neighbor_graph_method!r} only applies to " "graph-form .pt2 artifacts (lower_input_kind == 'graph'); this " f"model is not graph-form. Use nlist_backend to select the " "neighbor-list builder for the nlist path." ) + if neighbor_graph_method == "vesin" and not is_vesin_torch_available(): + raise ImportError( + "neighbor_graph_method='vesin' requires 'vesin.torch'; " + "install it (`pip install vesin[torch]`) or choose another method." + ) + if neighbor_graph_method == "nv" and not torch.cuda.is_available(): + raise ValueError( + "neighbor_graph_method='nv' requires CUDA; use 'vesin' or 'dense' " + "on CPU." + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/pt_expt/infer/deep_eval.py` around lines 149 - 208, Validate neighbor_graph_method availability in __init__ alongside the existing graph-form check, so misconfigurations fail before first eval(). In deep_eval.py, extend the constructor logic around neighbor_graph_method to eagerly reject "vesin" when vesin.torch is unavailable and "nv" when CUDA is unavailable or unsupported, mirroring the early validation already done by _setup_nlist_backend. Use the same error style as _setup_nlist_backend and keep the check tied to _neighbor_graph_method / _is_pt2 / metadata["lower_input_kind"] so graph-form .pt2 is the only path affected.deepmd/pt_expt/model/make_model.py (1)
459-469: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winDuplicate eligibility-check logic with
_resolve_graph_method.This new guard repeats the exact
descriptor = getattr(...); uses_graph_lower = getattr(...)pattern from_resolve_graph_method(Lines 405-409). Extracting a shared helper (e.g._graph_lower_eligible()) would avoid the two copies drifting if the eligibility rule ever changes.♻️ Suggested extraction
+ def _graph_lower_eligible(self) -> bool: + descriptor = getattr(self.atomic_model, "descriptor", None) + uses_graph_lower = getattr(descriptor, "uses_graph_lower", lambda: False) + return self.mixed_types() and uses_graph_lower() + def _resolve_graph_method( self, neighbor_graph_method: str | None ) -> str | None: ... - descriptor = getattr(self.atomic_model, "descriptor", None) - uses_graph_lower = getattr(descriptor, "uses_graph_lower", lambda: False) - if self.mixed_types() and uses_graph_lower(): + if self._graph_lower_eligible(): return "dense" return None def _call_common_graph(...): ... - descriptor = getattr(self.atomic_model, "descriptor", None) - uses_graph_lower = getattr(descriptor, "uses_graph_lower", lambda: False) - if not (self.mixed_types() and uses_graph_lower()): + if not self._graph_lower_eligible(): raise NotImplementedError( "neighbor_graph_method requires a mixed_types descriptor with a " "graph lower (e.g. dpa1 attn_layer=0)" )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/pt_expt/model/make_model.py` around lines 459 - 469, The new neighbor_graph_method guard in make_model repeats the same descriptor/uses_graph_lower eligibility check already implemented in _resolve_graph_method, so extract that shared logic into a helper such as _graph_lower_eligible() and use it in both places. Update the eligibility check inside the model-building path around atomic_model/descriptor handling so the rule lives in one place and cannot drift between _resolve_graph_method and the new guard.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@deepmd/pt_expt/utils/nv_graph_builder.py`:
- Around line 242-244: The coord reshaping logic in nv_graph_builder’s
neighbor-graph setup does not correctly handle the documented 2D coord form,
since it forces nf=1 for any non-3D tensor and can merge flattened multi-frame
inputs into one graph. Update the frame-count derivation in this coord/box
preprocessing path to use atype (or explicitly reject unsupported 2D inputs),
and make sure the same nf is applied consistently before reshaping coord and
box.
---
Nitpick comments:
In `@deepmd/pt_expt/infer/deep_eval.py`:
- Around line 149-208: Validate neighbor_graph_method availability in __init__
alongside the existing graph-form check, so misconfigurations fail before first
eval(). In deep_eval.py, extend the constructor logic around
neighbor_graph_method to eagerly reject "vesin" when vesin.torch is unavailable
and "nv" when CUDA is unavailable or unsupported, mirroring the early validation
already done by _setup_nlist_backend. Use the same error style as
_setup_nlist_backend and keep the check tied to _neighbor_graph_method / _is_pt2
/ metadata["lower_input_kind"] so graph-form .pt2 is the only path affected.
In `@deepmd/pt_expt/model/make_model.py`:
- Around line 459-469: The new neighbor_graph_method guard in make_model repeats
the same descriptor/uses_graph_lower eligibility check already implemented in
_resolve_graph_method, so extract that shared logic into a helper such as
_graph_lower_eligible() and use it in both places. Update the eligibility check
inside the model-building path around atomic_model/descriptor handling so the
rule lives in one place and cannot drift between _resolve_graph_method and the
new guard.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 2d7badff-46b5-40f8-8717-74ed796a6485
📒 Files selected for processing (8)
deepmd/pt/utils/nv_nlist.pydeepmd/pt_expt/infer/deep_eval.pydeepmd/pt_expt/model/make_model.pydeepmd/pt_expt/utils/nv_graph_builder.pydeepmd/pt_expt/utils/vesin_graph_builder.pydeepmd/pt_expt/utils/vesin_neighbor_list.pysource/tests/pt_expt/model/test_graph_builder_dispatch.pysource/tests/pt_expt/utils/test_graph_pt2_metadata.py
🚧 Files skipped from review as they are similar to previous changes (2)
- source/tests/pt_expt/model/test_graph_builder_dispatch.py
- deepmd/pt_expt/utils/vesin_graph_builder.py
NeighborGraph PR-C — O(N) on-device graph builders (vesin / nv)
Third PR of the NeighborGraph series (after #5581 foundation, #5583 PR-A dpa1 graph forward, #5604 PR-B
.pt2/C++). Adds two O(N) carry-all NeighborGraph builders behind the World-2neighbor_graph_methoddispatcher, replacing PR-A's O(N²)densesearch / per-frame ASE stopgap with on-device cell lists.What
build_neighbor_graph_vesin(deepmd/pt_expt/utils/vesin_graph_builder.py) —vesin.torchcell list. Device-following: runs the search on the input tensor's device (CUDA kernel on CUDA input, CPU cell list otherwise).build_neighbor_graph_nv(deepmd/pt_expt/utils/nv_graph_builder.py) — nvalchemiops GPU cell list, frame-batched (batch_idx/batch_ptr, one kernel for all frames — no Python loop). CUDA-only.build_neighbor_graph_ase: search → per-frame local(i, j, S)→neighbor_graph_from_ijs(...), which recomputesedge_vecdifferentiably from the original grad-carrying coords.neighbor_graph_method ∈ {"legacy","dense","ase","vesin","nv"}) and DeepEval.pt2graph inference (newneighbor_graph_methodkwarg, default"dense"→ existing inference byte-identical). dpmodel/jax fail-fast onvesin/nv(torch/CUDA-only).Perf-only: all builders emit the SAME neighbor set as
dense(carry-all,sel=normalization-only), proven by exact set-equality; energy/force/virial are unchanged (parity 1e-12 CPU / 1e-10 CUDA).Layering
dpmodelstays torch-free: vesin/nv builders live inpt_expt; the dpmodel dispatch only carries a fail-fast message.is_vesin_torch_available()/is_nv_available(),ImportErrorwith an install hint on absence.deepmd/pt/utils/nv_nlist.py:_matrix_to_extended_inputsStep-1 extraction.Testing
test_nv_graph_builder.py: 4 passed (set-equality vs dense periodic+non-periodic, frame-batch, differentiableedge_vec).test_graph_builder_dispatch.py: 3 passed on CUDA (vesin parity 1e-10, dpmodel reject vesin+nv, nv parity 1e-10).test_graph_deepeval.py: 6 passed on CUDA (.pt2graph dense parity + vesin, AOTI compile +torch.as_tensorextraction).Known limitations
vesin.torch.computeis single-system (no batch dim), so multi-frame vesin loops over frames (each an O(N) search;nf=1inference has zero loop cost). nv batches natively — the loop-free path for batched training.pyproject; no"auto"selector (explicit strings only).search_capacity = max(64, nloc)initial heuristic + 1.25× grow loop (.item()host-sync per grow).from_ijs(differentiable lattice translation, identity gradient;edge_vecis lattice-invariant) — consistent with the pt nv-nlist path.Implements
plan_neighbor_graph_prC_implementation; design spec: discussion wanghan-iapcm#4.Summary by CodeRabbit
neighbor_graph_methodfor graph-form.pt2inference and tightened dispatch support todense/ase/vesin/nv.type < 0) consistently from neighbor-graph edges for ASE, Vesin, and NV paths.