Skip to content

Harden lazy border_op registration and comm-artifact gating in pt_expt export#5452

Closed
Copilot wants to merge 4 commits into
masterfrom
copilot/fix-review-comments
Closed

Harden lazy border_op registration and comm-artifact gating in pt_expt export#5452
Copilot wants to merge 4 commits into
masterfrom
copilot/fix-review-comments

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 22, 2026

This addresses review feedback on PR #5451 around pt_expt lazy comm-op initialization. The fixes make registration atomic/idempotent, prevent premature op-library loading for non-comm models, and align non-serialization comm call paths with explicit initialization.

  • Thread-safe, idempotent comm registration

    • Added a module-level lock in deepmd/pt_expt/utils/comm.py.
    • ensure_comm_registered() now double-checks _registered inside the lock.
    • Duplicate-registration tolerance now also covers torch.library.register_autograd(...) runtime “already registered/has” cases.
  • Correct ordering in with-comm export path

    • In deepmd/pt_expt/utils/serialization.py, _needs_with_comm_artifact(model) is evaluated before importing/calling ensure_comm_registered().
    • Non-comm models now fail fast with the existing ValueError instead of triggering unrelated op-load failures.
  • Explicit initialization for non-serialization comm paths

    • Added explicit ensure_comm_registered() calls at descriptor comm_dict call sites:
      • deepmd/pt_expt/descriptor/repflows.py
      • deepmd/pt_expt/descriptor/repformers.py
    • Updated deepmd/pt_expt/utils/__init__.py and pt_expt test comments to reflect required lazy-init behavior.
  • Focused regression coverage

    • Added tests for registration idempotency/duplicate autograd handling and for export-path gating-before-registration behavior:
      • source/tests/pt_expt/utils/test_comm_registration.py
      • source/tests/pt_expt/utils/test_serialization_with_comm_guard.py
# comm.py
if _registered:
    return
with _register_lock:
    if _registered:
        return
    _check_underlying_ops_loaded()
    ...
    try:
        torch.library.register_autograd(...)
    except RuntimeError as e:
        if "already has" not in str(e) and "already registered" not in str(e):
            raise
    _registered = True

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • download-r2.pytorch.org
    • Triggering command: /opt/hostedtoolcache/uv/0.11.16/x86_64/uv uv pip install torch --index-url REDACTED (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

anyangml and others added 3 commits May 22, 2026 05:17
Co-authored-by: anyangml <137014849+anyangml@users.noreply.github.com>
for more information, see https://pre-commit.ci

Co-authored-by: anyangml <137014849+anyangml@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix code based on review comments Harden lazy border_op registration and comm-artifact gating in pt_expt export May 22, 2026
Copilot AI requested a review from anyangml May 22, 2026 05:22
except RuntimeError as e:
if "already has" not in str(e) and "already registered" not in str(e):
raise
_registered = True
@anyangml anyangml closed this May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants