Skip to content

fix(test): tear down LAMMPS before MPI.Finalize() in mpirun test runners#5455

Merged
wanghan-iapcm merged 1 commit into
deepmodeling:masterfrom
wanghan-iapcm:fix-spin-empty-subdomain-shutdown-race
May 23, 2026
Merged

fix(test): tear down LAMMPS before MPI.Finalize() in mpirun test runners#5455
wanghan-iapcm merged 1 commit into
deepmodeling:masterfrom
wanghan-iapcm:fix-spin-empty-subdomain-shutdown-race

Conversation

@wanghan-iapcm
Copy link
Copy Markdown
Collaborator

@wanghan-iapcm wanghan-iapcm commented May 23, 2026

Summary

  • Adds an explicit del lammps before MPI.Finalize() in all four mpirun-driven LAMMPS test runners (run_mpi_pair_deepmd.py, run_mpi_pair_deepmd_spin.py, run_mpi_pair_deepmd_dpa3_pt2.py, run_mpi_pair_deepmd_spin_dpa3_pt2.py).
  • Fixes a teardown-order race that intermittently manifests as subprocess exit code 136 (SIGFPE) for test_pair_deepmd_mpi_dpa3_spin_empty_subdomain on the GitHub Actions CUDA runner image.

Background

Recent CI runs on multiple unrelated PRs (#5446, #5450) hit the identical failure signature:

short test summary info ============================
FAILED source/lmp/tests/test_lammps_spin_dpa3_pt2.py::test_pair_deepmd_mpi_dpa3_spin_empty_subdomain
  - subprocess.CalledProcessError: ... returned non-zero exit status 136.
  • Reproduces ~1 in 5 runs on the GitHub Actions CUDA image (nvidia/cuda:12.9.1-cudnn-devel-ubuntu22.04).
  • Does not reproduce on a V100 Bohrium dev box — 60/60 consecutive passes.

So it's a pre-existing flake, not caused by either of the recent PRs.

Root cause (empirically confirmed)

The runner ends with:

forces_global = lammps.lmp.gather_atoms(...)
...
MPI.Finalize()

lammps is still alive when MPI.Finalize() returns. Python then garbage-collects it during interpreter shutdown, which triggers LAMMPS::~LAMMPSFinish::end()MPI_Allreduce for timing aggregation. By that time, MPI has already been finalized, which is undefined behavior.

I instrumented the runner with timestamped prints to verify the order directly. Without the fix:

t=3311.770  R1: BEFORE MPI.Finalize
t=3311.778  R0/R1: AFTER MPI.Finalize     ← MPI is finalized
t=3311.778  R0/R1: PY ATEXIT
… process exit, LAMMPS destructor runs HERE

With the fix:

t=3423.100  R1: AFTER del lammps (LAMMPS destructor done)   ← MPI still up
t=3423.108  R0/R1: BEFORE MPI.Finalize
t=3423.108  R0/R1: AFTER MPI.Finalize

So the LAMMPS destructor now runs while MPI is still up, which is what its MPI_Allreduce/MPI_Gather calls require.

The reason this manifests as SIGFPE only on the CUDA CI image (not on V100) is most likely that the CI image (or one of its preloaded libraries) enables FP-exception trapping; on V100 the same MPI-after-Finalize errors return silently. The flake is environment-specific, but the underlying antipattern is unconditional and worth fixing in any environment.

Test plan

  • Local CPU: 29/29 LAMMPS tests pass (test_lammps_dpa3_pt2.py, test_lammps_spin_dpa3_pt2.py)
  • Remote V100: 50/50 stress runs of the previously-failing test
  • Empirical confirmation that the fix flips the LAMMPS-destructor-vs-MPI.Finalize ordering (see Background)
  • CI: re-run the spin LAMMPS suite multiple times to confirm the SIGFPE no longer appears

Known limitations

  • Cannot directly observe the SIGFPE on V100, so the fix has not been observed preventing the actual crash — only correcting the antipattern that we have strong reason to believe causes it.
  • If the failure persists after merge, the next candidate root cause is CUDA stream destruction order, and we should revisit.

Summary by CodeRabbit

  • Bug Fixes
    • Improved MPI cleanup sequence in multiple test runners to prevent finalization-related crashes when executing tests in distributed MPI environments.

Review Change Stack

…n runners

The mpirun-driven LAMMPS test runners called ``MPI.Finalize()`` at the
end of the script with the ``lammps`` Python object still alive.  When
the interpreter then shut down, the LAMMPS C++ destructor ran in a
state where MPI was already finalized — and LAMMPS' ``Finish::end``,
fix/compute teardown, and the deep[m|spin] pair-style destructor chain
all issue MPI collectives (``MPI_Gather`` / ``MPI_Reduce``) during
cleanup.  On the empty-subdomain rank (no local atoms but live ghost
atoms), the asymmetric MPI traffic during destruction occasionally
hit an MPI-after-Finalize error path and crashed the rank with SIGFPE,
manifesting in CUDA CI as ``exit status 136`` of the subprocess for
``test_pair_deepmd_mpi_dpa3_spin_empty_subdomain``.

The crash was intermittent (1 fail in ~5 runs) on the GitHub Actions
CUDA runner, not reproducible on a V100 dev box.  PR deepmodeling#5446 (unrelated
to MPI / spin / CUDA code) hit the same flake — confirming it's a
pre-existing teardown race in the test runners, not a regression in
either PR.

The fix is mechanical and identical in all four runners: ``del lammps``
before ``MPI.Finalize()`` so the LAMMPS instance is torn down while
the communicator is still valid.
@dosubot dosubot Bot added the bug label May 23, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 23, 2026

Caution

Review failed

An error occurred during the review process. Please try again later.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 23, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 102fb446-43ed-41d6-a05a-669cde866127

📥 Commits

Reviewing files that changed from the base of the PR and between 4604131 and f1e144e.

📒 Files selected for processing (4)
  • source/lmp/tests/run_mpi_pair_deepmd.py
  • source/lmp/tests/run_mpi_pair_deepmd_dpa3_pt2.py
  • source/lmp/tests/run_mpi_pair_deepmd_spin.py
  • source/lmp/tests/run_mpi_pair_deepmd_spin_dpa3_pt2.py

📝 Walkthrough

Walkthrough

Four MPI test runner scripts are updated to explicitly tear down the PyLammps instance immediately before calling MPI.Finalize(). This prevents MPI-related crashes from the LAMMPS destructor invoking MPI operations after finalization, ensuring proper cleanup order across all runners.

Changes

MPI Test Cleanup

Layer / File(s) Summary
Explicit PyLammps teardown before MPI.Finalize()
source/lmp/tests/run_mpi_pair_deepmd.py, source/lmp/tests/run_mpi_pair_deepmd_dpa3_pt2.py, source/lmp/tests/run_mpi_pair_deepmd_spin.py, source/lmp/tests/run_mpi_pair_deepmd_spin_dpa3_pt2.py
All four MPI runners explicitly delete the PyLammps object before MPI.Finalize() with comments explaining the fix prevents LAMMPS destructor-time MPI calls after finalization.

🎯 2 (Simple) | ⏱️ ~7 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically describes the main change: explicitly tearing down the PyLammps instance before calling MPI.Finalize() in four mpirun-driven test runners.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@wanghan-iapcm wanghan-iapcm requested a review from njzjz May 23, 2026 06:58
Copy link
Copy Markdown
Contributor

@njzjz-bot njzjz-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed root-cause analysis. This teardown-order fix looks good to me: all four mpirun LAMMPS runners now release the lammps object before MPI.Finalize(), so any destructor-side MPI calls still happen while MPI is valid. The comments are also helpful and scoped to the observed CI flake.

CI is still running on this PR, so I’m approving the code change contingent on the remaining checks finishing green.

— OpenClaw 2026.5.12 (model: custom-chat-jinzhezeng-group/gpt-5.5)

@codecov
Copy link
Copy Markdown

codecov Bot commented May 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.50%. Comparing base (4604131) to head (f1e144e).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5455   +/-   ##
=======================================
  Coverage   82.50%   82.50%           
=======================================
  Files         830      830           
  Lines       88559    88559           
  Branches     4241     4241           
=======================================
+ Hits        73065    73066    +1     
  Misses      14201    14201           
+ Partials     1293     1292    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@njzjz njzjz added this pull request to the merge queue May 23, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 23, 2026
@wanghan-iapcm wanghan-iapcm added this pull request to the merge queue May 23, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 23, 2026
@wanghan-iapcm wanghan-iapcm added this pull request to the merge queue May 23, 2026
Merged via the queue into deepmodeling:master with commit 9245a7b May 23, 2026
73 checks passed
@wanghan-iapcm wanghan-iapcm deleted the fix-spin-empty-subdomain-shutdown-race branch May 23, 2026 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants