fix(test): tear down LAMMPS before MPI.Finalize() in mpirun test runners by wanghan-iapcm · Pull Request #5455 · deepmodeling/deepmd-kit

wanghan-iapcm · 2026-05-23T06:54:10Z

Summary

Adds an explicit del lammps before MPI.Finalize() in all four mpirun-driven LAMMPS test runners (run_mpi_pair_deepmd.py, run_mpi_pair_deepmd_spin.py, run_mpi_pair_deepmd_dpa3_pt2.py, run_mpi_pair_deepmd_spin_dpa3_pt2.py).
Fixes a teardown-order race that intermittently manifests as subprocess exit code 136 (SIGFPE) for test_pair_deepmd_mpi_dpa3_spin_empty_subdomain on the GitHub Actions CUDA runner image.

Background

Recent CI runs on multiple unrelated PRs (#5446, #5450) hit the identical failure signature:

short test summary info ============================
FAILED source/lmp/tests/test_lammps_spin_dpa3_pt2.py::test_pair_deepmd_mpi_dpa3_spin_empty_subdomain
  - subprocess.CalledProcessError: ... returned non-zero exit status 136.

Reproduces ~1 in 5 runs on the GitHub Actions CUDA image (nvidia/cuda:12.9.1-cudnn-devel-ubuntu22.04).
Does not reproduce on a V100 Bohrium dev box — 60/60 consecutive passes.

So it's a pre-existing flake, not caused by either of the recent PRs.

Root cause (empirically confirmed)

The runner ends with:

forces_global = lammps.lmp.gather_atoms(...)
...
MPI.Finalize()

lammps is still alive when MPI.Finalize() returns. Python then garbage-collects it during interpreter shutdown, which triggers LAMMPS::~LAMMPS → Finish::end() → MPI_Allreduce for timing aggregation. By that time, MPI has already been finalized, which is undefined behavior.

I instrumented the runner with timestamped prints to verify the order directly. Without the fix:

t=3311.770  R1: BEFORE MPI.Finalize
t=3311.778  R0/R1: AFTER MPI.Finalize     ← MPI is finalized
t=3311.778  R0/R1: PY ATEXIT
… process exit, LAMMPS destructor runs HERE

With the fix:

t=3423.100  R1: AFTER del lammps (LAMMPS destructor done)   ← MPI still up
t=3423.108  R0/R1: BEFORE MPI.Finalize
t=3423.108  R0/R1: AFTER MPI.Finalize

So the LAMMPS destructor now runs while MPI is still up, which is what its MPI_Allreduce/MPI_Gather calls require.

The reason this manifests as SIGFPE only on the CUDA CI image (not on V100) is most likely that the CI image (or one of its preloaded libraries) enables FP-exception trapping; on V100 the same MPI-after-Finalize errors return silently. The flake is environment-specific, but the underlying antipattern is unconditional and worth fixing in any environment.

Test plan

Local CPU: 29/29 LAMMPS tests pass (test_lammps_dpa3_pt2.py, test_lammps_spin_dpa3_pt2.py)
Remote V100: 50/50 stress runs of the previously-failing test
Empirical confirmation that the fix flips the LAMMPS-destructor-vs-MPI.Finalize ordering (see Background)
CI: re-run the spin LAMMPS suite multiple times to confirm the SIGFPE no longer appears

Known limitations

Cannot directly observe the SIGFPE on V100, so the fix has not been observed preventing the actual crash — only correcting the antipattern that we have strong reason to believe causes it.
If the failure persists after merge, the next candidate root cause is CUDA stream destruction order, and we should revisit.

Summary by CodeRabbit

Bug Fixes
- Improved MPI cleanup sequence in multiple test runners to prevent finalization-related crashes when executing tests in distributed MPI environments.

…n runners The mpirun-driven LAMMPS test runners called ``MPI.Finalize()`` at the end of the script with the ``lammps`` Python object still alive. When the interpreter then shut down, the LAMMPS C++ destructor ran in a state where MPI was already finalized — and LAMMPS' ``Finish::end``, fix/compute teardown, and the deep[m|spin] pair-style destructor chain all issue MPI collectives (``MPI_Gather`` / ``MPI_Reduce``) during cleanup. On the empty-subdomain rank (no local atoms but live ghost atoms), the asymmetric MPI traffic during destruction occasionally hit an MPI-after-Finalize error path and crashed the rank with SIGFPE, manifesting in CUDA CI as ``exit status 136`` of the subprocess for ``test_pair_deepmd_mpi_dpa3_spin_empty_subdomain``. The crash was intermittent (1 fail in ~5 runs) on the GitHub Actions CUDA runner, not reproducible on a V100 dev box. PR deepmodeling#5446 (unrelated to MPI / spin / CUDA code) hit the same flake — confirming it's a pre-existing teardown race in the test runners, not a regression in either PR. The fix is mechanical and identical in all four runners: ``del lammps`` before ``MPI.Finalize()`` so the LAMMPS instance is torn down while the communicator is still valid.

coderabbitai · 2026-05-23T06:54:20Z

Caution

Review failed

An error occurred during the review process. Please try again later.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai · 2026-05-23T06:56:27Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 102fb446-43ed-41d6-a05a-669cde866127

📥 Commits

Reviewing files that changed from the base of the PR and between 4604131 and f1e144e.

📒 Files selected for processing (4)

source/lmp/tests/run_mpi_pair_deepmd.py
source/lmp/tests/run_mpi_pair_deepmd_dpa3_pt2.py
source/lmp/tests/run_mpi_pair_deepmd_spin.py
source/lmp/tests/run_mpi_pair_deepmd_spin_dpa3_pt2.py

📝 Walkthrough

Walkthrough

Four MPI test runner scripts are updated to explicitly tear down the PyLammps instance immediately before calling MPI.Finalize(). This prevents MPI-related crashes from the LAMMPS destructor invoking MPI operations after finalization, ensuring proper cleanup order across all runners.

Changes

MPI Test Cleanup

Layer / File(s)	Summary
Explicit PyLammps teardown before MPI.Finalize() `source/lmp/tests/run_mpi_pair_deepmd.py`, `source/lmp/tests/run_mpi_pair_deepmd_dpa3_pt2.py`, `source/lmp/tests/run_mpi_pair_deepmd_spin.py`, `source/lmp/tests/run_mpi_pair_deepmd_spin_dpa3_pt2.py`	All four MPI runners explicitly delete the PyLammps object before MPI.Finalize() with comments explaining the fix prevents LAMMPS destructor-time MPI calls after finalization.

🎯 2 (Simple) | ⏱️ ~7 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and specifically describes the main change: explicitly tearing down the PyLammps instance before calling MPI.Finalize() in four mpirun-driven test runners.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

njzjz-bot

Thanks for the detailed root-cause analysis. This teardown-order fix looks good to me: all four mpirun LAMMPS runners now release the lammps object before MPI.Finalize(), so any destructor-side MPI calls still happen while MPI is valid. The comments are also helpful and scoped to the observed CI flake.

CI is still running on this PR, so I’m approving the code change contingent on the remaining checks finishing green.

— OpenClaw 2026.5.12 (model: custom-chat-jinzhezeng-group/gpt-5.5)

codecov · 2026-05-23T07:54:25Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.50%. Comparing base (4604131) to head (f1e144e).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #5455   +/-   ##
=======================================
  Coverage   82.50%   82.50%           
=======================================
  Files         830      830           
  Lines       88559    88559           
  Branches     4241     4241           
=======================================
+ Hits        73065    73066    +1     
  Misses      14201    14201           
+ Partials     1293     1292    -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

dosubot Bot added the bug label May 23, 2026

github-actions Bot added the LAMMPS label May 23, 2026

wanghan-iapcm requested a review from njzjz May 23, 2026 06:58

njzjz-bot approved these changes May 23, 2026

View reviewed changes

njzjz approved these changes May 23, 2026

View reviewed changes

njzjz added this pull request to the merge queue May 23, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 23, 2026

wanghan-iapcm added this pull request to the merge queue May 23, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 23, 2026

wanghan-iapcm added this pull request to the merge queue May 23, 2026

Merged via the queue into deepmodeling:master with commit 9245a7b May 23, 2026
73 checks passed

wanghan-iapcm deleted the fix-spin-empty-subdomain-shutdown-race branch May 23, 2026 20:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(test): tear down LAMMPS before MPI.Finalize() in mpirun test runners#5455

fix(test): tear down LAMMPS before MPI.Finalize() in mpirun test runners#5455
wanghan-iapcm merged 1 commit into
deepmodeling:masterfrom
wanghan-iapcm:fix-spin-empty-subdomain-shutdown-race

wanghan-iapcm commented May 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 23, 2026

Review failed

Uh oh!

coderabbitai Bot commented May 23, 2026

Walkthrough

Changes

Uh oh!

njzjz-bot left a comment

Uh oh!

codecov Bot commented May 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wanghan-iapcm commented May 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

Root cause (empirically confirmed)

Test plan

Known limitations

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 23, 2026

Review failed

Uh oh!

coderabbitai Bot commented May 23, 2026

Walkthrough

Changes

Uh oh!

njzjz-bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wanghan-iapcm commented May 23, 2026 •

edited by coderabbitai Bot

Loading

codecov Bot commented May 23, 2026 •

edited

Loading