fix(test): tear down LAMMPS before MPI.Finalize() in mpirun test runners#5455
Conversation
…n runners The mpirun-driven LAMMPS test runners called ``MPI.Finalize()`` at the end of the script with the ``lammps`` Python object still alive. When the interpreter then shut down, the LAMMPS C++ destructor ran in a state where MPI was already finalized — and LAMMPS' ``Finish::end``, fix/compute teardown, and the deep[m|spin] pair-style destructor chain all issue MPI collectives (``MPI_Gather`` / ``MPI_Reduce``) during cleanup. On the empty-subdomain rank (no local atoms but live ghost atoms), the asymmetric MPI traffic during destruction occasionally hit an MPI-after-Finalize error path and crashed the rank with SIGFPE, manifesting in CUDA CI as ``exit status 136`` of the subprocess for ``test_pair_deepmd_mpi_dpa3_spin_empty_subdomain``. The crash was intermittent (1 fail in ~5 runs) on the GitHub Actions CUDA runner, not reproducible on a V100 dev box. PR deepmodeling#5446 (unrelated to MPI / spin / CUDA code) hit the same flake — confirming it's a pre-existing teardown race in the test runners, not a regression in either PR. The fix is mechanical and identical in all four runners: ``del lammps`` before ``MPI.Finalize()`` so the LAMMPS instance is torn down while the communicator is still valid.
|
Caution Review failedAn error occurred during the review process. Please try again later. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughFour MPI test runner scripts are updated to explicitly tear down the PyLammps instance immediately before calling MPI.Finalize(). This prevents MPI-related crashes from the LAMMPS destructor invoking MPI operations after finalization, ensuring proper cleanup order across all runners. ChangesMPI Test Cleanup
🎯 2 (Simple) | ⏱️ ~7 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
njzjz-bot
left a comment
There was a problem hiding this comment.
Thanks for the detailed root-cause analysis. This teardown-order fix looks good to me: all four mpirun LAMMPS runners now release the lammps object before MPI.Finalize(), so any destructor-side MPI calls still happen while MPI is valid. The comments are also helpful and scoped to the observed CI flake.
CI is still running on this PR, so I’m approving the code change contingent on the remaining checks finishing green.
— OpenClaw 2026.5.12 (model: custom-chat-jinzhezeng-group/gpt-5.5)
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #5455 +/- ##
=======================================
Coverage 82.50% 82.50%
=======================================
Files 830 830
Lines 88559 88559
Branches 4241 4241
=======================================
+ Hits 73065 73066 +1
Misses 14201 14201
+ Partials 1293 1292 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
del lammpsbeforeMPI.Finalize()in all four mpirun-driven LAMMPS test runners (run_mpi_pair_deepmd.py,run_mpi_pair_deepmd_spin.py,run_mpi_pair_deepmd_dpa3_pt2.py,run_mpi_pair_deepmd_spin_dpa3_pt2.py).test_pair_deepmd_mpi_dpa3_spin_empty_subdomainon the GitHub Actions CUDA runner image.Background
Recent CI runs on multiple unrelated PRs (#5446, #5450) hit the identical failure signature:
nvidia/cuda:12.9.1-cudnn-devel-ubuntu22.04).So it's a pre-existing flake, not caused by either of the recent PRs.
Root cause (empirically confirmed)
The runner ends with:
lammpsis still alive whenMPI.Finalize()returns. Python then garbage-collects it during interpreter shutdown, which triggersLAMMPS::~LAMMPS→Finish::end()→MPI_Allreducefor timing aggregation. By that time, MPI has already been finalized, which is undefined behavior.I instrumented the runner with timestamped prints to verify the order directly. Without the fix:
With the fix:
So the LAMMPS destructor now runs while MPI is still up, which is what its
MPI_Allreduce/MPI_Gathercalls require.The reason this manifests as SIGFPE only on the CUDA CI image (not on V100) is most likely that the CI image (or one of its preloaded libraries) enables FP-exception trapping; on V100 the same MPI-after-Finalize errors return silently. The flake is environment-specific, but the underlying antipattern is unconditional and worth fixing in any environment.
Test plan
test_lammps_dpa3_pt2.py,test_lammps_spin_dpa3_pt2.py)Known limitations
Summary by CodeRabbit