Skip to content

Conversation

@EddyMM
Copy link
Contributor

@EddyMM EddyMM commented Aug 23, 2023

Description of changes

  • One of the test assertions (_test_mpi_job_termination) checks if cancellation of an MPI process does not leave any stray processes
  • The MPI job executed is an All-to-All MPI job. Timing when the job is running in order to cancel it is hard to predict.
  • These code changes use a simple sleep process instead with a deterministic runtime (5 mins). This way the test's check intervals have a higher guarantee of catching the MPI process while it's running.
  • The test also separates the clean up step (removing the shared file) from the retry steps

Tests

  • Ran the test (test_slurm::test_slurm) with Ubuntu2204

References

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…special timing conditions.

Signed-off-by: Eddy Mwiti <eddmwiti@amazon.com>
@EddyMM EddyMM added skip-changelog-update Disables the check that enforces changelog updates in PRs 3.x labels Aug 23, 2023
@EddyMM EddyMM changed the title [release-3.7] Use a fixed time MPI job to test job cancellation of MPI processes [release-3.7] [integ-tests] Use a fixed time MPI job to test job cancellation of MPI processes Aug 23, 2023
@codecov
Copy link

codecov bot commented Aug 23, 2023

Codecov Report

Merging #5665 (0f1ea5a) into release-3.7 (d7c44f5) will not change coverage.
The diff coverage is n/a.

@@             Coverage Diff              @@
##           release-3.7    #5665   +/-   ##
============================================
  Coverage        89.89%   89.89%           
============================================
  Files              179      179           
  Lines            15366    15366           
============================================
  Hits             13813    13813           
  Misses            1553     1553           
Flag Coverage Δ
unittests 89.89% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

@EddyMM EddyMM marked this pull request as ready for review August 23, 2023 10:25
@EddyMM EddyMM requested review from a team as code owners August 23, 2023 10:25

module load intelmpi
mpirun -n 6 IMB-MPI1 Alltoall -npmin 2
mpirun -n 6 bash -c 'sleep 300' -npmin 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do this we are not using MPI at all because each node will launch a process that will sleep for a fixed amount of time, but no interaction will happen between the nodes.

Let's sync up about this.

Copy link
Contributor Author

@EddyMM EddyMM Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test involves checking if cancellation of an MPI process does not leave any stray processes. The message exchange may not be needed but as discussed, this test "implicitly" covers IntelMPI tests so this PR has been opened pending follow up on reviewing the timing issue on Ubuntu2204: #5666

@EddyMM
Copy link
Contributor Author

EddyMM commented Aug 23, 2023

Closed in preference of: #5665

@EddyMM EddyMM reopened this Aug 23, 2023
@EddyMM EddyMM enabled auto-merge (squash) August 23, 2023 16:09
@EddyMM EddyMM merged commit 1605833 into aws:release-3.7 Aug 23, 2023
EddyMM added a commit to EddyMM/aws-parallelcluster that referenced this pull request Aug 28, 2023
…special timing conditions. (aws#5665)

Signed-off-by: Eddy Mwiti <eddmwiti@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.x skip-changelog-update Disables the check that enforces changelog updates in PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants