Skip to content

[Test] Fix issues causing intermittent test failures: test_efa, test_ad_integration, test_queue_parameters_update, test_slurm, test_multiple_efs#7357

Merged
gmarciani merged 8 commits into
aws:developfrom
gmarciani:wip/mgiacomo/3160/fix-tests-0429-1
Apr 30, 2026
Merged

[Test] Fix issues causing intermittent test failures: test_efa, test_ad_integration, test_queue_parameters_update, test_slurm, test_multiple_efs#7357
gmarciani merged 8 commits into
aws:developfrom
gmarciani:wip/mgiacomo/3160/fix-tests-0429-1

Conversation

@gmarciani
Copy link
Copy Markdown
Contributor

@gmarciani gmarciani commented Apr 29, 2026

Description of changes

Fixed issues causing intermittent failures on the following tests:

  • test_efa: Replaced hpc5a.48xlarge on us-east-2 with hpc6a.48xlarge on eu-north-1 to reduce the risk of insufficient capacity exceptions.
  • test_ad_integration: Fixed a race condition in the SSH key generation check by adding a sleep before the ls command inside the switch-user session, giving the PAM hook time to finish generating the user's SSH key and avoiding sporadic "Permission denied" failures.
  • test_queue_parameters_update: Prevented the test from hanging indefinitely when dumping job output, caused by a weak condition and the lack of an SSH timeout. Hangs in this test could also trigger the watchdog to kill unrelated tests sharing the same VPC.
  • test_multiple_efs: Switched the auxiliary instance that writes EFS data from AL2 to AL2023. AL2 reaches end of life in June 2026, and AL2023 should provide a more robust networking driver bootstrap, preventing the sporadic networking failures observed in this test.
  • test_slurm: Removed an unnecessary dependency on NFS shared storage by replacing the shared-file round-trip with an srun that streams output back directly. NFS cross-client visibility is already covered by storage-specific tests and was just noise here.

Also, included the following improvements:

  1. the RCA retrieval by including the chef client log captured in the head node console logs.
  2. removed redundant use of fixture marks to make our tests able to run with pytest 8+ which is the version pulled in when we run the test with python 3.10+

Tests

  • SUCCESS test_efa
  • SUCCESS test_ad_integration
  • SUCCESS test_queue_parameters_update
  • SUCCESS test_slurm
  • SUCCESS test_multiple_efs

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@gmarciani gmarciani changed the title Wip/mgiacomo/3160/fix tests 0429 1 [Test] Fix issues causing intermittent test failures: test_efa, test_ad_integration, test_queue_parameters_update, test_slurm, test_multiple_efs Apr 29, 2026
@gmarciani gmarciani added skip-changelog-update Disables the check that enforces changelog updates in PRs 3.x Test labels Apr 29, 2026
@gmarciani gmarciani marked this pull request as ready for review April 30, 2026 13:05
@gmarciani gmarciani requested review from a team as code owners April 30, 2026 13:05
Comment thread tests/integration-tests/tests/schedulers/test_slurm.py Outdated
@gmarciani gmarciani force-pushed the wip/mgiacomo/3160/fix-tests-0429-1 branch from 1742f7b to dd37b98 Compare April 30, 2026 14:07
himani2411
himani2411 previously approved these changes Apr 30, 2026
…large on eu-north-1

to reduce the risk of insufficient capacity exceptions.
…heck

Add a sleep before the ls command inside the switch-user session to allow
the PAM hook time to finish generating the user's SSH key. Without it, ls
may run before key generation completes, causing sporadic "Permission
denied" failures.
…s EFS data from AL2 to AL2023.

AL2 is reaching end of life in June 2026. We also believe that moving to
AL2023 could benefit from a more robust networking driver bootstrap,
preventing the sporadic networking failures observed in this test.
…sary dependency on NFS shared storage.

In particular, we replaced the shared-file round-trip with an `srun` that streams output back right away.
This removes a flaky dependency on NFS cross-client visibility, which is already covered by storage-specific tests and is just noise in this test.
…initely.

The test could hang indefinitely when dumping job
output due to a weak condition and the lack of SSH timeout.

When this happens, it could also have impact on other tests. In fact, if the test hangs for too long, our watchdog mechanism would kill all the tests that are sharing the same VPC with this one as a way to protect from infinite test executions.
@gmarciani gmarciani force-pushed the wip/mgiacomo/3160/fix-tests-0429-1 branch from 51bdbf4 to a9495d7 Compare April 30, 2026 15:05
This is required to prevent test failures with pytest 8+,
which is pulled in when upgrading to Python 3.10+.
@gmarciani gmarciani force-pushed the wip/mgiacomo/3160/fix-tests-0429-1 branch from a9495d7 to 7a28d8c Compare April 30, 2026 15:38
@gmarciani gmarciani enabled auto-merge (rebase) April 30, 2026 19:11
@gmarciani gmarciani merged commit 6431a3c into aws:develop Apr 30, 2026
24 checks passed
@gmarciani gmarciani deleted the wip/mgiacomo/3160/fix-tests-0429-1 branch April 30, 2026 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.x skip-changelog-update Disables the check that enforces changelog updates in PRs Test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants