Skip to content

Conversation

gmarciani
Copy link
Contributor

@gmarciani gmarciani commented Aug 8, 2025

Description of changes

Add integration tests to validate support for GB200. In particular, it verifies the automated configuration of NVIDIA IMEX.

This test creates a cluster with the necessary custom actions to configure NVIDIA IMEX and verifies the following:

  1. On the compute resource supporting IMEX (q1-cr1), the IMEX nodes file is configured by the prolog,
    IMEX service is healthy and no errors are reported in IMEX's or prolog's logs.
    Also, IMEX gets reconfigured when nodes belonging to the same compute resource get replaced
  2. On the compute resource not supporting IMEX (q1-cr2), the IMEX nodes file is not configured by the prolog,
    keeping the default values and IMEX is not started.

The test prints in test log the full IMEX status to facilitate troubleshooting.

Important Notes

  • The test is added to the develop configuration for daily execution.
  • The test uses instance type g4dn to simulate a p6e-gb200 instance. This is a reasonable approximation for the test because the focus of the test is on IMEX configuration, which can be executed on g4dn as well.
  • Will remove the compute node custom action once [NVIDIA-IMEX] Add test attribute for NVIDIA-imex simulation aws-parallelcluster-cookbook#3001 will be merged. at that point the test will replace the custom action with injecting the cookbook attribute to force the IMEX configuration.
  • The test makes use of a compute node prolog, following the approach recommended by NVIDIA for per-Job Deployment. The advantage of this approach is that in this way IMEX can be easily restarted locally by each compute node at the right time. If the same would be done form the head node, an additional mechanism to orchestrate the IMEX reload should have been introduced.

Limitations

  1. If we want to use p6e-gb200 instance types, we need to manually change the capacity reservation in the test because we do not have a way to automate such reservation today.

Tests

  • [SUCCESS] test_gb200

References

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@gmarciani gmarciani force-pushed the wip/mgiacomo/3140/test-gb200-0804-1 branch from 3029b18 to 2f5e4b8 Compare August 8, 2025 14:59
@gmarciani gmarciani added skip-changelog-update Disables the check that enforces changelog updates in PRs 3.x Test labels Aug 8, 2025
@gmarciani gmarciani force-pushed the wip/mgiacomo/3140/test-gb200-0804-1 branch 3 times, most recently from 57f0db7 to 9818b38 Compare August 8, 2025 15:19
@gmarciani gmarciani force-pushed the wip/mgiacomo/3140/test-gb200-0804-1 branch 2 times, most recently from 37f558c to 5f07db2 Compare August 8, 2025 16:07
@gmarciani gmarciani force-pushed the wip/mgiacomo/3140/test-gb200-0804-1 branch 8 times, most recently from 9115cd8 to 04a16f6 Compare August 8, 2025 21:51
Copy link

codecov bot commented Aug 8, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.13%. Comparing base (c8cc980) to head (950f401).
⚠️ Report is 72 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #6934      +/-   ##
===========================================
- Coverage    90.21%   90.13%   -0.08%     
===========================================
  Files          181      181              
  Lines        16213    16396     +183     
===========================================
+ Hits         14627    14779     +152     
- Misses        1586     1617      +31     
Flag Coverage Δ
unittests 90.13% <ø> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gmarciani gmarciani force-pushed the wip/mgiacomo/3140/test-gb200-0804-1 branch from 04a16f6 to ecc0b96 Compare August 8, 2025 22:16
@gmarciani gmarciani force-pushed the wip/mgiacomo/3140/test-gb200-0804-1 branch 7 times, most recently from 714a028 to 4fe6ef8 Compare August 9, 2025 04:15
@gmarciani gmarciani force-pushed the wip/mgiacomo/3140/test-gb200-0804-1 branch 10 times, most recently from 3107585 to d71ba84 Compare August 12, 2025 02:33
@gmarciani gmarciani force-pushed the wip/mgiacomo/3140/test-gb200-0804-1 branch from d71ba84 to fc95126 Compare August 12, 2025 02:59
@gmarciani gmarciani marked this pull request as ready for review August 12, 2025 03:08
@gmarciani gmarciani requested review from a team as code owners August 12, 2025 03:08
@gmarciani gmarciani enabled auto-merge (rebase) August 12, 2025 14:45
@gmarciani gmarciani force-pushed the wip/mgiacomo/3140/test-gb200-0804-1 branch 5 times, most recently from 0b45c81 to 083201e Compare August 12, 2025 21:46
timeout ${IMEX_STOP_TIMEOUT} systemctl stop ${IMEX_SERVICE}
pkill -9 ${IMEX_SERVICE}

#TODO Improvement: rotate server port to prevent race condition
Copy link
Contributor

@himani2411 himani2411 Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Non-Blocking]When do we plan to do this? Next Phase or another iteration in the coming weeks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to address this improvement in a follow up PR once we fully understand the implications.
So far I have not observed any race condition, even re-executing the same integ test on an existing cluster multiple timnes, but it can be because we are using only 2 nodes and not a real cuda application

@gmarciani gmarciani force-pushed the wip/mgiacomo/3140/test-gb200-0804-1 branch from 083201e to 5a9d5e4 Compare August 12, 2025 22:14
himani2411
himani2411 previously approved these changes Aug 12, 2025
…favor of the cookbook attribute to force IMEX configuration.
@gmarciani gmarciani force-pushed the wip/mgiacomo/3140/test-gb200-0804-1 branch from 5a9d5e4 to 84e4988 Compare August 12, 2025 22:36
@gmarciani gmarciani merged commit 839f0e2 into aws:develop Aug 12, 2025
24 checks passed
@gmarciani gmarciani deleted the wip/mgiacomo/3140/test-gb200-0804-1 branch August 12, 2025 22:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.x skip-changelog-update Disables the check that enforces changelog updates in PRs Test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants