-
Notifications
You must be signed in to change notification settings - Fork 315
[Test] Add integration tests to validate support for GB200. #6934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Test] Add integration tests to validate support for GB200. #6934
Conversation
3029b18
to
2f5e4b8
Compare
57f0db7
to
9818b38
Compare
37f558c
to
5f07db2
Compare
tests/integration-tests/tests/gb200/test_gb200/test_gb200/pcluster.config.yaml
Outdated
Show resolved
Hide resolved
9115cd8
to
04a16f6
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #6934 +/- ##
===========================================
- Coverage 90.21% 90.13% -0.08%
===========================================
Files 181 181
Lines 16213 16396 +183
===========================================
+ Hits 14627 14779 +152
- Misses 1586 1617 +31
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
04a16f6
to
ecc0b96
Compare
714a028
to
4fe6ef8
Compare
3107585
to
d71ba84
Compare
d71ba84
to
fc95126
Compare
0b45c81
to
083201e
Compare
timeout ${IMEX_STOP_TIMEOUT} systemctl stop ${IMEX_SERVICE} | ||
pkill -9 ${IMEX_SERVICE} | ||
|
||
#TODO Improvement: rotate server port to prevent race condition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Non-Blocking]When do we plan to do this? Next Phase or another iteration in the coming weeks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to address this improvement in a follow up PR once we fully understand the implications.
So far I have not observed any race condition, even re-executing the same integ test on an existing cluster multiple timnes, but it can be because we are using only 2 nodes and not a real cuda application
083201e
to
5a9d5e4
Compare
…favor of the cookbook attribute to force IMEX configuration.
5a9d5e4
to
84e4988
Compare
Description of changes
Add integration tests to validate support for GB200. In particular, it verifies the automated configuration of NVIDIA IMEX.
This test creates a cluster with the necessary custom actions to configure NVIDIA IMEX and verifies the following:
IMEX service is healthy and no errors are reported in IMEX's or prolog's logs.
Also, IMEX gets reconfigured when nodes belonging to the same compute resource get replaced
keeping the default values and IMEX is not started.
The test prints in test log the full IMEX status to facilitate troubleshooting.
Important Notes
g4dn
to simulate ap6e-gb200
instance. This is a reasonable approximation for the test because the focus of the test is on IMEX configuration, which can be executed ong4dn
as well.Limitations
Tests
test_gb200
References
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.