-
Notifications
You must be signed in to change notification settings - Fork 315
Change network interface setup logic to account for gb200 #6930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Please add in the Pr description a reference to official documentation driving this choice. |
"""Generate launch template network interfaces list""" | ||
|
||
is_gb200 = compute_resource.instance_types[0].split(".")[0] == P6E_GB200 | ||
interface = "efa" if compute_resource.efa and compute_resource.efa.enabled and not is_gb200 else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor] Here and below: this is not an interface, but an interface type. What about renaming it to interface_type
?
for network_card in compute_resource.network_cards_list[1:]: | ||
efa_enabled = True if compute_resource.efa and compute_resource.efa.enabled else False | ||
even = network_card.network_card_index() % 2 == 0 | ||
# if efa is disabled, and we have a gb200 instance we skip configuring odd numbered indexes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, include in the comment the reason why we skip odd indexes.
if is_gb200 and not efa_enabled and not even: | ||
continue | ||
|
||
interface = "efa" if compute_resource.efa and compute_resource.efa.enabled else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor] can be = "efa" if efa_enabled else None"
[ | ||
( | ||
True, | ||
"p6e-gb200.36xlarge", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our logic does not check for a specific size, but for whatever GB200 size. In this case it is better in unit test to use as instance_type value p6e.gb200.WHATEVER
.
This approach has the following advantages:
- (functional) it tests the actual logic, where the size is ignored
- (documentation) it uses unit test to document our behaviour, making more explicit that we ignore the size
), | ||
( | ||
True, | ||
"c6in.32xlarge", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above. In this case it would be better to use as instance_Type NOTp6e-gb200.WHATEVER_SIZE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes more sense to keep the instance type so that we have more confidence on what we are testing rather than keeping a generalized name like NOTp6e-gb200.WHATEVER_SIZE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so.
If you keep a specific value, your test verifies that the logic works on that specific value. If you use general values instead you cover the actual logic. This increases the reliability of the test and its value as a way to document behaviors.
Example to demonstrate the advantages:
- (functional) if in your logic you're taking the expected decision only for c6in.32xlarge, the test with the specific value will succeed, even if the logic is incorrect. If you use a generic value that is clearly not used in the logic to take decisions, your test will be able to capture bugs.
- (documentation) tests are meant to validate the correctness of behaviors, but also to document them. Ideally, we should be able to infer our behavior from the test, rather than looking at the underlying source code. If the test uses a specific value
c6in.32xlarge
, I may think that you're taking a decision around that specific value. If you useNOTp6e-gb200.WHATEVER_SIZE
it is clear that the described behavior is for whatever instance type that is not p6e-gb200.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are using NOTp6e-gb200.WHATEVER_SIZE
or p6e-gb200.WHATEVER_SIZE
which basically indicates that we are using p6e-gb200 for our logic. I dont really care about the naming convention here we could essentially specify the instance type as a comment, but i want the test to cover the different setup of NICs from the major multi-nic instances like hpc6id or p4d.
If in future, there is an immediate need to check for this specific logic working for these specific instance type I dont want to go and check the InstanceTypeData of these instances and compare it with the test cases here to see that it works, which is why the instance type (or a comment) as the name gives more confidence rather than a generic name.
): | ||
"""Generate launch template network interfaces list.""" | ||
is_gb200 = compute_resource.instance_types[0].split(".")[0] == P6E_GB200 | ||
interface_type = "efa" if compute_resource.efa and compute_resource.efa.enabled and not is_gb200 else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should mention that we skip the ODD indices if EFA is disabled for Gb200 instance in a validator as well as the public documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In documentation for sure, but I would not include it in a validator as it is a too low level detail for a validator.
] | ||
|
||
for network_card in compute_resource.network_cards_list[1:]: | ||
efa_enabled = compute_resource.efa and compute_resource.efa.enabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are already using this as a constant why not use if above at line 372?
Can you also describe the test done for testing Gb200 launch template generation if you have done it? |
|
||
compute_lt_nw_interfaces.append( | ||
ec2.CfnLaunchTemplate.NetworkInterfaceProperty( | ||
device_index=0 if network_card.maximum_network_interfaces() == 1 else 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true for p4d's too?or gb200?
Maximum Network interfaces of Gb200 is 17 so the device Index would be 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MaximumNetworkInterfaces
for the network card is 1 for odd indexes and 2 for even indexes. The maximum network cards is 17.
This logic also works for all other instance types.
Description of changes
EfaEnabled: False
, all even indices are set up with the regular interface, the odd indices are not usedEfaEnabled: True
, all even indices need an efa interface type and all odd indicies need an efa-only interface typeTests
test_multiple_nics
andtest_efa
integration testsReferences
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.