Skip to content

Conversation

hgreebe
Copy link
Contributor

@hgreebe hgreebe commented Aug 6, 2025

Description of changes

  • gb200 requires the following setup:
    • if EfaEnabled: False, all even indices are set up with the regular interface, the odd indices are not used
    • if EfaEnabled: True, all even indices need an efa interface type and all odd indicies need an efa-only interface type
  • This change sets up the network interface configuration for gb200 without altering the exisitng configuration for other instance types
  • Adds unit test that tests the network configuration for the launch template

Tests

  • Ran test_multiple_nics and test_efa integration tests

References

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@hgreebe hgreebe marked this pull request as ready for review August 6, 2025 18:59
@hgreebe hgreebe requested review from a team as code owners August 6, 2025 18:59
@gmarciani
Copy link
Contributor

Please add in the Pr description a reference to official documentation driving this choice.

"""Generate launch template network interfaces list"""

is_gb200 = compute_resource.instance_types[0].split(".")[0] == P6E_GB200
interface = "efa" if compute_resource.efa and compute_resource.efa.enabled and not is_gb200 else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] Here and below: this is not an interface, but an interface type. What about renaming it to interface_type?

for network_card in compute_resource.network_cards_list[1:]:
efa_enabled = True if compute_resource.efa and compute_resource.efa.enabled else False
even = network_card.network_card_index() % 2 == 0
# if efa is disabled, and we have a gb200 instance we skip configuring odd numbered indexes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, include in the comment the reason why we skip odd indexes.

if is_gb200 and not efa_enabled and not even:
continue

interface = "efa" if compute_resource.efa and compute_resource.efa.enabled else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] can be = "efa" if efa_enabled else None"

[
(
True,
"p6e-gb200.36xlarge",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our logic does not check for a specific size, but for whatever GB200 size. In this case it is better in unit test to use as instance_type value p6e.gb200.WHATEVER.

This approach has the following advantages:

  1. (functional) it tests the actual logic, where the size is ignored
  2. (documentation) it uses unit test to document our behaviour, making more explicit that we ignore the size

),
(
True,
"c6in.32xlarge",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. In this case it would be better to use as instance_Type NOTp6e-gb200.WHATEVER_SIZE

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes more sense to keep the instance type so that we have more confidence on what we are testing rather than keeping a generalized name like NOTp6e-gb200.WHATEVER_SIZE

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so.

If you keep a specific value, your test verifies that the logic works on that specific value. If you use general values instead you cover the actual logic. This increases the reliability of the test and its value as a way to document behaviors.

Example to demonstrate the advantages:

  1. (functional) if in your logic you're taking the expected decision only for c6in.32xlarge, the test with the specific value will succeed, even if the logic is incorrect. If you use a generic value that is clearly not used in the logic to take decisions, your test will be able to capture bugs.
  2. (documentation) tests are meant to validate the correctness of behaviors, but also to document them. Ideally, we should be able to infer our behavior from the test, rather than looking at the underlying source code. If the test uses a specific value c6in.32xlarge, I may think that you're taking a decision around that specific value. If you use NOTp6e-gb200.WHATEVER_SIZE it is clear that the described behavior is for whatever instance type that is not p6e-gb200.

Copy link
Contributor

@himani2411 himani2411 Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are using NOTp6e-gb200.WHATEVER_SIZE or p6e-gb200.WHATEVER_SIZE which basically indicates that we are using p6e-gb200 for our logic. I dont really care about the naming convention here we could essentially specify the instance type as a comment, but i want the test to cover the different setup of NICs from the major multi-nic instances like hpc6id or p4d.
If in future, there is an immediate need to check for this specific logic working for these specific instance type I dont want to go and check the InstanceTypeData of these instances and compare it with the test cases here to see that it works, which is why the instance type (or a comment) as the name gives more confidence rather than a generic name.

):
"""Generate launch template network interfaces list."""
is_gb200 = compute_resource.instance_types[0].split(".")[0] == P6E_GB200
interface_type = "efa" if compute_resource.efa and compute_resource.efa.enabled and not is_gb200 else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should mention that we skip the ODD indices if EFA is disabled for Gb200 instance in a validator as well as the public documentation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In documentation for sure, but I would not include it in a validator as it is a too low level detail for a validator.

]

for network_card in compute_resource.network_cards_list[1:]:
efa_enabled = compute_resource.efa and compute_resource.efa.enabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are already using this as a constant why not use if above at line 372?

@himani2411
Copy link
Contributor

Can you also describe the test done for testing Gb200 launch template generation if you have done it?


compute_lt_nw_interfaces.append(
ec2.CfnLaunchTemplate.NetworkInterfaceProperty(
device_index=0 if network_card.maximum_network_interfaces() == 1 else 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true for p4d's too?or gb200?

Maximum Network interfaces of Gb200 is 17 so the device Index would be 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MaximumNetworkInterfaces for the network card is 1 for odd indexes and 2 for even indexes. The maximum network cards is 17.

This logic also works for all other instance types.

gmarciani
gmarciani previously approved these changes Aug 8, 2025
@hgreebe hgreebe added the skip-changelog-update Disables the check that enforces changelog updates in PRs label Aug 8, 2025
@hgreebe hgreebe enabled auto-merge (rebase) August 8, 2025 15:55
@hgreebe hgreebe disabled auto-merge August 8, 2025 15:56
@hgreebe hgreebe enabled auto-merge (rebase) August 8, 2025 15:56
@hgreebe hgreebe merged commit d764202 into aws:develop Aug 8, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
skip-changelog-update Disables the check that enforces changelog updates in PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants