Skip to content

Conversation

@hanwen-cluster
Copy link
Contributor

This commit fixes a bug from #2855, which made the route table/metric number larger (meaning lower priority). Thereafter, some unwanted default rules on AL2023 took priority and failed test_multiple_nics integration test on AL2023.

This commit makes the number smaller (meaning higher priority) to fix the issue. e.g.
Prior to this commit, the number for table for 1,1 is 1001001. After this commit, the number is 101+10=111. The "+10" is to properly handle table for 0,0, which has number 10. Without "+10", the table would conflict with table 0 from OS.

FYI: the number of unwanted default AL2023 rule starts with 10101

Tests

  • test_multiple_nics on AL2023 has been passed

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

This commit fixes a bug from aws#2855, which made the route table/metric number larger (meaning lower priority). Thereafter, some unwanted default rules on AL2023 took priority and failed test_multiple_nics integration test on AL2023.

This commit makes the number smaller (meaning higher priority) to fix the issue.
e.g.
Prior to this commit, the number for table for 1,1 is 1001001. After this commit, the number is 101+10=111. The "+10" is to properly handle table for 0,0, which has number 10. Without "+10", the table would conflict with table 0 from OS.

FYI: the number of unwanted default AL2023 rule starts with 10101

Signed-off-by: Hanwen <hanwenli@amazon.com>
@hgreebe hgreebe merged commit e479a86 into aws:develop Dec 24, 2024
33 of 44 checks passed
hanwen-cluster added a commit to hanwen-cluster/aws-parallelcluster-cookbook that referenced this pull request Dec 31, 2024
aws#2855 made the pcluster route table/metric number larger (meaning lower priority). Thereafter, some unwanted default rules on AL2023 ec2-net-utils took priority and failed test_multiple_nics integration test on AL2023.
Then, aws#2857 made the number too small, interfering route table configurations from IMDS on AL2.

Therefore, this commit tries to imitate the priority prior to these two PRs. This is not the cleanest fix, because it is staying in the lucky priority rand instead of fully resolving the issue (i.e. prevent IMDS and ec2-net-utils from configuring the route tables). However, this commit is the least breaking change. So I propose to go with this commit.

Metric number range before the two PRs
Network card (0,0): 1000
Network card (0,1): 1000 (which was causing conflicts and the reason for all these PRs)
Network card (n,1): 100n (for p5, which has 32 network card, it will be 1000-10031)

Metric number range after the first PR:
Network card (0,0): 1000000
Network card (0,1): 1000001 (conflict fixed :) )
Network card (n,1): 1000001+n*1000 (for p5, it will be 1000000-1031001)

Metric number range after the second PR:
Network card (0,0): 10
Network card (0,1): 75
Network card (n,1): 0x(hexadecimal number)n01+10 (for p5, it will be 10-12555. The hexadecimal number was accidentally introduced because bash automatically interpret numbers start with "00" as hexadecimal number)

Metric number range after this commit:
Network card (0,0): 1000
Network card (0,1): 1001
Network card (n,1): n01+1000 (for p5, it will be 1000-4101)

Signed-off-by: Hanwen <hanwenli@amazon.com>
hanwen-cluster added a commit that referenced this pull request Dec 31, 2024
#2855 made the pcluster route table/metric number larger (meaning lower priority). Thereafter, some unwanted default rules on AL2023 ec2-net-utils took priority and failed test_multiple_nics integration test on AL2023.
Then, #2857 made the number too small, interfering route table configurations from IMDS on AL2.

Therefore, this commit tries to imitate the priority prior to these two PRs. This is not the cleanest fix, because it is staying in the lucky priority rand instead of fully resolving the issue (i.e. prevent IMDS and ec2-net-utils from configuring the route tables). However, this commit is the least breaking change. So I propose to go with this commit.

Metric number range before the two PRs
Network card (0,0): 1000
Network card (0,1): 1000 (which was causing conflicts and the reason for all these PRs)
Network card (n,1): 100n (for p5, which has 32 network card, it will be 1000-10031)

Metric number range after the first PR:
Network card (0,0): 1000000
Network card (0,1): 1000001 (conflict fixed :) )
Network card (n,1): 1000001+n*1000 (for p5, it will be 1000000-1031001)

Metric number range after the second PR:
Network card (0,0): 10
Network card (0,1): 75
Network card (n,1): 0x(hexadecimal number)n01+10 (for p5, it will be 10-12555. The hexadecimal number was accidentally introduced because bash automatically interpret numbers start with "00" as hexadecimal number)

Metric number range after this commit:
Network card (0,0): 1000
Network card (0,1): 1001
Network card (n,1): n01+1000 (for p5, it will be 1000-4101)

Signed-off-by: Hanwen <hanwenli@amazon.com>
hanwen-cluster added a commit to hanwen-cluster/aws-parallelcluster-cookbook that referenced this pull request Jan 7, 2025
This commit fixes a bug from aws#2855, which made the route table/metric number larger (meaning lower priority). Thereafter, some unwanted default rules on AL2023 took priority and failed test_multiple_nics integration test on AL2023.

This commit makes the number smaller (meaning higher priority) to fix the issue.
e.g.
Prior to this commit, the number for table for 1,1 is 1001001. After this commit, the number is 101+10=111. The "+10" is to properly handle table for 0,0, which has number 10. Without "+10", the table would conflict with table 0 from OS.

FYI: the number of unwanted default AL2023 rule starts with 10101

Signed-off-by: Hanwen <hanwenli@amazon.com>
hanwen-cluster added a commit to hanwen-cluster/aws-parallelcluster-cookbook that referenced this pull request Jan 7, 2025
aws#2855 made the pcluster route table/metric number larger (meaning lower priority). Thereafter, some unwanted default rules on AL2023 ec2-net-utils took priority and failed test_multiple_nics integration test on AL2023.
Then, aws#2857 made the number too small, interfering route table configurations from IMDS on AL2.

Therefore, this commit tries to imitate the priority prior to these two PRs. This is not the cleanest fix, because it is staying in the lucky priority rand instead of fully resolving the issue (i.e. prevent IMDS and ec2-net-utils from configuring the route tables). However, this commit is the least breaking change. So I propose to go with this commit.

Metric number range before the two PRs
Network card (0,0): 1000
Network card (0,1): 1000 (which was causing conflicts and the reason for all these PRs)
Network card (n,1): 100n (for p5, which has 32 network card, it will be 1000-10031)

Metric number range after the first PR:
Network card (0,0): 1000000
Network card (0,1): 1000001 (conflict fixed :) )
Network card (n,1): 1000001+n*1000 (for p5, it will be 1000000-1031001)

Metric number range after the second PR:
Network card (0,0): 10
Network card (0,1): 75
Network card (n,1): 0x(hexadecimal number)n01+10 (for p5, it will be 10-12555. The hexadecimal number was accidentally introduced because bash automatically interpret numbers start with "00" as hexadecimal number)

Metric number range after this commit:
Network card (0,0): 1000
Network card (0,1): 1001
Network card (n,1): n01+1000 (for p5, it will be 1000-4101)

Signed-off-by: Hanwen <hanwenli@amazon.com>
hanwen-cluster added a commit to hanwen-cluster/aws-parallelcluster-cookbook that referenced this pull request Jan 7, 2025
This commit fixes a bug from aws#2855, which made the route table/metric number larger (meaning lower priority). Thereafter, some unwanted default rules on AL2023 took priority and failed test_multiple_nics integration test on AL2023.

This commit makes the number smaller (meaning higher priority) to fix the issue.
e.g.
Prior to this commit, the number for table for 1,1 is 1001001. After this commit, the number is 101+10=111. The "+10" is to properly handle table for 0,0, which has number 10. Without "+10", the table would conflict with table 0 from OS.

FYI: the number of unwanted default AL2023 rule starts with 10101

Signed-off-by: Hanwen <hanwenli@amazon.com>
hanwen-cluster added a commit to hanwen-cluster/aws-parallelcluster-cookbook that referenced this pull request Jan 7, 2025
aws#2855 made the pcluster route table/metric number larger (meaning lower priority). Thereafter, some unwanted default rules on AL2023 ec2-net-utils took priority and failed test_multiple_nics integration test on AL2023.
Then, aws#2857 made the number too small, interfering route table configurations from IMDS on AL2.

Therefore, this commit tries to imitate the priority prior to these two PRs. This is not the cleanest fix, because it is staying in the lucky priority rand instead of fully resolving the issue (i.e. prevent IMDS and ec2-net-utils from configuring the route tables). However, this commit is the least breaking change. So I propose to go with this commit.

Metric number range before the two PRs
Network card (0,0): 1000
Network card (0,1): 1000 (which was causing conflicts and the reason for all these PRs)
Network card (n,1): 100n (for p5, which has 32 network card, it will be 1000-10031)

Metric number range after the first PR:
Network card (0,0): 1000000
Network card (0,1): 1000001 (conflict fixed :) )
Network card (n,1): 1000001+n*1000 (for p5, it will be 1000000-1031001)

Metric number range after the second PR:
Network card (0,0): 10
Network card (0,1): 75
Network card (n,1): 0x(hexadecimal number)n01+10 (for p5, it will be 10-12555. The hexadecimal number was accidentally introduced because bash automatically interpret numbers start with "00" as hexadecimal number)

Metric number range after this commit:
Network card (0,0): 1000
Network card (0,1): 1001
Network card (n,1): n01+1000 (for p5, it will be 1000-4101)

Signed-off-by: Hanwen <hanwenli@amazon.com>
SUFFIX=$(printf "%03d" $NETWORK_CARD_INDEX)$(printf "%03d" $DEVICE_NUMBER)
ROUTE_TABLE=1${SUFFIX}
SUFFIX=$(printf "%03d" $NETWORK_CARD_INDEX)$(printf "%02d" $DEVICE_NUMBER)
ROUTE_TABLE="$(( $SUFFIX + 10 ))"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to have a comment in the code to describe why this is required.

SUFFIX=$(printf "%03d" $NETWORK_CARD_INDEX)$(printf "%03d" $DEVICE_NUMBER)
ROUTE_TABLE="1${SUFFIX}"
SUFFIX=$(printf "%03d" $NETWORK_CARD_INDEX)$(printf "%02d" $DEVICE_NUMBER)
ROUTE_TABLE="$(( $SUFFIX + 10 ))"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have expected a change in the kitchen tests after this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please execute kitchen tests, verify if they are still correct and eventually improve coverage ? Thanks!

hanwen-cluster added a commit to hanwen-cluster/aws-parallelcluster-cookbook that referenced this pull request Jan 24, 2025
You can use this file to apply changes on ParallelCluster AMIs from the following PRs:

aws#2855
aws#2857
aws#2858

Signed-off-by: Hanwen <hanwenli@amazon.com>
hanwen-cluster added a commit that referenced this pull request Jan 24, 2025
You can use this file to apply changes on ParallelCluster AMIs from the following PRs:

#2855
#2857
#2858

Signed-off-by: Hanwen <hanwenli@amazon.com>
gmarciani pushed a commit to gmarciani/aws-parallelcluster-cookbook that referenced this pull request Feb 5, 2025
You can use this file to apply changes on ParallelCluster AMIs from the following PRs:

aws#2855
aws#2857
aws#2858

Signed-off-by: Hanwen <hanwenli@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants