Skip to content

Conversation

himani2411
Copy link
Contributor

@himani2411 himani2411 commented Mar 24, 2025

Description of changes

  • Remove route metric from NetworkManager

Addition of route-metric would add clashing routes in terms of the priority given for the 2 NICs, which is why some tests pass and the other fails. This happens on the HN where the NI 0 is default one which has the elastic public IP
Clashing route metrics would cause below errors in connecting to endpoints.

ERROR - Failed when getting instance info from EC2 with exception Connect timeout on endpoint URL: "https://ec2.us-east-1.amazonaws.com/"

IP Table of Parallelcluster 3.12.0 where the NI 0 has the highest priority and NI 1 has the 2nd highest priority

 ip route show table main
default via <IP-IG> dev eth0 proto dhcp src 192.168.23.21 metric 100 # N1 0 
default via <IP-IG> dev eth1 proto dhcp src 192.168.17.156 metric 101  # N1 1
default via <IP-IG> dev eth0 metric 1000 # PC added 
default via <IP-IG> dev eth1 metric 1001  # PC added 
 <IP-IG>/20 dev eth0 proto kernel scope link src 192.168.23.21 metric 100
 <IP-IG>/20 dev eth1 proto kernel scope link src 192.168.17.156 metric 101

IP Table of 3.13.0

default via  <IP-IG> dev eth1 proto dhcp src 192.168.17.156 metric 100 # clashes with the NI 0 line below
default via  <IP-IG> dev eth0 proto dhcp src 192.168.23.21 metric 100 
default via <IP-IG> dev eth1 proto dhcp src 192.168.17.156 metric 101
default via <IP-IG> dev eth0 metric 1000 # PC added 
default via <IP-IG> dev eth1 metric 1101  # PC added 
<IP-IG>/20 dev eth0 proto kernel scope link src 192.168.23.21 metric 100
<IP-IG>/20 dev eth1 proto kernel scope link src 192.168.17.156 metric 100 # clashes with the NI 0 line above
<IP-IG>/20 dev eth1 proto kernel scope link src 192.168.17.156 metric 101

Tests

ONGOING

test-suites:
  multiple_nics:
    test_multiple_nics.py::test_multiple_nics:
      dimensions:
        - regions: ["use1-az1"]
          instances: ["c6in.32xlarge"]
          oss: ["rhel8", "rhel9", "rocky8", "rocky9"]
          schedulers: ["slurm"]

References

  • Link to impacted open issues.
  • Link to related PRs in other packages (i.e. cookbook, node).
  • Link to documentation useful to understand the changes.

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@himani2411 himani2411 requested review from a team as code owners March 24, 2025 23:56
@himani2411 himani2411 force-pushed the release-3.13-route-mertic branch from c569e8c to a7c8977 Compare March 25, 2025 13:42
@himani2411 himani2411 enabled auto-merge (squash) March 25, 2025 13:42
@himani2411 himani2411 merged commit 7991177 into aws:release-3.13 Mar 25, 2025
27 of 31 checks passed
himani2411 added a commit to himani2411/aws-parallelcluster-cookbook that referenced this pull request Apr 2, 2025
Co-authored-by: Himani Anil Deshpande <himanidp@amazon.com>
himani2411 added a commit that referenced this pull request Apr 3, 2025
Co-authored-by: Himani Anil Deshpande <himanidp@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants