Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected egress bandwidth out to internet #1911

Closed
jwolski2 opened this issue Mar 6, 2022 · 10 comments
Closed

Unexpected egress bandwidth out to internet #1911

jwolski2 opened this issue Mar 6, 2022 · 10 comments

Comments

@jwolski2
Copy link

jwolski2 commented Mar 6, 2022

What happened:

Hey team, I'm wondering if you can help me sort out some network bandwidth issues we're having. We are running our application on Kubernetes on a c5n.9xlarge. I understand that a c5n.9xlarge has a maximum network throughput of 50Gbps. I also understand that the same instance type can drive 25Gbps to an internet gateway (Source). However, our application is able to drive less than 15Gbps to the internet gateway. And I've been able to reproduce similar behavior in our development environment with iperf where we seem to be limited to 10Gbps.

In our development environment, I have run the iperf client and server on 2 different c5n.9xlarge instances, with a mix of on-host and container combinations. The client and server commands typically look like this:

Client: iperf -c PUBLIC_OR_PRIVATE_IP -p 32293 -P 20 -t 120 -i 5
Server: iperf -s -p 32293

Here are the results I found:

Client Server Server IP Address Expected Actual
On-Host On-Host Private 50Gbps 50Gbps
On-Host On-Host Public 25Gbps 25Gbps
Container (Pod) On-Host Private 50Gbps 50Gbps
Container (Pod) On-Host Public 25Gbps 10Gbps*
On-Host Container (Pod) Public 25Gbps 25Gbps

*Based on the expected and actual results above, only the 4th test case does not meet expectations. This leads me to believe there is bottleneck egressing the container to the internet gateway. And that's the issue I'm trying to sort out.

Here's what the veth configuration looks like from within inside the container:

3: eth0@if89: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default
    link/ether 1e:28:3d:2e:5b:e1 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.10.4.220/32 scope global eth0
       valid_lft forever preferred_lft forever

And from outside on the host:

89: eni8cb21860b24@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default
    link/ether d6:9c:12:4f:81:31 brd ff:ff:ff:ff:ff:ff link-netnsid 3
    inet6 fe80::d49c:12ff:fe4f:8131/64 scope link
       valid_lft forever preferred_lft forever

And the iptables rules:

:AWS-SNAT-CHAIN-0 - [0:0]
:AWS-SNAT-CHAIN-1 - [0:0]
-A POSTROUTING -m comment --comment "AWS SNAT CHAIN" -j AWS-SNAT-CHAIN-0
-A AWS-SNAT-CHAIN-0 ! -d 10.10.0.0/16 -m comment --comment "AWS SNAT CHAIN" -j AWS-SNAT-CHAIN-1
-A AWS-SNAT-CHAIN-1 ! -o vlan+ -m comment --comment "AWS, SNAT" -m addrtype ! --dst-type LOCAL -j SNAT --to-source 10.10.6.108 --random-fully

During my investigation, I stumbled upon this previously filed issue #1087 which suggests it may have been resolved after a EKS AMI / kernel update (NOTE: we are not running EKS or the EKS AMI). And so, I'm wondering if we're running into similar issues in the kernel and/or whether my report above can be validated/invalidated by your team.

(After filing the issue, I'll also send along the results of running the CNI Log Collection tool to your email address).

Thanks!

Environment:

  • Kubernetes version (use kubectl version): 1.19.15
  • CNI Version 1.7.10
  • OS (e.g: cat /etc/os-release): Ubuntu 20.04.3 LTS
  • Kernel (e.g. uname -a): 5.11.0-1021-aws # 22~20.04.2-Ubuntu SMP
@jayanthvn
Copy link
Contributor

Thanks for the detailed info @jwolski2. I will test it out on my cluster and let you know.

@jayanthvn jayanthvn self-assigned this Mar 7, 2022
@jwolski2
Copy link
Author

Hey @jayanthvn I'm curious if you've had any time to dive into this issue. I've been working with AWS Support on a case about this same issue and after a few days of performing their own tests, they've concluded:

  • EC2 infrastructure looks fine
  • "Something is going on in Kubernetes or the AWS CNI that is restricting bandwidth"
  • Since our setup is not part of EKS, their continued support would be "best effort."

I'm not sure how to interpret that last bullet point so I'm coming back around to this issue to see if you've reached your own conclusions. Thanks!

@jayanthvn
Copy link
Contributor

@jwolski2 - Sorry I didn't get time this week to verify this behavior. Will test it out by next week and update you. I can try both Ubuntu and EKS AMI.

@jwolski2
Copy link
Author

Apologies for piling on the nudges, but I'm curious @jayanthvn. Has there been any movement on this issue from your side? Thanks!

@jayanthvn
Copy link
Contributor

@jwolski2 Sorry I got busy with few release activities. Will take a look asap.

@jwolski2
Copy link
Author

I've got good news. I've heard back from AWS Support and they suggested we either upgrade our Linux kernel version or our K8s version because they found evidence that there is an issue with the way the kernel is handling Kubernetes network traffic.

Anyways, I upgraded from kernel version 5.11.0-1021-aws to 5.13.0-1021-aws and sure enough, where I was driving < 10Gbps to the internet gateway before, I am now able to drive 17-19Gbps. It's not exactly the 25Gbps I was hoping for, but it's a solid improvement.

I'm not exactly sure where the issue lies. If you have any details about it, let me know!

@jayanthvn
Copy link
Contributor

Thanks for the update. We also have the ticket in our queue and will look into it.

EKS AMI kernel version is 5.4.181-99.354.amzn2.x86_64 so might be able to repro with EKS AMI too.

@jwolski2
Copy link
Author

One last update from our side: we pushed the kernel upgrade out to production and our app is now able to achieve at least 25Gbps whereas before it was topping out between 13-15Gbps. I haven't tested how far we can push the net i/o as we're happy enough now with performance.

You may close this issue whenever you find it appropriate, @jayanthvn.

@jayanthvn
Copy link
Contributor

Thanks for confirmation. So looks like it was a kernel issue.

@github-actions
Copy link

github-actions bot commented May 1, 2022

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants