Unexpected egress bandwidth out to internet #1911

jwolski2 · 2022-03-06T10:56:10Z

What happened:

Hey team, I'm wondering if you can help me sort out some network bandwidth issues we're having. We are running our application on Kubernetes on a c5n.9xlarge. I understand that a c5n.9xlarge has a maximum network throughput of 50Gbps. I also understand that the same instance type can drive 25Gbps to an internet gateway (Source). However, our application is able to drive less than 15Gbps to the internet gateway. And I've been able to reproduce similar behavior in our development environment with iperf where we seem to be limited to 10Gbps.

In our development environment, I have run the iperf client and server on 2 different c5n.9xlarge instances, with a mix of on-host and container combinations. The client and server commands typically look like this:

Client: iperf -c PUBLIC_OR_PRIVATE_IP -p 32293 -P 20 -t 120 -i 5
Server: iperf -s -p 32293

Here are the results I found:

Client	Server	Server IP Address	Expected	Actual
On-Host	On-Host	Private	50Gbps	50Gbps
On-Host	On-Host	Public	25Gbps	25Gbps
Container (Pod)	On-Host	Private	50Gbps	50Gbps
Container (Pod)	On-Host	Public	25Gbps	10Gbps*
On-Host	Container (Pod)	Public	25Gbps	25Gbps

*Based on the expected and actual results above, only the 4th test case does not meet expectations. This leads me to believe there is bottleneck egressing the container to the internet gateway. And that's the issue I'm trying to sort out.

Here's what the veth configuration looks like from within inside the container:

3: eth0@if89: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default
    link/ether 1e:28:3d:2e:5b:e1 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.10.4.220/32 scope global eth0
       valid_lft forever preferred_lft forever

And from outside on the host:

89: eni8cb21860b24@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default
    link/ether d6:9c:12:4f:81:31 brd ff:ff:ff:ff:ff:ff link-netnsid 3
    inet6 fe80::d49c:12ff:fe4f:8131/64 scope link
       valid_lft forever preferred_lft forever

And the iptables rules:

:AWS-SNAT-CHAIN-0 - [0:0]
:AWS-SNAT-CHAIN-1 - [0:0]
-A POSTROUTING -m comment --comment "AWS SNAT CHAIN" -j AWS-SNAT-CHAIN-0
-A AWS-SNAT-CHAIN-0 ! -d 10.10.0.0/16 -m comment --comment "AWS SNAT CHAIN" -j AWS-SNAT-CHAIN-1
-A AWS-SNAT-CHAIN-1 ! -o vlan+ -m comment --comment "AWS, SNAT" -m addrtype ! --dst-type LOCAL -j SNAT --to-source 10.10.6.108 --random-fully

During my investigation, I stumbled upon this previously filed issue #1087 which suggests it may have been resolved after a EKS AMI / kernel update (NOTE: we are not running EKS or the EKS AMI). And so, I'm wondering if we're running into similar issues in the kernel and/or whether my report above can be validated/invalidated by your team.

(After filing the issue, I'll also send along the results of running the CNI Log Collection tool to your email address).

Thanks!

Environment:

Kubernetes version (use kubectl version): 1.19.15
CNI Version 1.7.10
OS (e.g: cat /etc/os-release): Ubuntu 20.04.3 LTS
Kernel (e.g. uname -a): 5.11.0-1021-aws # 22~20.04.2-Ubuntu SMP

The text was updated successfully, but these errors were encountered:

jayanthvn · 2022-03-07T19:03:48Z

Thanks for the detailed info @jwolski2. I will test it out on my cluster and let you know.

jwolski2 · 2022-03-24T09:18:45Z

Hey @jayanthvn I'm curious if you've had any time to dive into this issue. I've been working with AWS Support on a case about this same issue and after a few days of performing their own tests, they've concluded:

EC2 infrastructure looks fine
"Something is going on in Kubernetes or the AWS CNI that is restricting bandwidth"
Since our setup is not part of EKS, their continued support would be "best effort."

I'm not sure how to interpret that last bullet point so I'm coming back around to this issue to see if you've reached your own conclusions. Thanks!

jayanthvn · 2022-03-24T14:16:34Z

@jwolski2 - Sorry I didn't get time this week to verify this behavior. Will test it out by next week and update you. I can try both Ubuntu and EKS AMI.

jwolski2 · 2022-04-13T16:11:48Z

Apologies for piling on the nudges, but I'm curious @jayanthvn. Has there been any movement on this issue from your side? Thanks!

jayanthvn · 2022-04-13T18:00:19Z

@jwolski2 Sorry I got busy with few release activities. Will take a look asap.

jwolski2 · 2022-04-14T21:16:04Z

I've got good news. I've heard back from AWS Support and they suggested we either upgrade our Linux kernel version or our K8s version because they found evidence that there is an issue with the way the kernel is handling Kubernetes network traffic.

Anyways, I upgraded from kernel version 5.11.0-1021-aws to 5.13.0-1021-aws and sure enough, where I was driving < 10Gbps to the internet gateway before, I am now able to drive 17-19Gbps. It's not exactly the 25Gbps I was hoping for, but it's a solid improvement.

I'm not exactly sure where the issue lies. If you have any details about it, let me know!

jayanthvn · 2022-04-14T21:29:23Z

Thanks for the update. We also have the ticket in our queue and will look into it.

EKS AMI kernel version is 5.4.181-99.354.amzn2.x86_64 so might be able to repro with EKS AMI too.

jwolski2 · 2022-04-30T08:13:41Z

One last update from our side: we pushed the kernel upgrade out to production and our app is now able to achieve at least 25Gbps whereas before it was topping out between 13-15Gbps. I haven't tested how far we can push the net i/o as we're happy enough now with performance.

You may close this issue whenever you find it appropriate, @jayanthvn.

jayanthvn · 2022-04-30T18:12:53Z

Thanks for confirmation. So looks like it was a kernel issue.

github-actions · 2022-05-01T04:26:25Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

jwolski2 added needs investigation question labels Mar 6, 2022

jayanthvn self-assigned this Mar 7, 2022

jayanthvn closed this as completed May 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected egress bandwidth out to internet #1911

Unexpected egress bandwidth out to internet #1911

jwolski2 commented Mar 6, 2022

jayanthvn commented Mar 7, 2022

jwolski2 commented Mar 24, 2022

jayanthvn commented Mar 24, 2022

jwolski2 commented Apr 13, 2022

jayanthvn commented Apr 13, 2022

jwolski2 commented Apr 14, 2022

jayanthvn commented Apr 14, 2022

jwolski2 commented Apr 30, 2022

jayanthvn commented Apr 30, 2022

github-actions bot commented May 1, 2022