Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setting AWS_VPC_K8S_CNI_EXTERNALSNAT=false adds latency #1087

Closed
bhaveshph opened this issue Jul 13, 2020 · 13 comments
Closed

setting AWS_VPC_K8S_CNI_EXTERNALSNAT=false adds latency #1087

bhaveshph opened this issue Jul 13, 2020 · 13 comments

Comments

@bhaveshph
Copy link

bhaveshph commented Jul 13, 2020

EKS / k8s version : 1.16
aws-vpc-cni version : 1.6.1
calico version (using this only for network policy) : 3.8.1
iptables --version => iptables v1.8.2 (legacy)

Following has been noticed with using Primary IP range and also Secondary IP range (where, AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=true) for pods.

  • setting AWS_VPC_K8S_CNI_EXTERNALSNAT=false (will use iptable rule to SNAT)
    egress calls to peered vpc, take 0.300 ms, as per 'traceroute -T -p 443 app1-dns'

  • setting AWS_VPC_K8S_CNI_EXTERNALSNAT=true (will not use iptable rule to SNAT)
    egress calls to peered vpc, takes 0.019 ms, as per 'traceroute -T -p 443 app1-dns'

I also tried setting, AWS_VPC_K8S_CNI_RANDOMIZESNAT=prng. iptables rules are added fine (without node reboot) and seems to be working correctly; But latency remains same as mentioned above.

Has anyone seen this before?
Am I missing something ?
Is it outstanding issue where SNATing with iptable from ec2 node adds up to latency ?

@mogren
Copy link
Contributor

mogren commented Jul 15, 2020

Hi @bhaveshph,

If you have a peered VPC, you can set AWS_VPC_K8S_CNI_EXCLUDE_SNAT_CIDRS (see Readme) with the CIDRs you want to avoid SNAT for. See #53 for some background on this issue.

I'm not sure where the increased latency comes from though, it used to not work at all with peered VPCs if SNAT was enabled, so it is possible that using SNAT triggers some alternate routing that adds additional latency. This blog post on the subject is pretty good: https://kubernetes.io/blog/2019/03/29/kube-proxy-subtleties-debugging-an-intermittent-connection-reset/

@bhaveshph
Copy link
Author

bhaveshph commented Jul 15, 2020

Hi @mogren

I see, awslabs/amazon-eks-ami#505
and upgraded EKS from 1.16 to 1.17 and using kube-proxy 1.17.7.
so prng option works, make --random-fully and iptables version is good and kube-proxy also happy.

I try following options for cni and still latency remains same.

env:
  AWS_VPC_K8S_CNI_LOGLEVEL: DEBUG
  AWS_VPC_K8S_CNI_VETHPREFIX: eni
  AWS_VPC_ENI_MTU: "9001"
  AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG: true
  ENI_CONFIG_LABEL_DEF: failure-domain.beta.kubernetes.io/zone
  AWS_VPC_K8S_CNI_EXTERNALSNAT: false
  AWS_VPC_K8S_CNI_RANDOMIZESNAT: prng
  AWS_VPC_K8S_CNI_EXCLUDE_SNAT_CIDRS: 100.67.0.0/16

Let me also mention, when I set, AWS_VPC_K8S_CNI_EXTERNALSNAT: true, which disbles SNAT, latency do goes down

@bhaveshph
Copy link
Author

bhaveshph commented Jul 17, 2020

@mogren
So I learn that, CIDR in AWS_VPC_K8S_CNI_EXCLUDE_SNAT_CIDRS is peered vpc CIDR range. Which basically disabled SNAT for that range rather for global setting AWS_VPC_K8S_CNI_EXTERNALSNAT: true.

Now doing that exclude (AWS_VPC_K8S_CNI_EXCLUDE_SNAT_CIDRS), started advertising 100.*, pod IP, to peered vpc; which is not whitelisted nor feasible for us to advertise 100.* range.

So back to square one, we still facing latency issue with AWS_VPC_K8S_CNI_EXTERNALSNAT: false
I see from README, this is default behavior and everyone is facing this for egress calls. I am not sure if it is reported before or not.

Thanks,
Bhavesh

@CarloColumna
Copy link

CarloColumna commented Aug 3, 2020

We have also experienced increased latency with AWS_VPC_K8S_CNI_EXTERNALSNAT: false on our peered VPCs. When we disabled it, setting to true the latency went down. This is following an upgrade to EKS v1.15.

@mogren
Copy link
Contributor

mogren commented Aug 11, 2020

Just a quick update @bhaveshph and @CarloColumna.

We have done some testing and narrowed it down to a kernel issue affecting the connection tracking when doing SNAT. When running our tests, things look fine with kernel-4.14.173 (4.14.173-137.229.amzn2.x86_64). When the Kernel version is upgraded to the next version available, kernel-4.14.177 (4.14.177-139.254.amzn2), we see increased latency when SNAT is enabled.

We still have to bisect down to the exact change that is triggering this which will take some more time, just letting you know that we are not ignoring the issue.

@bhaveshph
Copy link
Author

Hi @mogren
Appreciate the update very much.

Hope this gets to resolution; as it helps big time in CIDR expansion without much overhead.

Thanks,
Bhavesh

@bhaveshph
Copy link
Author

Hi @mogren and @SaranBalaji90 ,
Please let me know if you have any update on this one.

Thanks,
Bhavesh

@mmerkes
Copy link
Contributor

mmerkes commented Oct 1, 2020

@bhaveshph We're going to discuss this with the AmazonLinux team to get some help. We'll post any updates here if we can narrow down the issue. Once we identify the issue, we can fix it and release new AMIs.

@bhaveshph
Copy link
Author

Hi @mmerkes
Just wondering if you have update, please let me know.

Thanks,
Bhavesh

@jayanthvn
Copy link
Contributor

Hi @bhaveshph

Can you please upgrade to latest AMI and see if the issue is resolved? If not we can debug this further.

Thank you.

@bhaveshph
Copy link
Author

@jayanthvn

From internal support ticket, I get to know that, fix has went into amzn-2 linux Kernel version 5.4 (5.4.58-32.125.amzn2.x86_64) [kernel-ng, 5.9] and there is plan to back-port this fix on amzn-2 linux kernel version, 4.14.
It also mentioned, unfortunately there is no planned ami with this kernel version.

So can you please let me know, which amzn-eks-ami comes with kernel that has fix.

My bandwidth to try this out is low for now, but once ami is available, I can definitely make some time.

Thanks,
Bhavesh

@jayanthvn
Copy link
Contributor

Hi,

The current EKS ami kernel version is 4.14.203 (https://github.com/awslabs/amazon-eks-ami/releases).

@jayanthvn
Copy link
Contributor

Hi,

Closing this issue for now. Please try the latest AMI and feel free to reach out if the issue still exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants