New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Routing Issue Outside VPC #53

Closed
incognick opened this Issue Mar 29, 2018 · 24 comments

Comments

Projects
None yet
8 participants
@incognick
Copy link

incognick commented Mar 29, 2018

I'm experiencing a routing issue from outside of my VPC where my EKS cluster is located. My setup is as follows:

VPC A with 3 private subnets. Fourth subnet is public with NAT gateway
VPC B with VPN access.
Peering connection between the two.

VPC A houses my EKS cluster with 3 worker nodes each in a different subnet. VPC B is our existing infrastructure (different region) with VPN access.

Sometimes (not always), I'll have trouble getting a route into a pod from VPC B. Connection will timeout. Ping doesn't work either. If I ssh into one of the worker nodes in VPC A, I can route just fine into the pod.

  • I have confirmed this is not a ACL issue, or SG issue as I can route into other pods on the same node.
  • This is not confined to a single subnet.

Let me know if you need more information as I can reproduce pretty easily. I posted this question in the aws eks slack channel and they directed me to create an issue here.

Thank you!

@incognick

This comment has been minimized.

Copy link
Author

incognick commented Mar 29, 2018

FYI @bchav

@lbernail

This comment has been minimized.

Copy link
Contributor

lbernail commented Mar 29, 2018

@incognick : this will happen when the pod IP is on a secondary ENI

Plugin (version 0.1.4) cannot work across VPC peering (see issue: #44).
We are looking at using this plugin, not in EKS but in on our own cluster. Here is a few more details on the issue with the plugin today:

  • it SNATs all traffic sent outside the VPC CIDR block with the main interface IP
sudo iptables -t nat -nL POSTROUTING
SNAT all -- 0.0.0.0/0 !172.30.0.0/16 /* AWS, SNAT */ ADDRTYPE match dst-type !LOCAL to:172.30.70.171

=> it works for incoming traffic thanks to conntrack but you lose POD IP when traffic is sent from a pod

  • it uses an ip rule to force all traffic sent to IP addresses outside the VPC to use the primary interface:
ip rule
1024:	not from all to 172.16.0.0/16 lookup main

=> So incoming traffic is dropped by reverse path filter (example with pod with IP 172.16.0.100 on ENI 2 (ens6) and an instance in second VPC with IP 172.17.0.200):

ip route get 172.16.0.100 from 172.17.0.200 iif ens6
RTNETLINK answers: Invalid cross-device link
  • disabling rp filtering is not enough because the traffic is then dropped by aws source-dest check (traffic from pod IP associated to ENI 2 is sent via ENI 1)
    echo 0 | sudo tee /proc/sys/net/ipv4/conf/{all,ens5,ens6}/rp_filter
  • it works if you disable source/dest check but it does not make sense for this traffic to go through the primary interface anyway

Today, we use a patched version of the image (not really something that I can include in a PR because I simply removed the call to the function setting up the rule and NAT, but I'm happy to discuss it)

@incognick

This comment has been minimized.

Copy link
Author

incognick commented Mar 30, 2018

@lbernail Thanks for the response. Hopefully this can be address soon!

@edwize

This comment has been minimized.

Copy link

edwize commented Mar 30, 2018

@lbernail, I've been dealing with a similar issue with routing over my VPN from VPC. POD's with IP's from Primary ETH0 pool work fine to office network (172.33.x.x <->10.10.x.x. ) Traffic from pod's using secondary interface work fine POD->office, but not office->POD. The issue is office->POD comes in on ETH1 but out ETH0. Id like to discuss "I simply removed the call to the function setting up the rule and NAT"
Since this project is rapidly evolving, I don't mind having my own patched version until additional CIDR routing is standardized.

@edwize

This comment has been minimized.

Copy link

edwize commented Mar 30, 2018

@lbernail I got my POC working by doing the following directly on my Node, but would like to have the plug-in fixed to automatically apply to new Nodes and in a flexible way:

SOURCE/DEST check on ENI's:
Set Source/Destination check "false" on eth0-2 via GUI

REVERSE PATH FILTERING was already off (zero):
sysctl -a | grep rp_filter | grep -v arp_filter

DELETE SNAT RULE:
iptables -t nat -L POSTROUTING
iptables -t nat -D POSTROUTING ! -d 172.33.0.0/16 -m comment --comment "AWS, SNAT" -m addrtype ! --dst-type LOCAL -j SNAT --to-source 172.33.16.129
iptables -t nat -L POSTROUTING

DELETE IP RULE:
sudo ip rule show
sudo ip rule del prio 1024 (the "not from all to 172.16.0.0/16 lookup main" rule )
sudo ip route flush cache
sudo ip rule show

I was then able to CURL IP's from both the ETH0 and ETH1 ENI's and POD<->office over VPN worked.

@lbernail

This comment has been minimized.

Copy link
Contributor

lbernail commented Mar 31, 2018

@edwize : yes this current limit applies to any traffic outside of the VPC CIDR (so peered VPC and VPN connections or Directconnect links)

A quick note on what you need: once you remove the IP rule (ip rule del prio 1024) you don't need to disable rp_filter (if it is enabled) or source-dest check because traffic from PODs with IP on a secondary ENIs will use the proper ENI, thanks to 1536 priority rules added by the plugin such as:

ip rule 
1536:	from 172.16.0.100 lookup 2
1536:	from 172.16.0.101 lookup 3

With route table 2 forcing traffic to ENI 2 and 3 through ENI 3 (in my case primary ENI is ens5 and ens6 and ens7 are ENI 2 and 3):

ip route show table 2
default via 172.16.0.1 dev ens6
172.16.0.1 dev ens6  scope link
ip route show table 3
default via 172.16.0.1 dev ens7
172.16.0.1 dev ens7  scope link

Bear in mind that if you do this, you can't use nodes in public subnets (with public IP addresses on the primary interface) because the pods won't have public IPs associated with pod IPs so traffic will not be NATed to a public IP by the Internet Gateway. It is not an issue in our case because we run our cluster in private subnets only.

I'll create a quick branch with the fix we currently use so you can test it if you want.

@lbernail

This comment has been minimized.

Copy link
Contributor

lbernail commented Mar 31, 2018

@edwize : you can build a custom image from this branch: https://github.com/lbernail/amazon-vpc-cni-k8s/tree/lbernail/disable-nat-rule

It contains 2 additional commits compared to master:

  • one ensuring that logs are flushed properly (I created a PR for this on the main repo) which makes debugging a lot easier
  • one removing the call to function c.networkClient.SetupHostNetwork which configures the 1024 priority rule forcing traffic exiting the VPC through then main interface and creates the SNAT rule that changes source address to the main IP address from the node (the other lines commented are just so the code builds without errors regarding unused variables or imports)
@edwize

This comment has been minimized.

Copy link

edwize commented Mar 31, 2018

@lbernail Thanks for the branch, and advisement. I assume this project lags behind internal EKS work, because the lack of VPN and VPC-VPC support is surprising.

@eswarbala

This comment has been minimized.

Copy link

eswarbala commented Apr 4, 2018

Really appreciate the discussion here. We are planning to add a flag that disables the NATing to support the scenarios as discussed here.

@edwize

This comment has been minimized.

Copy link

edwize commented Apr 5, 2018

@eswarbala That would be appreciated. I was able to use @lbernail branch with the single NAT change to build a custom container and it deployed successfully. I'm using KOPS, and found that the "amazon-k8s-cni:0.1.1" container was hard coded, so now I have a custom build of that project too. Whee!

@Dieler

This comment has been minimized.

Copy link

Dieler commented Apr 19, 2018

@edwize I haven't worked with Go yet but would really like to build my own custom container from that branch. Can you please provide some guidance on how to build this project?

@edwize

This comment has been minimized.

Copy link

edwize commented Apr 23, 2018

@Dieler I hadn't seen Go either, but I found setting up Go on my MAC may not have created the exact environment for recompiling AWS's CNI. However, I did find that the Kubernete's project included a Dockerized build environment with all the tools and libraries, so I hacked together this not so pretty method on an Ubuntu 16.04 server:

  1. Get the Kubernetes project: git clone https://github.com/kubernetes/kubernetes
  2. Enter the build container: cd kubernetes; build/shell.sh
  3. Add AWS's CNI to test clean build
    go get github.com/aws/amazon-vpc-cni-k8s
    go get -u github.com/golang/dep/cmd/dep
    cd $GOPATH/src/github.com/aws/amazon-vpc-cni-k8s
    git rm --cached vendor/k8s.io/kubernetes
    dep status (Go dependency checker installed above "go get -u github.com/golang/dep/cmd/dep" )
    dep ensure
    make (successfully built stuff)
  4. Add modified CNI inside Kubernetes project, Docker container
    go get github.com/lbernail/amazon-vpc-cni-k8s
    cd $GOPATH/src/github.com/lbernail/amazon-vpc-cni-k8s
    git rm --cached vendor/k8s.io/kubernetes
    git checkout origin/lbernail/disable-nat-rule
  5. GO back to AWS directory
    cd $GOPATH/src/github.com/aws/amazon-vpc-cni-k8s
    cp ../../lbernail/amazon-vpc-cni-k8s/ipamd/ipamd.go ipamd/ipamd.go
    rm verify-network verify-aws aws-cni aws-k8s-agent
    make
  6. Get CNI out of the build container:
    from another terminal to Ubuntu server
    docker ps ==> 4f54652ae68f ( new amazon CNI container )
    docker cp 4f54652ae68f:/go/src/github.com/aws/amazon-vpc-cni-k8s/aws-k8s-agent ~/eddie/
  7. From terminal outside of build environment, run Docker
    cd ~/eddie/
    git clone http://github.com/lbernail/amazon-vpc-cni-k8s
    cd amazon-vpc-cni-k8s/
    cp ../aws-* .
    docker build -f scripts/dockerfiles/Dockerfile.release -t "amazon/amazon-k8s-cni:latest" .
    docker images ==> "amazon/amazon-k8s-cni"
  8. TAG image and put in your private repo
    docker images
    docker tag a9f6a99f9ccc yourcompany/amazon-k8s-cni:0.0.1
    docker push YOURCOMPANY/amazon-k8s-cni:0.0.1
  9. Now if you are using KOPS, you have to re-build it because the CNI is hardcoded to AWS:0.1.1
    ( the essential change is below, the KOPS project has build info which is useful )
    vi upup/models/cloudup/resources/addons/networking.amazon-vpc-routed-eni/0.1.1-kops.1.yaml.template
    (was) image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:0.1.1
    (now) image: YOURCOMPANY/amazon-k8s-cni:0.0.1
    git commit -am "swap routed-ini to YOURCOMPANY/amazon-k8s-cni:0.0.1"
@robbrockbank

This comment has been minimized.

Copy link
Contributor

robbrockbank commented May 9, 2018

I hit this same issue today where I've set up a customer gateway/VPN to another network. I'm using BGP to advertise routes between AWS and my other network.

Whilst I am able successfully route from my AWS EKS pods to my remote network, the SNAT-ing of the pod IP is causing other issues on my remote node (in particular setting up policy rules using Calico where I am assuming the source address to be the Pod IP).

@eswarbala : You mention adding a flag to disable the NAT-ing. I was wondering what that might look like in terms of API and behavior, would it be a simple "disable always" which would omit adding the SNAT rules?

I've also hit the RPF check issue as well - seemingly for some secondary ENIs but not for others.

@robbrockbank

This comment has been minimized.

Copy link
Contributor

robbrockbank commented May 14, 2018

@edwize: Regarding building on your Mac. I was able to get this working without any additional changes. You'll need to tell the compiler to generate a linux binary. Setting GOOS=linux before calling the make target did the trick for me,

rob$ GOOS=linux make static
go build -o aws-k8s-agent main.go
go build -o aws-cni plugins/routed-eni/cni.go
go build verify-aws.go
go build verify-network.go
@robbrockbank

This comment has been minimized.

Copy link
Contributor

robbrockbank commented May 16, 2018

@eswarbala : I was playing around with removing the SNAT iptables rule and the VPC ip routing rule. That works great for my cluster to cluster communication but as discussed in this thread means I can't access the internet from my AWS EKS pods. I thought it might be sufficient to add a NAT gateway to my subnets, but couldn't seem to get that working.

I was wondering - would you expect that configuring a NAT gateway would cover the case where we are disabling the SNAT, and if so, are there any pointers you could give on how to set it up?

@lbernail

This comment has been minimized.

Copy link
Contributor

lbernail commented May 17, 2018

@robbrockbank It should work with NAT gateways (this is what we did). Maybe you are a missing a default route to your nat gateways?

@robbrockbank

This comment has been minimized.

Copy link
Contributor

robbrockbank commented May 18, 2018

@Ibernail: Thanks for the follow-up, I'll give it another try - I figured it might just be a misconfiguration on my part, it's good to know that it should work! I had a default route to my NAT gateway, but presumably I hadn't set up my internet gateway correctly.

I have another follow up, which may not really belong here, but I think is useful to overall discussion of off-VPC routing.

I was looking into service discovery options to allow me to access a service via a local IP rather than routing over the public internet. To this end I configured my service to use an internal Network Load Balancer (which I realize is only supposed to be Beta at the moment and possibly not even Beta for an EKS cluster):

- apiVersion: v1
  kind: Service
  metadata:
    annotations:
      service.beta.kubernetes.io/aws-load-balancer-internal: 0.0.0.0/0
      service.beta.kubernetes.io/aws-load-balancer-type: nlb
    labels:
      app: nginx
      run: nginx
    name: nginx-internal
    namespace: rlb-cloud
  spec:
    externalTrafficPolicy: Local
    ports:
    - name: http
      port: 80
      protocol: TCP
      targetPort: 80
    selector:
      run: nginx
    type: LoadBalancer

To get this working I made a modification to the IAM permissions for the cluster (copied from https://gist.github.com/micahhausler/4f3a2ee540f5714e6dd91b4bacace3ae#file-create-cluster-sh-L30).

This created the NLB, and the DNS entry which points to the internal (VPC) address for the NLB. This DNS entry is globally distributed so I'm able to get the internal address of the NLB from my peered network which is promising.

Unfortunately this didn't work for a couple of reasons:

  • From within the VPC it wasn't possible to use the NLB address as we'd hit the Remote Path Filtering issue for some secondary ENIs. With the patched CNI from @lbernail branch this scenario works.
  • From outside the VPC (my peered network) this doesn't work because I can't seem to access the NLB address, and there doesn't seem to be a security group associated with it that I can modify. If I was able to access the NLB then I believe I'd be able to hit the pod addresses using the NLB internal address without any source NATing.

Anyways - I'm sharing here in case it's a useful thing to consider.

@lbernail

This comment has been minimized.

Copy link
Contributor

lbernail commented May 20, 2018

@robbrockbank maybe you were missing routes to the IGW on the subnets were you have NAT gateways?

Regarding NLB I'm really not a specialist but it seems that is not possible to access NLB accross peerings: "Connectivity from clients to your load balancer is not supported over AWS managed VPN connections or VPC peering." (from https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html)
You should be able to achieve this using a Private Link: https://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/endpoint-service.html

@robbrockbank

This comment has been minimized.

Copy link
Contributor

robbrockbank commented May 23, 2018

@lbernail : thanks for your follow up, greatly appreciated. I'd misconfigured by routing tables so once I figured that out it all seems to be ok now, thanks for the push. Regarding the NLB, good grief I just didn't see that comment in the docs - thanks for pointing that out :-)

@robbrockbank

This comment has been minimized.

Copy link
Contributor

robbrockbank commented May 23, 2018

I put up PR to make it configurable via an environment as to whether the aws-node image will install the SNAT and off-VPC rules (which seems to be the cause of the off-VPC routing issues). The idea being that SNAT for the containers would be handled via an explicitly configured NAT gateway. Not sure the approach I've taken is sensible, but happy to make some iterations on it.

IIUC, one thing that I think would make it more useful though is to be able to allocate the node IPs from a different subnet than the secondary (container) IPs. That would allow the nodes to use a routing table with a default route to an igw, and the for the containers to use a routing table with a default route to a nat gw. As it stands, configuring the EKS subnets with default route to a NAT gateway means you have to configure specific routes to an internet gateway to allow traffic to hit the nodes public IP (e.g. for SSH). (please let me know if my thinking is wrong here though)

@lbernail

This comment has been minimized.

Copy link
Contributor

lbernail commented May 24, 2018

@robbrockbank I like the idea of using different subnets for the main host interface and additional ENIs (we use this feature with another CNI plugin: https://github.com/lyft/cni-ipvlan-vpc-k8s). But this requires modifying the logic of the plugin to identify the secondary ENIs subnet (probably using tags) and to avoid adding secondary IPs to the main interface.

@robbrockbank

This comment has been minimized.

Copy link
Contributor

robbrockbank commented May 24, 2018

@lbernail - I was thinking of going further and having, I guess, 4 subnets so that you have two for primary and two for secondary - that way you still have subnets split across availability zones. I'm assuming at that point the tagging would be done as part of the cloud formation templating? Apologies if I'm talking rubbish - I'm rather new to all this so it's a bit of a steep and slow learning curve.

@liwenwu-amazon liwenwu-amazon added this to the v1.1 milestone Jun 22, 2018

@isz-paul

This comment has been minimized.

Copy link

isz-paul commented Jun 26, 2018

Something that's missing from several of these discussions: Why was the SNAT iptables entry introduced in the first place?

I'd like to completely get away from IP tables connection tracking if at all possible. Every tried a SYN-flood on an IP tables machine?

@liwenwu-amazon

This comment has been minimized.

Copy link
Contributor

liwenwu-amazon commented Jun 28, 2018

This issue should be addressed by PR #81 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment