Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix return path of NodePort traffic. #130

Merged
merged 1 commit into from Aug 21, 2018
Merged

Conversation

fasaxc
Copy link
Contributor

@fasaxc fasaxc commented Jul 10, 2018

Issue #, if available:

Fixes #75

Description of changes:

Add iptables and routing rules that

  • connmark traffic that arrives at the host over eth0
  • restore the mark when the traffic leaves a pod veth
  • force marked traffic to use the main routing table so that it
    exits via eth0.

Testing performed

  • Spun up a 2 node cluster and installed the plugin (and Calico with the IptablesMangleAllowAction=RETURN).
  • Started a 2-pod nginx service, at least one ngingx pod was on a non-primary ENI (checked with ip rule)
  • Exposed the nginx pods as a NodePort service
  • Repeatedly checked connectivity from both nodes to the service, going via the other node's NodePort. Worked as expected. (Without fix, saw 50% connection failure rate.)
  • Checked connectivity to the service from a busybox pod to its cluster IP and domain name.
  • Killed/restarted the aws-node pod.
  • Checked log for errors/ warnings.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link
Contributor

@liwenwu-amazon liwenwu-amazon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change may become incompatible with other features such as:

  • add Pod IP to NLB/ALB target group and use NLB/ALB to directly send traffic to Pod IP

  • Pod use different subnets and security groups from eth0.

@lxpollitt
Copy link

@liwenwu-amazon Can you expand on your comments? I thought all this PR impacts is traffic directed at node port access via eth0. So it's not clear to me why this change would be incompatible with the NLB/ALB routing directly to the pod IP or pods with different subnets or security groups than eth0. It doesn't sound like either of those future use cases involve traffic being sent to node port?

@fasaxc
Copy link
Contributor Author

fasaxc commented Jul 11, 2018

@liwenwu-amazon I think @lxpollitt is right, this PR only affects traffic that arrives at eth0 and then returns from a pod. I think that:

  • ALB/NLB direct to pod traffic will arrive over ethN and return via ethN. (If it arrives at eth0 then it is going to a pod that is homed on eth0 so, although my change will be activated it'll be a no-op.)
  • Any solution to the SG/subnet issue that uses ENIs will need to route NodePort traffic to the correct ENI, which means that it won't trigger my new rules so the behaviour will be correct.
  • Doing direct ALB/NLB is desirable but it won't fix NodePorts, which are a basic feature of Kubernetes. NodePorts can be used for other purposes in addition to ALB/NLB so I think we should fix them even if ALB/NLB is going to be improved.

@liwenwu-amazon
Copy link
Contributor

@fasaxc @lxpollitt , thanks for the PR. Can you add more detail on how you verify the PR, such as

  • what kind of tools you use to verify IP tables rules are invoked in the expected order by linux kernel?
    • in case for node-port traffic for pods using non-eth0's secondary IPs
    • in other cases, non-node-port traffic
  • the analysis if there is any performance impact for non-node-port traffic
  • any extra guideline on how to troubleshoot

@fasaxc
Copy link
Contributor Author

fasaxc commented Jul 11, 2018

@liwenwu-amazon the rules are static, and they amount to

iptables -t mangle -A PREROUTING -i eth0 -m addrtype --dst-type LOCAL --limit-iface-in -j CONNMARK --set-mark 0x80/0x80
iptables -t mangle -A PREROUTING -i eni+ -j CONNMARK --restore-mark --mask=0x80

and a routing rule that matches on the 0x80 mark bit to use the main routing table.

The mangle PREROUTING chain is executed just before the nat PREROUTING chain, before the routing decision.

The first rule only takes effect if the incoming packet comes in via eth0 (-i eth0) and is going to an IP owned by eth0 (--dst-type LOCAL --limit-iface-in). The latter check ensures that we're really dealing with a host IP and not a packet that is going to a pod that happens to be homed on eth0. Traffic that comes in on a secondary ENI won't match so we won't set the connmark for those packets.

The second rule matches only packets that are leaving local pods and it restores the mark bit. That means that the mark bit will only be set on packets that are part of a connection that started by coming in eth0 to a host IP. Hence, only connections that arrived on eth0 and were NATted to a pod will have the mark bit set, which should only catch NodePort traffic.

This PR doesn't do anything for non-eth0 secondary IPs right now, those packets should be processed as before. I wasn't sure what the desired behaviour of such IPs was. The PR could be extended to use a mark value per ENI but that would require quite a bit of additional work. If a user is already ENI-aware, presumably they'll only be accessing pods attached to that ENI so the routing should just work in that case?

We use similar rules in Calico and the impact of 1-2 rules of that kind should be negligible. The non-NodePort traffic should fail the first match criteria on each rule, which amounts to a 4-byte string comparison.

For troubleshooting:

  • I found it useful to use tcpdump on each ENI when diagnosing the issue originally. We observed the correct packets (i.e. correct source and dest) being routed out of the wrong interface.
  • The real rules have "AWS" in the comment, so you can see if they're being hit with sudo iptables-save -c | grep AWS, for example.
  • Worth looking for any other uses of CONNMARK or the configured mark bit in the dataplane. (I chose a bit for the default value that is not used by either kube-proxy or Calico.)

@liwenwu-amazon
Copy link
Contributor

@fasaxc I am testing the change using https://kubernetes.io/docs/tasks/access-application-cluster/connecting-frontend-backend/ example on a cluster of 1 node. I am not able to see
lb
instance become healthy on ELB.

@fasaxc
Copy link
Contributor Author

fasaxc commented Jul 12, 2018

@liwenwu-amazon That's odd, I'll try to reproduce here. Please can you give me details of your set up?

  • Was this using the AWS cloud provider to set up a LoadBalancer (or a manually configured ELB)?
  • Was this an EKS system or set up using a different installer?
  • Was the backing pod on the main ENI or another one?

@fasaxc
Copy link
Contributor Author

fasaxc commented Jul 12, 2018

I was able to reproduce locally. Previously I was testing on a kops cluster but after switching to EKS I hit issues with the fix. I'll keep digging to see if I can get to the bottom of it.

@fasaxc
Copy link
Contributor Author

fasaxc commented Jul 12, 2018

It looks like the difference between kops and EKS is that EKS has strict RPF filtering on eth0 whereas it was disabled on the kops cluster. I find that the fix does seem to work if I set the RPF check on eth0 to "loose" mode with sysctl -w net.ipv4.conf.eth0.rp_filter=2. Does that work for you? Is it something that you'd consider accepting as part of the fix?

@liwenwu-amazon
Copy link
Contributor

@fasaxc thanks for the RPF finding. Do you know why unhealthy instance target only happens to cluster which contains 1 node? If a cluster have 2 nodes, both of them become healthy

@liwenwu-amazon
Copy link
Contributor

@fasaxc , can you explain why change eth0 to "loose" mode make it working? In another word, if it is not set (as it today in eks), which particular packet get dropped and make the instance unhealthy? thanks

@fasaxc
Copy link
Contributor Author

fasaxc commented Jul 13, 2018

@liwenwu-amazon With "strict" RPF, what happens is

  • ELB packet arrives via eth0, gets DNATed to (local) pod IP on secondary ENI
  • DNATted packet is RPF checked
    • kernel tries to find a route from <pod IP on secondary ENI> to <ELB IP>
    • it finds the route out of eth1
    • RPF check fails because strict mode requires that the reverse route is for the same interface.

With "loose" RPF, the check passes because the kernel accepts any return route rather than requiring the return route to leave the same interface.

In my testing I used two nodes and 2 "frontend" pods, one on each host; in that scenario, I do understand why both nodes get marked as healthy. This happens because the DNAT on each host chooses the frontend pod at random. In addition, after an RPF failure, the kernel seems to drop the NAT state for the connection (so retried packets may go to a different DNAT rule). That means that you get this scenario:

  • remote host's SYN packet arrives via eth0, gets DNATed to (local) pod IP on secondary ENI
  • RPF check fails, packet dropped
  • remote host's resends same SYN packet (standard TCP behaviour)
  • remote host's SYN packet arrives via eth0, gets DNATed to remote pod IP
  • (since there are no special routing rules for remote pod IPs) RPF check passes and packet is forwarded to remote pod
  • connection succeeds

In manual testing, I was seeing the SYN retries with tcpdump and the latency of a connection that went through this retry was 1s more than normal. With an ELB it was showing the nodes as healthy but I was seeing random latency spikes through the ELB.

@fasaxc
Copy link
Contributor Author

fasaxc commented Jul 13, 2018

I've updated the PR to set the "loose" setting on eth0. PTAL.

@liwenwu-amazon
Copy link
Contributor

@fasaxc I am a bit confused on your notes why RPF check failed.
Here is my setup:
ELB (192.168.167.142) <----> Node (eth0, 192.168.132.34) <----> FrontEnd-Pod(192.168.157.136)

kubectl get pod -o wide
NAME                        READY     STATUS    RESTARTS   AGE       IP                NODE
frontend-766c875db4-wqr5p   1/1       Running   0          18h       192.168.157.136   ip-192-168-132-34.us-west-2.compute.internal
hello-7ff54bc875-4lfjq      1/1       Running   0          19h       192.168.154.231   ip-192-168-132-34.us-west-2.compute.internal
hello-7ff54bc875-4n429      1/1       Running   0          19h       192.168.131.110   ip-192-168-132-34.us-west-2.compute.internal
hello-7ff54bc875-9vqk7      1/1       Running   0          19h       192.168.173.131   ip-192-168-132-34.us-west-2.compute.internal
hello-7ff54bc875-gth7n      1/1       Running   0          19h       192.168.146.111   ip-192-168-132-34.us-west-2.compute.internal
hello-7ff54bc875-q4mbn      1/1       Running   0          19h       192.168.180.61    ip-192-168-132-34.us-west-2.compute.internal
hello-7ff54bc875-rwp57      1/1       Running   0          19h       192.168.161.31    ip-192-168-132-34.us-west-2.compute.internal
hello-7ff54bc875-tcdl5      1/1       Running   0          19h       192.168.159.100   ip-192-168-132-34.us-west-2.compute.internal

Here is route table for FrontEnd-Pod

ip route
default via 192.168.128.1 dev eth0 
169.254.169.254 dev eth0 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 
192.168.128.0/18 dev eth0 proto kernel scope link src 192.168.132.34 
192.168.131.110 dev eni6669baf3723 scope link 
192.168.146.111 dev eni4763e566de6 scope link 
192.168.154.231 dev enib36cd009927 scope link 
192.168.157.136 dev enibc52a89a1fc scope link <---FrontEnd-Pod
192.168.159.100 dev eni837ae39767c scope link 
192.168.161.31 dev enicea187a2214 scope link 
192.168.163.189 dev eni459f009cb4c scope link 
192.168.173.131 dev enia645ac38938 scope link 
192.168.180.61 dev eni6188ad2106a scope link 

ip rule show any packet destined to FrontEnd-Pod needs use main routing table

ip rule show
0:	from all lookup local 
512:	from all to 192.168.163.189 lookup main 
512:	from all to 192.168.173.131 lookup main 
512:	from all to 192.168.161.31 lookup main 
512:	from all to 192.168.131.110 lookup main 
512:	from all to 192.168.159.100 lookup main 
512:	from all to 192.168.180.61 lookup main 
512:	from all to 192.168.154.231 lookup main 
512:	from all to 192.168.146.111 lookup main 
512:	from all to 192.168.157.136 lookup main <----FrontEnd-Pod
1024:	not from all to 192.168.0.0/16 lookup main 
1024:	from all fwmark 0x80/0x80 lookup main 
1536:	from 192.168.173.131 lookup 2 
1536:	from 192.168.131.110 lookup 2 
1536:	from 192.168.159.100 lookup 2 
1536:	from 192.168.180.61 lookup 3 
1536:	from 192.168.154.231 lookup 3 
1536:	from 192.168.157.136 lookup 2 
32766:	from all lookup main 
32767:	from all lookup default 

Here is incoming ELB Health check flow

1. ELB --> Node eth0 (IP_SA=192.168.167.142, IP_DA=192.168.132.34, DA_PORT=32481)
2. Node NAT it  (IP_SA=192.168.132.34, IP_DA=192.168.157.136, DA_PORT=88)
3. route table should send it to enibc52a89a1fc

My understanding of RPF check is that it will drop the packet if packet is received from the interface where it should be the output interface for the IP-DA. I don't quite follow why it fails PRF in this case.

@spikecurtis
Copy link

@liwenwu-amazon strict RPF drops a packet if the routing lookup for the reverse path yields a different interface than it came in on. So, when it is NAT'ed to (IP_SA=192.168.132.34, IP_DA=192.168.157.136), the kernel does a routing lookup for (IP_SA=192.168.157.136, IP_DA=192.168.132.34). The IP rules say it should use routing table 2

1536:	from 192.168.157.136 lookup 2 

which presumably would forward out a different interface than eth0, and thus it doesn't pass a strict RPF check.

A loose RPF check just verifies that it can route the reverse packet, but does not require that it goes out the same interface it came in, so would be successful in this case.

@liwenwu-amazon
Copy link
Contributor

@spikecurtis I thought routing table selection is done before DNAT, so it will use main routing table to perform RPF check. Do you see any stats (e.g. netfilter stats) to prove your finding? Also, do you have kernel code trace to prove kernel forwarding code will perform ip rule selection again during RPF checking? thanks

@spikecurtis
Copy link

@liwenwu-amazon the routing decision, including table selection has to happen after DNAT, otherwise the packet would not get forwarded correctly with the new destination. Reverse path filtering also happens during the routing decision.

You can see in the kernel sources that the code that validates the reverse path, __fib_validate_source, inverts the source and destination addresses, then calls the main fib_lookup procedure, which also does table selection.

https://elixir.bootlin.com/linux/v4.17.6/source/net/ipv4/fib_frontend.c#L325

@liwenwu-amazon
Copy link
Contributor

@spikecurtis, @fasaxc thanks for the reference. Interesting to know in this case it uses main routing table for destination lookup, and uses another routing table for RPF lookup.

For monitor/debug/audit purpose, do you know any statistics we can use for this change? In another word, shall we add any additional to /opt/cni/bin/aws-cni-support.sh? thanks.

@fasaxc
Copy link
Contributor Author

fasaxc commented Jul 16, 2018

The only diagnostic for the RPF check I'm aware of is the following sysctl:

net.ipv4.conf.all.log_martians=1 

It causes the kernel to log all packets that are dropped by the RPF check. I haven't found any kernel stats for that.

For the new iptables rules that we added, I suggest including iptables-save -c output in the diagnostics dump. That can be very useful; especially for spotting interactions between different apps that use iptables.

One slight correction to the health check flow; I think the SNAT happens after the routing decision so IP_SA=192.168.132.34 should be IP_SA=192.168.167.142 (the ELB's IP).

@liwenwu-amazon
Copy link
Contributor

@fasaxc Thanks again for the notes. The only concerns I have now is "disabling RPF check", where RPF is a security feature that can help to limit the malicious traffic and prevent IP address spoofing.

How about adding a configuration knob that

  • for deployment that do NOT care about RPF security check and also that do NOT want to use NLB/ALB for service/PodIP mapping, it can use this configuration knob to turn on node-port support
  • for deployment that care about RPF security check and use NLB/ALB for service/PodIP mapping, they can disable this feature.

@fasaxc
Copy link
Contributor Author

fasaxc commented Jul 16, 2018

OK, how about AWS_VPC_IGNORE_STRICT_RP_FILTER=true?

@liwenwu-amazon
Copy link
Contributor

@fasaxc how about use AWS_VPC_CNI_NODE_PORT_SUPPORT? When AWS_VPC_CNI_NODE_PORT_SUPPORT=true, CNI will trigger the code in this PR, such as

  • set RPF check false
  • add 1 more IP table rules 1024: from all fwmark 0x80/0x80 lookup main
  • add 2 more rules to IP mangle table

if AWS_VPC_CNI_NODE_PORT_SUPPORT is not set, it behaves same as today.

@spikecurtis
Copy link

@liwenwu-amazon

I think it's very important that nodePorts work correctly with the default settings, and would therefore advocate that the default value of this config flag is to enable the changes in this PR.

NodePorts are a part of standard Kubernetes networking, and if they don't work by default EKS and self-managed clusters that use the CNI plug in might not be considered conforming to the Kubernetes API spec.

While NLB/ALB with direct pod access is a good technology, the reality is that it will not be adopted overnight, nor will it be adopted by 100% of users. Having trouble getting NodePorts working could end up being a source of adoption friction for our users, and I think we should avoid it as much as possible.

I want to emphasize that this change does not disable RPF checks, it simply switches them from strict to loose. Loose mode is the recommended mode "if using asymmetric routing or other complicated routing" according to the Linux man pages. https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt It's also worth pointing out that unless the user has explicitly disable it, the AWS network will already check for the kind of IP spoofing that RPF is designed to prevent.

@fasaxc
Copy link
Contributor Author

fasaxc commented Jul 17, 2018

@liwenwu-amazon I agree with @spikecurtis; I think it should default to making NodePorts work or users will get tripped up. WDYT?

@fasaxc
Copy link
Contributor Author

fasaxc commented Jul 18, 2018

I've updated the patch along those lines.

@liwenwu-amazon
Copy link
Contributor

@fasaxc can you add some unit-test too? thanks

@liwenwu-amazon
Copy link
Contributor

@fasaxc @lxpollitt @spikecurtis I have following concerns with node-port as default:

  • for each k8s service (e.g. k8s frontend example), Load Balancers needs to add ALL nodes in the cluster. e.g. for a 5000 node cluster, Load Balancer needs to have 5000 instance in its target group and constantly sending health-checks to all 5000 nodes.
  • each k8s service will generate a ip_conntrack state in every node in the cluster. This may increase memory requirement in every node in the cluster
  • health-check and traffic are sent multi-hops, not efficient and very difficult to debug:
    • for a endpoint (e.g. nginx in frontend example) the incoming traffic can have any node IP address as source IP
    • all health-check and traffic can get sent to any one of the node in the cluster, SNATed with that node's IP and then forward to endpoint (nginx)
  • for some deployment which requires multiple node groups and NO communication between nodes from different node group, these k8s service will trigger health-check traffic sent between nodes from different node group.

@liwenwu-amazon
Copy link
Contributor

liwenwu-amazon commented Jul 19, 2018

Here is a comparison between node-port and using Pod VPC IP as target:
git-node-port.pdf
screen shot 2018-07-19 at 5 21 31 am

@fasaxc
Copy link
Contributor Author

fasaxc commented Jul 19, 2018

@liwenwu-amazon I think we're all agreed that NLB/ALB/ELB -> Pod VPC IP is a good solution. The reason for supporting NodePorts too is that they're a standard Kubernetes feature. The user doesn't have to be using an NLB/ALB/ELB to use a NodePort.

@fasaxc
Copy link
Contributor Author

fasaxc commented Jul 19, 2018

I've added some UT to the PR.

Copy link
Contributor

@liwenwu-amazon liwenwu-amazon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For addressing CR comments, Is it possible to amend one of the previous commit? This way, we will not have too many commits at time PR merges?

@fasaxc fasaxc force-pushed the nodeport-fix branch 2 times, most recently from 173d128 to 0b8fba1 Compare August 1, 2018 10:14
@fasaxc
Copy link
Contributor Author

fasaxc commented Aug 1, 2018

@liwenwu-amazon I've updated the PR to include the diagnostics you requested.

@liwenwu-amazon
Copy link
Contributor

@fasaxc @spikecurtis , can you help to come up a e2e test which we can manually verify for now:

  • when both network policy and node-port service are configured, both are working as expected.
  • ideally, this should cover following cases:
    • sending Pod and receiving Pod are in 2 different worker nodes
    • sending Pod and receiving Pod are in same worker node

thanks

@liwenwu-amazon
Copy link
Contributor

liwenwu-amazon commented Aug 1, 2018

@fasaxc , I am getting following (with default setting) which does NOT seems right:
I think it is better to explicitly print out the value. Otherwise, if different version of code is using different default, it is harder to figure out what's value is using ...

[root@ip-192-168-232-52 bin]# curl http://localhost:61678/v1/env-settings | python -m json.tool
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   100  100   100    0     0    100      0  0:00:01 --:--:--  0:00:01 20000
{
    "AWS_VPC_CNI_NODE_PORT_SUPPORT": "",
    "AWS_VPC_K8S_CNI_CONNMARK": "",
    "AWS_VPC_K8S_CNI_EXTERNALSNAT": ""
}

Add iptables and routing rules that

- connmark traffic that arrives at the host over eth0
- restore the mark when the traffic leaves a pod veth
- force marked traffic to use the main routing table so that it
  exits via eth0.

Configure eth0 RPF check for "loose" filtering to prevent
NodePort traffic from being blocked due to incorrect reverse
path lookup in the kernel.  (The kernel is unable to undo the
NAT as part of its RPF check so it calculates the incorrect
reverse route.)

Add diagnostics for env var configuration and sysctls.

Fixes aws#75
@fasaxc
Copy link
Contributor Author

fasaxc commented Aug 2, 2018

I've updated the PR to print the active state of the environment variables instead of the raw text.

For a E2E tests, I think this should be sufficient:

  • create 2-node cluster and external test node
  • create server pod on each node, using non-default ENI
  • create separate NodePort service for each server pod
  • verify connectivity from each host to each NodePort on the primary ENI (local and remote)
  • verify connectivity from external test node to NodePorts
  • with network policy enabled (e.g. Calico policy-only mode installed and configured to expect eni* interfaces instead of cali* interfaces)
    • repeat test
    • add NetworkPolicy to pods to allow traffic from hosts via CIDR
    • repeat test
    • remove policy, install default-deny policy
    • verify traffic is dropped from other hosts (for Calico, the local host can always connect to itself)

@liwenwu-amazon
Copy link
Contributor

@fasaxc thanks for the E2E testcase!
Based on your suggestion, I have come up the following detail steps for testing this. Please review it and see if it is accurate.
Also, can you share a network policy yaml which I can use for it?

Here are the detail steps I am going to use to test this:

  • I am planning to use https://kubernetes.io/docs/tasks/access-application-cluster/connecting-frontend-backend/
  • start out with 1 node (t2.medium) cluster first (I think this will force testing node-port feature)
    • modify hello.yaml with replicas : 12
    • kubectl apply -f hello.yaml
    • create hello service : kubectl create -f https://k8s.io/examples/service/access/hello-service.yaml
    • create frontend deployment and service: kubectl create -f https://k8s.io/examples/service/access/frontend.yaml
    • To test local to local test
      *kubectl exec -ti hello-xxx sh into one of the hello pod
      * verify wget <elb's address> returns index.html
    • To test remote to local test
      * verify curl <elb's address> from a external node
  • scale up to 2 nodes and also scale hello pod up to 24, so some of hello pod get scheduled onto the new node
    • perform same tests as above
  • deploy calico policy-engine add-on kubectl apply -f calico.yaml
  • TODO:details-need-to-be added add allow policy (?? yaml) and verify??
  • TODO:details-need-to-be-added add dis-allow policy (?? yaml) and verify??

@fasaxc
Copy link
Contributor Author

fasaxc commented Aug 6, 2018

@liwenwu-amazon k8s is allow-by-default; this page explains how to install a default-deny policy (in the "Default Policies" section): https://kubernetes.io/docs/concepts/services-networking/network-policies/

Note: node ports and policy don't interact as you might hope. kube-proxy SNATs the traffic to ensure that return traffic goes via the ingress node so kubernetes policy can actually see the ingress node as the source of the traffic. Calico supports pre-NAT policy that allows for securing NodePorts but that is not exposed by the kubernetes API. It has to be applied to the host instead of the workload since that is where the NAT happens.

@bchav bchav added this to the v1.2 milestone Aug 9, 2018
@liwenwu-amazon
Copy link
Contributor

@fasaxc thanks for the input. Here is the yaml for policy part, please let me know if they are correct and if you have any comments/suggestions:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-hello2hello
  namespace: default
spec:
  podSelector:
    matchLabels:
      tier: backend
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
           tier: frontend
    - ipBlock:
        cidr: 0.0.0.0/0
        except:
        - 192.168.0.0/16 <-- VPC CIDR
  egress:
  - to:
    - podSelector:
        matchLabels:
           tier: frontend
    - ipBlock:
        cidr: 0.0.0.0/0
        except:
        - 192.168.0.0/16 <-- VPC CIDR
  • here are the verification I have done:
    • find out dns name of frontend service
      • kubectl describe svc frontend and IP address
    • verify able to reach node-port service from outside cluster
      • curl http://<frontend's dns>, should return {"message":"Hello"}
    • verify able to reach node-port service from a backend pod
      • kubectl exec -ti hello-xxxx sh
      • curl http://<frontend's IP> should return {"message":"Hello"}
    • verify backend can NOT communicate with another backend pod
      • kubectl exec -ti hello-xxxx sh
      • ping another Pod's IP, it should failed

@spikecurtis
Copy link

Hi @liwenwu-amazon

Since this issue affects node ports, it's more important to test that Network Policy works correctly when restricting access to the the frontend, rather than the backend. So a policy such as

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-from-hosts
  namespace: default
spec:
  podSelector:
    matchLabels:
      tier: frontend
  policyTypes:
  - Ingress
  ingress:
  - from:
    - ipBlock:
        cidr: 192.168.0.0/16 <-- VPC CIDR

would allow connections from other hosts in the VPC (including K8s nodes, which is the important part). Then, you should be able to access the service by connecting to the node port.

If you delete the allow policy and replace with a deny:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
  namespace: default
spec:
  podSelector:
    matchLabels:
      tier: frontend
  policyTypes:
  - Ingress
  ingress: []

then you should see that some requests are denied if they are sent to a pod on a remote host.

@fasaxc suggested that you may want to create two different services, and restrict the pods for each service to be on different nodes. This would make the test more deterministic, so that we can separately test nodePort -> local pod vs nodePort -> remote pod.

@liwenwu-amazon
Copy link
Contributor

@spikecurtis Thanks for suggestions. Since the frontend service only have 1 nginx Pod, and from a Pod (kubectl exec -ti <hello-pod> sh) and do a wget http://<node>: port for every node in the cluster, I should have tested nodePort->local pod vs nodePort->remote pod, right?

@spikecurtis
Copy link

Yeah, if there is only one pod and you hit the node port on each node, you will exercise both nodePort->local pod vs nodePort->remote pod. However, with only one pod, how do you ensure that the pod is on a secondary ENI? The way I thought you were going to do it is to create more pods than can fit on a single ENI.

@liwenwu-amazon
Copy link
Contributor

@spikecurtis Good catch! Yes. I will need to deploy some pods first to make sure IP of primary ENI are all used before deploy frontend Pod and sevice.

@liwenwu-amazon liwenwu-amazon merged commit 9d05e90 into aws:master Aug 21, 2018
@liwenwu-amazon
Copy link
Contributor

@fasaxc @spikecurtis Thank you very much for this PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NodePort not working properly for pods on secondary ENIs
6 participants