Add a configuration knob to allow Pod to use different VPC SecurityGroups and Subnet #165

liwenwu-amazon · 2018-09-02T18:39:58Z

Issue #131

Problem

Today ipamD uses the node's primary ENI's SecurityGroups and Subnet when allocating secondary ENIs for Pod.

Here are few use cases which require Pods to use different SecurityGroups and Subnets than Primary ENI:

There is a limited IP addresses available in primary ENI's subnet. This limits the number of Pods can be created in the cluster.
For security reason, Pods need to use different SecurityGroups and Subnet than Node's SecurityGroups and Subnet
For security reason and availability reason, some Pods in the cluster need to use one set of SecurityGroups and Subnet, whereas some other Pods need to use different SecurityGroups and Subnet

Pod's Custom Network Config

ENIConfig CRD

Here we define a new CRD ENIConfig to allow user to configure SecurityGroups and Subet for Pods network config:

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: eniconfigs.crd.k8s.amazonaws.com
spec:
  scope: Cluster
  group: crd.k8s.amazonaws.com
  version: v1alpha1
  names:
    scope: Cluster
    plural: eniconfigs
    singuar: eniconfig
    kind: ENIConfig

Node Annotation

We will use Node's Annotation to indicate which ENIConfig will be used for this Node's Pod network.

kubectl annotate node <node-name> k8s.amazonaws.com/eniConfig=<ENIConfig name>

`default` ENIConfig

If a node does not have annotation k8s.amazonaws.com/eniConfig, it will use ENIConfig whose name is default

Workflow

Set environment variable AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG to true
- This will cause ipamD to use the SecurityGroups and Subnet in node's ENIConfig for ENI allocation
Create a VPC Subnet for pod network config (e.g subnet-0c4678ec01ce68b24)
Create VPC Security groups for pod network config (e.g. sg-066c7927a794cf7e7, sg-08f5f22cfb70d8405, sg-0bb293fc16f5c2058)
Create ENIConfig CRD, for example:

apiVersion: crd.k8s.amazonaws.com/v1alpha1
kind: ENIConfig
metadata:
 name: group1-pod-netconfig
spec:
 subnet: subnet-0c4678ec01ce68b24
 securityGroups:
 - sg-066c7927a794cf7e7
 - sg-08f5f22cfb70d8405
 - sg-0bb293fc16f5c2058

Annotate a node to use ENIConfig group1-pod-netconfig

kubectl annotate node <node-xxx> k8s.amazonaws.com/eniConfig=group1-pod-netconfig

The ipamD will use these for ENI allocation

Behavior when there is a ENIConfig Configuration Change

If user changes ENIConfig CRD definition (e.g. using different subnet or different security groups), or changes node's annotation to use different ENIConfig CRD, ipamD will use this new configuration when allocating new ENIs.

Alternative Solutions Considered But Not Adopted:

Use Environment for Pod Subnet and SecurityGroups

Every config change will trigger CNI/aws-node daemonSet rolling upgrade
All Pods in the cluster MUST use same Subnet and SecurityGroups

Use `/etc/cni/net.d/aws.conf`

In addition to issues mentioned above, you need to rebuild aws-vpc-cni docker image.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

stuartnelson3 · 2018-09-18T09:25:20Z

Currently running this on a test cluster, seems to be working as advertised

mogren · 2018-09-19T00:31:46Z

scripts/aws-cni-support.sh

-curl http://localhost:61678/v1/env-settings > ${LOG_DIR}/env.output
+curl http://localhost:61678/v1/networkutils-env-settings > ${LOG_DIR}/networkutils-env.output
+curl http://localhost:61678/v1/ipamd-env-settings > ${LOG_DIR}/ipamd-env.output
+curl http://localhost:61678/v1/eni-configs  > ${LOG_DIR}/eni-configs.out


Nit: Why does the eni-configs log file have a .out extension when all the others have .output

mogren · 2018-09-19T00:34:11Z

ipamd/ipamd.go

@@ -43,11 +45,40 @@ const (
 	ipPoolMonitorInterval       = 5 * time.Second
 	maxRetryCheckENI            = 5
 	eniAttachTime               = 10 * time.Second
-	defaultWarmENITarget        = 1
 	nodeIPPoolReconcileInterval = 60 * time.Second
 	maxK8SRetries               = 12
 	retryK8SInterval            = 5 * time.Second
 	noWarmIPTarget              = 0


Maybe rename this to defaultWarmIPTarget and move it to line 62?

stuartnelson3 · 2018-09-19T09:36:00Z

I've noticed to unexpected behavior while using this:

I'm attempting to ping from one physical machine inside a non-AWS datacenter, to a pod running on a k8s cluster running on instances in EC2. The two are linked via a direct connect.

When pinging from the pod to the physical machine, the ping request and reply packets make it through fine:

13:52:40.197673 IP pod-host-XX.XX.XX.XX > bare-metal-XX.XX.XX.XX: ICMP echo request, id 968, seq 4, length 64
13:52:40.197708 IP bare-metal-XX.XX.XX.XX > pod-host-XX.XX.XX.XX: ICMP echo reply, id 968, seq 4, length 64

To note is that I seem to be receiving packets from the EC2 instance itself, over eth0; the pod's IP addr is not being recorded by tcpdump as the origin/destination of the request/reply messages. The pod is attached to device=3 according to the output on /v1/enis. Checking ip a on the EC2 instance, there are eth0, eth1, and eth2.

When pinging from the physical machine to the pod ip, the physical machine sees no replies.

If I tcpdump on the EC2 instance while pinging from the physical machine to the pod, I see requests coming in on eth2, but then leaving on eth0:

$ tcpdump -n -i eth2 icmp and host bare-metal-XX.XX.XX.XX
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth2, link-type EN10MB (Ethernet), capture size 262144 bytes
13:37:31.639213 IP bare-metal-XX.XX.XX.XX > pod-XX.XX.XX.XX: ICMP echo request, id 3649, seq 1, length 64
13:37:32.693195 IP bare-metal-XX.XX.XX.XX > pod-XX.XX.XX.XX: ICMP echo request, id 3649, seq 2, length 64
13:37:33.717124 IP bare-metal-XX.XX.XX.XX > pod-XX.XX.XX.XX: ICMP echo request, id 3649, seq 3, length 64
13:37:34.741126 IP bare-metal-XX.XX.XX.XX > pod-XX.XX.XX.XX: ICMP echo request, id 3649, seq 4, length 64
13:37:35.765154 IP bare-metal-XX.XX.XX.XX > pod-XX.XX.XX.XX: ICMP echo request, id 3649, seq 5, length 64

$ tcpdump -n -i eth0 icmp and host bare-metal-XX.XX.XX.XX
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
13:37:46.265847 IP pod-XX.XX.XX.XX > bare-metal-XX.XX.XX.XX: ICMP echo reply, id 3843, seq 1, length 64
13:37:47.285361 IP pod-XX.XX.XX.XX > bare-metal-XX.XX.XX.XX: ICMP echo reply, id 3843, seq 2, length 64
13:37:48.309237 IP pod-XX.XX.XX.XX > bare-metal-XX.XX.XX.XX: ICMP echo reply, id 3843, seq 3, length 64

So it appears something is incorrectly configured with ip routes/rules, since the reply is leaving on the wrong device.

When I check the routes/rules:

$ ip rule show | grep pod-XX.XX.XX.XX
512:    from all to pod-XX.XX.XX.XX lookup main 
1536:    from pod-XX.XX.XX.XX lookup 2 
$ ip route show table 2
default via 10.132.0.1 dev eth1 
10.132.0.1 dev eth1  scope link

According to this, it seems like traffic from the pod should in fact be leaving over eth1. This contradictory behavior is confirmed:

$ ip route get bare-metal-XX.XX.XX.XX from pod-XX.XX.XX.XX iif eniab037e126c3
bare-metal-XX.XX.XX.XX from pod-XX.XX.XX.XX via pod-host-XX.XX.XX.XX dev eth0 
    cache  iif eniab037e126c3

$ ip route get 10.132.0.1 from pod-XX.XX.XX.XX iif eniab037e126c3
10.132.0.1 from pod-XX.XX.XX.XX via pod-host-XX.XX.XX.XX dev eth0 
    cache  iif eniab037e126c3

I'm hoping that this isn't the expected behavior? Is there any further information I can provide in helping to debug this? Thanks for the great work, we're very excited to start using this feature!

liwenwu-amazon · 2018-09-19T13:35:25Z

@stuartnelson3 can you set AWS_VPC_K8S_CNI_EXTERNALSNAT to true and see if it works.

mogren · 2018-09-21T19:35:13Z

LGTM!

mogren

Make sure all the tests have passed with the latest changes.

stuartnelson3 · 2018-09-24T13:54:49Z

When I set AWS_VPC_K8S_CNI_EXTERNALSNAT to true, the container is unable to ping the baremetal machine, and the baremetal machine cannot ping the container.

The default setting, AWS_VPC_K8S_CNI_EXTERNALSNAT=false, allows the container to ping the baremetal machine, but the baremetal machine is not able to ping the container.

Do you have any suggestions how to further trouble shoot this, or if there's any more information I could add to my previous comment (#165 (comment)) ?

liwenwu-amazon · 2018-09-24T14:16:41Z

@stuartnelson3 What's your VPC topology? In another word, how is your baremetal connected to Pods in the Pod's VPC? Are your pods using Pod's specific subnet and security groups from this PR?

stuartnelson3 · 2018-09-24T17:31:33Z

my mistake! the routers in our baremetal datacenter were blocking the CIDR block of the secondary subnet. Setting AWS_VPC_K8S_CNI_EXTERNALSNAT=true fixed the issue!

rifelpet · 2018-09-27T16:34:57Z

config/v1.2/aws-k8s-cni.yaml

+  - crd.k8s.amazonaws.com
+  resources:
+  - "*"
+  - namespaecs


@liwenwu-amazon I realize this is already merged, but it looks like theres a typo in this ClusterRole rule: namespaecs -> namespaces, though with the wildcard above it this line may be redundant.

sdavids13 · 2018-10-17T17:25:08Z

Are there instructions on how this could be used in a multi-AZ autoscaling group? When a machine comes up how can it know that if it comes up in us-east-1a to use the alternate subnet X (eniConfig=groupX-pod-netconfig) while anything launching in us-east-1b uses alternate subnet Y (eniConfig=groupY-pod-netconfig). Also, where can you set the node annotation where it is already configured before it joins the cluster (i.e. you don't need to run kubectl annotate node <node-name> k8s.amazonaws.com/eniConfig=<ENIConfig name> after it has already joined the cluster -- EC2 Tag?)

bnutt · 2018-10-18T15:55:30Z

I am wondering this as well @sdavids13 , my thought to this is that you could have the user_data in the launch config for the autoscaling group check what AZ the instance is in by curl http://169.254.169.254/latest/meta-data/placement/availability-zone. Based on the response, you could map it to a ENIConfig which contains a subnet in that AZ and then set the label for the node in kubelet on startup. If you use the amazon ami, you can pass the labels you want to the bootstrap script https://github.com/awslabs/amazon-eks-ami/blob/master/files/bootstrap.sh.

Edit: Sorry, I realized you need annotations, not labels, I dont see a way either to assign annotations in kubelet, so it somehow needs to be done after it's joined the cluster which really isn't that feasible since you may have nodes scaling up or down. If it could have an EC2 tag to set the ENIConfig it would be easy to specify which to use for each subnet.

After testing this more it is mainly usable when you run all your instances in one AZ, by doing this you could specify a default eniconfig and then all your instances would automatically allocate ip's from the subnet specified. However, if you want to run across different az's, you would need to make multiple eni configs, lets say one for each AZ. If you dont have default config your nodes by default will not allocate any ip's for pods based on the subnet the worker node was created in, so your worker node is not operating. This means you would need to annotate every node that comes up by hand if you can not do it during node bootstrap. Would it be possible to have the default behavior to just use the worker node subnet to allocate pod ip's even if the flag is enabled instead of having to define a default eniconfig, or is there a programatic approach you can recommend @liwenwu-amazon on how to annotate nodes? Or even switch to use labels?

sdavids13 · 2018-10-24T18:17:52Z

@liwenwu-amazon @bchav Could either of you please help explain how this feature can be used in a multi-AZ environment on node startup? Looking at the kubelet documentation there doesn't appear to be a way to specify annotations at node startup/registration time. Could you please provide a mechanism to allow us to set the eniConfig value somewhere in the node startup process?

liwenwu-amazon · 2018-10-24T20:01:26Z

@sdavids13 One note on eniConfig is that:

There is NO hard requirement that node MUST be annotated at node startup time.
If node comes up with NO eniConfig annotation, or there is NO matching eniConfig CRD found, ipamD will NOT try to allocate any ENIs and IPs
You can write a external controller to watch the node object and annotate the node based on your business need.
Once the node get annotated with a eniConfig name and also the matching eniConfig CRD is configured, the ipamD will start allocating ENIs and IPs using security groups and subnet specified in the eniConfig CRD.

liwenwu-amazon · 2018-10-24T20:07:33Z

@bnutt ,

"Would it be possible to have the default behavior to just use the worker node subnet to allocate pod ip's even if the flag is enabled instead of having to define a default eniconfig, or is there a programatic approach you can recommend @liwenwu-amazon on how to annotate nodes? Or even switch to use labels?"

Yes, today's default behavior (AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG = false) is all ENIs are using same security groups and subnet as worker node.

sdavids13 · 2018-10-24T22:17:10Z

@liwenwu-amazon The problem with doing it after the worker node joins the cluster is that pods are already going to start being scheduled on the host. There will effectively be a race condition between when pods will be scheduled/run on the node and when the external "watcher" process can update the annotation. Then what ends up happening? The pods might be launched into the incorrect/undesired subnet and then you will need to kill all of those pods to then be rescheduled in the correct subnet? Please correct me if there is a better approach that won't allow pods to be scheduled on the host until the annotation is applied. (important to note that if you don't supply a "default" eniConfig then the critical "watcher" pod wouldn't be able to be deployed and hence will never allow any other pods to start, hence nothing would be able to be launched from a new cluster)

Alternatively could we provide an environment variable through kubelet (or a similar mechanism) that can be applied in the user-data script to provide the "default" eniConfig value if the annotation isn't present on the node via the k8s API? @liwenwu-amazon / @bchav

liwenwu-amazon · 2018-10-24T22:27:22Z

@sdavids13 I have a question with your deployment:

After a node has been annotated with a eniConfig, how do you prevent a new Pod (e.g. The pod which is supposed to use a different subnet) being scheduled on this node?

sdavids13 · 2018-10-24T23:20:43Z

@liwenwu-amazon I'm not quite following your question - I would only want pods to be scheduled after the node has been annotated, so I'm not sure why I would want to prevent them from being scheduled after being annotated. But to answer your question... you would generally cordon the node to prevent new pods from being scheduled. Unfortunately I don't believe you can cordon the node when the node registers itself but even if you could you still need to at least get the node watcher pod up and running in order to perform the annotating process, hence we have a chicken before the egg problem.

Taking a step back this is my goal:

Configure a multi-AZ EKS cluster where the primary EC2 ENI/IP is in a routable subnet (to other peered VPCs) while all pod ENIs/IPs are run in a subnet in a different CIDR range/non-routable space in each corresponding AZ. This was described in the original issue.
A cluster can be launched via a terraform script, install helm, various helm charts, and requires 0 human intervention throughout the process.
Minimize/have 0 "false errors" coming from the cluster.

liwenwu-amazon · 2018-10-25T00:25:29Z

@sdavids13 Here is one way to achieve your goal:

config node watcher pod to use hostNetwork: true, so that node watcher can run before CNI allocates ENI and IPs
if a pod is scheduled to a node before the node get annotated with the eniConfig, this pod will NOT get a IP. After node is annotated with the eniConfig, ipamD will start allocating IPs and ENIs. After this, the Pod will get a IP from the subnet configured in the eniConfig CRD

Will this satisfy your requirement?

ewbankkit · 2018-10-25T20:12:35Z

@sdavids13 I have exactly the same scenario I need to solve.
@liwenwu-amazon My understanding of the proposed solution is:

Set AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=true in the aws-node Daemonset
Deploy the node watcher pod with hostNetwork: true
Start EC2 instance workers without the k8s.amazonaws.com/eniConfig=<ENIConfig name> annotation
node watcher (and other pods with hostNetwork: true) will get scheduled and use the IP address of the worker node it's scheduled on
node worker runs a standard controller loop, ensuring Nodes are annotated with the "correct" ENIConfig annotation (see below)
Once a worker node is annotated correctly, regular hostNetwork: false pods will run on it and use IP addresses based on the associated ENIConfig

Determining the "correct" ENIConfig could be done in a number of ways:

Using the failure-domain.beta.kubernetes.io/zone label on the node to determine the node's AZ and then looking up the corresponding ENIConfig in a ConfigMap
Using the node's ExternalID attribute (=EC2 instance ID) to determine the node's AZ by making EC2 API calls and then looking up the ENIConfig
Look at all the ENIConfigs registered and from the associated subnet ID build an AZ to ENIConfig map instead of using a ConfigMap
...

Sound about right?

taylorb-syd · 2019-02-01T00:11:04Z

Just FYI for those of you who find yourself here trying to solve the problem that @sdavids13 raised.

Currently in the master build, which theoretically should be included in the 1.4.0 and later releases, the following PR was made:

Feature: ENIConfig set by custom annotation or label names #280

What this feature does is two things:

Expands the control of ENIConfig to that of a Label as a well as a Annotation.
Added control variables ENI_CONFIG_LABEL_DEF and ENI_CONFIG_ANNOTATION_DEF to change the controlling label/annotation from k8s.amazonaws.com/eniConfig to an arbitrary label/annotation.

The upside of this is that if you set ENI_CONFIG_LABEL_DEF to failure-domain.beta.kubernetes.io/zone then create a ENIConfig for each Availability Zone in your VPC (e.g. us-east-1a and us-east-1b), it will automatically select that the correct ENIConfig for your availability zone, without the requirement of a watcher, custom labels, or any external infrastructure.

Additionally as the code is written to prefer an annotation over a label, this means that if you want to override the ENI Config, you can annotate to override these "default for availability zone".

Edit: Changed expected release based upon comment by @mogren

mogren · 2019-02-01T00:53:36Z

Unfortunately, 1.3.1 won't have this change. I created a tracking ticket on the AWS container roadmap board for the next CNI release.

update vendor for new eniconfig CRD

5751423

liwenwu-amazon force-pushed the pod-network branch from f16c913 to 8bb4c71 Compare September 3, 2018 00:13

liwenwu-amazon requested a review from nckturner September 5, 2018 17:18

mogren reviewed Sep 19, 2018

View reviewed changes

Add Config Knob to allow Pod to use different VPC subnet

ddbe248

liwenwu-amazon force-pushed the pod-network branch from 8bb4c71 to ddbe248 Compare September 21, 2018 17:46

mogren approved these changes Sep 21, 2018

View reviewed changes

liwenwu-amazon merged commit c30ede2 into aws:master Sep 21, 2018

rifelpet reviewed Sep 27, 2018

View reviewed changes

mumoshu mentioned this pull request Sep 30, 2018

feat: initial support for amazon-vpc-cni-k8s kubernetes-retired/kube-aws#1463

Merged

5 tasks

mumoshu mentioned this pull request Oct 22, 2018

Configurable subnet/security group per pod? #208

Closed

lnr0626 mentioned this pull request Oct 30, 2018

RP filter isn't updated to loose when using centos 7 #212

Closed

This was referenced Oct 31, 2018

Kops should be able to add annotations kubernetes/kops#3344

Closed

Add support for adding annotations to nodes from instance group spec kubernetes/kops#6026

Closed

yuanlinios mentioned this pull request Dec 12, 2018

Proper setup procedure of eniConfig? #264

Closed

mumoshu mentioned this pull request Jan 15, 2019

review cross-nodegroup ingress rules eksctl-io/eksctl#419

Closed

taylorb-syd mentioned this pull request Feb 1, 2019

Improved Documentation re ENIConfig Environment Variables #309

Merged

lod3456 mentioned this pull request Aug 17, 2021

Adding annotations to selected nodes. kubernetes/kops#12172

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a configuration knob to allow Pod to use different VPC SecurityGroups and Subnet #165

Add a configuration knob to allow Pod to use different VPC SecurityGroups and Subnet #165

liwenwu-amazon commented Sep 2, 2018

stuartnelson3 commented Sep 18, 2018

mogren Sep 19, 2018

mogren Sep 19, 2018

stuartnelson3 commented Sep 19, 2018

liwenwu-amazon commented Sep 19, 2018

mogren commented Sep 21, 2018

mogren left a comment

stuartnelson3 commented Sep 24, 2018

liwenwu-amazon commented Sep 24, 2018

stuartnelson3 commented Sep 24, 2018

rifelpet Sep 27, 2018

sdavids13 commented Oct 17, 2018

bnutt commented Oct 18, 2018 •

edited

Loading

sdavids13 commented Oct 24, 2018

liwenwu-amazon commented Oct 24, 2018

liwenwu-amazon commented Oct 24, 2018

sdavids13 commented Oct 24, 2018 •

edited

Loading

liwenwu-amazon commented Oct 24, 2018

sdavids13 commented Oct 24, 2018

liwenwu-amazon commented Oct 25, 2018

ewbankkit commented Oct 25, 2018

taylorb-syd commented Feb 1, 2019 •

edited

Loading

mogren commented Feb 1, 2019

Add a configuration knob to allow Pod to use different VPC SecurityGroups and Subnet #165

Add a configuration knob to allow Pod to use different VPC SecurityGroups and Subnet #165

Conversation

liwenwu-amazon commented Sep 2, 2018

Problem

Pod's Custom Network Config

ENIConfig CRD

Node Annotation

default ENIConfig

Workflow

Behavior when there is a ENIConfig Configuration Change

Alternative Solutions Considered But Not Adopted:

Use Environment for Pod Subnet and SecurityGroups

Use /etc/cni/net.d/aws.conf

stuartnelson3 commented Sep 18, 2018

mogren Sep 19, 2018

Choose a reason for hiding this comment

mogren Sep 19, 2018

Choose a reason for hiding this comment

stuartnelson3 commented Sep 19, 2018

liwenwu-amazon commented Sep 19, 2018

mogren commented Sep 21, 2018

mogren left a comment

Choose a reason for hiding this comment

stuartnelson3 commented Sep 24, 2018

liwenwu-amazon commented Sep 24, 2018

stuartnelson3 commented Sep 24, 2018

rifelpet Sep 27, 2018

Choose a reason for hiding this comment

sdavids13 commented Oct 17, 2018

bnutt commented Oct 18, 2018 • edited Loading

sdavids13 commented Oct 24, 2018

liwenwu-amazon commented Oct 24, 2018

liwenwu-amazon commented Oct 24, 2018

sdavids13 commented Oct 24, 2018 • edited Loading

liwenwu-amazon commented Oct 24, 2018

sdavids13 commented Oct 24, 2018

liwenwu-amazon commented Oct 25, 2018

ewbankkit commented Oct 25, 2018

taylorb-syd commented Feb 1, 2019 • edited Loading

mogren commented Feb 1, 2019

`default` ENIConfig

Use `/etc/cni/net.d/aws.conf`

bnutt commented Oct 18, 2018 •

edited

Loading

sdavids13 commented Oct 24, 2018 •

edited

Loading

taylorb-syd commented Feb 1, 2019 •

edited

Loading