Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EKS] [request]: Add/Delete/Update Subnets Registered with the Control Plane #170

Open
christopherhein opened this issue Feb 23, 2019 · 95 comments
Labels
EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue

Comments

@christopherhein
Copy link

christopherhein commented Feb 23, 2019

Tell us about your request
The ability to update the Subnets that the EKS Control plane is registered with.

Which service(s) is this request for?
EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
https://twitter.com/julien_fabre/status/1099071498621411329

Are you currently working around this issue?

Additional context

@christopherhein christopherhein added EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue labels Feb 23, 2019
@christopherhein
Copy link
Author

christopherhein commented Feb 23, 2019

/cc @Pryz

@jmeichle
Copy link

jmeichle commented Feb 23, 2019

This would be a nice improvement

@Pryz
Copy link

Pryz commented Feb 23, 2019

To add some color, here are some use cases :

  1. You have a multi-tenant cluster configured with X number of subnets but you are getting close to IP exhaustion and you want to extend the setup to Y more subnets. Without losing the current configuration of course.

  2. You are expending your setup to new availability zones and so want to use new subnets to schedule PODs there.

  3. You were using your cluster on private subnets only and now want to extend to use some public subnets.

Generally, in many environment, network setup are moving and EKS need to be flexible enough to embrace such changes.

Thanks !

@alfredkrohmer
Copy link

alfredkrohmer commented Feb 23, 2019

@Pryz Your worker nodes don't have to be in the same subnets that your control plan is configured for. The latter are used for creating the ENIs that are used for kubectl log|exec|attach and for ELB/NLB placement.

@Pryz
Copy link

Pryz commented Mar 4, 2019

@devkid yes but that's a problem. You can basically schedule PODs on subnets which are not configured on the master but can't access it (logs, proxy, whatever).

@alfredkrohmer
Copy link

alfredkrohmer commented Mar 4, 2019

@Pryz If you have proper routing between the different subnet, this is not a problem. We have configured our control plane for one set of subnets and our workers run in a second, disjoint set of subnets and logs, proxy, exec are working just fine.

@hanjunlee
Copy link

hanjunlee commented Apr 10, 2019

@Pryz Could you explain detail how to routing between the different subnets? I am just wondering how to access to disjoint set of subnets. I hope your reply! Thanks

@Pryz
Copy link

Pryz commented Apr 10, 2019

@hanjunlee I'm not sure I understand your question. My setup is quite simple : 1 VPC, up to 5 CIDRs with 6 subnets each (3 private subnets, 3 public subnets).
Each AZ gets 2 routing tables (1 for private subnets and 1 for public subnets). There is no issue in this routing setup. Any IP from a subnet can talk with any other IP from any of the other subnets.

@aschonefeld
Copy link

aschonefeld commented May 3, 2019

I am seeing the same issue. I have configured 2 subnets initially but my CIDR range was too small for IP assignemts from the cluster. So I added new subnets to the VPC and the worker nodes are running fine in these new subnets.

When using kubectl proxy and accessing the URL I get an error of :

Error: 'Address is not allowed'
Trying to reach: 'https://secondary-ip-of-worker-node:8443/'

Control Pane ENI in old subnets, worker nodes in new subnets and kubectl host have all in and outbound rules for each other. I would think this is related to the issue of new subnets not having an attached ENI for the control pane. Any help would be appreciated.

@ckassen
Copy link

ckassen commented May 24, 2019

@aschonefeld did you tag the new subnets with the correct kubernetes.io/cluster/<clustername> tag set to shared?

I could not reproduce the error when I tagged the subnet that way. Spun up a node in the subnet and started a pod on that node. Afterwards I could run kubectl logs|exec|port-forward on that Pod.

@aschonefeld
Copy link

aschonefeld commented May 28, 2019

@ckassen the cloud formation temlate for the worker nodes tagged the subnet to shared. Kubectl logs for example is also working for me but proxy to the dashboard is not working. Command goes through but no connect to the dashboard is possible. Is kubectl. proxy working for you?

@willthames
Copy link

willthames commented May 31, 2019

If you decide your load balancers are in the wrong subnet, and decide to create new ones, as far as I can tell EKS doesn't detect the new subnets, and still creates the load balancers in the old subnets, even though they're no longer tagged with kubernetes.io/role/internal-elb. Being able to add the new subnets to EKS would be useful.

@thomasjungblut
Copy link

thomasjungblut commented Aug 6, 2019

Any work-around known so far? Terraform wants to create a new EKS cluster for me after adding new subnets :-/

@sjmiller609
Copy link

sjmiller609 commented Aug 6, 2019

@thomasjungblut are you using this module? https://github.com/terraform-aws-modules/terraform-aws-eks the cluster aws_eks_cluster.this resource wants to be replaced?

@thomasjungblut
Copy link

thomasjungblut commented Aug 7, 2019

We built our own module, but effectively it's the same TF resource that wants to be replaced yes.

It makes sense, since the API doesn't support changing the subnets: https://docs.aws.amazon.com/eks/latest/APIReference/API_UpdateClusterConfig.html

Would be cool to at least have the option of adding new subnets. The background is that we want to switch from public to private subnets, so we could add our private subnets on top of the existing public ones and just change the routes a bit. Would certainly make our life a bit easier :)

@theothermike
Copy link

theothermike commented Aug 8, 2019

We just ran into this exact issue, using that EKS TF Module too. A workaround that seems to work:

  1. Create the new subnets, setup the routes, etc
  2. Manually edit the ASG for the worker nodes, add the subnets
  3. Edit the control plane SG, add the CIDR ranges of the new subnets

This, of course, breaks running TF with the EKS module for that cluster again. We're hoping to mitigate that by tightening up the TF code so we can just create new properly sized VPC/subnets, and kill the old EKS cluster entirely

We're trying to make a custom TF module that will do the above work, without using the EKS module, so at least we can apply that programatically in the future if needed and that cluster is still around

@jahzielHA
Copy link

jahzielHA commented Aug 30, 2019

We are having the same problem. We needed to add new subnets to the EKS cluster and we had to rebuild it since the aws eks update-cluster-config --resources-vpc-config does not allow to update Subnet nor Security Groups once the cluster has been built.

An error occurred (InvalidParameterException) when calling the UpdateClusterConfig operation: subnetIds and securityGroupIds cannot be updated.

@flythebluesky
Copy link

flythebluesky commented Sep 19, 2019

Do we really need to rebuild the entire cluster to add a subnet?

@mailjunze

This comment has been minimized.

@henkka
Copy link

henkka commented Oct 24, 2019

Workaround : If your EKS cluster is behind the current latest version and planning to upgrade.

Control-plane discovers the subnets at the time of initialization process. Tag the new subnets. Upgrade the cluster, new tagged subnets will be discovered automatically.

This workaround didn't work for us :( We created new subnets, tagged them with kubernetes.io/cluster/eks: shared tag and ran the EKS upgrade with no changes to the EKS attached subnets. Anything we missed?

@QCU266
Copy link

QCU266 commented Nov 11, 2019

Workaround : If your EKS cluster is behind the current latest version and planning to upgrade.
Control-plane discovers the subnets at the time of initialization process. Tag the new subnets. >
Upgrade the cluster, new tagged subnets will be discovered automatically.

The same as @henkka , The workaround didn't work for me too.
what should I do?

@BertieW
Copy link

BertieW commented Nov 29, 2019

@qcu Did you modify the control plane security groups with the new CIDRs?

@casutherland
Copy link

casutherland commented Dec 2, 2019

@qcu Did you modify the control plane security groups with the new CIDRs?

^-- @QCU266: this was for you.

@BertieW
Copy link

BertieW commented Dec 11, 2019

The process that worked for us using terraform:
Created new subnet(s)
Tagged new subnets with kubernetes.io/cluster/ tag set to shared -- our subnets share the same route table, but if they didn't, we'd have tagged that too.
Modify security group for the control plane to add the new CIDR.
Cluster schedules pods just fine, and stuff like logs|proxy|etc work with no issue.

We use the EKS terraform module, and all of this was duable with terraform. The worker node block will happily accept a subnet that isn't one of those declared with the cluster initially. No manual changes required.

Assuming that the assets are properly tagged, I'd venture that the kubectl issues encountered above are down to SG configuration, not inherently EKS-related.

@khacminh
Copy link

khacminh commented Jan 15, 2020

Is there any plan for this proposal? The proposal is created for almost a year and AWS seems doesn't have any plan for it.

@cazter
Copy link

cazter commented Feb 13, 2020

@BertieW using the TF module with the new subnets added to the worker groups "forces replacement" of the EKS cluster.

All subnets (w/ tags), are added to the VPC and SG's, and they share same route table and are within the same VPC. As @thomasjungblut pointed out, the TF module appears to be restricted by the AWS API limitations and cannot add the additional subnets to the existing cluster. I don't see anyway to get around replacing the EKS cluster if deploying from the TF module :(

@cazter
Copy link

cazter commented Feb 13, 2020

Alright found a solution for the TF EKS module.

You can't make changes to the module subnets parameter. So if you were calling these subnets via a variable as I was like so:

subnets = module.vpc.private_subnets

You'll need to provide a static definition for precisely those subnets used when creating your cluster, example:

subnets = ["subnet-085cxxxx", "subnet-0a60xxxx", "subnet-0be8xxxx"]

Then within your worker groups you can add your new subnets. Afterwards you'll be able to apply the TF without forcing a cluster replacement.

@dogzzdogzz
Copy link

dogzzdogzz commented Jul 30, 2021

@donkeyx Just make sure all the routing and required tags for EKS are good in your new subnet and create ASG then, I expanded my IP pool by creating secondary subnets in each AZ for 20 clusters without any problem. I'm using self-managed ASG, i'm not sure if there is any configuration difference for managed node group.

Here is my terraform code for new subnet.

locals {
    eks_clusters_tag = { for i in var.eks_clusters : "kubernetes.io/cluster/${i}" => "shared" if length(i) > 0 }
    eks_public_load_balancer_tag = length(var.eks_clusters) > 0 ? {"kubernetes.io/role/elb": "1"} : {}
    eks_internal_load_balancer_tag = length(var.eks_clusters) > 0 ? {"kubernetes.io/role/internal-elb": "1"} : {}
}

resource "aws_subnet" "secondary_private_subnet" {
  count             = var.enabled && var.secondary_private_subnet_enabled ? var.eks_cluster_az_count : 0
  vpc_id            = aws_vpc.mod.0.id

  cidr_block        = cidrsubnet(var.vpc_cidr, 3, count.index + local.az_count)
  availability_zone = element(data.aws_availability_zones.available.names, count.index)
  tags              = merge(
    module.default_label.tags, 
    map("Name", format("%s-private-secondary-%s", module.default_label.id, element(data.aws_availability_zones.available.names, count.index))), 
    local.eks_clusters_tag,
    local.eks_internal_load_balancer_tag,
  )
}

resource "aws_route_table" "secondary_private_subnet" {
  count            = var.enabled && var.secondary_private_subnet_enabled ? var.eks_cluster_az_count : 0
  vpc_id           = aws_vpc.mod.0.id
  tags             = merge(module.default_label.tags, map("Name", format("%s-private-secondary", module.default_label.id)))
}

resource "aws_route" "secondary_private_subnet_nat_gateway" {
  count                     = var.enabled && var.secondary_private_subnet_enabled ? var.eks_cluster_az_count : 0
  route_table_id            = element(aws_route_table.secondary_private_subnet.*.id, count.index)
  destination_cidr_block    = "0.0.0.0/0"
  nat_gateway_id            = element(aws_nat_gateway.natgw.*.id, count.index)
}

resource "aws_route_table_association" "secondary_private_subnet" {
  count          = var.enabled && var.secondary_private_subnet_enabled ? var.eks_cluster_az_count : 0
  subnet_id      = element(aws_subnet.secondary_private_subnet.*.id, count.index)
  route_table_id = element(aws_route_table.secondary_private_subnet.*.id, count.index)
}

@mikestef9
Copy link
Contributor

mikestef9 commented Jul 30, 2021

@donkeyx see my comment above #170 (comment)

The subnets passed as part of cluster creation are not related to where you can run worker nodes.

@toabi
Copy link

toabi commented Jul 30, 2021

Well, I think everybody knows now that putting workers in other subnets is no problem. The only issue which always stays like this is when you in the past did some error in subnet creation and made them in stupid sizes and now want to clean up.

Network layout cleanup is now still not possible without recreating the control plane.

@owengo
Copy link

owengo commented Jul 30, 2021

@mikestef9 yes but it would be really useful to be able to reclaim the subnets passed as part of cluster creation. Right now the only solution is to destroy the cluster.

@sacredwx
Copy link

sacredwx commented Aug 3, 2021

+1

@donkeyx
Copy link

donkeyx commented Aug 4, 2021

@toabi and @owengo raise good points which match my scenario.

In our case we had existing subnets whith a /24 range, which is pretty standard for our org. But, only after creating and scaling the eks cluster, I noticed that we had exhaused out ips in the subnet. I then looked at how to use a custom range for the pods so it would have no impacton the subnet CIDR's, which, i think can only be done by "not" using managed node groups? The other solution, which is to add more subnets, would work but never release the subnets that were part of the original creation.

Idealy, it would be nice for managed node groups to use an internal CIDR range so it does not impact existing subnets sizing. Alternative, though is a simple api or UI based option to "change" subnets, this is a managed service after all..

@hameno
Copy link

hameno commented Aug 4, 2021

@donkeyx Managed nodes can already use separate subnets. I did exactly that. Deployed new managed node groups in new bigger subnets and everything works normally.

@olfway
Copy link

olfway commented Oct 8, 2021

It seems that's not a problem to place node groups to different subnets, but the problem is if you place it in cidr block which wasn't associated with vpc before eks cluster creation

Today I attached new cidrs to my vpc, created subnets with all required tags, etc and nodes can successfully join my cluster from that new subnets.

But some queries to api server returns errors like:

Get "https://10.24.50.202:443/apis/metrics.k8s.io/v1beta1": Address is not allowed
Post "https://cert-manager-webhook.cert-manager.svc:443/convert?timeout=30s": Address is not allowed

I found then that api-server has a parameter on start:

--proxy-cidr-whitelist="10.24.0.0/24,10.24.2.0/24,10.24.1.0/24,10.24.3.0/24,10.24.5.0/24,10.24.6.0/24,10.24.4.0/24"

And that whitelisted cidrs is original cidr list which was associated with vpc.

So the question is how can I add additional cidr to already existed eks cluster

@mikestef9
Copy link
Contributor

mikestef9 commented Oct 8, 2021

@olfway how long did you wait? Did this eventually work? We have a reconciler running on the control plane that looks for new CIDR blocks periodically, and updates the proxy-cidr-whitelist value.

@olfway
Copy link

olfway commented Oct 8, 2021

It seems that the problem was for one hour, after that it starts working
Now I see updated whitelist value in logs

Also, I tried to change "Cluster endpoint access" to "Public and private" and then back to "Private" but I'm not sure if it helped or not

Update: I mean one hour is then I actually got "Address is not allowed" errors and I've added new cidr to vpc another hour or two earlier

@EmilyShepherd
Copy link

EmilyShepherd commented Oct 9, 2021

@mikestef9 is there any way to speed this up or trigger this refresh? I've run into the same issue today and the delay is adding complexity to an already complex migration 😅

@connatix-cradulescu
Copy link

connatix-cradulescu commented Oct 11, 2021

@olfway you need to open a ticket to AWS support to add in allow the new CIDRs.

@diranged
Copy link

diranged commented Oct 11, 2021

I'm a bit confused - it seems like a bunch of the conversation here is around the idea that you might not be able to add nodes to an EKS cluster if the subnets for those nodes are not part of the "control plane", but that's not true. My understanding is that the control plane just needs to be routable, not specifically in the same subnets.

We actually made the mistake early on to configure most of our EKS clusters so that the control planes are in "all" the subnets.. when in reality we only want them in 2-3 subnets. I am actually watching this ticket hoping that AWS will provide us a way to reconfigure a live EKS cluster to remove subnets that we do not need from the control plane. Is that something that is possible at all yet - even through a Support ticket?

@mikestef9
Copy link
Contributor

mikestef9 commented Oct 11, 2021

The reconciler runs every 4 hours, and it's rolled out for all EKS versions. So worst case, the control plane should recognize the new CIDRs within 4 hours.

@diranged correct, you can run worker in nodes in any subnets, they do not have to be the same ones registered with the control plane. See https://docs.aws.amazon.com/eks/latest/userguide/network_reqs.html

@EmilyShepherd
Copy link

EmilyShepherd commented Oct 11, 2021

Thanks @mikestef9 - anacdotally I also found that toggling a setting like private / public availability of the cluster caused it to pick up the new CIDR blocks quickly. 👍

@mnovacyu
Copy link

mnovacyu commented Oct 12, 2021

@mikestef9 do the subnets require specific tags of any sort? We've seen cases where the new secondary subnets are not getting recognized by the control plane.

@emaildanwilson
Copy link

emaildanwilson commented Nov 5, 2021

I'm trying to decom a subnet which is a part of an eks config, it seems there is no way to do this other than by deleting the cluster. Is there a fix in progress for this?

@chrissnell-okta
Copy link

chrissnell-okta commented Jan 21, 2022

For what it's worth, toggling public/private had no effect on the recognizing of the new subnets. Perhaps those who claim this works are just experiencing a coincidence of the reconciler running at the same time?

Really wish there was a way to force a reconciler run.

@jamiegs
Copy link

jamiegs commented Mar 18, 2022

So it seems that when your managed node group subnets and eks cluster subnets don't match you get an health issue error like this. This makes it so I can't modify the node group any longer as any changes fails with this error.

AutoScalingGroupInvalidConfiguration	The Amazon AutoScalingGroup eks-node_workers has 
subnets ([subnet-....]) which is not expected by Amazon EKS. Expected subnets : ([subnet-...])

Is this a new health issue that was added? I don't see many references to it.

@armujahid
Copy link

armujahid commented Apr 6, 2022

I am also getting "AutoScalingGroupInvalidConfiguration The Amazon AutoScalingGroup eks-node_workers has
subnets ([subnet-....]) which is not expected by Amazon EKS. Expected subnets : ([subnet-...])" issue after modifing ASG to use 3x private subnets instead of using 3x public subnets. Note that all 6 subnets were created during cluster creation.

Solution:
Recreating a new node group with private subnets fixed my issue. Subnets modification of existing node groups isn't supported (or is very limited) so this is the best workaround of this issue.

@visla-xugeng
Copy link

visla-xugeng commented Apr 18, 2022

My VPC has 3 public and 3 private subnets. I only select 3 private subnets, when I create EKS within this VPC. I created several managed node groups based these 3 private subnets. Then, I created a new managed node group in the public subnet in the EKS cluster and It worked. This one is really confused, because the public subnets are not selected when I created the EKS cluster. I thought I need to add public subnets into the eks cluster first and then create a managed node group in it. Actually, it's not necessary. I can directly add node group in the public subnets. I am not sure how this works.

Can anyone explain? (The Kubernetes version is 1.21 in EKS)

@mikestef9
Copy link
Contributor

mikestef9 commented Apr 19, 2022

@visla-xugeng see earlier comment #170 (comment) from this thread

@visla-xugeng
Copy link

visla-xugeng commented Apr 19, 2022

@mikestef9
Thank you for your reply. I have read several doc mentioned in #170 comment. I just have one question here. Why do we still need to select several subnets when we create EKS cluster, since the nodes can be launched in any subnets in your cluster's VPC? What's the purpose of specifying subnets?
In my case, I tagged both private and public subnets with kubernets.io/cluster/my-cluster shared, when I created VPC. As a result, I can build node group in the both private and public subnets, and launch ec2 instances inside. It looks like that selecting several private subnets during the creation of EKS cluster is not necessary. Is this right?

@dz902
Copy link

dz902 commented Jun 10, 2022

This seems to be only a matter of UI problem. Why not just add it after all?

@sethatron
Copy link

sethatron commented Jun 24, 2022

My organization also really needs this functionality as well

@mikestef9 mikestef9 moved this from We're Working On It to Researching in containers-roadmap Jul 29, 2022
@BEvgeniyS
Copy link

BEvgeniyS commented Aug 2, 2022

Hi @mikestef9,

Does the fact that this ticket moved back to 'Researching' mean we shouldn't expect any progress anytime soon?

@apetrosyan1613
Copy link

apetrosyan1613 commented Aug 2, 2022

Hi all! I have I EKS in two subnets. Recently I added one public and one private subnets in a new AZ. I created a node pool in a private subnet and added NLB in public. Nodes were successfully created and pods were successfully scheduled. I can see pods logs and run kuberctl exec commands. kubectl top also outputs new nodes and pods. At first sight, everything looks good. What issues can I face?

@Sheepux
Copy link

Sheepux commented Aug 25, 2022

Adding our vote/use case.
We're moving our cluster to private endpoint and at the same time moving the control plane endpoint outside the node subnet. However without being able to update we have to recreate the cluster itself and that's quite a struggle to explain to your users that our "maintenance" is going to be with the service being unavailable.
This issue is needed when doing some "small" architecture changes and mostly when you manage hundreds of cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue
Projects
containers-roadmap
  
Researching
Development

No branches or pull requests