Provide more control over node upgrading #1018

m00lecule · 2021-12-17T22:15:52Z

Hello, I would like to submit a feature request.

Currently I am trying to replace clusterautoscaler with karpenter but one missing feature is a blocker - updating cluster nodes spawned by karpenter. Referencing official docs - https://karpenter.sh/docs/concepts/#upgrading-nodes the only way to update Your current cluster setup is to configure ttlSecondsUntilExpired for each node. I believe this wont be acceptable for applications that must be running 24/7 - in some corner cases major part of nodes might be deleted at the same time, causing cluster outage. Also it bring too much entropy to the cluster configuration - I can imagine that monitoring engineers might be terrified that during midnight shift when they find out that some production facing instances have been unexpectedly deleted.

More mature tools like kops are handling updates in HA manner - please see kops cluster rolling-update https://kops.sigs.k8s.io/operations/rolling-update/ or https://docs.aws.amazon.com/eks/latest/userguide/update-managed-node-group.html.

My proposition is to implement Provisioner CRD versioning

Motivation - karpenter will assign labels to each node, with version of Provisioner that was used to spawn an EC2 instance. Karpenter will detect changes applied to Provisioner and apply them to cluster keeping cluster highly available.

Example scenario

1. create original Provisioner CRD

provsioner-01.yml

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
    (...)
    - key: "node.kubernetes.io/instance-type" 
      operator: In
      values: ["m5.xlarge","m5.large"]
   (...)
  provider:
    (...)
    launchTemplate: eks-XXXX
    launchTemplateVersion: 1

2. karpenter spawns instances basing on `provsioner-01.yml`

Imagine that we spawned 10 instances with AMI from launch template eks-XXXX in version 1. Those instances size would be one of m5.xlarge or m5.large

3. apply some changes to `Provisioner` named `default`

provsioner-02.yml

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
    (...)
    - key: "node.kubernetes.io/instance-type" 
      operator: In
      values: ["r4.xlarge"]
   (...)
  provider:
    (...)
    launchTemplate: eks-XXXX
    launchTemplateVersion: 2

4. now the interesting part take places

Karpenter detects that there were changes applied to Provisioner named default and tries to replace all instances spawned using provsioner-01.yml with most up to date configuration from provsioner-02.yml keeping the cluster highly available.

For convenience nodes spawned using Provisioner from provisioner-02.yml will be labeled default-node-02 and those from original configuration default-node-01

Those changes would be applied in rolling update manner and the loop would look like:

Karpenter cordons all nodes labeled default-node-01 (prevents from scheduling on pods)
Initialy Karpneter would spin instance default-node-02 using launch template in version 2 (from provisioner-02.yml), the instance would be sized r4.xlarge.
When new node would be ready then It would try to drain and evict node labeled default-node-01 - if there would be no room for pods from picked node then another instance labeled default-node-02 would be created.
Go to 3. until any node labeled default-node-01 exist.

thanks to @akestner and @ellistarn for bringing me here

I am looking forward for Your feedback about proposed approach

Implementation

If I would have to do in on myself then I would propose MutatingAdmissionWebhook for Provisioner CRD that will notify karpenter-controller and trigger karpenter nodes reconfiguration.

https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

ellistarn · 2021-12-17T22:51:31Z

terrified that during midnight shift when they find out that some production facing instances have been unexpectedly deleted.

I think that this is massively important. We've heard this a couple times. We need some time-gated mechanism for allowing node termination. I think this particular facet operates outside of just upgrade and includes expiration, consolidation, and other termination use cases.

ellistarn · 2021-12-17T22:53:15Z

Karpenter detects that there were changes applied to Provisioner named default and tries to replace all instances spawned using provsioner-01.yml

If I'm reading this correctly, it seems like you're looking for a similar relationship to Deployment -> Replicaset -> Pod, but for NodeDeployment -> NodeReplicaSet -> Node. I can't help but think of CAPI here, but I need to noodle on how this would apply to a dynamic instance count.

m00lecule · 2021-12-17T22:54:47Z

Karpenter detects that there were changes applied to Provisioner named default and tries to replace all instances spawned using provsioner-01.yml

If I'm reading this correctly, it seems like you're looking for a similar relationship to Deployment -> Replicaset -> Pod, but for NodeDeployment -> NodeReplicaSet -> Node. I can't help but think of CAPI here, but I need to noodle on how this would apply to a dynamic instance count.

Yes, rolling update on Provisioner CRD level. The algorithm would be a bit greedy - spawn enough instances using latest Provisioner CRD revision that would fit all pods from node, considered for deletion and then reschedule all pods when new instance would be ready (PDB). I can imagine that in some scenario It would require spawning multiple instances using latest configuration (decreased sizing) and in other number of nodes would be reduced (increased instances sizing).

ellistarn · 2021-12-17T23:03:40Z

Yes, rolling update on Provisioner CRD level. The algorithm would some sort greedy - spawn enough instances using latest Provisioner CRD revision that would fit all pods from the previous one. I can imagine that in some scenario It would require spawning multiple instances using latest configuration and in other number of nodes would be reduced because we have increased instances sizing.

I do wonder how much churn this would cause on provisioner update. I'm imagining the scenario where a provisioner manages 5,000 nodes. Right now, an operator could update requirements and they would be applied immediately for new instances. With the proposed mode, this would force a refresh on the entire cluster.

I do think we need to figure out how to support the case where you do want to recycle instances that no longer match requirements, but I want to be cautious that we don't lose out on the use case mentioned above.

m00lecule · 2021-12-17T23:08:11Z

Yes, rolling update on Provisioner CRD level. The algorithm would some sort greedy - spawn enough instances using latest Provisioner CRD revision that would fit all pods from the previous one. I can imagine that in some scenario It would require spawning multiple instances using latest configuration and in other number of nodes would be reduced because we have increased instances sizing.

I do wonder how much churn this would cause on provisioner update. I'm imagining the scenario where a provisioner manages 5,000 nodes. Right now, an operator could update requirements and they would be applied immediately for new instances. With the proposed mode, this would force a refresh on the entire cluster.

I do think we need to figure out how to support the case where you do want to recycle instances that no longer match requirements, but I want to be cautious that we don't lose out on the use case mentioned above.

Agree, cluster administrator must be capable of picking own strategy to handle cluster updates but still - If i would be in charge of handling those numbers of nodes that are not some periodic spike then I would never pick a ttlSecondsUntilExpired option. There must be changes provision strategy switch - RollingUpdate or sort of DelayedProvision strategy

ellistarn · 2022-01-14T01:49:33Z

We haven't gone anywhere. @njtran is looking into this. :)

johngmyers · 2022-01-24T20:04:32Z

In general, you only want to update nodes when they no longer match the current spec. So that would be if they no longer fit within the requirements, subnetSelector, or securityGroupSelector or match the taints, labels, kubeletConfiguration, instanceProfile, launchTemplate, tags, or metadataOptions.

johngmyers · 2022-03-02T01:55:02Z

kOps determines an instance is out of date by seeing if its launch template version is not the most recent one. It attempts to ensure that a relevant spec change will result in a new template version, in some cases putting hashes of relevant configuration into the userdata.

An instance that does not match the current spec should be treated as if it has expired.

How quickly Karpenter should replace expired/out-of-spec instances is a separate issue. kOps has per provisioner-equivalent configuration to limit how many additional instances should be launched when updating such nodes. It currently defaults to one instance, but can be set to either a scalar or percentage up to 100% of instances needing update.

johngmyers · 2022-03-03T00:15:09Z

Some information and experience from maintaining and using node reconciliation in kOps.

kOps usually surges, creating new nodes and waiting for them to be determined useful before cordoning and draining nodes needing reconciliation. As I understand it, Karpenter reconciles in deficit, draining nodes needing reconciliation and then letting the resulting unschedulable replacement pods drive creation of new nodes. Having Karpenter surge seems quite challenging and out of scope for this issue. The concurrency one might want with deficit reconciliation is likely lower than that with surging.

kOps reconciles one group at a time. This is because concurrency limits are specified per-group and it would be difficult to give kOps a way to configure limits that apply across groups.

For the general-purpose group we specify a concurrency limit of 15 nodes. This is not highly tuned, but significantly higher than this we had problems with the control plane and our CNI being able to keep up with the rate of change. We also use the same limit for service-specific groups belonging to stateless services.

Groups belonging to stateful services usually have a concurrency limit of 1 node. This is because the disruption budget for those services is usually set to 1 and more surging would be a waste of money. Nodes for stateful services take significantly longer to reconcile, as their disruption budgets take longer to permit progress. This is particularly so for services that store data locally on storage-optimized instances.

We have logic (not in kOps) to reconcile the stateful groups after the stateless ones are done.

We limit node reconciliation to approximately business hours, but avoid peak times.

ellistarn · 2022-04-18T20:34:43Z

Related: #1457

m00lecule · 2022-06-12T11:05:45Z

Hello back @ellistarn, @njtran!

I would like to share with You some thoughts. Currently Karpenter is spawning plain EC2 instances and we would like to control their rollout process. It seems that is just duplicating ASG logic inside Karpenter. What if We could change Karpenter behavior a bit by granting him capabilities to create and manage ASG's? Configuring provisioners would trigger ASG's creation - for example Provisioner typed with specs annotations:provisiones:asg and 3 instances types ["r4.xlarge", "r4.2xlarge", "r4.4xlarge"] would create in AWS three ASG and updating provisioner would trigger ASG rollout.

ellistarn · 2022-07-06T17:26:55Z

It seems that is just duplicating ASG logic inside Karpenter.

Disagree. Karpenter reacts to pending pods, ASG uses a replica count.

Closing in favor of #1738

m00lecule added the feature New feature or request label Dec 17, 2021

m00lecule changed the title ~~feature-request: karpenter Provider CRD versioning~~ feature-request: karpenter Provisioner CRD versioning Dec 17, 2021

m00lecule changed the title ~~feature-request: karpenter Provisioner CRD versioning~~ feature: karpenter Provisioner CRD versioning Dec 17, 2021

m00lecule changed the title ~~feature: karpenter Provisioner CRD versioning~~ feature: cluster rolling update - karpenter Provisioner CRD versioning Dec 17, 2021

ellistarn changed the title ~~feature: cluster rolling update - karpenter Provisioner CRD versioning~~ Support more control over node upgrade Dec 17, 2021

ellistarn changed the title ~~Support more control over node upgrade~~ Provide more control over node upgrading Dec 17, 2021

ellistarn added consolidation api Issues that require API changes needs-design Design required labels Dec 17, 2021

ellistarn assigned njtran Jan 14, 2022

This was referenced Mar 2, 2022

Tag root volumes #1430

Closed

Reconcile nodes that are out of spec #1457

Closed

rtripat unassigned njtran Mar 3, 2022

stevehipwell mentioned this issue Mar 9, 2022

Support terminating nodes when pod density is too low #1491

Closed

tzneal mentioned this issue Apr 20, 2022

Cannot evict pods with PDB when all hosting nodes are expired at the same time #1682

Closed

spring1843 mentioned this issue Apr 27, 2022

Automatic Update of Nodes #1716

Closed

njtran mentioned this issue Apr 28, 2022

Mega Issue: Deprovisioning Controls #1738

Open

18 tasks

ra-grover mentioned this issue May 23, 2022

Reconciling Nodes on Provisioner Spec changes #1841

Closed

3 tasks

njtran mentioned this issue Jun 2, 2022

Add deprovision limit and/or delay #1885

Closed

ellistarn closed this as completed Jul 6, 2022

njtran mentioned this issue Nov 7, 2022

chore: merge Expiration into deprovisioning controller kubernetes-sigs/karpenter#59

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide more control over node upgrading #1018

Provide more control over node upgrading #1018

m00lecule commented Dec 17, 2021 •

edited

ellistarn commented Dec 17, 2021

ellistarn commented Dec 17, 2021

m00lecule commented Dec 17, 2021 •

edited

ellistarn commented Dec 17, 2021

m00lecule commented Dec 17, 2021 •

edited

ellistarn commented Jan 14, 2022

johngmyers commented Jan 24, 2022

johngmyers commented Mar 2, 2022

johngmyers commented Mar 3, 2022

ellistarn commented Apr 18, 2022

m00lecule commented Jun 12, 2022 •

edited

ellistarn commented Jul 6, 2022

Provide more control over node upgrading #1018

Provide more control over node upgrading #1018

Comments

m00lecule commented Dec 17, 2021 • edited

Example scenario

1. create original Provisioner CRD

2. karpenter spawns instances basing on provsioner-01.yml

3. apply some changes to Provisioner named default

4. now the interesting part take places

Implementation

Community Note

ellistarn commented Dec 17, 2021

ellistarn commented Dec 17, 2021

m00lecule commented Dec 17, 2021 • edited

ellistarn commented Dec 17, 2021

m00lecule commented Dec 17, 2021 • edited

ellistarn commented Jan 14, 2022

johngmyers commented Jan 24, 2022

johngmyers commented Mar 2, 2022

johngmyers commented Mar 3, 2022

ellistarn commented Apr 18, 2022

m00lecule commented Jun 12, 2022 • edited

ellistarn commented Jul 6, 2022

m00lecule commented Dec 17, 2021 •

edited

2. karpenter spawns instances basing on `provsioner-01.yml`

3. apply some changes to `Provisioner` named `default`

m00lecule commented Dec 17, 2021 •

edited

m00lecule commented Dec 17, 2021 •

edited

m00lecule commented Jun 12, 2022 •

edited