Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide more control over node upgrading #1018

Closed
m00lecule opened this issue Dec 17, 2021 · 12 comments
Closed

Provide more control over node upgrading #1018

m00lecule opened this issue Dec 17, 2021 · 12 comments
Labels
api Issues that require API changes consolidation feature New feature or request needs-design Design required

Comments

@m00lecule
Copy link

m00lecule commented Dec 17, 2021

Hello, I would like to submit a feature request.

Currently I am trying to replace clusterautoscaler with karpenter but one missing feature is a blocker - updating cluster nodes spawned by karpenter. Referencing official docs - https://karpenter.sh/docs/concepts/#upgrading-nodes the only way to update Your current cluster setup is to configure ttlSecondsUntilExpired for each node. I believe this wont be acceptable for applications that must be running 24/7 - in some corner cases major part of nodes might be deleted at the same time, causing cluster outage. Also it bring too much entropy to the cluster configuration - I can imagine that monitoring engineers might be terrified that during midnight shift when they find out that some production facing instances have been unexpectedly deleted.

More mature tools like kops are handling updates in HA manner - please see kops cluster rolling-update https://kops.sigs.k8s.io/operations/rolling-update/ or https://docs.aws.amazon.com/eks/latest/userguide/update-managed-node-group.html.

My proposition is to implement Provisioner CRD versioning

Motivation - karpenter will assign labels to each node, with version of Provisioner that was used to spawn an EC2 instance. Karpenter will detect changes applied to Provisioner and apply them to cluster keeping cluster highly available.

Example scenario

1. create original Provisioner CRD

provsioner-01.yml

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
    (...)
    - key: "node.kubernetes.io/instance-type" 
      operator: In
      values: ["m5.xlarge","m5.large"]
   (...)
  provider:
    (...)
    launchTemplate: eks-XXXX
    launchTemplateVersion: 1 

2. karpenter spawns instances basing on provsioner-01.yml

Imagine that we spawned 10 instances with AMI from launch template eks-XXXX in version 1. Those instances size would be one of m5.xlarge or m5.large

3. apply some changes to Provisioner named default

provsioner-02.yml

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
    (...)
    - key: "node.kubernetes.io/instance-type" 
      operator: In
      values: ["r4.xlarge"]
   (...)
  provider:
    (...)
    launchTemplate: eks-XXXX
    launchTemplateVersion: 2 

4. now the interesting part take places

Karpenter detects that there were changes applied to Provisioner named default and tries to replace all instances spawned using provsioner-01.yml with most up to date configuration from provsioner-02.yml keeping the cluster highly available.

For convenience nodes spawned using Provisioner from provisioner-02.yml will be labeled default-node-02 and those from original configuration default-node-01

Those changes would be applied in rolling update manner and the loop would look like:

  1. Karpenter cordons all nodes labeled default-node-01 (prevents from scheduling on pods)
  2. Initialy Karpneter would spin instance default-node-02 using launch template in version 2 (from provisioner-02.yml), the instance would be sized r4.xlarge.
  3. When new node would be ready then It would try to drain and evict node labeled default-node-01 - if there would be no room for pods from picked node then another instance labeled default-node-02 would be created.
  4. Go to 3. until any node labeled default-node-01 exist.

thanks to @akestner and @ellistarn for bringing me here

I am looking forward for Your feedback about proposed approach

Implementation

If I would have to do in on myself then I would propose MutatingAdmissionWebhook for Provisioner CRD that will notify karpenter-controller and trigger karpenter nodes reconfiguration.

https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@m00lecule m00lecule added the feature New feature or request label Dec 17, 2021
@m00lecule m00lecule changed the title feature-request: karpenter Provider CRD versioning feature-request: karpenter Provisioner CRD versioning Dec 17, 2021
@m00lecule m00lecule changed the title feature-request: karpenter Provisioner CRD versioning feature: karpenter Provisioner CRD versioning Dec 17, 2021
@m00lecule m00lecule changed the title feature: karpenter Provisioner CRD versioning feature: cluster rolling update - karpenter Provisioner CRD versioning Dec 17, 2021
@ellistarn ellistarn changed the title feature: cluster rolling update - karpenter Provisioner CRD versioning Support more control over node upgrade Dec 17, 2021
@ellistarn ellistarn changed the title Support more control over node upgrade Provide more control over node upgrading Dec 17, 2021
@ellistarn
Copy link
Contributor

terrified that during midnight shift when they find out that some production facing instances have been unexpectedly deleted.

I think that this is massively important. We've heard this a couple times. We need some time-gated mechanism for allowing node termination. I think this particular facet operates outside of just upgrade and includes expiration, consolidation, and other termination use cases.

@ellistarn
Copy link
Contributor

Karpenter detects that there were changes applied to Provisioner named default and tries to replace all instances spawned using provsioner-01.yml

If I'm reading this correctly, it seems like you're looking for a similar relationship to Deployment -> Replicaset -> Pod, but for NodeDeployment -> NodeReplicaSet -> Node. I can't help but think of CAPI here, but I need to noodle on how this would apply to a dynamic instance count.

@m00lecule
Copy link
Author

m00lecule commented Dec 17, 2021

Karpenter detects that there were changes applied to Provisioner named default and tries to replace all instances spawned using provsioner-01.yml

If I'm reading this correctly, it seems like you're looking for a similar relationship to Deployment -> Replicaset -> Pod, but for NodeDeployment -> NodeReplicaSet -> Node. I can't help but think of CAPI here, but I need to noodle on how this would apply to a dynamic instance count.

Yes, rolling update on Provisioner CRD level. The algorithm would be a bit greedy - spawn enough instances using latest Provisioner CRD revision that would fit all pods from node, considered for deletion and then reschedule all pods when new instance would be ready (PDB). I can imagine that in some scenario It would require spawning multiple instances using latest configuration (decreased sizing) and in other number of nodes would be reduced (increased instances sizing).

@ellistarn ellistarn added consolidation api Issues that require API changes needs-design Design required labels Dec 17, 2021
@ellistarn
Copy link
Contributor

Yes, rolling update on Provisioner CRD level. The algorithm would some sort greedy - spawn enough instances using latest Provisioner CRD revision that would fit all pods from the previous one. I can imagine that in some scenario It would require spawning multiple instances using latest configuration and in other number of nodes would be reduced because we have increased instances sizing.

I do wonder how much churn this would cause on provisioner update. I'm imagining the scenario where a provisioner manages 5,000 nodes. Right now, an operator could update requirements and they would be applied immediately for new instances. With the proposed mode, this would force a refresh on the entire cluster.

I do think we need to figure out how to support the case where you do want to recycle instances that no longer match requirements, but I want to be cautious that we don't lose out on the use case mentioned above.

@m00lecule
Copy link
Author

m00lecule commented Dec 17, 2021

Yes, rolling update on Provisioner CRD level. The algorithm would some sort greedy - spawn enough instances using latest Provisioner CRD revision that would fit all pods from the previous one. I can imagine that in some scenario It would require spawning multiple instances using latest configuration and in other number of nodes would be reduced because we have increased instances sizing.

I do wonder how much churn this would cause on provisioner update. I'm imagining the scenario where a provisioner manages 5,000 nodes. Right now, an operator could update requirements and they would be applied immediately for new instances. With the proposed mode, this would force a refresh on the entire cluster.

I do think we need to figure out how to support the case where you do want to recycle instances that no longer match requirements, but I want to be cautious that we don't lose out on the use case mentioned above.

Agree, cluster administrator must be capable of picking own strategy to handle cluster updates but still - If i would be in charge of handling those numbers of nodes that are not some periodic spike then I would never pick a ttlSecondsUntilExpired option. There must be changes provision strategy switch - RollingUpdate or sort of DelayedProvision strategy

@ellistarn
Copy link
Contributor

We haven't gone anywhere. @njtran is looking into this. :)

@johngmyers
Copy link

In general, you only want to update nodes when they no longer match the current spec. So that would be if they no longer fit within the requirements, subnetSelector, or securityGroupSelector or match the taints, labels, kubeletConfiguration, instanceProfile, launchTemplate, tags, or metadataOptions.

@johngmyers
Copy link

kOps determines an instance is out of date by seeing if its launch template version is not the most recent one. It attempts to ensure that a relevant spec change will result in a new template version, in some cases putting hashes of relevant configuration into the userdata.

An instance that does not match the current spec should be treated as if it has expired.

How quickly Karpenter should replace expired/out-of-spec instances is a separate issue. kOps has per provisioner-equivalent configuration to limit how many additional instances should be launched when updating such nodes. It currently defaults to one instance, but can be set to either a scalar or percentage up to 100% of instances needing update.

@johngmyers
Copy link

Some information and experience from maintaining and using node reconciliation in kOps.

kOps usually surges, creating new nodes and waiting for them to be determined useful before cordoning and draining nodes needing reconciliation. As I understand it, Karpenter reconciles in deficit, draining nodes needing reconciliation and then letting the resulting unschedulable replacement pods drive creation of new nodes. Having Karpenter surge seems quite challenging and out of scope for this issue. The concurrency one might want with deficit reconciliation is likely lower than that with surging.

kOps reconciles one group at a time. This is because concurrency limits are specified per-group and it would be difficult to give kOps a way to configure limits that apply across groups.

For the general-purpose group we specify a concurrency limit of 15 nodes. This is not highly tuned, but significantly higher than this we had problems with the control plane and our CNI being able to keep up with the rate of change. We also use the same limit for service-specific groups belonging to stateless services.

Groups belonging to stateful services usually have a concurrency limit of 1 node. This is because the disruption budget for those services is usually set to 1 and more surging would be a waste of money. Nodes for stateful services take significantly longer to reconcile, as their disruption budgets take longer to permit progress. This is particularly so for services that store data locally on storage-optimized instances.

We have logic (not in kOps) to reconcile the stateful groups after the stateless ones are done.

We limit node reconciliation to approximately business hours, but avoid peak times.

@ellistarn
Copy link
Contributor

Related: #1457

@m00lecule
Copy link
Author

m00lecule commented Jun 12, 2022

Hello back @ellistarn, @njtran!

I would like to share with You some thoughts. Currently Karpenter is spawning plain EC2 instances and we would like to control their rollout process. It seems that is just duplicating ASG logic inside Karpenter. What if We could change Karpenter behavior a bit by granting him capabilities to create and manage ASG's? Configuring provisioners would trigger ASG's creation - for example Provisioner typed with specs annotations:provisiones:asg and 3 instances types ["r4.xlarge", "r4.2xlarge", "r4.4xlarge"] would create in AWS three ASG and updating provisioner would trigger ASG rollout.

@ellistarn
Copy link
Contributor

It seems that is just duplicating ASG logic inside Karpenter.

Disagree. Karpenter reacts to pending pods, ASG uses a replica count.

Closing in favor of #1738

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Issues that require API changes consolidation feature New feature or request needs-design Design required
Projects
None yet
Development

No branches or pull requests

4 participants