-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide more control over node upgrading #1018
Comments
I think that this is massively important. We've heard this a couple times. We need some time-gated mechanism for allowing node termination. I think this particular facet operates outside of just upgrade and includes expiration, consolidation, and other termination use cases. |
If I'm reading this correctly, it seems like you're looking for a similar relationship to Deployment -> Replicaset -> Pod, but for NodeDeployment -> NodeReplicaSet -> Node. I can't help but think of CAPI here, but I need to noodle on how this would apply to a dynamic instance count. |
Yes, rolling update on Provisioner CRD level. The algorithm would be a bit greedy - spawn enough instances using latest Provisioner CRD revision that would fit all pods from node, considered for deletion and then reschedule all pods when new instance would be ready (PDB). I can imagine that in some scenario It would require spawning multiple instances using latest configuration (decreased sizing) and in other number of nodes would be reduced (increased instances sizing). |
I do wonder how much churn this would cause on provisioner update. I'm imagining the scenario where a provisioner manages 5,000 nodes. Right now, an operator could update requirements and they would be applied immediately for new instances. With the proposed mode, this would force a refresh on the entire cluster. I do think we need to figure out how to support the case where you do want to recycle instances that no longer match requirements, but I want to be cautious that we don't lose out on the use case mentioned above. |
Agree, cluster administrator must be capable of picking own strategy to handle cluster updates but still - If i would be in charge of handling those numbers of nodes that are not some periodic spike then I would never pick a |
We haven't gone anywhere. @njtran is looking into this. :) |
In general, you only want to update nodes when they no longer match the current spec. So that would be if they no longer fit within the requirements, subnetSelector, or securityGroupSelector or match the taints, labels, kubeletConfiguration, instanceProfile, launchTemplate, tags, or metadataOptions. |
kOps determines an instance is out of date by seeing if its launch template version is not the most recent one. It attempts to ensure that a relevant spec change will result in a new template version, in some cases putting hashes of relevant configuration into the userdata. An instance that does not match the current spec should be treated as if it has expired. How quickly Karpenter should replace expired/out-of-spec instances is a separate issue. kOps has per provisioner-equivalent configuration to limit how many additional instances should be launched when updating such nodes. It currently defaults to one instance, but can be set to either a scalar or percentage up to 100% of instances needing update. |
Some information and experience from maintaining and using node reconciliation in kOps. kOps usually surges, creating new nodes and waiting for them to be determined useful before cordoning and draining nodes needing reconciliation. As I understand it, Karpenter reconciles in deficit, draining nodes needing reconciliation and then letting the resulting unschedulable replacement pods drive creation of new nodes. Having Karpenter surge seems quite challenging and out of scope for this issue. The concurrency one might want with deficit reconciliation is likely lower than that with surging. kOps reconciles one group at a time. This is because concurrency limits are specified per-group and it would be difficult to give kOps a way to configure limits that apply across groups. For the general-purpose group we specify a concurrency limit of 15 nodes. This is not highly tuned, but significantly higher than this we had problems with the control plane and our CNI being able to keep up with the rate of change. We also use the same limit for service-specific groups belonging to stateless services. Groups belonging to stateful services usually have a concurrency limit of 1 node. This is because the disruption budget for those services is usually set to 1 and more surging would be a waste of money. Nodes for stateful services take significantly longer to reconcile, as their disruption budgets take longer to permit progress. This is particularly so for services that store data locally on storage-optimized instances. We have logic (not in kOps) to reconcile the stateful groups after the stateless ones are done. We limit node reconciliation to approximately business hours, but avoid peak times. |
Related: #1457 |
Hello back @ellistarn, @njtran! I would like to share with You some thoughts. Currently Karpenter is spawning plain EC2 instances and we would like to control their rollout process. It seems that is just duplicating ASG logic inside Karpenter. What if We could change Karpenter behavior a bit by granting him capabilities to create and manage ASG's? Configuring provisioners would trigger ASG's creation - for example Provisioner typed with specs |
Disagree. Karpenter reacts to pending pods, ASG uses a replica count. Closing in favor of #1738 |
Hello, I would like to submit a feature request.
Currently I am trying to replace clusterautoscaler with karpenter but one missing feature is a blocker - updating cluster nodes spawned by karpenter. Referencing official docs - https://karpenter.sh/docs/concepts/#upgrading-nodes the only way to update Your current cluster setup is to configure
ttlSecondsUntilExpired
for each node. I believe this wont be acceptable for applications that must be running 24/7 - in some corner cases major part of nodes might be deleted at the same time, causing cluster outage. Also it bring too much entropy to the cluster configuration - I can imagine that monitoring engineers might be terrified that during midnight shift when they find out that some production facing instances have been unexpectedly deleted.More mature tools like
kops
are handling updates in HA manner - please seekops cluster rolling-update
https://kops.sigs.k8s.io/operations/rolling-update/ or https://docs.aws.amazon.com/eks/latest/userguide/update-managed-node-group.html.My proposition is to implement Provisioner CRD versioning
Motivation - karpenter will assign labels to each node, with version of
Provisioner
that was used to spawn an EC2 instance. Karpenter will detect changes applied toProvisioner
and apply them to cluster keeping cluster highly available.Example scenario
1. create original Provisioner CRD
provsioner-01.yml
2. karpenter spawns instances basing on
provsioner-01.yml
Imagine that we spawned 10 instances with AMI from launch template
eks-XXXX
in version1
. Those instances size would be one ofm5.xlarge
orm5.large
3. apply some changes to
Provisioner
nameddefault
provsioner-02.yml
4. now the interesting part take places
Karpenter detects that there were changes applied to
Provisioner
nameddefault
and tries to replace all instances spawned usingprovsioner-01.yml
with most up to date configuration fromprovsioner-02.yml
keeping the cluster highly available.For convenience nodes spawned using Provisioner from
provisioner-02.yml
will be labeleddefault-node-02
and those from original configurationdefault-node-01
Those changes would be applied in rolling update manner and the loop would look like:
default-node-01
(prevents from scheduling on pods)default-node-02
using launch template in version2
(fromprovisioner-02.yml
), the instance would be sizedr4.xlarge
.default-node-01
- if there would be no room for pods from picked node then another instance labeleddefault-node-02
would be created.default-node-01
exist.thanks to @akestner and @ellistarn for bringing me here
I am looking forward for Your feedback about proposed approach
Implementation
If I would have to do in on myself then I would propose
MutatingAdmissionWebhook
forProvisioner CRD
that will notifykarpenter-controller
and trigger karpenter nodes reconfiguration.https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/
Community Note
The text was updated successfully, but these errors were encountered: