-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bare Metal] Scale out of workload cluster worker node group cause control plane to roll #7993
Comments
Update: When scaling out a single node cluster with a worker node group, the control plane node "reverts" back to just being a control plane node. This means that the taint: In the code/spec this is the difference between: Single node cluster control plane spec ( initConfiguration:
localAPIEndpoint: {}
nodeRegistration:
imagePullPolicy: IfNotPresent
kubeletExtraArgs:
anonymous-auth: "false"
provider-id: PROVIDER_ID
read-only-port: "0"
tls-cipher-suites: TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
taints: []
joinConfiguration:
bottlerocketAdmin: {}
bottlerocketBootstrap: {}
bottlerocketControl: {}
discovery: {}
nodeRegistration:
ignorePreflightErrors:
- DirAvailable--etc-kubernetes-manifests
imagePullPolicy: IfNotPresent
kubeletExtraArgs:
anonymous-auth: "false"
provider-id: PROVIDER_ID
read-only-port: "0"
tls-cipher-suites: TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
taints: []
pause: {}
proxy: {}
registryMirror: {} Single node cluster control plane spec scaled out with a worker node group ( initConfiguration:
localAPIEndpoint: {}
nodeRegistration:
imagePullPolicy: IfNotPresent
kubeletExtraArgs:
anonymous-auth: "false"
provider-id: PROVIDER_ID
read-only-port: "0"
tls-cipher-suites: TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
joinConfiguration:
bottlerocketAdmin: {}
bottlerocketBootstrap: {}
bottlerocketControl: {}
discovery: {}
nodeRegistration:
ignorePreflightErrors:
- DirAvailable--etc-kubernetes-manifests
imagePullPolicy: IfNotPresent
kubeletExtraArgs:
anonymous-auth: "false"
provider-id: PROVIDER_ID
read-only-port: "0"
tls-cipher-suites: TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
pause: {}
proxy: {}
registryMirror: {} Diff: --- singleNode.yaml 2024-04-18 12:56:22.396824833 -0600
+++ scaledout.yaml 2024-04-18 12:56:31.960841460 -0600
@@ -7,7 +7,6 @@
provider-id: PROVIDER_ID
read-only-port: "0"
tls-cipher-suites: TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
- taints: []
joinConfiguration:
bottlerocketAdmin: {}
bottlerocketBootstrap: {}
@@ -22,7 +21,6 @@
provider-id: PROVIDER_ID
read-only-port: "0"
tls-cipher-suites: TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
- taints: []
pause: {}
proxy: {}
registryMirror: {} In code this is the difference between a |
I also observed that an kubectl get nodes
NAME STATUS ROLES AGE VERSION
<new worker node> Ready <none> 145m v1.27.4-eks-cedffd4
<original CP node> Ready control-plane 21h v1.27.4-eks-cedffd4 The workload cluster is left is a bad state though and subsequent lifecycle commands will fail with: ❌ Validation failed {"validation": "control plane ready", "error": "1 control plane replicas are unavailable", "remediation": "ensure control plane nodes and pods for cluster workload-test are Ready"} As the control plane needs to be rolled but there is no available hardware. kubectl get kcp -n eksa-system workload-cluster
NAME CLUSTER INITIALIZED API SERVER AVAILABLE REPLICAS READY UPDATED UNAVAILABLE AGE VERSION
workload-cluster workload-cluster true true 2 1 1 1 4h v1.27.11-eks-1-27-25 |
Final update: It appears that version of EKSA before v0.19 did not have this behavior and would not roll control plane nodes when adding or removing the only worker node group configuration. This, for better or worse, was a bug/not the intended behavior. v0.19 has "fixed" this bug/issue. We discussed this internally and we will not be pursuing any change to this behavior. There will be some doc updates to make this clear. One last note on why we won't be changing this behavior (this will be in the docs too). Going from a control plane only cluster to a control plane with worker node(s) cluster changes the nature and fundamental behavior of the control plane nodes. There are significant internal code, behavior, and spec consequences of a change like this. These are a couple of the reasons we have decided to not pursue a change to the current v0.19 behavior. |
What happened:
I have a single node workload cluster. I have added only one addition machine hardware to my hardware.csv in order to add 1 worker node group with 1 worker node to the cluster. When i run
eksctl anywhere upgrade cluster
, eksa starts to roll my 1 control plane node. As i dont have any hardware the cluster does not upgrade and is left is an unmanageable state, meaning I cannot perform any other cluster lifecycle phases with the cli.This potentially has something to do with #7991 .
What you expected to happen:
The control plane should not roll. The cluster should not become unmanageable.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
The text was updated successfully, but these errors were encountered: