What happened
When a create nodegroup run fails and leaves a CloudFormation stack in ROLLBACK_COMPLETE state, a subsequent eksctl create nodegroup with the same config silently does nothing and logs:
[ℹ] checking security group configuration for all nodegroups
[ℹ] all nodegroups have up-to-date cloudformation templates
The user sees a reassuring "all good" message and only discovers later that the nodegroup is still broken. The ROLLBACK_COMPLETE stack is unusable (CloudFormation refuses to update it), yet eksctl treats it as a healthy existing nodegroup, excludes it from the create plan, and exits successfully.
Related: #4006 (same symptom, closed by stale-bot without a fix or root-cause analysis).
What was expected
eksctl should either warn/error that one or more nodegroup stacks are in ROLLBACK_COMPLETE and point to a fix, or recreate the rolled-back stack automatically. At a minimum, it must not claim "all nodegroups have up-to-date cloudformation templates" when it has not checked template freshness at all.
Steps to reproduce
eksctl create nodegroup with a config that causes CloudFormation to fail mid-create (e.g. an invalid launch template).
- Observe the nodegroup stack reaches
ROLLBACK_COMPLETE.
- Fix the config issue and re-run
eksctl create nodegroup with the same config file.
- The rolled-back stack is silently treated as an existing nodegroup, excluded from the create plan, the compatibility check runs,
all nodegroups have up-to-date cloudformation templates is logged, and the command exits 0 without creating anything.
Environment
Reproduces against current master at commit b86e8bdfb. Not a regression from a specific recent version — the code paths involved have existed for a long time.
Root cause
Two independent bugs combine:
1. Misleading log line in ValidateExistingNodeGroupsForCompatibility
pkg/eks/nodegroup_service.go line 309:
logger.Info(\"all nodegroups have up-to-date cloudformation templates\")
The function (pkg/eks/nodegroup_service.go lines 282-322) does not check template freshness. It only checks whether existing nodegroup stacks expose the NodeGroupFeatureSharedSecurityGroup CloudFormation output via isNodeGroupCompatible (pkg/eks/compatibility.go lines 44-96). The log message dates from that narrow shared-security-group compatibility check but reads like a general "your stacks are clean and current" health assertion.
2. ROLLBACK_COMPLETE is grouped with healthy terminal states
pkg/cfn/manager/api.go lines 533-550:
func (*StackCollection) StackStatusIsNotTransitional(s *Stack) bool {
for _, state := range nonTransitionalReadyStackStatuses() {
if s.StackStatus == state {
return true
}
}
return false
}
func nonTransitionalReadyStackStatuses() []types.StackStatus {
return []types.StackStatus{
types.StackStatusCreateComplete,
types.StackStatusUpdateComplete,
types.StackStatusRollbackComplete, // <-- problem
types.StackStatusUpdateRollbackComplete,
}
}
ROLLBACK_COMPLETE means the stack's initial CREATE failed and was rolled back — all resources are gone, only the empty stack shell remains, and CloudFormation refuses to update it. Grouping it with CREATE_COMPLETE/UPDATE_COMPLETE causes callers to treat broken stacks as healthy. Note: UPDATE_ROLLBACK_COMPLETE is genuinely healthy (a failed update rolled back to a known-good state), so only ROLLBACK_COMPLETE is wrong here.
3. The create-nodegroup filter excludes rolled-back stacks
Separately — and this is why the command exits 0 without creating anything — NodeGroupFilter.SetOnlyLocal (pkg/ctl/cmdutils/filter/nodegroup_filter.go line 80) calls loadLocalAndRemoteNodegroups, which treats any existing stack (including ROLLBACK_COMPLETE) as a "remote" nodegroup to be excluded from creation. The underlying ListNodeGroupStacks (pkg/cfn/manager/nodegroup.go line 238) only filters out DELETE_COMPLETE/DELETE_FAILED, so ROLLBACK_COMPLETE stacks pass through as "existing".
Suggested fix direction
- Fail fast in
SetOnlyLocal when a nodegroup in the user's config has an existing stack in ROLLBACK_COMPLETE — surface an actionable error like nodegroup(s) %q have a CloudFormation stack in ROLLBACK_COMPLETE state; delete the failed stack(s) first with 'eksctl delete nodegroup --region=%s --cluster=%s --name=<name>' and then retry.
- Reword the log line at
pkg/eks/nodegroup_service.go:309 to describe what is actually checked (shared-SG compatibility), not template freshness.
- Remove
ROLLBACK_COMPLETE from nonTransitionalReadyStackStatuses. The helper has only one caller (StackStatusIsNotTransitional → ValidateExistingNodeGroupsForCompatibility), so the blast radius is tiny and the semantic fix is desirable there too. Keep UPDATE_ROLLBACK_COMPLETE. Do not touch allNonDeletedStackStatuses — delete/describe paths legitimately need to see ROLLBACK_COMPLETE stacks.
What happened
When a
create nodegrouprun fails and leaves a CloudFormation stack inROLLBACK_COMPLETEstate, a subsequenteksctl create nodegroupwith the same config silently does nothing and logs:The user sees a reassuring "all good" message and only discovers later that the nodegroup is still broken. The
ROLLBACK_COMPLETEstack is unusable (CloudFormation refuses to update it), yet eksctl treats it as a healthy existing nodegroup, excludes it from the create plan, and exits successfully.Related: #4006 (same symptom, closed by stale-bot without a fix or root-cause analysis).
What was expected
eksctl should either warn/error that one or more nodegroup stacks are in
ROLLBACK_COMPLETEand point to a fix, or recreate the rolled-back stack automatically. At a minimum, it must not claim "all nodegroups have up-to-date cloudformation templates" when it has not checked template freshness at all.Steps to reproduce
eksctl create nodegroupwith a config that causes CloudFormation to fail mid-create (e.g. an invalid launch template).ROLLBACK_COMPLETE.eksctl create nodegroupwith the same config file.all nodegroups have up-to-date cloudformation templatesis logged, and the command exits 0 without creating anything.Environment
Reproduces against current
masterat commitb86e8bdfb. Not a regression from a specific recent version — the code paths involved have existed for a long time.Root cause
Two independent bugs combine:
1. Misleading log line in
ValidateExistingNodeGroupsForCompatibilitypkg/eks/nodegroup_service.goline 309:The function (
pkg/eks/nodegroup_service.golines 282-322) does not check template freshness. It only checks whether existing nodegroup stacks expose theNodeGroupFeatureSharedSecurityGroupCloudFormation output viaisNodeGroupCompatible(pkg/eks/compatibility.golines 44-96). The log message dates from that narrow shared-security-group compatibility check but reads like a general "your stacks are clean and current" health assertion.2.
ROLLBACK_COMPLETEis grouped with healthy terminal statespkg/cfn/manager/api.golines 533-550:ROLLBACK_COMPLETEmeans the stack's initialCREATEfailed and was rolled back — all resources are gone, only the empty stack shell remains, and CloudFormation refuses to update it. Grouping it withCREATE_COMPLETE/UPDATE_COMPLETEcauses callers to treat broken stacks as healthy. Note:UPDATE_ROLLBACK_COMPLETEis genuinely healthy (a failed update rolled back to a known-good state), so onlyROLLBACK_COMPLETEis wrong here.3. The create-nodegroup filter excludes rolled-back stacks
Separately — and this is why the command exits 0 without creating anything —
NodeGroupFilter.SetOnlyLocal(pkg/ctl/cmdutils/filter/nodegroup_filter.goline 80) callsloadLocalAndRemoteNodegroups, which treats any existing stack (includingROLLBACK_COMPLETE) as a "remote" nodegroup to be excluded from creation. The underlyingListNodeGroupStacks(pkg/cfn/manager/nodegroup.goline 238) only filters outDELETE_COMPLETE/DELETE_FAILED, soROLLBACK_COMPLETEstacks pass through as "existing".Suggested fix direction
SetOnlyLocalwhen a nodegroup in the user's config has an existing stack inROLLBACK_COMPLETE— surface an actionable error likenodegroup(s) %q have a CloudFormation stack in ROLLBACK_COMPLETE state; delete the failed stack(s) first with 'eksctl delete nodegroup --region=%s --cluster=%s --name=<name>' and then retry.pkg/eks/nodegroup_service.go:309to describe what is actually checked (shared-SG compatibility), not template freshness.ROLLBACK_COMPLETEfromnonTransitionalReadyStackStatuses. The helper has only one caller (StackStatusIsNotTransitional→ValidateExistingNodeGroupsForCompatibility), so the blast radius is tiny and the semantic fix is desirable there too. KeepUPDATE_ROLLBACK_COMPLETE. Do not touchallNonDeletedStackStatuses—delete/describepaths legitimately need to seeROLLBACK_COMPLETEstacks.