Skip to content

[Bug] ROLLBACK_COMPLETE stacks pass compatibility check with misleading log #8712

@costela

Description

@costela

What happened

When a create nodegroup run fails and leaves a CloudFormation stack in ROLLBACK_COMPLETE state, a subsequent eksctl create nodegroup with the same config silently does nothing and logs:

[ℹ]  checking security group configuration for all nodegroups
[ℹ]  all nodegroups have up-to-date cloudformation templates

The user sees a reassuring "all good" message and only discovers later that the nodegroup is still broken. The ROLLBACK_COMPLETE stack is unusable (CloudFormation refuses to update it), yet eksctl treats it as a healthy existing nodegroup, excludes it from the create plan, and exits successfully.

Related: #4006 (same symptom, closed by stale-bot without a fix or root-cause analysis).

What was expected

eksctl should either warn/error that one or more nodegroup stacks are in ROLLBACK_COMPLETE and point to a fix, or recreate the rolled-back stack automatically. At a minimum, it must not claim "all nodegroups have up-to-date cloudformation templates" when it has not checked template freshness at all.

Steps to reproduce

  1. eksctl create nodegroup with a config that causes CloudFormation to fail mid-create (e.g. an invalid launch template).
  2. Observe the nodegroup stack reaches ROLLBACK_COMPLETE.
  3. Fix the config issue and re-run eksctl create nodegroup with the same config file.
  4. The rolled-back stack is silently treated as an existing nodegroup, excluded from the create plan, the compatibility check runs, all nodegroups have up-to-date cloudformation templates is logged, and the command exits 0 without creating anything.

Environment

Reproduces against current master at commit b86e8bdfb. Not a regression from a specific recent version — the code paths involved have existed for a long time.

Root cause

Two independent bugs combine:

1. Misleading log line in ValidateExistingNodeGroupsForCompatibility

pkg/eks/nodegroup_service.go line 309:

logger.Info(\"all nodegroups have up-to-date cloudformation templates\")

The function (pkg/eks/nodegroup_service.go lines 282-322) does not check template freshness. It only checks whether existing nodegroup stacks expose the NodeGroupFeatureSharedSecurityGroup CloudFormation output via isNodeGroupCompatible (pkg/eks/compatibility.go lines 44-96). The log message dates from that narrow shared-security-group compatibility check but reads like a general "your stacks are clean and current" health assertion.

2. ROLLBACK_COMPLETE is grouped with healthy terminal states

pkg/cfn/manager/api.go lines 533-550:

func (*StackCollection) StackStatusIsNotTransitional(s *Stack) bool {
    for _, state := range nonTransitionalReadyStackStatuses() {
        if s.StackStatus == state {
            return true
        }
    }
    return false
}

func nonTransitionalReadyStackStatuses() []types.StackStatus {
    return []types.StackStatus{
        types.StackStatusCreateComplete,
        types.StackStatusUpdateComplete,
        types.StackStatusRollbackComplete,        // <-- problem 
        types.StackStatusUpdateRollbackComplete,
    }
}

ROLLBACK_COMPLETE means the stack's initial CREATE failed and was rolled back — all resources are gone, only the empty stack shell remains, and CloudFormation refuses to update it. Grouping it with CREATE_COMPLETE/UPDATE_COMPLETE causes callers to treat broken stacks as healthy. Note: UPDATE_ROLLBACK_COMPLETE is genuinely healthy (a failed update rolled back to a known-good state), so only ROLLBACK_COMPLETE is wrong here.

3. The create-nodegroup filter excludes rolled-back stacks

Separately — and this is why the command exits 0 without creating anything — NodeGroupFilter.SetOnlyLocal (pkg/ctl/cmdutils/filter/nodegroup_filter.go line 80) calls loadLocalAndRemoteNodegroups, which treats any existing stack (including ROLLBACK_COMPLETE) as a "remote" nodegroup to be excluded from creation. The underlying ListNodeGroupStacks (pkg/cfn/manager/nodegroup.go line 238) only filters out DELETE_COMPLETE/DELETE_FAILED, so ROLLBACK_COMPLETE stacks pass through as "existing".

Suggested fix direction

  1. Fail fast in SetOnlyLocal when a nodegroup in the user's config has an existing stack in ROLLBACK_COMPLETE — surface an actionable error like nodegroup(s) %q have a CloudFormation stack in ROLLBACK_COMPLETE state; delete the failed stack(s) first with 'eksctl delete nodegroup --region=%s --cluster=%s --name=<name>' and then retry.
  2. Reword the log line at pkg/eks/nodegroup_service.go:309 to describe what is actually checked (shared-SG compatibility), not template freshness.
  3. Remove ROLLBACK_COMPLETE from nonTransitionalReadyStackStatuses. The helper has only one caller (StackStatusIsNotTransitionalValidateExistingNodeGroupsForCompatibility), so the blast radius is tiny and the semantic fix is desirable there too. Keep UPDATE_ROLLBACK_COMPLETE. Do not touch allNonDeletedStackStatusesdelete/describe paths legitimately need to see ROLLBACK_COMPLETE stacks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions