Skip to content

fix rack uninstall autoscaler update loop#3793

Closed
ntner wants to merge 1 commit into
masterfrom
rack-uninstall-autoscaler-race
Closed

fix rack uninstall autoscaler update loop#3793
ntner wants to merge 1 commit into
masterfrom
rack-uninstall-autoscaler-race

Conversation

@ntner
Copy link
Copy Markdown
Contributor

@ntner ntner commented Apr 19, 2026

What is the feature/update/fix?

Fix: Rack Uninstall No Longer Hangs When Autoscaling Is Enabled

We've fixed a bug where convox rack uninstall would hang indefinitely on racks with Autoscale=Yes (the default) and at least one installed app. The root cause was a livelock between the uninstall flow and the rack's own autoscaler Lambda. The Lambda, scheduled by an EventBridge cron rule inside the rack's CloudFormation stack, fires every 60 seconds and calls UpdateStack. During uninstall, the CLI force-deletes the Instances Auto Scaling Group via the AutoScaling API, and the next Lambda-triggered UpdateStack would try to apply the rack's AutoScalingRollingUpdate policy — which calls SuspendProcesses on an ASG that is now pending-delete. AWS rejects the call, CloudFormation rolls back, the Lambda fires again, and the loop continues until manually intervened.

This release adds three coordinated guards:

  • CLI-side rule disable. SystemUninstall now calls a new disableAutoscalerRule helper that best-effort disables the InstancesAutoscalerEvent EventBridge rule on the rack stack before the cleanAsg loop starts. The helper uses cloudformation.DescribeStackResource to resolve the physical rule name, then calls events.DisableRule. Any error is logged and swallowed — uninstall proceeds regardless. This works against any rack version, including racks that predate the Lambda change.
  • Lambda-side stack-state allowlist. The autoscaler Lambda now skips its UpdateStack call unless the rack stack is in CREATE_COMPLETE or UPDATE_COMPLETE. Every in-progress, failed, rollback, delete, and review state now causes the Lambda to log skipping autoscale: stack <name> is <status> to CloudWatch and return without side-effects. This breaks the livelock after at most one UPDATE_ROLLBACK_COMPLETE cycle on racks that do not yet have the new CLI.
  • Template self-healing. The InstancesAutoscalerEvent CloudFormation resource now sets "State": "ENABLED" explicitly. This matches CloudFormation's existing default, so install behavior is unchanged — but the first convox rack update after upgrading to this version pushes PutRule with State=ENABLED, so any rule that was manually disabled during an aborted uninstall recovery is automatically re-enabled.

How to use it?

This fix is automatically applied when you update your rack. No additional configuration is required.

$ convox rack update

The next time you run convox rack uninstall on a rack with Autoscale=Yes, the CLI silently disables the autoscaler cron rule first, and the uninstall completes cleanly:

$ convox rack uninstall my-rack
disabled autoscaler rule: my-rack-InstancesAutoscalerEvent-ABC123
...

If your rack was previously stuck in an aborted uninstall — for example, the EventBridge rule was manually disabled to break the loop — the next successful convox rack update reconciles the rule back to the enabled state, so the autoscaler resumes normal operation.


Does it have a breaking change?

No breaking changes. The CLI-side change uses public AWS APIs that are additive to the uninstall flow; the CLI works against any rack version. The Lambda change is internal to the rack and self-contained. The CloudFormation change adds an explicit "State": "ENABLED" property that matches the existing default, so install behavior is unchanged and existing racks reconcile to the enabled state on their next update.

One latent behavior note: because the Lambda now requires CREATE_COMPLETE or UPDATE_COMPLETE, racks sitting in UPDATE_ROLLBACK_COMPLETE for extended periods — typically after a failed convox rack params set — will have autoscaling halted until the operator resolves the rollback. This is an intentional safety property of the allowlist; the previous behavior of continuously attempting UpdateStack in this state was the bug.

The uninstall path makes two additional AWS API calls (cloudformation:DescribeStackResource and events:DisableRule). These are covered by the standard operator policy required to run convox rack install; operators with tightly scoped custom IAM policies may need to add these two actions — if they are missing, the CLI logs a skip message and the uninstall proceeds.


Requirements

To receive this fix, you must update to rack version 20260421192651 or newer.

  • Check your rack's version with convox rack
  • Update your rack with convox rack update

@ntner ntner force-pushed the rack-uninstall-autoscaler-race branch from 50288cf to 4c654ae Compare April 19, 2026 01:08
@ntner ntner mentioned this pull request Apr 19, 2026
@ntner ntner closed this in #3792 Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant