fix rack uninstall autoscaler update loop#3793
Closed
ntner wants to merge 1 commit into
Closed
Conversation
50288cf to
4c654ae
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is the feature/update/fix?
Fix: Rack Uninstall No Longer Hangs When Autoscaling Is Enabled
We've fixed a bug where
convox rack uninstallwould hang indefinitely on racks withAutoscale=Yes(the default) and at least one installed app. The root cause was a livelock between the uninstall flow and the rack's own autoscaler Lambda. The Lambda, scheduled by an EventBridge cron rule inside the rack's CloudFormation stack, fires every 60 seconds and callsUpdateStack. During uninstall, the CLI force-deletes theInstancesAuto Scaling Group via the AutoScaling API, and the next Lambda-triggeredUpdateStackwould try to apply the rack'sAutoScalingRollingUpdatepolicy — which callsSuspendProcesseson an ASG that is now pending-delete. AWS rejects the call, CloudFormation rolls back, the Lambda fires again, and the loop continues until manually intervened.This release adds three coordinated guards:
SystemUninstallnow calls a newdisableAutoscalerRulehelper that best-effort disables theInstancesAutoscalerEventEventBridge rule on the rack stack before thecleanAsgloop starts. The helper usescloudformation.DescribeStackResourceto resolve the physical rule name, then callsevents.DisableRule. Any error is logged and swallowed — uninstall proceeds regardless. This works against any rack version, including racks that predate the Lambda change.UpdateStackcall unless the rack stack is inCREATE_COMPLETEorUPDATE_COMPLETE. Every in-progress, failed, rollback, delete, and review state now causes the Lambda to logskipping autoscale: stack <name> is <status>to CloudWatch and return without side-effects. This breaks the livelock after at most oneUPDATE_ROLLBACK_COMPLETEcycle on racks that do not yet have the new CLI.InstancesAutoscalerEventCloudFormation resource now sets"State": "ENABLED"explicitly. This matches CloudFormation's existing default, so install behavior is unchanged — but the firstconvox rack updateafter upgrading to this version pushesPutRulewithState=ENABLED, so any rule that was manually disabled during an aborted uninstall recovery is automatically re-enabled.How to use it?
This fix is automatically applied when you update your rack. No additional configuration is required.
The next time you run
convox rack uninstallon a rack withAutoscale=Yes, the CLI silently disables the autoscaler cron rule first, and the uninstall completes cleanly:If your rack was previously stuck in an aborted uninstall — for example, the EventBridge rule was manually disabled to break the loop — the next successful
convox rack updatereconciles the rule back to the enabled state, so the autoscaler resumes normal operation.Does it have a breaking change?
No breaking changes. The CLI-side change uses public AWS APIs that are additive to the uninstall flow; the CLI works against any rack version. The Lambda change is internal to the rack and self-contained. The CloudFormation change adds an explicit
"State": "ENABLED"property that matches the existing default, so install behavior is unchanged and existing racks reconcile to the enabled state on their next update.One latent behavior note: because the Lambda now requires
CREATE_COMPLETEorUPDATE_COMPLETE, racks sitting inUPDATE_ROLLBACK_COMPLETEfor extended periods — typically after a failedconvox rack params set— will have autoscaling halted until the operator resolves the rollback. This is an intentional safety property of the allowlist; the previous behavior of continuously attemptingUpdateStackin this state was the bug.The uninstall path makes two additional AWS API calls (
cloudformation:DescribeStackResourceandevents:DisableRule). These are covered by the standard operator policy required to runconvox rack install; operators with tightly scoped custom IAM policies may need to add these two actions — if they are missing, the CLI logs a skip message and the uninstall proceeds.Requirements
To receive this fix, you must update to rack version
20260421192651or newer.convox rackconvox rack update