fix: stop Knative apps privately + keep UI state accurate after stop/start#7343
fix: stop Knative apps privately + keep UI state accurate after stop/start#7343AdilFayyaz merged 2 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates Flyte’s “Stop App” / “Start App” behavior for Knative-backed apps to reliably (a) scale to zero and (b) make public ingress inaccessible, while also improving UI correctness by avoiding caching during stop→start transitional windows.
Changes:
- Replace the prior (incorrect)
autoscaling.knative.dev/max-scale="0"“stop” mechanism with Knative-native stop semantics: mark Servicesnetworking.knative.dev/visibility=cluster-local, label themflyte.org/app-stopped=true, force scale-to-zero via min/initial scale annotations, and delete the latest ready Revision to terminate pods promptly. - Ensure “Start” (Deploy) clears stop/private labels and does not skip updates when the spec SHA is unchanged but the Service is currently stopped.
- Add
allow-zero-initial-scale="true"to devbox Knative autoscaler configuration (rendered manifests, kustomize overlays, and manual setup Makefile), and add control-plane Get caching that skips caching transitional responses.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| docker/devbox-bundled/manifests/dev.yaml | Enables Knative autoscaler allow-zero-initial-scale in rendered dev manifests. |
| docker/devbox-bundled/manifests/complete.yaml | Enables Knative autoscaler allow-zero-initial-scale in rendered complete manifests. |
| docker/devbox-bundled/Makefile | Patches config-autoscaler during manual Knative setup to allow zero initial scale. |
| docker/devbox-bundled/kustomize/dev/kustomization.yaml | Adds kustomize patch to set allow-zero-initial-scale in dev overlay. |
| docker/devbox-bundled/kustomize/complete/kustomization.yaml | Adds kustomize patch to set allow-zero-initial-scale in complete overlay. |
| app/service/app_service.go | Avoids caching Get responses during stop→start transitional windows via isTransitionalState. |
| app/service/app_service_test.go | Adds tests to validate caching behavior for transitional vs stable stopped states. |
| app/internal/k8s/app_client.go | Implements new stop/start semantics (labels + cluster-local visibility + scale-to-zero + delete latest ready Revision) and updates STOPPED status mapping. |
| app/internal/k8s/app_client_test.go | Updates/adds unit tests for the new stop/start behavior and STOPPED mapping. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // isTransitionalState returns true when the app has a non-stopped desired state | ||
| // but is currently reporting a stopped status. This happens in the window between | ||
| // a Start action and K8s actually bringing the pod up; caching in that window | ||
| // would lock the UI into showing "stopped" for the full TTL on every poll cycle. | ||
| func isTransitionalState(app *flyteapp.App) bool { | ||
| if app == nil { | ||
| return false | ||
| } | ||
| if app.GetSpec().GetDesiredState() == flyteapp.Spec_DESIRED_STATE_STOPPED { | ||
| return false | ||
| } | ||
| for _, cond := range app.GetStatus().GetConditions() { | ||
| if cond.GetDeploymentStatus() == flyteapp.Status_DEPLOYMENT_STATUS_STOPPED { | ||
| return true |
| // Updating the KService template alone is not sufficient — it does not immediately terminate existing pods. | ||
| // for the autoscaler and does not kill running pods; they only scale down after | ||
| // the stable window (~60s) with no traffic. |
🐳 Docker CI Image BuiltThe CI Docker image has been built and pushed for this PR! Image: This image will be automatically used by CI workflows in this PR. To test locally: make gen DOCKER_CI_IMAGE=ghcr.io/flyteorg/flyte/ci:pr-7343 |
2784775 to
6b42349
Compare
pingsutw
left a comment
There was a problem hiding this comment.
Looks good, one nit, should we override the desired state (if labelAppStopped is set to true) here before returning the spec?
flyte/app/internal/k8s/app_client.go
Line 449 in 6b42349
|
@pingsutw yes we should override the desired state. Although, the UI did seem to toggle the button correctly. It should be reading from the spec but probably its reading the status from the watch events. |
Signed-off-by: M. Adil Fayyaz <62440954+AdilFayyaz@users.noreply.github.com>
| patch := []byte(fmt.Sprintf( | ||
| `{"metadata":{"labels":{"%s":"true","%s":"%s"}},"spec":{"template":{"metadata":{"annotations":{"autoscaling.knative.dev/min-scale":"%s","autoscaling.knative.dev/initial-scale":"%s"}}}}}`, | ||
| labelAppStopped, | ||
| labelKnativeVisibility, | ||
| visibilityClusterLocal, | ||
| scaleZero, | ||
| scaleZero, | ||
| )) |
| if err := c.k8sClient.Get(ctx, client.ObjectKey{Name: name, Namespace: ns}, current); err == nil { | ||
| if revName := current.Status.LatestReadyRevisionName; revName != "" { | ||
| rev := &servingv1.Revision{} | ||
| rev.Name = revName | ||
| rev.Namespace = ns | ||
| if delErr := c.k8sClient.Delete(ctx, rev); delErr != nil && !k8serrors.IsNotFound(delErr) { | ||
| logger.Warnf(ctx, "Failed to delete Revision %s/%s to stop: %v", ns, revName, delErr) | ||
| } |
| // Updating the KService template alone is not sufficient — it does not immediately terminate existing pods. | ||
| // for the autoscaler and does not kill running pods; they only scale down after | ||
| // the stable window (~60s) with no traffic. |
Why are the changes needed?
Users expect “Stop App” to (1) scale the app down to zero and (2) make the public ingress inaccessible. The prior approach relied on
autoscaling.knative.dev/max-scale: "0"as a stop signal, but in Knative that value represents “unlimited” upper bound, so it didn’t reliably enforce the intended stop semantics. Separately, the UI polls/Get calls can briefly show stale “stopped” state during stop→start transitions because the control plane can cache transitional responses while the data plane is still converging.What changes were proposed in this pull request?
networking.knative.dev/visibility=cluster-localso it is not published to the external gateway (public ingress becomes inaccessible).flyte.org/app-stopped=trueas the control-plane source of truth for STOPPED state.autoscaling.knative.dev/min-scale="0"andautoscaling.knative.dev/initial-scale="0"to converge cleanly to zero pods.flyte.org/app-stopped(not max-scale).Devbox / cluster configuration
config-autoscaler.allow-zero-initial-scale="true"in devbox-bundled kustomize overlays and rendered manifests.docker/devbox-bundled/Makefilesetup-knative to also patch config-autoscaler so the “manual install” path matches the bundled manifests behavior.UI correctness / control-plane behavior
How was this patch tested?
transitions to ACTIVE without being stuck in STOPPED due to caching.
Setup process
Screenshots
Check all the applicable boxes
Related PRs
Stack
If you do use
git townto manage PR Stacks, the stack relevant to this PRwill show below. Otherwise, you can ignore this section.
Docs link