Skip to content

fix: stop Knative apps privately + keep UI state accurate after stop/start#7343

Merged
AdilFayyaz merged 2 commits intomainfrom
adil/apps-fix-stop-knative
May 6, 2026
Merged

fix: stop Knative apps privately + keep UI state accurate after stop/start#7343
AdilFayyaz merged 2 commits intomainfrom
adil/apps-fix-stop-knative

Conversation

@AdilFayyaz
Copy link
Copy Markdown
Contributor

Why are the changes needed?

Users expect “Stop App” to (1) scale the app down to zero and (2) make the public ingress inaccessible. The prior approach relied on autoscaling.knative.dev/max-scale: "0" as a stop signal, but in Knative that value represents “unlimited” upper bound, so it didn’t reliably enforce the intended stop semantics. Separately, the UI polls/Get calls can briefly show stale “stopped” state during stop→start transitions because the control plane can cache transitional responses while the data plane is still converging.

What changes were proposed in this pull request?

  • Replaced the incorrect max-scale=0 stop semantics with Knative-native behavior:
  • Stop() labels the Knative Service networking.knative.dev/visibility=cluster-local so it is not published to the external gateway (public ingress becomes inaccessible).
  • Stop() marks the Service with flyte.org/app-stopped=true as the control-plane source of truth for STOPPED state.
  • Stop() sets autoscaling.knative.dev/min-scale="0" and autoscaling.knative.dev/initial-scale="0" to converge cleanly to zero pods.
  • Stop() deletes the latest ready Revision so existing pods terminate immediately (no waiting for the stable window).
  • Deploy() (“Start”) clears the stopped/private labels so the service becomes externally routable again.
  • Status mapping now treats STOPPED based on flyte.org/app-stopped (not max-scale).

Devbox / cluster configuration

  • Enabled config-autoscaler.allow-zero-initial-scale="true" in devbox-bundled kustomize overlays and rendered manifests.
  • Updated docker/devbox-bundled/Makefile setup-knative to also patch config-autoscaler so the “manual install” path matches the bundled manifests behavior.

UI correctness / control-plane behavior

  • Added a small TTL cache in the control-plane AppService for Get responses to reduce cross-plane chatter, while explicitly avoiding caching “transitional” stop/start windows (desired state running, but status still stopped) so the UI doesn’t get stuck showing STOPPED after clicking Start.

How was this patch tested?

  • Unit tests:
    • go test ./app/internal/k8s
    • go test ./app/service
    • Manual verification (devbox / cluster):
      • Create app → confirm public URL reachable.
      • Stop app → confirm pods scale to 0 and public URL is not reachable (service is cluster-local).
      • Start app → confirm visibility label cleared, service becomes publicly reachable again, and UI
        transitions to ACTIVE without being stuck in STOPPED due to caching.

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Stack

If you do use git town to manage PR Stacks, the stack relevant to this PR
will show below. Otherwise, you can ignore this section.

Docs link

@AdilFayyaz AdilFayyaz self-assigned this May 5, 2026
@AdilFayyaz AdilFayyaz added the bug Something isn't working label May 5, 2026
Copilot AI review requested due to automatic review settings May 5, 2026 21:53
@github-actions github-actions Bot added the flyte2 label May 5, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Flyte’s “Stop App” / “Start App” behavior for Knative-backed apps to reliably (a) scale to zero and (b) make public ingress inaccessible, while also improving UI correctness by avoiding caching during stop→start transitional windows.

Changes:

  • Replace the prior (incorrect) autoscaling.knative.dev/max-scale="0" “stop” mechanism with Knative-native stop semantics: mark Services networking.knative.dev/visibility=cluster-local, label them flyte.org/app-stopped=true, force scale-to-zero via min/initial scale annotations, and delete the latest ready Revision to terminate pods promptly.
  • Ensure “Start” (Deploy) clears stop/private labels and does not skip updates when the spec SHA is unchanged but the Service is currently stopped.
  • Add allow-zero-initial-scale="true" to devbox Knative autoscaler configuration (rendered manifests, kustomize overlays, and manual setup Makefile), and add control-plane Get caching that skips caching transitional responses.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
docker/devbox-bundled/manifests/dev.yaml Enables Knative autoscaler allow-zero-initial-scale in rendered dev manifests.
docker/devbox-bundled/manifests/complete.yaml Enables Knative autoscaler allow-zero-initial-scale in rendered complete manifests.
docker/devbox-bundled/Makefile Patches config-autoscaler during manual Knative setup to allow zero initial scale.
docker/devbox-bundled/kustomize/dev/kustomization.yaml Adds kustomize patch to set allow-zero-initial-scale in dev overlay.
docker/devbox-bundled/kustomize/complete/kustomization.yaml Adds kustomize patch to set allow-zero-initial-scale in complete overlay.
app/service/app_service.go Avoids caching Get responses during stop→start transitional windows via isTransitionalState.
app/service/app_service_test.go Adds tests to validate caching behavior for transitional vs stable stopped states.
app/internal/k8s/app_client.go Implements new stop/start semantics (labels + cluster-local visibility + scale-to-zero + delete latest ready Revision) and updates STOPPED status mapping.
app/internal/k8s/app_client_test.go Updates/adds unit tests for the new stop/start behavior and STOPPED mapping.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +78 to +91
// isTransitionalState returns true when the app has a non-stopped desired state
// but is currently reporting a stopped status. This happens in the window between
// a Start action and K8s actually bringing the pod up; caching in that window
// would lock the UI into showing "stopped" for the full TTL on every poll cycle.
func isTransitionalState(app *flyteapp.App) bool {
if app == nil {
return false
}
if app.GetSpec().GetDesiredState() == flyteapp.Spec_DESIRED_STATE_STOPPED {
return false
}
for _, cond := range app.GetStatus().GetConditions() {
if cond.GetDeploymentStatus() == flyteapp.Status_DEPLOYMENT_STATUS_STOPPED {
return true
Comment on lines +192 to +194
// Updating the KService template alone is not sufficient — it does not immediately terminate existing pods.
// for the autoscaler and does not kill running pods; they only scale down after
// the stable window (~60s) with no traffic.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

🐳 Docker CI Image Built

The CI Docker image has been built and pushed for this PR!

Image: ghcr.io/flyteorg/flyte/ci:pr-7343

This image will be automatically used by CI workflows in this PR.

To test locally:

make gen DOCKER_CI_IMAGE=ghcr.io/flyteorg/flyte/ci:pr-7343

Signed-off-by: M. Adil Fayyaz <62440954+AdilFayyaz@users.noreply.github.com>
@AdilFayyaz AdilFayyaz force-pushed the adil/apps-fix-stop-knative branch from 2784775 to 6b42349 Compare May 5, 2026 22:36
@AdilFayyaz AdilFayyaz requested review from pingsutw May 5, 2026 22:40
Copy link
Copy Markdown
Member

@pingsutw pingsutw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, one nit, should we override the desired state (if labelAppStopped is set to true) here before returning the spec?

spec := &flyteapp.Spec{}

@AdilFayyaz
Copy link
Copy Markdown
Contributor Author

@pingsutw yes we should override the desired state. Although, the UI did seem to toggle the button correctly. It should be reading from the spec but probably its reading the status from the watch events.

Signed-off-by: M. Adil Fayyaz <62440954+AdilFayyaz@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 6, 2026 20:26
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Comment on lines +216 to +223
patch := []byte(fmt.Sprintf(
`{"metadata":{"labels":{"%s":"true","%s":"%s"}},"spec":{"template":{"metadata":{"annotations":{"autoscaling.knative.dev/min-scale":"%s","autoscaling.knative.dev/initial-scale":"%s"}}}}}`,
labelAppStopped,
labelKnativeVisibility,
visibilityClusterLocal,
scaleZero,
scaleZero,
))
Comment on lines +238 to +245
if err := c.k8sClient.Get(ctx, client.ObjectKey{Name: name, Namespace: ns}, current); err == nil {
if revName := current.Status.LatestReadyRevisionName; revName != "" {
rev := &servingv1.Revision{}
rev.Name = revName
rev.Namespace = ns
if delErr := c.k8sClient.Delete(ctx, rev); delErr != nil && !k8serrors.IsNotFound(delErr) {
logger.Warnf(ctx, "Failed to delete Revision %s/%s to stop: %v", ns, revName, delErr)
}
Comment on lines +192 to +194
// Updating the KService template alone is not sufficient — it does not immediately terminate existing pods.
// for the autoscaler and does not kill running pods; they only scale down after
// the stable window (~60s) with no traffic.
@AdilFayyaz AdilFayyaz merged commit 190976b into main May 6, 2026
25 checks passed
@AdilFayyaz AdilFayyaz deleted the adil/apps-fix-stop-knative branch May 6, 2026 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working flyte2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants