Skip to content

fix(apps): add startupProbe to homepage, headlamp, actual-budget#1636

Open
devantler wants to merge 4 commits into
mainfrom
claude/repo-assist-add-startupprobes-only
Open

fix(apps): add startupProbe to homepage, headlamp, actual-budget#1636
devantler wants to merge 4 commits into
mainfrom
claude/repo-assist-add-startupprobes-only

Conversation

@devantler
Copy link
Copy Markdown
Contributor

@devantler devantler commented May 28, 2026

🤖 Generated by the Daily AI Assistant

Background

Prod-cluster warning-event investigation surfaced 14 Unhealthy events in a recent window, all of which mapped to cold-start probe failures on container creation — not shutdown races, not real downtime. Each affected chart hardcodes livenessProbe/readinessProbe with Kubernetes defaults (timeoutSeconds: 1, periodSeconds: 10, failureThreshold: 3) and ships no startupProbe. Containers take ~10–13 s to begin serving, so each pod creation logs 1–3 Unhealthy warnings and leaves only ~17 s of headroom before the liveness restart would fire.

Observed (prod, 14:59–15:44 UTC 2026-05-28):

  • homepage: 3 Unhealthy events per pod, 5 pods affected during one rollout
  • actual-budget: 2 events at cold start
  • headlamp: 1 event per KEDA scale-from-zero (recurring)

Fix

Add a startupProbe to each Deployment via the existing postRenderer block:

startupProbe:
  httpGet:
    path: /
    port: http
  initialDelaySeconds: 20    # past observed ~13s cold start → first probe lands on serving container
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 12       # 60s grace beyond initialDelay = 80s total startup budget

Total startup budget before liveness takes over: 20 s + 12 × 5 s = 80 s. Liveness/readiness untouched — once startup succeeds the tight chart defaults are fine on a warm pod. None of the three charts expose startupProbe as a values knob, so all three use a strategic-merge/JSON patch on the rendered Deployment.

Why initialDelaySeconds, not faster polling

A first revision used periodSeconds: 2, failureThreshold: 30 (no initialDelaySeconds). That failed the merge-queue check-event-warnings action (commit 19fb2b3 fixes it). kubelet emits the same reason=Unhealthy Warning event regardless of whether startup, liveness, or readiness failed — so a startupProbe alone doesn't silence the noise. And periodSeconds: 2 actually made it worse: a ~13 s cold start produced 5–7 failed-probe events per pod (vs. 1–3 with the chart-default periodSeconds: 10). The action records a marker post-reconcile and fails on any Warning with lastTimestamp within a 90 s settle window; this PR's rollouts now created cold-start failures inside that window.

The current revision sets initialDelaySeconds: 20 so the first probe lands on an already-serving container — zero failed probes on a normal rollout, and the chart-default liveness/readiness only start after startup succeeds (i.e. once HTTP is healthy).

Trade-offs

  • Pods are NotReady for ~20 s instead of ~13 s on cold start; rollouts take ~7 s longer per pod. Acceptable for a clean steady-state CI signal.
  • For headlamp / actual-budget the patch lands on containers[0] (the main app container); the headlamp-plugin sidecar has no probes and is unaffected.

Validation

  • ksail workload validate256 files validated (local)
  • ksail --config ksail.prod.yaml workload validate256 files validated
  • kubectl kustomize k8s/providers/{docker,hetzner}/apps/ both succeed; all three startupProbe blocks resolve into the rendered HelmRelease post-renderer.

Each of these charts hardcodes liveness/readiness with Kubernetes
defaults (timeoutSeconds: 1, periodSeconds: 10, failureThreshold: 3)
and does not template a startupProbe. On a cold start the container
takes ~10–13s to begin serving HTTP, so each pod creation logs 1–3
'Unhealthy' Warning events and leaves only ~17s of headroom before
the liveness restart fires.

Add a strategic-merge startupProbe via the existing postRenderer block
(60s startup window, 2s period/timeout) so liveness/readiness are
gated until the container is actually serving. No change to
liveness/readiness — once startup succeeds the tight defaults are fine
on a warm pod.

Observed (prod, 14:59–15:44 UTC on 2026-05-28):
  - homepage:       3 Unhealthy events per rollout pod (5 pods affected)
  - actual-budget:  2 events at cold start
  - headlamp:       1 event per KEDA scale-from-zero

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces noisy cold-start Unhealthy events for selected applications by adding startupProbe patches through existing Flux HelmRelease post-renderers.

Changes:

  • Adds a 60s HTTP startup window for homepage.
  • Adds the same startup gating for KEDA-scaled headlamp.
  • Adds startup gating for actual-budget while leaving existing liveness/readiness behavior unchanged.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
k8s/bases/apps/homepage/helm-release.yaml Adds a JSON6902 post-renderer patch for the main container startup probe.
k8s/bases/apps/headlamp/helm-release.yaml Adds a startup probe to the main Headlamp container without affecting the plugin sidecar.
k8s/bases/apps/actual-budget/helm-release.yaml Adds a startup probe to the rendered Actual Budget Deployment.

@devantler devantler marked this pull request as ready for review May 28, 2026 17:24
@botantler botantler Bot enabled auto-merge May 28, 2026 17:24
@botantler botantler Bot added this pull request to the merge queue May 28, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026
@devantler devantler added this pull request to the merge queue May 28, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026
The original probe (periodSeconds: 2, failureThreshold: 30, no
initialDelay) silenced cold-start liveness/readiness *restarts* but
not the underlying "Unhealthy" Warning events — kubelet emits the
same event for startup, liveness, and readiness probe failures, and
the 2s period generates 5-7 failures during the ~13s cold start
instead of the chart-default 1-3 (periodSeconds: 10).

Merge-queue deploy of #1636 failed the check-event-warnings action,
which records a marker post-reconcile and fails if any Warning event
has lastTimestamp within a 90s settle window. The rollout these
patches force created new pods during that window; their startup
probes fired every 2s during cold start; their events landed past
the marker.

Set initialDelaySeconds: 20 (past the observed ~13s cold start) and
periodSeconds: 5 so the first probe lands on a serving container.
Zero failure events on a normal rollout; failureThreshold: 12 leaves
60s of grace if a container is unusually slow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 28, 2026 21:14
@devantler
Copy link
Copy Markdown
Contributor Author

🤖 Generated by the Daily AI Assistant

Merge-queue failure analysis & fix

The merge-queue's Deploy to Prod🔎 Require no new event warnings step failed with 5 distinct Unhealthy events fired post-marker:

2026-05-28T20:57:04Z  [homepage] Pod/homepage-855f8f9786-wgbp5  Unhealthy (x1): Startup probe failed: ... connection refused
2026-05-28T20:57:14Z  [homepage] Pod/homepage-855f8f9786-wgbp5  Unhealthy (x4): Startup probe failed: ... context deadline exceeded
2026-05-28T20:57:19Z  [homepage] Pod/homepage-855f8f9786-lf7px  Unhealthy (x1): Startup probe failed: ... connection refused
2026-05-28T20:57:26Z  [actual-budget] Pod/actual-budget-actualbudget-7d495bbb65-mkw8x  Unhealthy (x7): Startup probe failed: ...
2026-05-28T20:57:31Z  [homepage] Pod/homepage-855f8f9786-lf7px  Unhealthy (x5): Startup probe failed: ...

Root cause — own goal:

  1. Kubelet emits the same Unhealthy Warning event for startup, liveness, and readiness probe failures. Adding a startupProbe doesn't silence those events; it just renames the reason string.
  2. The original probe (periodSeconds: 2, failureThreshold: 30) polled 5× faster than the chart-default periodSeconds: 10 liveness/readiness it was meant to replace, so a 13 s cold start now produces 5–7 failures (x4, x5, x7 above) instead of the original 1–3.
  3. This PR forces a rollout of homepage/headlamp/actual-budget (postRenderer changes). The new pods cold-start during the action's 90 s settle window, so their failures land past the marker and fail the check.

Fix (19fb2b3):

startupProbe:
  httpGet: { path: /, port: http }
  initialDelaySeconds: 20   # past the observed ~13s cold start
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 12      # 60s grace beyond initialDelay = 80s total

The first probe lands on a serving container, so the probe never fails on a normal rollout — zero events in the settle window. failureThreshold: 12 still leaves a generous 60 s of post-delay grace for unusually slow starts.

Trade-off: pods are NotReady for ~20 s instead of ~13 s on cold start (rollouts take 7 s longer per pod), since initialDelaySeconds blocks readiness from succeeding earlier. Acceptable cost for clean steady-state CI.

ksail workload validate and ksail --config ksail.prod.yaml workload validate both green on the new commit.

@botantler botantler Bot enabled auto-merge May 28, 2026 21:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comment thread k8s/bases/apps/homepage/helm-release.yaml
Comment thread k8s/bases/apps/headlamp/helm-release.yaml
Comment thread k8s/bases/apps/actual-budget/helm-release.yaml
# Conflicts:
#	k8s/bases/apps/homepage/helm-release.yaml
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026
@devantler devantler added this pull request to the merge queue May 28, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026
@devantler devantler added this pull request to the merge queue May 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
@devantler
Copy link
Copy Markdown
Contributor Author

🤖 Generated by the Daily AI Assistant

Merge-queue failure analysis: prod cluster degraded, NOT a PR regression

PR #1636 only modifies three app HelmRelease postRenderers (homepage / headlamp / actual-budget) — confirmed via git diff main...HEAD --stat. None of the namespaces or workloads that triggered the check-event-warnings failure are touched by this branch.

The three flagged events:

BackOff (x1861) — cert-manager/cert-manager-cainjector-585b7d746-kgp62
BackOff (x1023) — kubevirt/virt-handler-lf2dd
BackOff (x1620) — kube-system/spire-agent-nx4rw

x1023+ restart counts on 5-day-old pods are the gate correctly catching pre-existing crash loops. Read-only investigation against admin@prod:

What's actually broken

Root pattern: pods crash-loop trying to reach the kube-apiserver ClusterIP (10.96.0.1:443), observed across at least 13 critical workloads (cert-manager-cainjector, virt-handler, all 6 spire-agents, kustomize-controller, flux-operator, fleet, keda-add-ons-http-external-scaler, kube-state-metrics, trust-manager, origin-ca-issuer, longhorn csi-provisioner). Sample logs:

spire-agent:   create attestation client: failed to dial dns:///spire-server:8081:
               context deadline exceeded ... dial tcp 10.96.193.18:8081: i/o timeout
cainjector:    failed to get server groups: ... dial tcp 10.96.0.1:443: i/o timeout
virt-handler:  failed to list *v1.ConfigMap: ... dial tcp 10.96.0.1:443: i/o timeout
spire-server:  notifier(k8sbundle): unable to update: ... dial tcp 10.96.0.1:443: i/o timeout

Smoking gun: prod-worker-2

Restarts grouped by node:

Node Total restarts (>10)
prod-worker-2 1072
prod-worker-1 400
prod-control-plane-{1,2,3} / prod-worker-3 ~90 each

— and cilium-2z7fv on prod-worker-2 was OOMKilled at 2026-05-28T22:07:32Z (exit 137). Spire-server's apiserver-notifier failures begin at exactly that timestamp. After the Cilium agent restart, ClusterIP routing on prod-worker-2 hasn't fully recovered — pods on that node keep timing out on Service IPs.

Why Cilium got OOMed

prod-worker-2 resource state:

memory   7453078434 (98%)  33132Mi (457%)   ← requests / limits
cpu      3921m (99%)       32150m (813%)
usage    5289Mi (72%)      — top node

98% memory requests committed, no slack. And 25 critical infra pods are QoS: BestEffort — including all 6 spire-agents, the SPIRE server, both Cilium operators, all 6 cilium-envoy DaemonSet pods, and the HCloud CSI controller/nodes. BestEffort is first to evict under memory pressure, and Cilium itself ships with resources: {} in the HelmRelease values. When prod-worker-2 hit memory pressure, the kernel killed the highest-RSS BestEffort process — cilium-agent — and the BPF state hasn't fully reconciled since.

Cascade

prod-worker-2 saturated → cilium-2z7fv OOM → BPF maps degraded
  → pods on prod-worker-2 can't reach Service ClusterIPs (10.96.0.1, 10.96.193.18)
  → spire-agent on prod-worker-2 can't attest → no /run/spire/sockets/admin.sock on host
  → cilium-2z7fv now logging continuously: "SPIRE Delegate API Client failed to init"
  → cert-manager / kubevirt / fleet / flux all crash-loop their apiserver-bound clients

What this PR should do: nothing

The probe changes here are sound and the merge-queue check is correctly surfacing a real prod incident. The right place to address this is a dedicated PR (see follow-up task spawned in the parent session) that:

  1. Sets explicit requests (and optional limits) on Cilium agent, operator, envoy, and on SPIRE agent/server — promote them out of BestEffort QoS so the kernel doesn't pick them when prod-worker-2 saturates.
  2. Investigates prod-worker-2's high allocation (98% requests committed) and rebalances if needed.
  3. Operator follow-up to clear the stuck BPF state on prod-worker-2: kubectl delete pod -n kube-system cilium-2z7fv (Cilium DaemonSet recreates it) or a node reboot.

I will not modify PR #1636 to add those fixes — mixing scopes would make the rollback story for a probe-tuning change much worse if any of the Cilium changes turn out to need iteration.

devantler added a commit that referenced this pull request May 29, 2026
…ut of BestEffort QoS (#1649)

* fix(cilium,spire): set resource requests to promote critical agents out of BestEffort QoS

Cilium agent, operator, envoy, and the embedded SPIRE server/agent run in
BestEffort QoS by default — the upstream chart leaves `resources:` empty
everywhere. On 2026-05-28 prod-worker-2 (at 98% memory request commitment)
OOMKilled `cilium-2z7fv`; the restarted agent left ClusterIP routing on
that node degraded, then got stuck retrying `SPIRE admin socket
(/run/spire/sockets/admin.sock) does not exist` because the spire-agent
DaemonSet pod for the node was also BestEffort and crash-looping. ~13
workloads cascaded into i/o timeout against `10.96.0.1:443` and
`10.96.193.18:8081` (cert-manager-cainjector, virt-handler, all
spire-agents, spire-server, kustomize-controller, flux-operator, fleet,
keda http external scaler, kube-state-metrics, trust-manager,
origin-ca-issuer, csi-provisioner).

Add explicit requests to:
- `resources` (agent DaemonSet) — 200m / 512Mi (observed steady-state
  ~165m / 340Mi)
- `envoy.resources` (standalone cilium-envoy DaemonSet) — 50m / 128Mi
- `operator.resources` — 100m / 256Mi
- `authentication.mutual.spire.install.server.resources` — 50m / 128Mi
- `authentication.mutual.spire.install.agent.resources` — 50m / 128Mi

All five pods are now Burstable instead of BestEffort, so they're no
longer first in line for kubelet eviction / OOMKill under node memory
pressure. Limits intentionally unset — Cilium recommends against capping
the agent.

Out of scope: prod-worker-2 sits at 98% memory commitment for unrelated
reasons (VPA recommendations + workload density). Adding ~768Mi of new
DaemonSet requests per node will tip it further; a follow-up rebalance
or worker scale-up is likely needed. Flagged in PR body.

Recovery action (separate from this PR): once Flux has reconciled the
new resources, restart the wedged agent with
`kubectl --context=admin@prod delete pod -n kube-system cilium-2z7fv`.
If prod-worker-2 doesn't recover within ~5 min, reboot the node via
talosctl / Hetzner console.

Validated with:
- ksail workload validate (256 files ok)
- ksail --config ksail.prod.yaml workload validate (256 files ok)
- kubectl kustomize k8s/clusters/{local,prod}/ — clean
- kubectl kustomize k8s/providers/{docker,hetzner}/infrastructure/controllers/ — clean

Refs: #1636

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: retrigger after transient ksail-workload-validate EOF on bases/apps/headlamp

Previous run (26621107142) failed in System Test at the kubeconform
step with `validation failed: EOF` for `bases/apps/headlamp` — a
schema-fetch network blip, not a content failure. Headlamp is
untouched by this PR (diff is the Cilium HelmRelease only), the
manifest validates cleanly locally on this branch, and the previous
successful main-line run validated headlamp from the same files.
Empty commit to retrigger CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@devantler devantler added this pull request to the merge queue May 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
@devantler devantler added this pull request to the merge queue May 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
@devantler devantler added this pull request to the merge queue May 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
@devantler devantler added this pull request to the merge queue May 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🫴 Ready

Development

Successfully merging this pull request may close these issues.

2 participants