fix(apps): add startupProbe to homepage, headlamp, actual-budget by devantler · Pull Request #1636 · devantler-tech/platform

devantler · 2026-05-28T16:10:54Z

🤖 Generated by the Daily AI Assistant

Background

Prod-cluster warning-event investigation surfaced 14 Unhealthy events in a recent window, all of which mapped to cold-start probe failures on container creation — not shutdown races, not real downtime. Each affected chart hardcodes livenessProbe/readinessProbe with Kubernetes defaults (timeoutSeconds: 1, periodSeconds: 10, failureThreshold: 3) and ships no startupProbe. Containers take ~10–13 s to begin serving, so each pod creation logs 1–3 Unhealthy warnings and leaves only ~17 s of headroom before the liveness restart would fire.

Observed (prod, 14:59–15:44 UTC 2026-05-28):

homepage: 3 Unhealthy events per pod, 5 pods affected during one rollout
actual-budget: 2 events at cold start
headlamp: 1 event per KEDA scale-from-zero (recurring)

Fix

Add a startupProbe to each Deployment via the existing postRenderer block:

startupProbe:
  httpGet:
    path: /
    port: http
  initialDelaySeconds: 20    # past observed ~13s cold start → first probe lands on serving container
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 12       # 60s grace beyond initialDelay = 80s total startup budget

Total startup budget before liveness takes over: 20 s + 12 × 5 s = 80 s. Liveness/readiness untouched — once startup succeeds the tight chart defaults are fine on a warm pod. None of the three charts expose startupProbe as a values knob, so all three use a strategic-merge/JSON patch on the rendered Deployment.

Why `initialDelaySeconds`, not faster polling

A first revision used periodSeconds: 2, failureThreshold: 30 (no initialDelaySeconds). That failed the merge-queue check-event-warnings action (commit 19fb2b3 fixes it). kubelet emits the same reason=Unhealthy Warning event regardless of whether startup, liveness, or readiness failed — so a startupProbe alone doesn't silence the noise. And periodSeconds: 2 actually made it worse: a ~13 s cold start produced 5–7 failed-probe events per pod (vs. 1–3 with the chart-default periodSeconds: 10). The action records a marker post-reconcile and fails on any Warning with lastTimestamp within a 90 s settle window; this PR's rollouts now created cold-start failures inside that window.

The current revision sets initialDelaySeconds: 20 so the first probe lands on an already-serving container — zero failed probes on a normal rollout, and the chart-default liveness/readiness only start after startup succeeds (i.e. once HTTP is healthy).

Trade-offs

Pods are NotReady for ~20 s instead of ~13 s on cold start; rollouts take ~7 s longer per pod. Acceptable for a clean steady-state CI signal.
For headlamp / actual-budget the patch lands on containers[0] (the main app container); the headlamp-plugin sidecar has no probes and is unaffected.

Validation

ksail workload validate → 256 files validated (local)
ksail --config ksail.prod.yaml workload validate → 256 files validated
kubectl kustomize k8s/providers/{docker,hetzner}/apps/ both succeed; all three startupProbe blocks resolve into the rendered HelmRelease post-renderer.

Each of these charts hardcodes liveness/readiness with Kubernetes defaults (timeoutSeconds: 1, periodSeconds: 10, failureThreshold: 3) and does not template a startupProbe. On a cold start the container takes ~10–13s to begin serving HTTP, so each pod creation logs 1–3 'Unhealthy' Warning events and leaves only ~17s of headroom before the liveness restart fires. Add a strategic-merge startupProbe via the existing postRenderer block (60s startup window, 2s period/timeout) so liveness/readiness are gated until the container is actually serving. No change to liveness/readiness — once startup succeeds the tight defaults are fine on a warm pod. Observed (prod, 14:59–15:44 UTC on 2026-05-28): - homepage: 3 Unhealthy events per rollout pod (5 pods affected) - actual-budget: 2 events at cold start - headlamp: 1 event per KEDA scale-from-zero Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR reduces noisy cold-start Unhealthy events for selected applications by adding startupProbe patches through existing Flux HelmRelease post-renderers.

Changes:

Adds a 60s HTTP startup window for homepage.
Adds the same startup gating for KEDA-scaled headlamp.
Adds startup gating for actual-budget while leaving existing liveness/readiness behavior unchanged.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
`k8s/bases/apps/homepage/helm-release.yaml`	Adds a JSON6902 post-renderer patch for the main container startup probe.
`k8s/bases/apps/headlamp/helm-release.yaml`	Adds a startup probe to the main Headlamp container without affecting the plugin sidecar.
`k8s/bases/apps/actual-budget/helm-release.yaml`	Adds a startup probe to the rendered Actual Budget Deployment.

The original probe (periodSeconds: 2, failureThreshold: 30, no initialDelay) silenced cold-start liveness/readiness *restarts* but not the underlying "Unhealthy" Warning events — kubelet emits the same event for startup, liveness, and readiness probe failures, and the 2s period generates 5-7 failures during the ~13s cold start instead of the chart-default 1-3 (periodSeconds: 10). Merge-queue deploy of #1636 failed the check-event-warnings action, which records a marker post-reconcile and fails if any Warning event has lastTimestamp within a 90s settle window. The rollout these patches force created new pods during that window; their startup probes fired every 2s during cold start; their events landed past the marker. Set initialDelaySeconds: 20 (past the observed ~13s cold start) and periodSeconds: 5 so the first probe lands on a serving container. Zero failure events on a normal rollout; failureThreshold: 12 leaves 60s of grace if a container is unusually slow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

devantler · 2026-05-28T21:14:30Z

🤖 Generated by the Daily AI Assistant

Merge-queue failure analysis & fix

The merge-queue's Deploy to Prod → 🔎 Require no new event warnings step failed with 5 distinct Unhealthy events fired post-marker:

2026-05-28T20:57:04Z  [homepage] Pod/homepage-855f8f9786-wgbp5  Unhealthy (x1): Startup probe failed: ... connection refused
2026-05-28T20:57:14Z  [homepage] Pod/homepage-855f8f9786-wgbp5  Unhealthy (x4): Startup probe failed: ... context deadline exceeded
2026-05-28T20:57:19Z  [homepage] Pod/homepage-855f8f9786-lf7px  Unhealthy (x1): Startup probe failed: ... connection refused
2026-05-28T20:57:26Z  [actual-budget] Pod/actual-budget-actualbudget-7d495bbb65-mkw8x  Unhealthy (x7): Startup probe failed: ...
2026-05-28T20:57:31Z  [homepage] Pod/homepage-855f8f9786-lf7px  Unhealthy (x5): Startup probe failed: ...

Root cause — own goal:

Kubelet emits the same Unhealthy Warning event for startup, liveness, and readiness probe failures. Adding a startupProbe doesn't silence those events; it just renames the reason string.
The original probe (periodSeconds: 2, failureThreshold: 30) polled 5× faster than the chart-default periodSeconds: 10 liveness/readiness it was meant to replace, so a 13 s cold start now produces 5–7 failures (x4, x5, x7 above) instead of the original 1–3.
This PR forces a rollout of homepage/headlamp/actual-budget (postRenderer changes). The new pods cold-start during the action's 90 s settle window, so their failures land past the marker and fail the check.

Fix (19fb2b3):

startupProbe:
  httpGet: { path: /, port: http }
  initialDelaySeconds: 20   # past the observed ~13s cold start
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 12      # 60s grace beyond initialDelay = 80s total

The first probe lands on a serving container, so the probe never fails on a normal rollout — zero events in the settle window. failureThreshold: 12 still leaves a generous 60 s of post-delay grace for unusually slow starts.

Trade-off: pods are NotReady for ~20 s instead of ~13 s on cold start (rollouts take 7 s longer per pod), since initialDelaySeconds blocks readiness from succeeding earlier. Acceptable cost for clean steady-state CI.

ksail workload validate and ksail --config ksail.prod.yaml workload validate both green on the new commit.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

# Conflicts: # k8s/bases/apps/homepage/helm-release.yaml

devantler · 2026-05-29T05:55:25Z

🤖 Generated by the Daily AI Assistant

Merge-queue failure analysis: prod cluster degraded, NOT a PR regression

PR #1636 only modifies three app HelmRelease postRenderers (homepage / headlamp / actual-budget) — confirmed via git diff main...HEAD --stat. None of the namespaces or workloads that triggered the check-event-warnings failure are touched by this branch.

The three flagged events:

BackOff (x1861) — cert-manager/cert-manager-cainjector-585b7d746-kgp62
BackOff (x1023) — kubevirt/virt-handler-lf2dd
BackOff (x1620) — kube-system/spire-agent-nx4rw

x1023+ restart counts on 5-day-old pods are the gate correctly catching pre-existing crash loops. Read-only investigation against admin@prod:

What's actually broken

Root pattern: pods crash-loop trying to reach the kube-apiserver ClusterIP (10.96.0.1:443), observed across at least 13 critical workloads (cert-manager-cainjector, virt-handler, all 6 spire-agents, kustomize-controller, flux-operator, fleet, keda-add-ons-http-external-scaler, kube-state-metrics, trust-manager, origin-ca-issuer, longhorn csi-provisioner). Sample logs:

spire-agent:   create attestation client: failed to dial dns:///spire-server:8081:
               context deadline exceeded ... dial tcp 10.96.193.18:8081: i/o timeout
cainjector:    failed to get server groups: ... dial tcp 10.96.0.1:443: i/o timeout
virt-handler:  failed to list *v1.ConfigMap: ... dial tcp 10.96.0.1:443: i/o timeout
spire-server:  notifier(k8sbundle): unable to update: ... dial tcp 10.96.0.1:443: i/o timeout

Smoking gun: `prod-worker-2`

Restarts grouped by node:

Node	Total restarts (>10)
prod-worker-2	1072
prod-worker-1	400
prod-control-plane-{1,2,3} / prod-worker-3	~90 each

— and cilium-2z7fv on prod-worker-2 was OOMKilled at 2026-05-28T22:07:32Z (exit 137). Spire-server's apiserver-notifier failures begin at exactly that timestamp. After the Cilium agent restart, ClusterIP routing on prod-worker-2 hasn't fully recovered — pods on that node keep timing out on Service IPs.

Why Cilium got OOMed

prod-worker-2 resource state:

memory   7453078434 (98%)  33132Mi (457%)   ← requests / limits
cpu      3921m (99%)       32150m (813%)
usage    5289Mi (72%)      — top node

98% memory requests committed, no slack. And 25 critical infra pods are QoS: BestEffort — including all 6 spire-agents, the SPIRE server, both Cilium operators, all 6 cilium-envoy DaemonSet pods, and the HCloud CSI controller/nodes. BestEffort is first to evict under memory pressure, and Cilium itself ships with resources: {} in the HelmRelease values. When prod-worker-2 hit memory pressure, the kernel killed the highest-RSS BestEffort process — cilium-agent — and the BPF state hasn't fully reconciled since.

Cascade

prod-worker-2 saturated → cilium-2z7fv OOM → BPF maps degraded
  → pods on prod-worker-2 can't reach Service ClusterIPs (10.96.0.1, 10.96.193.18)
  → spire-agent on prod-worker-2 can't attest → no /run/spire/sockets/admin.sock on host
  → cilium-2z7fv now logging continuously: "SPIRE Delegate API Client failed to init"
  → cert-manager / kubevirt / fleet / flux all crash-loop their apiserver-bound clients

What this PR should do: nothing

The probe changes here are sound and the merge-queue check is correctly surfacing a real prod incident. The right place to address this is a dedicated PR (see follow-up task spawned in the parent session) that:

Sets explicit requests (and optional limits) on Cilium agent, operator, envoy, and on SPIRE agent/server — promote them out of BestEffort QoS so the kernel doesn't pick them when prod-worker-2 saturates.
Investigates prod-worker-2's high allocation (98% requests committed) and rebalances if needed.
Operator follow-up to clear the stuck BPF state on prod-worker-2: kubectl delete pod -n kube-system cilium-2z7fv (Cilium DaemonSet recreates it) or a node reboot.

I will not modify PR #1636 to add those fixes — mixing scopes would make the rollback story for a probe-tuning change much worse if any of the Cilium changes turn out to need iteration.

…ut of BestEffort QoS (#1649) * fix(cilium,spire): set resource requests to promote critical agents out of BestEffort QoS Cilium agent, operator, envoy, and the embedded SPIRE server/agent run in BestEffort QoS by default — the upstream chart leaves `resources:` empty everywhere. On 2026-05-28 prod-worker-2 (at 98% memory request commitment) OOMKilled `cilium-2z7fv`; the restarted agent left ClusterIP routing on that node degraded, then got stuck retrying `SPIRE admin socket (/run/spire/sockets/admin.sock) does not exist` because the spire-agent DaemonSet pod for the node was also BestEffort and crash-looping. ~13 workloads cascaded into i/o timeout against `10.96.0.1:443` and `10.96.193.18:8081` (cert-manager-cainjector, virt-handler, all spire-agents, spire-server, kustomize-controller, flux-operator, fleet, keda http external scaler, kube-state-metrics, trust-manager, origin-ca-issuer, csi-provisioner). Add explicit requests to: - `resources` (agent DaemonSet) — 200m / 512Mi (observed steady-state ~165m / 340Mi) - `envoy.resources` (standalone cilium-envoy DaemonSet) — 50m / 128Mi - `operator.resources` — 100m / 256Mi - `authentication.mutual.spire.install.server.resources` — 50m / 128Mi - `authentication.mutual.spire.install.agent.resources` — 50m / 128Mi All five pods are now Burstable instead of BestEffort, so they're no longer first in line for kubelet eviction / OOMKill under node memory pressure. Limits intentionally unset — Cilium recommends against capping the agent. Out of scope: prod-worker-2 sits at 98% memory commitment for unrelated reasons (VPA recommendations + workload density). Adding ~768Mi of new DaemonSet requests per node will tip it further; a follow-up rebalance or worker scale-up is likely needed. Flagged in PR body. Recovery action (separate from this PR): once Flux has reconciled the new resources, restart the wedged agent with `kubectl --context=admin@prod delete pod -n kube-system cilium-2z7fv`. If prod-worker-2 doesn't recover within ~5 min, reboot the node via talosctl / Hetzner console. Validated with: - ksail workload validate (256 files ok) - ksail --config ksail.prod.yaml workload validate (256 files ok) - kubectl kustomize k8s/clusters/{local,prod}/ — clean - kubectl kustomize k8s/providers/{docker,hetzner}/infrastructure/controllers/ — clean Refs: #1636 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: retrigger after transient ksail-workload-validate EOF on bases/apps/headlamp Previous run (26621107142) failed in System Test at the kubeconform step with `validation failed: EOF` for `bases/apps/headlamp` — a schema-fetch network blip, not a content failure. Headlamp is untouched by this PR (diff is the Cilium HelmRelease only), the manifest validates cleanly locally on this branch, and the previous successful main-line run validated headlamp from the same files. Empty commit to retrigger CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 28, 2026 16:10

github-project-automation Bot added this to 🌊 Project Board May 28, 2026

github-project-automation Bot moved this to 🫴 Ready in 🌊 Project Board May 28, 2026

devantler temporarily deployed to ci May 28, 2026 16:11 — with GitHub Actions Inactive

Copilot started reviewing on behalf of devantler May 28, 2026 16:11 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

devantler marked this pull request as ready for review May 28, 2026 17:24

Merge branch 'main' into claude/repo-assist-add-startupprobes-only

d7295fb

botantler Bot approved these changes May 28, 2026

View reviewed changes

devantler temporarily deployed to ci May 28, 2026 17:24 — with GitHub Actions Inactive

botantler Bot enabled auto-merge May 28, 2026 17:24

botantler Bot added this pull request to the merge queue May 28, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026

devantler added this pull request to the merge queue May 28, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026

Copilot AI review requested due to automatic review settings May 28, 2026 21:14

Copilot started reviewing on behalf of devantler May 28, 2026 21:14 View session

devantler had a problem deploying to ci May 28, 2026 21:14 — with GitHub Actions Error

botantler Bot approved these changes May 28, 2026

View reviewed changes

botantler Bot enabled auto-merge May 28, 2026 21:14

Copilot AI reviewed May 28, 2026

View reviewed changes

Comment thread k8s/bases/apps/homepage/helm-release.yaml

Comment thread k8s/bases/apps/headlamp/helm-release.yaml

Comment thread k8s/bases/apps/actual-budget/helm-release.yaml

Merge branch 'main' into claude/repo-assist-add-startupprobes-only

487ed87

# Conflicts: # k8s/bases/apps/homepage/helm-release.yaml

botantler Bot approved these changes May 28, 2026

View reviewed changes

devantler had a problem deploying to ci May 28, 2026 21:32 — with GitHub Actions Failure

devantler mentioned this pull request May 28, 2026

fix(opencost): treat custom pricing as USD/month and lift memory limit #1637

Merged

devantler temporarily deployed to ci May 28, 2026 22:05 — with GitHub Actions Inactive

botantler Bot added this pull request to the merge queue May 28, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026

devantler added this pull request to the merge queue May 28, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026

devantler added this pull request to the merge queue May 29, 2026