fix(openbao): switch readinessProbe to HTTP /sys/health to unblock bootstrap by devantler · Pull Request #1656 · devantler-tech/platform

devantler · 2026-05-29T13:02:41Z

🤖 Generated by the Daily AI Assistant

The bug

System Test on PR #1636 (run 26603473269) failed with:

Kustomization/flux-system/infrastructure — health check failed after 3m0.432401941s:
  timeout waiting for: [Job/openbao/vault-config status: 'InProgress'] (HealthCheckFailed)

PR #1636 only modified homepage/headlamp/actual-budget HelmReleases — the failure was unrelated to its diff. The Job/openbao/vault-config regularly takes a long time to bootstrap on fresh clusters; the vault-config Job's own comment claims "openbao bootstrap can take 20-40 min on cold CI runners" and sets backoffLimit: 30, activeDeadlineSeconds: 3600 accordingly.

My first attempt at fixing this was a per-cluster timeout patch (PR #1648, now closed) — bumping the local infrastructure Flux Kustomization timeout to 20m. Reviewer push-back was correct: 30 min is extreme, not a normal cold-start time; this is a symptom, not the disease. This PR fixes the disease.

Root cause

The openbao-helm 0.28.3 chart's default readinessProbe is exec: bao status -tls-skip-verify (server-statefulset.yaml:157-178). bao status exits 2 when sealed — so on a fresh cluster, the StatefulSet pod stays NotReady until something unseals it.

That "something" is the vault-config Job at k8s/bases/infrastructure/vault-config/job.yaml — which runs in the downstream infrastructure Flux Kustomization, gated on infrastructure-controllers (which contains this HelmRelease) becoming Ready first. The chain:

Layer	Has wait?	Contains
`infrastructure-controllers`	`wait: true`	openbao HelmRelease
`infrastructure`	`wait: true`, `dependsOn: infrastructure-controllers`	vault-config Job (would unseal)

Flux's HelmController uses --wait by default (install.disableWait: false), so the HelmRelease cannot converge while the pod is NotReady; install.remediation.retries: -1 (helm-release.yaml:11-13) drives an endless install → wait timeout → uninstall → reinstall churn for the entire bootstrap window. Bootstrap only escapes via a fragile race where the Job pod eventually catches a transient window during the chart's install/uninstall thrash — historically 20-40 min.

The Job's vault-init init container also has an unbounded until bao status … ; sleep 3; done loop (job.yaml:90-95) with no timeout, so a single Pod can sit in init for up to activeDeadlineSeconds: 3600 (1 hour) before getting killed and retried — amplifying the race window.

The fix

Setting server.readinessProbe.path makes the chart template render the httpGet branch instead of the exec branch:

{{- if .Values.server.readinessProbe.path }}
httpGet:
  path: {{ .Values.server.readinessProbe.path | quote }}
  port: {{ .Values.server.readinessProbe.port }}
  scheme: {{ include "openbao.scheme" . | upper }}

The HashiCorp Vault Helm chart uses exactly this pattern for the same reason — /v1/sys/health?standbyok=true&sealedcode=204&uninitcode=204 makes the /sys/health endpoint return HTTP 204 even when the server is sealed and uninitialized, so the Pod reports Ready as soon as the listener is up. The HelmRelease then converges Ready on first install, infrastructure-controllers becomes Ready, the infrastructure layer runs, and the vault-config Job completes in ~1-2 min instead of waiting for the deadlock to self-resolve.

Scheme handling

openbao.scheme returns http when global.tlsDisable: true (_helpers.tpl — chart default; matches our tls_disable = 1 listener config in helm-release.yaml:80-83), so the probe stays HTTP — no TLS plumbing required.

LivenessProbe

The chart's server.livenessProbe.enabled defaults to false (values.yaml:643), so there is no parallel liveness fix needed. (If liveness were enabled with the same exec, the kubelet would kill sealed pods — but it isn't.)

Validation

Static checks only — no cluster (per AGENTS.md):

$ ksail workload validate
✔ 256 files validated

$ ksail --config ksail.prod.yaml workload validate
✔ 256 files validated

Rendered openbao HelmRelease shows the path landing correctly:

server:
  readinessProbe:
    path: /v1/sys/health?standbyok=true&sealedcode=204&uninitcode=204

The full Talos+Docker system-test will run in CI on this PR — it should now pass in normal time without the 3m flake.

What this means for the existing safety nets

Once this fix is in, the Job's own backoffLimit: 30 and activeDeadlineSeconds: 3600 become massively oversized — they were sized for the 20-40 min race that no longer exists. I'm not lowering them in this PR (keeps the diff surgical, oversized safety nets are harmless), but a follow-up can bring them down to ~backoffLimit: 10, activeDeadlineSeconds: 600 once a few CI runs confirm the new bootstrap time. The misleading "20-40 min on cold CI runners" comment in job.yaml:37-40 is also worth rewriting at that point — leaving it alone here to keep the diff a single, easily-reviewed file.

Similarly, the local infrastructure Kustomization's 3m timeout should now be comfortable (the rest of the layer is fast-converging; the only slow resource was vault-config-as-a-symptom-of-OpenBao-not-Ready). No timeout bump needed.

What this does NOT change

The Job script itself — unchanged.
The Job's place in the Flux DAG — unchanged.
OpenBao server config (listener, storage, TLS) — unchanged.
ESO / vault-seed / ExternalSecret semantics — unchanged.

Only the chart's probe handler is switched from exec to httpGet.

🤖 Generated with Claude Code

…otstrap Root cause of the System Test "Job/openbao/vault-config status: 'InProgress' (HealthCheckFailed)" flake (PR #1636 run 26603473269): The openbao-helm 0.28.3 chart's default readinessProbe is `exec: bao status -tls-skip-verify` (server-statefulset.yaml:157-178), which returns exit code 2 on a sealed server. On a fresh cluster, the StatefulSet pod therefore stays NotReady until something unseals it. That "something" is the vault-config Job, which lives in the downstream 'infrastructure' Flux Kustomization and is gated on 'infrastructure-controllers' (which contains this HelmRelease) becoming Ready first. Flux's HelmController uses --wait by default (install.disableWait: false), so the HelmRelease cannot converge while the pod is NotReady; install.remediation.retries: -1 then drives an endless install -> wait timeout -> uninstall -> reinstall churn for the full bootstrap window. Bootstrap only escapes via a fragile race between Flux retries and the Job pod eventually catching a transient window where the OpenBao server is listening — historically 20-40 min, as the Job's backoffLimit=30 comment notes. Setting readinessProbe.path makes the chart template render the httpGet branch instead of the exec branch: {{- if .Values.server.readinessProbe.path }} httpGet: path: {{ .Values.server.readinessProbe.path | quote }} port: {{ .Values.server.readinessProbe.port }} scheme: {{ include "openbao.scheme" . | upper }} With sealedcode=204 and uninitcode=204, the /sys/health endpoint returns HTTP 204 even on a sealed-and-uninitialized server, so the Pod reports Ready as soon as the listener is up. The HelmRelease then converges Ready on first install, infrastructure-controllers becomes Ready, the infrastructure layer runs, and the vault-config Job completes in ~1-2 min instead of waiting 20-40 min for the deadlock to self-resolve. Scheme handling: 'openbao.scheme' returns 'http' when global.tlsDisable: true (chart default; matches our 'tls_disable = 1' listener), so the probe stays HTTP — no TLS plumbing required. The chart's livenessProbe defaults to enabled: false, so no parallel liveness fix is needed. This is the same pattern HashiCorp's official Vault Helm chart uses for the same reason (see vault-helm/values.yaml: readinessProbe.path defaults to '/v1/sys/health?standbyok=true&sealedcode=204&uninitcode=204'). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR updates the OpenBao HelmRelease values to avoid a bootstrap deadlock caused by the chart’s default exec-based readiness probe failing while the server is sealed/uninitialized. By switching readiness to an HTTP /v1/sys/health endpoint that returns a 2xx/204 during sealed/uninitialized states, Flux/Helm can mark the release ready and allow the downstream vault-config Job to run promptly.

Changes:

Override server.readinessProbe to use an HTTP health endpoint (/v1/sys/health) with sealedcode=204 and uninitcode=204.
Add in-file rationale documenting the Flux dependency deadlock this prevents.

Copilot AI review requested due to automatic review settings May 29, 2026 13:02

github-project-automation Bot added this to 🌊 Project Board May 29, 2026

github-project-automation Bot moved this to 🫴 Ready in 🌊 Project Board May 29, 2026

devantler temporarily deployed to ci May 29, 2026 13:02 — with GitHub Actions Inactive

Copilot started reviewing on behalf of devantler May 29, 2026 13:02 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

devantler marked this pull request as ready for review May 29, 2026 13:08

devantler added this pull request to the merge queue May 29, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026

devantler added this pull request to the merge queue May 29, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026

devantler added this pull request to the merge queue May 29, 2026

devantler mentioned this pull request May 29, 2026

feat(cluster-policies): require workloads to spread across nodes #1661

Open

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(openbao): switch readinessProbe to HTTP /sys/health to unblock bootstrap#1656

fix(openbao): switch readinessProbe to HTTP /sys/health to unblock bootstrap#1656
devantler wants to merge 1 commit into
mainfrom
claude/openbao-readinessprobe-fix

devantler commented May 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

devantler commented May 29, 2026

The bug

Root cause

The fix

Scheme handling

LivenessProbe

Validation

What this means for the existing safety nets

What this does NOT change

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants