Skip to content

fix(openbao): switch readinessProbe to HTTP /sys/health to unblock bootstrap#1656

Open
devantler wants to merge 1 commit into
mainfrom
claude/openbao-readinessprobe-fix
Open

fix(openbao): switch readinessProbe to HTTP /sys/health to unblock bootstrap#1656
devantler wants to merge 1 commit into
mainfrom
claude/openbao-readinessprobe-fix

Conversation

@devantler
Copy link
Copy Markdown
Contributor

🤖 Generated by the Daily AI Assistant

The bug

System Test on PR #1636 (run 26603473269) failed with:

Kustomization/flux-system/infrastructure — health check failed after 3m0.432401941s:
  timeout waiting for: [Job/openbao/vault-config status: 'InProgress'] (HealthCheckFailed)

PR #1636 only modified homepage/headlamp/actual-budget HelmReleases — the failure was unrelated to its diff. The Job/openbao/vault-config regularly takes a long time to bootstrap on fresh clusters; the vault-config Job's own comment claims "openbao bootstrap can take 20-40 min on cold CI runners" and sets backoffLimit: 30, activeDeadlineSeconds: 3600 accordingly.

My first attempt at fixing this was a per-cluster timeout patch (PR #1648, now closed) — bumping the local infrastructure Flux Kustomization timeout to 20m. Reviewer push-back was correct: 30 min is extreme, not a normal cold-start time; this is a symptom, not the disease. This PR fixes the disease.

Root cause

The openbao-helm 0.28.3 chart's default readinessProbe is exec: bao status -tls-skip-verify (server-statefulset.yaml:157-178). bao status exits 2 when sealed — so on a fresh cluster, the StatefulSet pod stays NotReady until something unseals it.

That "something" is the vault-config Job at k8s/bases/infrastructure/vault-config/job.yaml — which runs in the downstream infrastructure Flux Kustomization, gated on infrastructure-controllers (which contains this HelmRelease) becoming Ready first. The chain:

Layer Has wait? Contains
infrastructure-controllers wait: true openbao HelmRelease
infrastructure wait: true, dependsOn: infrastructure-controllers vault-config Job (would unseal)

Flux's HelmController uses --wait by default (install.disableWait: false), so the HelmRelease cannot converge while the pod is NotReady; install.remediation.retries: -1 (helm-release.yaml:11-13) drives an endless install → wait timeout → uninstall → reinstall churn for the entire bootstrap window. Bootstrap only escapes via a fragile race where the Job pod eventually catches a transient window during the chart's install/uninstall thrash — historically 20-40 min.

The Job's vault-init init container also has an unbounded until bao status … ; sleep 3; done loop (job.yaml:90-95) with no timeout, so a single Pod can sit in init for up to activeDeadlineSeconds: 3600 (1 hour) before getting killed and retried — amplifying the race window.

The fix

Setting server.readinessProbe.path makes the chart template render the httpGet branch instead of the exec branch:

{{- if .Values.server.readinessProbe.path }}
httpGet:
  path: {{ .Values.server.readinessProbe.path | quote }}
  port: {{ .Values.server.readinessProbe.port }}
  scheme: {{ include "openbao.scheme" . | upper }}

The HashiCorp Vault Helm chart uses exactly this pattern for the same reason — /v1/sys/health?standbyok=true&sealedcode=204&uninitcode=204 makes the /sys/health endpoint return HTTP 204 even when the server is sealed and uninitialized, so the Pod reports Ready as soon as the listener is up. The HelmRelease then converges Ready on first install, infrastructure-controllers becomes Ready, the infrastructure layer runs, and the vault-config Job completes in ~1-2 min instead of waiting for the deadlock to self-resolve.

Scheme handling

openbao.scheme returns http when global.tlsDisable: true (_helpers.tpl — chart default; matches our tls_disable = 1 listener config in helm-release.yaml:80-83), so the probe stays HTTP — no TLS plumbing required.

LivenessProbe

The chart's server.livenessProbe.enabled defaults to false (values.yaml:643), so there is no parallel liveness fix needed. (If liveness were enabled with the same exec, the kubelet would kill sealed pods — but it isn't.)

Validation

Static checks only — no cluster (per AGENTS.md):

$ ksail workload validate
✔ 256 files validated

$ ksail --config ksail.prod.yaml workload validate
✔ 256 files validated

Rendered openbao HelmRelease shows the path landing correctly:

server:
  readinessProbe:
    path: /v1/sys/health?standbyok=true&sealedcode=204&uninitcode=204

The full Talos+Docker system-test will run in CI on this PR — it should now pass in normal time without the 3m flake.

What this means for the existing safety nets

Once this fix is in, the Job's own backoffLimit: 30 and activeDeadlineSeconds: 3600 become massively oversized — they were sized for the 20-40 min race that no longer exists. I'm not lowering them in this PR (keeps the diff surgical, oversized safety nets are harmless), but a follow-up can bring them down to ~backoffLimit: 10, activeDeadlineSeconds: 600 once a few CI runs confirm the new bootstrap time. The misleading "20-40 min on cold CI runners" comment in job.yaml:37-40 is also worth rewriting at that point — leaving it alone here to keep the diff a single, easily-reviewed file.

Similarly, the local infrastructure Kustomization's 3m timeout should now be comfortable (the rest of the layer is fast-converging; the only slow resource was vault-config-as-a-symptom-of-OpenBao-not-Ready). No timeout bump needed.

What this does NOT change

  • The Job script itself — unchanged.
  • The Job's place in the Flux DAG — unchanged.
  • OpenBao server config (listener, storage, TLS) — unchanged.
  • ESO / vault-seed / ExternalSecret semantics — unchanged.

Only the chart's probe handler is switched from exec to httpGet.

🤖 Generated with Claude Code

…otstrap

Root cause of the System Test "Job/openbao/vault-config status:
'InProgress' (HealthCheckFailed)" flake (PR #1636 run 26603473269):

The openbao-helm 0.28.3 chart's default readinessProbe is
`exec: bao status -tls-skip-verify` (server-statefulset.yaml:157-178),
which returns exit code 2 on a sealed server. On a fresh cluster, the
StatefulSet pod therefore stays NotReady until something unseals it.
That "something" is the vault-config Job, which lives in the
downstream 'infrastructure' Flux Kustomization and is gated on
'infrastructure-controllers' (which contains this HelmRelease) becoming
Ready first. Flux's HelmController uses --wait by default
(install.disableWait: false), so the HelmRelease cannot converge while
the pod is NotReady; install.remediation.retries: -1 then drives an
endless install -> wait timeout -> uninstall -> reinstall churn for the
full bootstrap window. Bootstrap only escapes via a fragile race
between Flux retries and the Job pod eventually catching a transient
window where the OpenBao server is listening — historically 20-40 min,
as the Job's backoffLimit=30 comment notes.

Setting readinessProbe.path makes the chart template render the
httpGet branch instead of the exec branch:

  {{- if .Values.server.readinessProbe.path }}
  httpGet:
    path: {{ .Values.server.readinessProbe.path | quote }}
    port: {{ .Values.server.readinessProbe.port }}
    scheme: {{ include "openbao.scheme" . | upper }}

With sealedcode=204 and uninitcode=204, the /sys/health endpoint
returns HTTP 204 even on a sealed-and-uninitialized server, so the
Pod reports Ready as soon as the listener is up. The HelmRelease then
converges Ready on first install, infrastructure-controllers becomes
Ready, the infrastructure layer runs, and the vault-config Job
completes in ~1-2 min instead of waiting 20-40 min for the deadlock to
self-resolve.

Scheme handling: 'openbao.scheme' returns 'http' when
global.tlsDisable: true (chart default; matches our 'tls_disable = 1'
listener), so the probe stays HTTP — no TLS plumbing required. The
chart's livenessProbe defaults to enabled: false, so no parallel
liveness fix is needed.

This is the same pattern HashiCorp's official Vault Helm chart uses
for the same reason
(see vault-helm/values.yaml: readinessProbe.path defaults to
'/v1/sys/health?standbyok=true&sealedcode=204&uninitcode=204').

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the OpenBao HelmRelease values to avoid a bootstrap deadlock caused by the chart’s default exec-based readiness probe failing while the server is sealed/uninitialized. By switching readiness to an HTTP /v1/sys/health endpoint that returns a 2xx/204 during sealed/uninitialized states, Flux/Helm can mark the release ready and allow the downstream vault-config Job to run promptly.

Changes:

  • Override server.readinessProbe to use an HTTP health endpoint (/v1/sys/health) with sealedcode=204 and uninitcode=204.
  • Add in-file rationale documenting the Flux dependency deadlock this prevents.

@devantler devantler marked this pull request as ready for review May 29, 2026 13:08
@devantler devantler added this pull request to the merge queue May 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
@devantler devantler added this pull request to the merge queue May 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
@devantler devantler added this pull request to the merge queue May 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🫴 Ready

Development

Successfully merging this pull request may close these issues.

2 participants