Skip to content

Add storage backend probe to /health (closes #73)#119

Open
larsborn wants to merge 11 commits into
git-pkgs:mainfrom
larsborn:feat/storage-health-probe
Open

Add storage backend probe to /health (closes #73)#119
larsborn wants to merge 11 commits into
git-pkgs:mainfrom
larsborn:feat/storage-health-probe

Conversation

@larsborn
Copy link
Copy Markdown

@larsborn larsborn commented May 12, 2026

Summary

  • /health now performs an active write → size-check → read → verify → delete round-trip against the configured storage backend, in addition to the existing database check. Closes Add storage backend probe to health check #73.
  • Result is cached for a configurable interval (health.storage_probe_interval, env PROXY_HEALTH_STORAGE_PROBE_INTERVAL, default 30s; "0" disables caching). The probe runs under a detached context.WithTimeout(context.Background(), 10s) so a client disconnect can't poison the cache.
  • Response shape changes from plain text ("ok" / "database error") to JSON:
    {"status":"ok","checks":{"database":{"status":"ok"},"storage":{"status":"ok"}}}
    Status codes are unchanged (200 healthy / 503 unhealthy). Failures include an error field and (for storage) a step label.
  • New metric: proxy_health_probe_failures_total{step="write|size|read|verify|delete"}, following the existing proxy_integrity_failures_total pattern.
  • Probe path layout: .healthcheck/<unix-nano>-<crypto/rand hex> — unique per call, collision-safe under concurrent replicas. Object is deleted after verify; delete failures surface as probe failures.
  • Transition-only logging (ok↔error), so Kubernetes-rate probing doesn't spam logs.

Behavioral notes / breaking changes

  • Response shape: any monitor that grep'd the body for "ok" will break. Status-code-based monitors keep working. Documented in README's new ### Health Check subsection and in the regenerated Swagger.
  • Probe-object cleanup: if Delete fails, the probe object is left under .healthcheck/. With a 30s TTL and a continuously-failing delete that's ~3 KB/hour per replica. The proxy_health_probe_failures_total{step="delete"} counter surfaces this. A future Storage.List extension would enable a startup sweep — explicitly out of scope here.
  • Deliberate spec deviation: health.go calls rc.Close() explicitly (not deferred) between ReadAll and Delete so the file handle is released before deletion. On Windows the deferred-close ordering caused Delete to fail with "file in use" — caught when wiring up TestHealthEndpoint against the real filesystem backend. Commented in the source.

Untested

I have not validated this against a remote backend (S3/Azure).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add storage backend probe to health check

1 participant