Add storage backend probe to /health (closes #73) by larsborn · Pull Request #119 · git-pkgs/proxy

larsborn · 2026-05-12T20:31:37Z

Summary

/health now performs an active write → size-check → read → verify → delete round-trip against the configured storage backend, in addition to the existing database check. Closes Add storage backend probe to health check #73.
Result is cached for a configurable interval (health.storage_probe_interval, env PROXY_HEALTH_STORAGE_PROBE_INTERVAL, default 30s; "0" disables caching). The probe runs under a detached context.WithTimeout(context.Background(), 10s) so a client disconnect can't poison the cache.
Response shape changes from plain text ("ok" / "database error") to JSON:
```
{"status":"ok","checks":{"database":{"status":"ok"},"storage":{"status":"ok"}}}
```
Status codes are unchanged (200 healthy / 503 unhealthy). Failures include an error field and (for storage) a step label.
New metric: proxy_health_probe_failures_total{step="write|size|read|verify|delete"}, following the existing proxy_integrity_failures_total pattern.
Probe path layout: .healthcheck/<unix-nano>-<crypto/rand hex> — unique per call, collision-safe under concurrent replicas. Object is deleted after verify; delete failures surface as probe failures.
Transition-only logging (ok↔error), so Kubernetes-rate probing doesn't spam logs.

Behavioral notes / breaking changes

Response shape: any monitor that grep'd the body for "ok" will break. Status-code-based monitors keep working. Documented in README's new ### Health Check subsection and in the regenerated Swagger.
Probe-object cleanup: if Delete fails, the probe object is left under .healthcheck/. With a 30s TTL and a continuously-failing delete that's ~3 KB/hour per replica. The proxy_health_probe_failures_total{step="delete"} counter surfaces this. A future Storage.List extension would enable a startup sweep — explicitly out of scope here.
Deliberate spec deviation: health.go calls rc.Close() explicitly (not deferred) between ReadAll and Delete so the file handle is released before deletion. On Windows the deferred-close ordering caused Delete to fail with "file in use" — caught when wiring up TestHealthEndpoint against the real filesystem backend. Commented in the source.

Untested

I have not validated this against a remote backend (S3/Azure).

…TestServer Also fix Windows file-locking issue in storageProbe: close the reader explicitly before Delete so the file handle is released prior to os.Remove.

larsborn added 11 commits May 12, 2026 19:47

config: add Health.StorageProbeInterval

6ff0c65

metrics: add proxy_health_probe_failures_total counter

77e766d

server: add storageProbe with happy-path test

ba4aa76

server: add storageProbe failure-mode tests

928c53d

server: add healthCache with TTL, single-flight, transition logging

d7572c8

server: wire storage probe into /health

228b5aa

server: update TestHealthEndpoint for JSON; wire healthCache into new…

b80dcd3

…TestServer Also fix Windows file-locking issue in storageProbe: close the reader explicitly before Delete so the file handle is released prior to os.Remove.

server: clean up stale comment in storageProbe

d0e52b3

docs: document storage health probe and new metric

ca0803f

docs: regenerate Swagger for /health JSON response

c9f1231

server: simplify rc.Close error handling in storageProbe

f39a3e3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add storage backend probe to /health (closes #73)#119

Add storage backend probe to /health (closes #73)#119
larsborn wants to merge 11 commits into
git-pkgs:mainfrom
larsborn:feat/storage-health-probe

larsborn commented May 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

larsborn commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Behavioral notes / breaking changes

Untested

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

larsborn commented May 12, 2026 •

edited

Loading