Add storage backend probe to /health (closes #73)#119
Open
larsborn wants to merge 11 commits into
Open
Conversation
…TestServer Also fix Windows file-locking issue in storageProbe: close the reader explicitly before Delete so the file handle is released prior to os.Remove.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
/healthnow performs an active write → size-check → read → verify → delete round-trip against the configured storage backend, in addition to the existing database check. Closes Add storage backend probe to health check #73.health.storage_probe_interval, envPROXY_HEALTH_STORAGE_PROBE_INTERVAL, default30s;"0"disables caching). The probe runs under a detachedcontext.WithTimeout(context.Background(), 10s)so a client disconnect can't poison the cache."ok"/"database error") to JSON:{"status":"ok","checks":{"database":{"status":"ok"},"storage":{"status":"ok"}}}errorfield and (for storage) asteplabel.proxy_health_probe_failures_total{step="write|size|read|verify|delete"}, following the existingproxy_integrity_failures_totalpattern..healthcheck/<unix-nano>-<crypto/rand hex>— unique per call, collision-safe under concurrent replicas. Object is deleted after verify; delete failures surface as probe failures.Behavioral notes / breaking changes
"ok"will break. Status-code-based monitors keep working. Documented in README's new### Health Checksubsection and in the regenerated Swagger.Deletefails, the probe object is left under.healthcheck/. With a 30s TTL and a continuously-failing delete that's ~3 KB/hour per replica. Theproxy_health_probe_failures_total{step="delete"}counter surfaces this. A futureStorage.Listextension would enable a startup sweep — explicitly out of scope here.health.gocallsrc.Close()explicitly (not deferred) betweenReadAllandDeleteso the file handle is released before deletion. On Windows the deferred-close ordering causedDeleteto fail with "file in use" — caught when wiring upTestHealthEndpointagainst the real filesystem backend. Commented in the source.Untested
I have not validated this against a remote backend (S3/Azure).