Widen Sentry capture beyond panic recovery#181
Merged
Conversation
The initial integration only captured panics via a top-level defer, but Go services rarely panic and the codebase handles most errors with log.Printf, so nothing was reaching Sentry in practice. Adds three capture layers: - Echo middleware on both control plane and worker HTTP servers — captures panics (before the existing Recover middleware handles them) and any 5xx responses, attaching request context. - Unary + stream gRPC interceptors on the worker — capture server-side error codes (Internal, Unknown, Unavailable, etc.) while skipping client-side codes (InvalidArgument, NotFound, Canceled, etc.) that represent expected flow. - observability.Go wrapper for background goroutines with Sentry-aware panic recovery; applied to the maintenance loop (control plane) and UploadBaseImageIfNew / MigrateStaleCheckpoints / RollingUpgrade loops (worker). Also adds observability.CaptureError for targeted error capture, used in the control plane maintenance loop to report DB recovery failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #179 / #180. The initial Sentry wiring only captured panics via a top-level
defer, but Go services rarely panic and the codebase handles errors withlog.Printf— so nothing was reaching Sentry in practice (hence the empty dashboard).This PR adds three capture layers so real errors actually surface:
middleware.Recover()converts them to 500s) and any handler error that resolves to HTTP 5xx. Client errors (4xx) are skipped — they're expected flow.Internal,Unknown,Unavailable,DataLoss, etc.) and skips client-side codes (InvalidArgument,NotFound,Canceled,DeadlineExceeded,PermissionDenied, etc.) that represent expected flow rather than bugs.observability.Gowrapper for background goroutines. A panic in a long-lived background loop would otherwise die silently; now it's captured with agoroutinetag. Applied to the maintenance loop in the control plane andUploadBaseImageIfNew/MigrateStaleCheckpoints/RollingUpgradeHibernatedin the worker.Also adds
observability.CaptureError(err, tags...)as a generic helper, and uses it in the control plane maintenance loop to report DB recovery failures that were previously only logged.All capture calls are safe no-ops when Sentry is not initialized.
Test plan
go build ./cmd/server ./cmd/worker ./internal/observability ./internal/api ./internal/workerpassesgo vetclean on touched packagessentry: enabled (...)still appears at startup for both servicesservice=control-planeandhttp.routetagInternal— event appears withservice=workerandgrpc.methodtaggoroutinetagWhat's still not captured
log.Printf/log.Fatalfat arbitrary sites across internal packages (would require either replacing the standard logger or adding targetedCaptureErrorcalls site-by-site — out of scope for this PR; can be added incrementally as we find gaps)runExecStreamQEMU,forwardStdin) — errors there bubble up through the gRPC interceptor already🤖 Generated with Claude Code