feat(observability): add otel pipeline and system tracing#116
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a comprehensive observability pipeline to WaveHouse, enabling deeper system inspection and simplified debugging. By integrating OpenTelemetry and enhancing the logging stack, the system can now export telemetry data to external collectors while maintaining robust error reporting. The changes also include necessary updates to deployment configurations to ensure secure and reliable observability data collection. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request integrates OpenTelemetry logging, introduces configurable log levels, and updates the SigNoz deployment configuration. It also modifies context handling in the query API and adds automatic stack trace capturing for error logs. Feedback identifies critical issues including improper context propagation that prevents resource cleanup, hardcoded credentials in configuration files, and variable shadowing that ignores environment-based log levels. Performance concerns were raised regarding stack trace generation on all errors, along with a request for minor code cleanup. Iterate.
8909b4d to
6bc0d79
Compare
|
2 [MUST], 1 [SHOULD], 1 [MAY] — see inline threads for detail.
Iterate — fix the |
- Remove logLevel shadow in main.go so WH_LOG_LEVEL applies to OTel logger - Drop context.WithoutCancel in query.go (regressed multi-waiter singleflight, see #120 for the underlying tradeoff) - Drop debug.Stack() injection in logger.go — captures wrong frames and allocates several KB on every error log - Move SigNoz ClickHouse credentials to env-substituted vars + .env.example so default:password no longer ships in source-controlled YAML - Remove `// --- END PASTE ---` artifact in provider.go - Add Observability + Logging sections to configuration.md and deployment.md, plus CHANGELOG entry Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review summary — 0 [MUST], 4 [SHOULD], 3 [MAY]All findings from prior reviewers (logLevel shadowing, stray paste artifact, `debug.Stack()` in TraceHandler, credentials in SigNoz compose/collector YAML, `context.WithoutCancel` in `query.go`, missing doc-sync for `configuration.md`/`deployment.md`/`CHANGELOG.md`) are addressed in the commit history. The core architecture is sound: private `prometheus.Registry`, WARN+ non-configurable safety floor, bounded-shutdown defer, `WithAttrs`/`WithGroup`-safe `TraceHandler`, lazy gRPC dialing. See inline threads for detail. Iterate — no blockers, but two [SHOULD]s are worth a quick pass before merge:
The other [SHOULD]s (shutdown error discarded in `handleErr`, Prometheus sidecar missing `ReadTimeout`/`WriteTimeout`) and [MAY]s (reserved-path validation, sampling test tolerance, `api.md` doc-sync) are lower urgency. `api.md` doc-sync note [MAY]: `docs/src/content/docs/api.md` does not document the new `GET /metrics` endpoint. AGENTS.md §"Documentation & Consistency Sync" requires `api.md` to be updated for every new endpoint. `deployment.md` covers it well — a short entry in the API reference page is all that's needed (unauthenticated, Prometheus-format, active only when `prometheus.enabled: true`). |
Seven valid items from the post-refactor review: - main.go: gate RegisterSystemMetrics on otel.enabled || prometheus.enabled so it doesn't silently no-op against the no-op global MeterProvider. - main.go: add ReadTimeout/WriteTimeout on the Prometheus sidecar http.Server. Single-shot scrape endpoint — no streaming reason to skip them, and they bound slow-client exposure. - provider.go: handleErr now surfaces partial-init shutdown errors via slog.Warn instead of dropping them silently. The setup error is still returned to the caller; the shutdown error is diagnostic-only but matters when partial-init fails in production. - router.go: extend the existing /v1/stream/* otelhttp bypass to also skip /health, /ready, and the configured Prometheus path. Avoids ~4 spans/min of pure scrape-noise plus a self-loop when the same backend stores both traces and scraped metrics. - config.Validate: reject prometheus.path of /health or /ready — Chi registers /metrics before the health routes, so a collision would silently shadow the probe (first-registered-wins). - otel_test.go: tighten TestOTel_TraceSampling_HalfRate tolerance from ±500 (~22σ) to ±200 (~9σ). Still flake-proof but actually catches a sampler accidentally pinned to 25% / 75%. - otel_test.go: rename `infoCount` → `lowSevCount` and `warnCount` → `warnAndAboveCount` to match what the expressions actually compute (DEBUG+INFO and WARN+ERROR respectively). Declined: a sync.Once guard around runtime.Start. The upstream package provides no shutdown, but adding sync.Once would make a 2nd InitProvider call behave differently from the 1st, masking the leak rather than fixing it. Integration tests intentionally accept the leak. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (2)
internal/observability/provider.go (1)
155-158:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGuard
runtime.Startso repeated init doesn't register runtime instrumentation twice.
runtime.Start(...)is still invoked on every successful metrics/prometheus init. A secondInitProviderin the same process can therefore register duplicate runtime instrumentation and leave extra background work behind. Please wrap this in a process-widesync.Once(or equivalent) and make the one-time behavior explicit.🧵 Proposed one-time guard
+var runtimeStartOnce sync.Once + func InitProvider(ctx context.Context, serviceName string, cfg ProviderConfig) (func(context.Context) error, http.Handler, error) { @@ - if err := runtime.Start(runtime.WithMinimumReadMemStatsInterval(15 * time.Second)); err != nil { + var runtimeErr error + runtimeStartOnce.Do(func() { + runtimeErr = runtime.Start(runtime.WithMinimumReadMemStatsInterval(15 * time.Second)) + }) + if runtimeErr != nil { handleErr(err) - return shutdown, nil, err + return shutdown, nil, runtimeErr } }Does go.opentelemetry.io/contrib/instrumentation/runtime expose a shutdown or stop API for runtime.Start, and is runtime.Start intended to be called only once per process?🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/observability/provider.go` around lines 155 - 158, InitProvider currently calls runtime.Start each time it initializes metrics which can register duplicate instrumentation; guard the runtime.Start call with a process-wide sync.Once so it runs only once. Add a package-level sync.Once (e.g., runtimeStartOnce) and an error holder (e.g., runtimeStartErr), then replace the direct runtime.Start call inside InitProvider with runtimeStartOnce.Do(func(){ runtimeStartErr = runtime.Start(...)}); after the Once.Do check runtimeStartErr and call handleErr/return as before; reference runtime.Start, InitProvider, handleErr and ensure the one-time behavior is explicit and process-wide.cmd/wavehouse/main.go (1)
344-365:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFail fast if the dedicated Prometheus listener cannot bind.
The sidecar server is started asynchronously, so
run()can keep serving the API even when the metrics listener immediately dies withEADDRINUSE. If Prometheus is explicitly enabled on a dedicated port, bind the socket synchronously and treat bind failures as startup failures instead of degrading after boot.📡 Proposed startup handling
+ ln, err := net.Listen("tcp", fmt.Sprintf(":%d", promPort)) + if err != nil { + logger.Error("prometheus server listen", "error", err) + return 1 + } + promSrv = &http.Server{ Addr: fmt.Sprintf(":%d", promPort), Handler: mux, @@ go func() { logger.Info("starting prometheus metrics server", "addr", promSrv.Addr, "path", promPath) - if err := promSrv.ListenAndServe(); err != nil && !errors.Is(err, http.ErrServerClosed) { + if err := promSrv.Serve(ln); err != nil && !errors.Is(err, http.ErrServerClosed) { logger.Error("prometheus server error", "error", err) } }()Also applies to: 398-404
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/wavehouse/main.go` around lines 344 - 365, When cfg.Prometheus.Port is non-zero and promHandler != nil, ensure the Prometheus sidecar socket is bound synchronously and any bind error aborts startup: replace the current late-fail async start of promSrv by creating a net.Listener (e.g. via net.Listen(fmt.Sprintf(":%d", cfg.Prometheus.Port))) before calling run() or launching the goroutine, assign that listener to the http.Server (promSrv) and return the bind error immediately so run() fails fast; update the code paths that currently create promSrv and start it asynchronously (refer to promSrv, promHandler, cfg.Prometheus.Port and the async startup around run()) to use the listener-based, fail-fast semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/config/config.go`:
- Around line 197-218: The Prometheus path validation currently only forbids
exact collisions with reserved endpoints; update the check in the Prometheus
block (c.Prometheus / local p variable and p.Path) so that when prometheus is
mounted on the API server (p.Port == 0) it also rejects any path that is equal
to or is a descendant of the API namespace (startsWith "/v1"), not just
"/health" and "/ready". Modify the loop/conditional that examines p.Path to
return an error for p.Port == 0 && strings.HasPrefix(p.Path, "/v1") (and keep
the existing reserved checks) so config like "/v1" or "/v1/anything" is rejected
when port==0.
In `@internal/observability/provider.go`:
- Around line 69-79: Save the current global providers at the start of init
(capture the return values of the current tracer, meter and logger providers
before calling otel.SetTracerProvider, otel.SetMeterProvider, and
global.SetLoggerProvider), and in handleErr restore those saved providers back
to the globals before/after invoking shutdown; specifically, modify handleErr to
call otel.SetTracerProvider(prevTracer), otel.SetMeterProvider(prevMeter) and
global.SetLoggerProvider(prevLogger) (using the saved prev* variables) so that
on any initialization failure the original globals are reinstated prior to
leaving the function.
In `@tests/integration/otel_test.go`:
- Around line 66-141: Collapse the three TestOTel_TraceSampling_* functions into
a single table-driven test: create a slice tests := []struct{name string, rate
float64, iterations int, expectFunc func(int) bool, expectMsg string} with
entries for FullRate, HalfRate, ZeroRate; inside a loop call t.Run(tt.name,
func(t *testing.T){ ... }) and move common setup/teardown (guardOTelGlobals, r
:= testutil.NewFakeOTLP, shutdown,_ := initAndShutdown(...
ProviderConfig{Endpoint: r.Addr(), TracesEnabled: true, TracesSampleRate:
tt.rate}), tracer := otel.Tracer("test"), span loop, drain context and shutdown
assertion) into the subtest body, then replace individual assertions with a
single check using tt.expectFunc(r.SpanCount()) / assert helpers and
tt.expectMsg; keep references to initAndShutdown, observability.ProviderConfig,
guardOTelGlobals, testutil.NewFakeOTLP, otel.Tracer, shutdown and r.SpanCount to
locate and refactor the code.
---
Duplicate comments:
In `@cmd/wavehouse/main.go`:
- Around line 344-365: When cfg.Prometheus.Port is non-zero and promHandler !=
nil, ensure the Prometheus sidecar socket is bound synchronously and any bind
error aborts startup: replace the current late-fail async start of promSrv by
creating a net.Listener (e.g. via net.Listen(fmt.Sprintf(":%d",
cfg.Prometheus.Port))) before calling run() or launching the goroutine, assign
that listener to the http.Server (promSrv) and return the bind error immediately
so run() fails fast; update the code paths that currently create promSrv and
start it asynchronously (refer to promSrv, promHandler, cfg.Prometheus.Port and
the async startup around run()) to use the listener-based, fail-fast semantics.
In `@internal/observability/provider.go`:
- Around line 155-158: InitProvider currently calls runtime.Start each time it
initializes metrics which can register duplicate instrumentation; guard the
runtime.Start call with a process-wide sync.Once so it runs only once. Add a
package-level sync.Once (e.g., runtimeStartOnce) and an error holder (e.g.,
runtimeStartErr), then replace the direct runtime.Start call inside InitProvider
with runtimeStartOnce.Do(func(){ runtimeStartErr = runtime.Start(...)}); after
the Once.Do check runtimeStartErr and call handleErr/return as before; reference
runtime.Start, InitProvider, handleErr and ensure the one-time behavior is
explicit and process-wide.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: c21dedad-d0c2-43a7-8f4b-9b7feb43a5e7
📒 Files selected for processing (5)
cmd/wavehouse/main.gointernal/api/router.gointernal/config/config.gointernal/observability/provider.gotests/integration/otel_test.go
|
1 [SHOULD], 0 [MAY] — see the inline thread for detail.
Iterate — the [SHOULD] above is the only new finding; pin the SigNoz images to concrete version tags. State of CodeRabbit's open CHANGES_REQUESTED (from Two items from that review are still present in the code as of HEAD
If these threads were resolved by Eric with a commit-to-follow-up or counter-argument, the two findings are below ship-blocking level. The core implementation — per-signal gating, WARN+ERROR floor invariant, bounded 5s shutdown, private Prometheus registry, lazy gRPC dial — is correct and well-tested. |
Three valid items from this round: - provider.go: snapshot OTel globals (propagator + tracer/meter/log providers) on entry; restore them in handleErr before shutdown so a partial init failure leaves the process with the no-op defaults rather than dead, already-shut-down providers. main.go logs and continues on init error, so the prior behavior would have left the caller running against unusable globals. - config.Validate: reject prometheus.path == "/v1" or "/v1/*" when prometheus.port == 0. Same-port mode mounts the unauthenticated metrics handler on the main router; a /v1-namespaced path would silently shadow the authenticated API subtree at an authenticated-looking URL. Sidecar-port mode (port != 0) is unaffected. - otel_test.go: collapse the three TestOTel_TraceSampling_* functions into a single table-driven TestOTel_TraceSampling with full / half / zero subtests. New cases now share the arrange/act/drain scaffolding. New tests: TestValidate_PrometheusPathReservedConflicts covers /health, /ready, /v1, and /v1/* in same-port mode; TestValidate_PrometheusV1PathAllowedOnSidecarPort verifies the same-port-only restriction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review — round 61 [MUST], 1 [SHOULD], 0 [MAY] — see inline threads for detail.
Everything else is solid: credentials are env-substituted with a gitignored Iterate — fix the AGENTS.md File Structure entry; the |
…tart leak Two items from this review round: - AGENTS.md: File Structure section now lists internal/observability/ between internal/mq/ and internal/pipes/ to keep alphabetical order; package count corrected from "Eleven" to "Twelve". The earlier Architecture + Key Design Decisions edits had updated the prose sections but not the structure block — AGENTS.md's own Documentation & Consistency Sync rule explicitly requires this. - provider.go: wrap runtime.Start in a package-level sync.Once so test suites that re-init the provider don't accumulate goroutines that the upstream package has no way to stop. The trade-off (subsequent re-inits miss runtime metrics on their new MeterProvider) is acceptable: no test asserts on runtime metric presence and production calls InitProvider exactly once. Errors from runtime.Start now log via slog.Warn instead of routing through handleErr — instrumentation failure shouldn't tear down a fully-initialized pipeline and roll back the globals. The sync.Once item is a reversal from round-4 where I declined a similar suggestion. The earlier decline was about "subsequent inits behave differently" — Claude's reframing here (instrumentation errors via Warn, not handleErr) showed the path that avoids the over-aggressive rollback on a non-fatal error. The goroutine accumulation in tests was concrete while the cost (re-init missing runtime metrics) is only observable in a hypothetical hot-reload path that doesn't exist today. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
0 [MUST], 1 [SHOULD], 0 [MAY] — see the inline thread for detail. The HEAD commit ( The only new finding is the carryover [SHOULD] from prior rounds:
Ship it — no correctness or security blockers. The core implementation is solid: per-signal gating, WARN+ERROR floor invariant, bounded 5 s shutdown, private Prometheus registry, lazy gRPC dial, and a thorough integration suite. Address the [SHOULD] before the dev-loop integration tracked in #121, or inline it with that PR. |
Closes the remaining items from #121: integrate the SigNoz stack into the Makefile so `make dev-obs` is a one-shot "dev mode + SigNoz + WaveHouse pointed at OTLP collector" experience, and finish the stack hardening that PR #116's review surfaced. Makefile — new `##@ Observability` section mirroring the `deps-*` family: - signoz-up (idempotent, --wait on UI + collector health) - signoz-down (preserves UI history + admin account) - signoz-logs (tails UI + collector) - signoz-wipe (destroys volumes — admin account reset) - signoz-dashboards (upsert via load-dashboards.sh; guards on creds) - dev-obs (deps-up + signoz-up + air with WH_OTEL_ENABLED=true WH_OTEL_ADDR=127.0.0.1:4317; auto-loads dashboards iff SIGNOZ_TOKEN or SIGNOZ_EMAIL+PASSWORD in env, otherwise prints a pointer at signoz-dashboards) clean-all also tears down the SigNoz compose project so `make clean-all` is a true full reset. Compose hardening (deployments/signoz/compose.yaml — renamed from docker-compose.yaml per modern Compose v2 convention): - Drop `container_name:` from all six services so the project-prefix naming actually works (and multiple instances no longer collide). - Drop the explicit `signoz-net` network block — project-default works and `standalone.signoz.yaml` reaches the collector via host.docker.internal. - Add an `otel-collector` healthcheck against its `health_check` extension on :13133. The signoz/signoz-otel-collector image has no curl/wget but does have bash, so probe via /dev/tcp + GET / and grep for `200 OK`. This is what `--wait signoz otel-collector` blocks on. - Gate `signoz` and `otel-collector` on `signoz-telemetrystore-migrator: service_completed_successfully`. Without this, `--wait signoz otel-collector` skipped the migrator entirely (nothing pulled it into the closure) and the collector's `migrate sync check` retried forever against a nonexistent `signoz_traces` schema. Warm restarts hid the bug because the schema persisted in the volume — only a wipe + up surfaced it. - Header comment now points operators at `make signoz-up`, calls out the dev-only posture, and points at `deploy.resources.limits` as the prod knob. Docs: - docs/development.md — new "Running with observability" section with the target table + first-run admin-account guidance. - docs/deployment.md — refresh the SigNoz bring-up + dashboards flow around the new Make wrappers; update compose-path references. - README.md — quick-start bullet pointing at `make dev-obs`. WHissues.md, standalone.signoz.yaml: update `docker-compose.yaml` → `compose.yaml` path references. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Establishes WaveHouse's observability infrastructure: an opt-in OpenTelemetry pipeline (traces, metrics, logs) with per-signal toggles, configurable sampling, and an optional Prometheus
/metricsexposition. The slog logger fans out to stdout (always 100%) and OTLP, so logs never disappear regardless of collector state.Config (all under
otel:):enabled— master switch (defaultfalse)addr— OTLP gRPC collector endpoint (plaintext, see TLS limitation below)traces.{enabled, sample_rate}— head-based samplingmetrics.enabled— OTLP push for metrics + Go runtime metricsmetrics.prometheus.{enabled, path, port}—/metricsscrape endpoint (port0= mount on API server, non-zero = sidecar listener)logs.{enabled, sample_rate}— DEBUG/INFO sampled to OTLP; WARN/ERROR always 100% (non-configurable safety floor); stdout always 100%Changes:
internal/observability/provider.go—InitProvider(ctx, serviceName, ProviderConfig) → (shutdown, promHandler, err). Each signal independently gated. Privateprometheus.Registryto keepDefaultRegisterer's auto-registered process/Go collectors out of/metrics.internal/observability/logger.go— slog handler that fans out to stdout AND OTLP, with a level-aware OTLP sampler (WARN+ floor non-configurable).TraceHandlerinjectstrace_id/span_idfrom the active span.defer otelShutdowninmain.gois now bounded to 5s so an unreachable collector can't hang process exit (the OTel SDK doesn't fully honor the shutdown context against an unreachable gRPC endpoint).deployments/signoz/— self-contained compose setup with env-substituted credentials (.env.exampletemplate,.envgitignored). The stack is functional but not yet wired intomake dev— see feat(observability): o11y dev-stack cleanup + wire into make #121 for the dev-loop integration follow-up.TLS limitation (acknowledged in docs, tracked in #97 for the follow-up feature): exporters call
WithInsecure(), so direct shipping to TLS-protected cloud OTLP endpoints (Grafana Cloud, Honeycomb, Datadog OTLP) is not yet supported — sidecar collector remains the workaround.Test plan
internal/configvalidation cases for sample-rate range, port collision, path leading-slash, and "ignored when disabled" semantics.otlpSamplerFntested directly for the WARN+ floor invariant.tests/integration/otel_test.go) via in-process OTLP receiver (testutil.FakeOTLP): trace sampling at 1.0/0.5/0.0 with binomial tolerance; log WARN+ floor withsample_rate=0.0; per-signal disable (traces-only, logs-only); unreachable-endpoint emit-non-blocking; Prometheus/metricsscrape exposing custom counter with OTel→Prom name translation.make cipasses locally (unit 73.5%, integration 19.4%, e2e 51.9%, total 80.7%).make dev-obsintegration). Code paths are integration-tested against a fake OTLP receiver, which is the right layer for this PR's scope.Related Issues
Closes #96 — adds
WH_OTEL_ENABLEDmaster switch + per-signal toggles that short-circuit telemetry init. Default flipped from the issue's suggestedtruetofalsefor the "no collector running" case (more conservative). Naming matches the issue's originalWH_OTEL_ENABLEDproposal.Refs #97 — the docs portion (env var reference table + TLS-constraint acknowledgment + sidecar workaround) shipped here. Issue stays open and is re-scoped to the actual feature work: endpoint scheme sniffing (
https://...→ TLS) +WH_OTEL_HEADERSfor direct shipping to Grafana Cloud / Honeycomb / Datadog OTLP without a sidecar.Refs #121 — SigNoz dev-stack cleanup (version pinning, healthchecks, drop
container_name:) +make dev-obswiring for first-class "dev mode + traces visible in SigNoz UI" workflow. Filed as separate PR for another contributor to pick up.Tracked separately: #120 — singleflight context-cancellation tradeoff in
internal/api/query.go, surfaced during this PR's review but not specific to observability.Summary by CodeRabbit