Skip to content

feat(observability): add otel pipeline and system tracing#116

Merged
EricAndrechek merged 19 commits into
mainfrom
observability_2
May 13, 2026
Merged

feat(observability): add otel pipeline and system tracing#116
EricAndrechek merged 19 commits into
mainfrom
observability_2

Conversation

@taitelee

@taitelee taitelee commented May 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Establishes WaveHouse's observability infrastructure: an opt-in OpenTelemetry pipeline (traces, metrics, logs) with per-signal toggles, configurable sampling, and an optional Prometheus /metrics exposition. The slog logger fans out to stdout (always 100%) and OTLP, so logs never disappear regardless of collector state.

Config (all under otel:):

  • enabled — master switch (default false)
  • addr — OTLP gRPC collector endpoint (plaintext, see TLS limitation below)
  • traces.{enabled, sample_rate} — head-based sampling
  • metrics.enabled — OTLP push for metrics + Go runtime metrics
  • metrics.prometheus.{enabled, path, port}/metrics scrape endpoint (port 0 = mount on API server, non-zero = sidecar listener)
  • logs.{enabled, sample_rate} — DEBUG/INFO sampled to OTLP; WARN/ERROR always 100% (non-configurable safety floor); stdout always 100%

Changes:

  • OTel pipeline: internal/observability/provider.goInitProvider(ctx, serviceName, ProviderConfig) → (shutdown, promHandler, err). Each signal independently gated. Private prometheus.Registry to keep DefaultRegisterer's auto-registered process/Go collectors out of /metrics.
  • Structured logging: internal/observability/logger.go — slog handler that fans out to stdout AND OTLP, with a level-aware OTLP sampler (WARN+ floor non-configurable). TraceHandler injects trace_id/span_id from the active span.
  • Bounded shutdown: defer otelShutdown in main.go is now bounded to 5s so an unreachable collector can't hang process exit (the OTel SDK doesn't fully honor the shutdown context against an unreachable gRPC endpoint).
  • SigNoz dev stack: deployments/signoz/ — self-contained compose setup with env-substituted credentials (.env.example template, .env gitignored). The stack is functional but not yet wired into make dev — see feat(observability): o11y dev-stack cleanup + wire into make #121 for the dev-loop integration follow-up.

TLS limitation (acknowledged in docs, tracked in #97 for the follow-up feature): exporters call WithInsecure(), so direct shipping to TLS-protected cloud OTLP endpoints (Grafana Cloud, Honeycomb, Datadog OTLP) is not yet supported — sidecar collector remains the workaround.

Test plan

  • Unit: internal/config validation cases for sample-rate range, port collision, path leading-slash, and "ignored when disabled" semantics. otlpSamplerFn tested directly for the WARN+ floor invariant.
  • Integration (tests/integration/otel_test.go) via in-process OTLP receiver (testutil.FakeOTLP): trace sampling at 1.0/0.5/0.0 with binomial tolerance; log WARN+ floor with sample_rate=0.0; per-signal disable (traces-only, logs-only); unreachable-endpoint emit-non-blocking; Prometheus /metrics scrape exposing custom counter with OTel→Prom name translation.
  • CI: full make ci passes locally (unit 73.5%, integration 19.4%, e2e 51.9%, total 80.7%).
  • [-] Manual SigNoz UI verification: deferred to feat(observability): o11y dev-stack cleanup + wire into make #121 (the dev-stack hardening + make dev-obs integration). Code paths are integration-tested against a fake OTLP receiver, which is the right layer for this PR's scope.

Related Issues

Closes #96 — adds WH_OTEL_ENABLED master switch + per-signal toggles that short-circuit telemetry init. Default flipped from the issue's suggested true to false for the "no collector running" case (more conservative). Naming matches the issue's original WH_OTEL_ENABLED proposal.

Refs #97 — the docs portion (env var reference table + TLS-constraint acknowledgment + sidecar workaround) shipped here. Issue stays open and is re-scoped to the actual feature work: endpoint scheme sniffing (https://... → TLS) + WH_OTEL_HEADERS for direct shipping to Grafana Cloud / Honeycomb / Datadog OTLP without a sidecar.

Refs #121 — SigNoz dev-stack cleanup (version pinning, healthchecks, drop container_name:) + make dev-obs wiring for first-class "dev mode + traces visible in SigNoz UI" workflow. Filed as separate PR for another contributor to pick up.

Tracked separately: #120 — singleflight context-cancellation tradeoff in internal/api/query.go, surfaced during this PR's review but not specific to observability.

Summary by CodeRabbit

  • New Features
    • OpenTelemetry observability: per-signal enablement and sampling, OTLP log exporter with WARN+ export floor, WH_LOG_LEVEL-controlled stdout/OTLP log levels, and optional Prometheus /metrics endpoint (main port or dedicated).
  • Bug Fixes
    • Bounded OTLP shutdown (5s); unreachable collectors no longer block startup or exports.
  • Configuration
    • New otel and prometheus config blocks with validation and sensible defaults.
  • Tests
    • In-process OTLP test receiver and integration tests for sampling, per-signal gating, Prometheus scrape, and unreachable-endpoint behavior.
  • Documentation
    • Updated architecture, configuration, deployment docs; local SigNoz compose and .env example added.

@taitelee taitelee requested a review from EricAndrechek May 12, 2026 16:50
@github-actions github-actions Bot added dependencies Pull requests that update a dependency file go Pull requests that update go code area/observability Metrics, logs, traces, health, profiling area/api HTTP handlers, routing, middleware area/ingest Ingest pipeline (Bento, batching, DLQ) area/infra CI, build, deploy, Docker, release labels May 12, 2026
@gemini-code-assist

Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive observability pipeline to WaveHouse, enabling deeper system inspection and simplified debugging. By integrating OpenTelemetry and enhancing the logging stack, the system can now export telemetry data to external collectors while maintaining robust error reporting. The changes also include necessary updates to deployment configurations to ensure secure and reliable observability data collection.

Highlights

  • Observability Infrastructure: Integrated OpenTelemetry for unified tracing, metrics, and logging, with support for OTLP-compatible backends.
  • Structured Logging: Upgraded the global logger to an OTel-aware slog handler that automatically attaches Trace IDs and stack traces for error-level logs.
  • Configuration & Deployment: Added observability configuration to the system and updated the SigNoz deployment setup to include authentication and proper volume management.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Comment thread cmd/wavehouse/main.go Outdated
Comment thread internal/observability/provider.go Outdated
Comment thread internal/observability/logger.go Outdated
Comment thread deployments/signoz/otel-collector-config.yaml Outdated

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates OpenTelemetry logging, introduces configurable log levels, and updates the SigNoz deployment configuration. It also modifies context handling in the query API and adds automatic stack trace capturing for error logs. Feedback identifies critical issues including improper context propagation that prevents resource cleanup, hardcoded credentials in configuration files, and variable shadowing that ignores environment-based log levels. Performance concerns were raised regarding stack trace generation on all errors, along with a request for minor code cleanup. Iterate.

Comment thread internal/api/query.go Outdated
Comment thread deployments/signoz/otel-collector-config.yaml Outdated
Comment thread cmd/wavehouse/main.go Outdated
Comment thread internal/observability/logger.go Outdated
Comment thread internal/observability/provider.go Outdated
@EricAndrechek EricAndrechek moved this from Backlog to In progress in WaveHouse Task Board May 12, 2026
Comment thread cmd/wavehouse/main.go Outdated
Comment thread internal/observability/logger.go Outdated
Comment thread internal/config/config.go Outdated
Comment thread deployments/signoz/docker-compose.yaml Outdated
Comment thread internal/observability/provider.go Outdated
@claude

claude Bot commented May 12, 2026

Copy link
Copy Markdown

2 [MUST], 1 [SHOULD], 1 [MAY] — see inline threads for detail.

Severity Finding Location
[MUST] logLevel variable shadow silently ignores WH_LOG_LEVEL when OTel is enabled cmd/wavehouse/main.go:105
[MUST] Missing doc-sync: no docs/configuration.md entry, no CHANGELOG.md entry internal/config/config.go:31
[SHOULD] Hardcoded password in compose + collector config (MEDIUM security) deployments/signoz/docker-compose.yaml:22–59
[SHOULD] debug.Stack() captures slog-internal stack, not caller; hot-path allocation on every error log internal/observability/logger.go:31
[MAY] // --- END PASTE --- leftover comment internal/observability/provider.go:113

Iterate — fix the logLevel shadow and the doc-sync before merge.

@claude

EricAndrechek added a commit that referenced this pull request May 12, 2026
- Remove logLevel shadow in main.go so WH_LOG_LEVEL applies to OTel logger
- Drop context.WithoutCancel in query.go (regressed multi-waiter singleflight,
  see #120 for the underlying tradeoff)
- Drop debug.Stack() injection in logger.go — captures wrong frames and
  allocates several KB on every error log
- Move SigNoz ClickHouse credentials to env-substituted vars + .env.example
  so default:password no longer ships in source-controlled YAML
- Remove `// --- END PASTE ---` artifact in provider.go
- Add Observability + Logging sections to configuration.md and deployment.md,
  plus CHANGELOG entry

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added documentation Improvements or additions to documentation area/docs Documentation, site/, README and removed area/api HTTP handlers, routing, middleware labels May 12, 2026
Comment thread cmd/wavehouse/main.go
@claude

claude Bot commented May 13, 2026

Copy link
Copy Markdown

Review summary — 0 [MUST], 4 [SHOULD], 3 [MAY]

All findings from prior reviewers (logLevel shadowing, stray paste artifact, `debug.Stack()` in TraceHandler, credentials in SigNoz compose/collector YAML, `context.WithoutCancel` in `query.go`, missing doc-sync for `configuration.md`/`deployment.md`/`CHANGELOG.md`) are addressed in the commit history. The core architecture is sound: private `prometheus.Registry`, WARN+ non-configurable safety floor, bounded-shutdown defer, `WithAttrs`/`WithGroup`-safe `TraceHandler`, lazy gRPC dialing.

See inline threads for detail.

Iterate — no blockers, but two [SHOULD]s are worth a quick pass before merge:

  1. router.go:82 — `otelhttp` middleware wraps `/metrics`, `/health`, and `/ready`; every Prometheus scrape and liveness probe generates a trace span. Add those paths to the existing `/v1/stream/` bypass filter.
  2. main.go:104 — `RegisterSystemMetrics` is called unconditionally outside the `if cfg.OTel.Enabled || cfg.Prometheus.Enabled` gate; it silently no-ops against the global no-op meter when observability is fully disabled.

The other [SHOULD]s (shutdown error discarded in `handleErr`, Prometheus sidecar missing `ReadTimeout`/`WriteTimeout`) and [MAY]s (reserved-path validation, sampling test tolerance, `api.md` doc-sync) are lower urgency.


`api.md` doc-sync note [MAY]: `docs/src/content/docs/api.md` does not document the new `GET /metrics` endpoint. AGENTS.md §"Documentation & Consistency Sync" requires `api.md` to be updated for every new endpoint. `deployment.md` covers it well — a short entry in the API reference page is all that's needed (unauthenticated, Prometheus-format, active only when `prometheus.enabled: true`).

Seven valid items from the post-refactor review:

- main.go: gate RegisterSystemMetrics on otel.enabled || prometheus.enabled
  so it doesn't silently no-op against the no-op global MeterProvider.
- main.go: add ReadTimeout/WriteTimeout on the Prometheus sidecar
  http.Server. Single-shot scrape endpoint — no streaming reason to skip
  them, and they bound slow-client exposure.
- provider.go: handleErr now surfaces partial-init shutdown errors via
  slog.Warn instead of dropping them silently. The setup error is still
  returned to the caller; the shutdown error is diagnostic-only but
  matters when partial-init fails in production.
- router.go: extend the existing /v1/stream/* otelhttp bypass to also
  skip /health, /ready, and the configured Prometheus path. Avoids ~4
  spans/min of pure scrape-noise plus a self-loop when the same backend
  stores both traces and scraped metrics.
- config.Validate: reject prometheus.path of /health or /ready — Chi
  registers /metrics before the health routes, so a collision would
  silently shadow the probe (first-registered-wins).
- otel_test.go: tighten TestOTel_TraceSampling_HalfRate tolerance from
  ±500 (~22σ) to ±200 (~9σ). Still flake-proof but actually catches a
  sampler accidentally pinned to 25% / 75%.
- otel_test.go: rename `infoCount` → `lowSevCount` and `warnCount` →
  `warnAndAboveCount` to match what the expressions actually compute
  (DEBUG+INFO and WARN+ERROR respectively).

Declined: a sync.Once guard around runtime.Start. The upstream package
provides no shutdown, but adding sync.Once would make a 2nd
InitProvider call behave differently from the 1st, masking the leak
rather than fixing it. Integration tests intentionally accept the leak.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
internal/observability/provider.go (1)

155-158: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard runtime.Start so repeated init doesn't register runtime instrumentation twice.

runtime.Start(...) is still invoked on every successful metrics/prometheus init. A second InitProvider in the same process can therefore register duplicate runtime instrumentation and leave extra background work behind. Please wrap this in a process-wide sync.Once (or equivalent) and make the one-time behavior explicit.

🧵 Proposed one-time guard
+var runtimeStartOnce sync.Once
+
 func InitProvider(ctx context.Context, serviceName string, cfg ProviderConfig) (func(context.Context) error, http.Handler, error) {
@@
-		if err := runtime.Start(runtime.WithMinimumReadMemStatsInterval(15 * time.Second)); err != nil {
+		var runtimeErr error
+		runtimeStartOnce.Do(func() {
+			runtimeErr = runtime.Start(runtime.WithMinimumReadMemStatsInterval(15 * time.Second))
+		})
+		if runtimeErr != nil {
 			handleErr(err)
-			return shutdown, nil, err
+			return shutdown, nil, runtimeErr
 		}
 	}
Does go.opentelemetry.io/contrib/instrumentation/runtime expose a shutdown or stop API for runtime.Start, and is runtime.Start intended to be called only once per process?
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/observability/provider.go` around lines 155 - 158, InitProvider
currently calls runtime.Start each time it initializes metrics which can
register duplicate instrumentation; guard the runtime.Start call with a
process-wide sync.Once so it runs only once. Add a package-level sync.Once
(e.g., runtimeStartOnce) and an error holder (e.g., runtimeStartErr), then
replace the direct runtime.Start call inside InitProvider with
runtimeStartOnce.Do(func(){ runtimeStartErr = runtime.Start(...)}); after the
Once.Do check runtimeStartErr and call handleErr/return as before; reference
runtime.Start, InitProvider, handleErr and ensure the one-time behavior is
explicit and process-wide.
cmd/wavehouse/main.go (1)

344-365: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast if the dedicated Prometheus listener cannot bind.

The sidecar server is started asynchronously, so run() can keep serving the API even when the metrics listener immediately dies with EADDRINUSE. If Prometheus is explicitly enabled on a dedicated port, bind the socket synchronously and treat bind failures as startup failures instead of degrading after boot.

📡 Proposed startup handling
+	ln, err := net.Listen("tcp", fmt.Sprintf(":%d", promPort))
+	if err != nil {
+		logger.Error("prometheus server listen", "error", err)
+		return 1
+	}
+
 	promSrv = &http.Server{
 		Addr:              fmt.Sprintf(":%d", promPort),
 		Handler:           mux,
@@
 	go func() {
 		logger.Info("starting prometheus metrics server", "addr", promSrv.Addr, "path", promPath)
-		if err := promSrv.ListenAndServe(); err != nil && !errors.Is(err, http.ErrServerClosed) {
+		if err := promSrv.Serve(ln); err != nil && !errors.Is(err, http.ErrServerClosed) {
 			logger.Error("prometheus server error", "error", err)
 		}
 	}()

Also applies to: 398-404

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/wavehouse/main.go` around lines 344 - 365, When cfg.Prometheus.Port is
non-zero and promHandler != nil, ensure the Prometheus sidecar socket is bound
synchronously and any bind error aborts startup: replace the current late-fail
async start of promSrv by creating a net.Listener (e.g. via
net.Listen(fmt.Sprintf(":%d", cfg.Prometheus.Port))) before calling run() or
launching the goroutine, assign that listener to the http.Server (promSrv) and
return the bind error immediately so run() fails fast; update the code paths
that currently create promSrv and start it asynchronously (refer to promSrv,
promHandler, cfg.Prometheus.Port and the async startup around run()) to use the
listener-based, fail-fast semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/config/config.go`:
- Around line 197-218: The Prometheus path validation currently only forbids
exact collisions with reserved endpoints; update the check in the Prometheus
block (c.Prometheus / local p variable and p.Path) so that when prometheus is
mounted on the API server (p.Port == 0) it also rejects any path that is equal
to or is a descendant of the API namespace (startsWith "/v1"), not just
"/health" and "/ready". Modify the loop/conditional that examines p.Path to
return an error for p.Port == 0 && strings.HasPrefix(p.Path, "/v1") (and keep
the existing reserved checks) so config like "/v1" or "/v1/anything" is rejected
when port==0.

In `@internal/observability/provider.go`:
- Around line 69-79: Save the current global providers at the start of init
(capture the return values of the current tracer, meter and logger providers
before calling otel.SetTracerProvider, otel.SetMeterProvider, and
global.SetLoggerProvider), and in handleErr restore those saved providers back
to the globals before/after invoking shutdown; specifically, modify handleErr to
call otel.SetTracerProvider(prevTracer), otel.SetMeterProvider(prevMeter) and
global.SetLoggerProvider(prevLogger) (using the saved prev* variables) so that
on any initialization failure the original globals are reinstated prior to
leaving the function.

In `@tests/integration/otel_test.go`:
- Around line 66-141: Collapse the three TestOTel_TraceSampling_* functions into
a single table-driven test: create a slice tests := []struct{name string, rate
float64, iterations int, expectFunc func(int) bool, expectMsg string} with
entries for FullRate, HalfRate, ZeroRate; inside a loop call t.Run(tt.name,
func(t *testing.T){ ... }) and move common setup/teardown (guardOTelGlobals, r
:= testutil.NewFakeOTLP, shutdown,_ := initAndShutdown(...
ProviderConfig{Endpoint: r.Addr(), TracesEnabled: true, TracesSampleRate:
tt.rate}), tracer := otel.Tracer("test"), span loop, drain context and shutdown
assertion) into the subtest body, then replace individual assertions with a
single check using tt.expectFunc(r.SpanCount()) / assert helpers and
tt.expectMsg; keep references to initAndShutdown, observability.ProviderConfig,
guardOTelGlobals, testutil.NewFakeOTLP, otel.Tracer, shutdown and r.SpanCount to
locate and refactor the code.

---

Duplicate comments:
In `@cmd/wavehouse/main.go`:
- Around line 344-365: When cfg.Prometheus.Port is non-zero and promHandler !=
nil, ensure the Prometheus sidecar socket is bound synchronously and any bind
error aborts startup: replace the current late-fail async start of promSrv by
creating a net.Listener (e.g. via net.Listen(fmt.Sprintf(":%d",
cfg.Prometheus.Port))) before calling run() or launching the goroutine, assign
that listener to the http.Server (promSrv) and return the bind error immediately
so run() fails fast; update the code paths that currently create promSrv and
start it asynchronously (refer to promSrv, promHandler, cfg.Prometheus.Port and
the async startup around run()) to use the listener-based, fail-fast semantics.

In `@internal/observability/provider.go`:
- Around line 155-158: InitProvider currently calls runtime.Start each time it
initializes metrics which can register duplicate instrumentation; guard the
runtime.Start call with a process-wide sync.Once so it runs only once. Add a
package-level sync.Once (e.g., runtimeStartOnce) and an error holder (e.g.,
runtimeStartErr), then replace the direct runtime.Start call inside InitProvider
with runtimeStartOnce.Do(func(){ runtimeStartErr = runtime.Start(...)}); after
the Once.Do check runtimeStartErr and call handleErr/return as before; reference
runtime.Start, InitProvider, handleErr and ensure the one-time behavior is
explicit and process-wide.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: c21dedad-d0c2-43a7-8f4b-9b7feb43a5e7

📥 Commits

Reviewing files that changed from the base of the PR and between 329db0c and 86fcf65.

📒 Files selected for processing (5)
  • cmd/wavehouse/main.go
  • internal/api/router.go
  • internal/config/config.go
  • internal/observability/provider.go
  • tests/integration/otel_test.go

Comment thread internal/config/config.go
Comment thread internal/observability/provider.go
Comment thread tests/integration/otel_test.go Outdated
Comment thread deployments/signoz/docker-compose.yaml
@claude

claude Bot commented May 13, 2026

Copy link
Copy Markdown

1 [SHOULD], 0 [MAY] — see the inline thread for detail.

Severity Finding Location
[SHOULD] Three SigNoz images pinned to :latest — migrator, query-service, otel-collector deployments/signoz/docker-compose.yaml:33,49,64

Iterate — the [SHOULD] above is the only new finding; pin the SigNoz images to concrete version tags.


State of CodeRabbit's open CHANGES_REQUESTED (from 2026-05-13T14:04):

Two items from that review are still present in the code as of HEAD 86fcf65:

  1. AGENTS.md File Structure section missing internal/observability/ — The package description IS in AGENTS.md (line 24 and Key Design Decisions Dual-Mode Metric Export (OTLP Push vs. Prometheus Scrape) #15); only the quick-reference text block at §File Structure is stale. Low severity but it's the doc-sync rule.
  2. runtime.Start goroutine not tracked in shutdown — The goroutine IS cleaned up through meterProvider.Shutdown() in shutdownFuncs for normal operation. The unreachable-endpoint integration test explicitly comments on the edge case and works around it with a fire-and-forget goroutine. No production correctness risk; the test suite is intentionally not using goleak. Acceptable as-is or addressable via runtime.WithContext.

If these threads were resolved by Eric with a commit-to-follow-up or counter-argument, the two findings are below ship-blocking level. The core implementation — per-signal gating, WARN+ERROR floor invariant, bounded 5s shutdown, private Prometheus registry, lazy gRPC dial — is correct and well-tested.

@claude

@EricAndrechek EricAndrechek moved this from Ready to In progress in WaveHouse Task Board May 13, 2026
Three valid items from this round:

- provider.go: snapshot OTel globals (propagator + tracer/meter/log
  providers) on entry; restore them in handleErr before shutdown so a
  partial init failure leaves the process with the no-op defaults
  rather than dead, already-shut-down providers. main.go logs and
  continues on init error, so the prior behavior would have left the
  caller running against unusable globals.
- config.Validate: reject prometheus.path == "/v1" or "/v1/*" when
  prometheus.port == 0. Same-port mode mounts the unauthenticated
  metrics handler on the main router; a /v1-namespaced path would
  silently shadow the authenticated API subtree at an
  authenticated-looking URL. Sidecar-port mode (port != 0) is unaffected.
- otel_test.go: collapse the three TestOTel_TraceSampling_* functions
  into a single table-driven TestOTel_TraceSampling with full / half /
  zero subtests. New cases now share the arrange/act/drain scaffolding.

New tests: TestValidate_PrometheusPathReservedConflicts covers
/health, /ready, /v1, and /v1/* in same-port mode;
TestValidate_PrometheusV1PathAllowedOnSidecarPort verifies the
same-port-only restriction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
coderabbitai[bot]
coderabbitai Bot previously approved these changes May 13, 2026
@github-project-automation github-project-automation Bot moved this from In progress to Ready in WaveHouse Task Board May 13, 2026
Comment thread internal/observability/provider.go Outdated
Comment thread AGENTS.md
@claude

claude Bot commented May 13, 2026

Copy link
Copy Markdown

Review — round 6

1 [MUST], 1 [SHOULD], 0 [MAY] — see inline threads for detail.

  • [MUST] AGENTS.md File Structure section still omits internal/observability/ — added to Architecture and Key Design Decisions but the table was not updated; mandatory per the doc-sync rules.
  • [SHOULD] provider.go:169runtime.Start called without a sync.Once guard; goroutines it spawns are unshuttable and accumulate on repeated InitProvider calls (three in the current integration suite). Production path is fine (called once), but this was already flagged by Gemini and is still unaddressed.

Everything else is solid: credentials are env-substituted with a gitignored .env, the hardcoded-10% trace sample rate is gone and replaced with the configurable field, context propagation is correct, handleErr properly restores OTel globals before partial shutdown, TraceHandler.WithAttrs/WithGroup are implemented with a regression test, and the integration-test suite is thorough and well-structured.

Iterate — fix the AGENTS.md File Structure entry; the sync.Once is a quality improvement worth doing in the same pass.

@taitelee

…tart leak

Two items from this review round:

- AGENTS.md: File Structure section now lists internal/observability/
  between internal/mq/ and internal/pipes/ to keep alphabetical order;
  package count corrected from "Eleven" to "Twelve". The earlier
  Architecture + Key Design Decisions edits had updated the prose
  sections but not the structure block — AGENTS.md's own Documentation
  & Consistency Sync rule explicitly requires this.
- provider.go: wrap runtime.Start in a package-level sync.Once so test
  suites that re-init the provider don't accumulate goroutines that the
  upstream package has no way to stop. The trade-off (subsequent re-inits
  miss runtime metrics on their new MeterProvider) is acceptable: no
  test asserts on runtime metric presence and production calls
  InitProvider exactly once. Errors from runtime.Start now log via
  slog.Warn instead of routing through handleErr — instrumentation
  failure shouldn't tear down a fully-initialized pipeline and roll back
  the globals.

The sync.Once item is a reversal from round-4 where I declined a similar
suggestion. The earlier decline was about "subsequent inits behave
differently" — Claude's reframing here (instrumentation errors via Warn,
not handleErr) showed the path that avoids the over-aggressive rollback
on a non-fatal error. The goroutine accumulation in tests was concrete
while the cost (re-init missing runtime metrics) is only observable in
a hypothetical hot-reload path that doesn't exist today.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread deployments/signoz/docker-compose.yaml
@claude

claude Bot commented May 13, 2026

Copy link
Copy Markdown

0 [MUST], 1 [SHOULD], 0 [MAY] — see the inline thread for detail.

The HEAD commit (6de31ee) closes the two open [MUST] threads from CodeRabbit's last CHANGES_REQUESTED: the internal/observability/ entry is now in the AGENTS.md File Structure table, and runtimeStartOnce guards runtime.Start against repeated InitProvider calls. CodeRabbit approved on this commit.

The only new finding is the carryover [SHOULD] from prior rounds:

Severity File Finding
[SHOULD] deployments/signoz/docker-compose.yaml:33,49,64 migrator, signoz, otel-collector images on :latest while clickhouse in the same file is pinned

Ship it — no correctness or security blockers. The core implementation is solid: per-signal gating, WARN+ERROR floor invariant, bounded 5 s shutdown, private Prometheus registry, lazy gRPC dial, and a thorough integration suite. Address the [SHOULD] before the dev-loop integration tracked in #121, or inline it with that PR.

@claude

@EricAndrechek EricAndrechek merged commit d1007c1 into main May 13, 2026
11 checks passed
@EricAndrechek EricAndrechek deleted the observability_2 branch May 13, 2026 15:14
@github-project-automation github-project-automation Bot moved this from Ready to Done in WaveHouse Task Board May 13, 2026
jfwoods added a commit that referenced this pull request May 14, 2026
Closes the remaining items from #121: integrate the SigNoz stack into the
Makefile so `make dev-obs` is a one-shot "dev mode + SigNoz + WaveHouse
pointed at OTLP collector" experience, and finish the stack hardening that
PR #116's review surfaced.

Makefile — new `##@ Observability` section mirroring the `deps-*` family:

  - signoz-up         (idempotent, --wait on UI + collector health)
  - signoz-down       (preserves UI history + admin account)
  - signoz-logs       (tails UI + collector)
  - signoz-wipe       (destroys volumes — admin account reset)
  - signoz-dashboards (upsert via load-dashboards.sh; guards on creds)
  - dev-obs           (deps-up + signoz-up + air with WH_OTEL_ENABLED=true
                       WH_OTEL_ADDR=127.0.0.1:4317; auto-loads dashboards
                       iff SIGNOZ_TOKEN or SIGNOZ_EMAIL+PASSWORD in env,
                       otherwise prints a pointer at signoz-dashboards)

  clean-all also tears down the SigNoz compose project so `make clean-all`
  is a true full reset.

Compose hardening (deployments/signoz/compose.yaml — renamed from
docker-compose.yaml per modern Compose v2 convention):

  - Drop `container_name:` from all six services so the project-prefix
    naming actually works (and multiple instances no longer collide).
  - Drop the explicit `signoz-net` network block — project-default works
    and `standalone.signoz.yaml` reaches the collector via host.docker.internal.
  - Add an `otel-collector` healthcheck against its `health_check`
    extension on :13133. The signoz/signoz-otel-collector image has no
    curl/wget but does have bash, so probe via /dev/tcp + GET / and grep
    for `200 OK`. This is what `--wait signoz otel-collector` blocks on.
  - Gate `signoz` and `otel-collector` on
    `signoz-telemetrystore-migrator: service_completed_successfully`.
    Without this, `--wait signoz otel-collector` skipped the migrator
    entirely (nothing pulled it into the closure) and the collector's
    `migrate sync check` retried forever against a nonexistent
    `signoz_traces` schema. Warm restarts hid the bug because the schema
    persisted in the volume — only a wipe + up surfaced it.
  - Header comment now points operators at `make signoz-up`, calls out
    the dev-only posture, and points at `deploy.resources.limits` as the
    prod knob.

Docs:
  - docs/development.md — new "Running with observability" section with
    the target table + first-run admin-account guidance.
  - docs/deployment.md — refresh the SigNoz bring-up + dashboards flow
    around the new Make wrappers; update compose-path references.
  - README.md — quick-start bullet pointing at `make dev-obs`.

WHissues.md, standalone.signoz.yaml: update `docker-compose.yaml` →
`compose.yaml` path references.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/api HTTP handlers, routing, middleware area/docs Documentation, site/, README area/infra CI, build, deploy, Docker, release area/ingest Ingest pipeline (Bento, batching, DLQ) area/observability Metrics, logs, traces, health, profiling dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation go Pull requests that update go code

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

feat(observability): add WH_OTEL_ENABLED to short-circuit telemetry init

2 participants