feat(observability): add otel pipeline and system tracing by taitelee · Pull Request #116 · Wave-RF/WaveHouse

taitelee · 2026-05-12T16:50:49Z

Summary

Establishes WaveHouse's observability infrastructure: an opt-in OpenTelemetry pipeline (traces, metrics, logs) with per-signal toggles, configurable sampling, and an optional Prometheus /metrics exposition. The slog logger fans out to stdout (always 100%) and OTLP, so logs never disappear regardless of collector state.

Config (all under otel:):

enabled — master switch (default false)
addr — OTLP gRPC collector endpoint (plaintext, see TLS limitation below)
traces.{enabled, sample_rate} — head-based sampling
metrics.enabled — OTLP push for metrics + Go runtime metrics
metrics.prometheus.{enabled, path, port} — /metrics scrape endpoint (port 0 = mount on API server, non-zero = sidecar listener)
logs.{enabled, sample_rate} — DEBUG/INFO sampled to OTLP; WARN/ERROR always 100% (non-configurable safety floor); stdout always 100%

Changes:

OTel pipeline: internal/observability/provider.go — InitProvider(ctx, serviceName, ProviderConfig) → (shutdown, promHandler, err). Each signal independently gated. Private prometheus.Registry to keep DefaultRegisterer's auto-registered process/Go collectors out of /metrics.
Structured logging: internal/observability/logger.go — slog handler that fans out to stdout AND OTLP, with a level-aware OTLP sampler (WARN+ floor non-configurable). TraceHandler injects trace_id/span_id from the active span.
Bounded shutdown: defer otelShutdown in main.go is now bounded to 5s so an unreachable collector can't hang process exit (the OTel SDK doesn't fully honor the shutdown context against an unreachable gRPC endpoint).
SigNoz dev stack: deployments/signoz/ — self-contained compose setup with env-substituted credentials (.env.example template, .env gitignored). The stack is functional but not yet wired into make dev — see feat(observability): o11y dev-stack cleanup + wire into make #121 for the dev-loop integration follow-up.

TLS limitation (acknowledged in docs, tracked in #97 for the follow-up feature): exporters call WithInsecure(), so direct shipping to TLS-protected cloud OTLP endpoints (Grafana Cloud, Honeycomb, Datadog OTLP) is not yet supported — sidecar collector remains the workaround.

Test plan

Unit: internal/config validation cases for sample-rate range, port collision, path leading-slash, and "ignored when disabled" semantics. otlpSamplerFn tested directly for the WARN+ floor invariant.
Integration (tests/integration/otel_test.go) via in-process OTLP receiver (testutil.FakeOTLP): trace sampling at 1.0/0.5/0.0 with binomial tolerance; log WARN+ floor with sample_rate=0.0; per-signal disable (traces-only, logs-only); unreachable-endpoint emit-non-blocking; Prometheus /metrics scrape exposing custom counter with OTel→Prom name translation.
CI: full make ci passes locally (unit 73.5%, integration 19.4%, e2e 51.9%, total 80.7%).
[-] Manual SigNoz UI verification: deferred to feat(observability): o11y dev-stack cleanup + wire into make #121 (the dev-stack hardening + make dev-obs integration). Code paths are integration-tested against a fake OTLP receiver, which is the right layer for this PR's scope.

Related Issues

Closes #96 — adds WH_OTEL_ENABLED master switch + per-signal toggles that short-circuit telemetry init. Default flipped from the issue's suggested true to false for the "no collector running" case (more conservative). Naming matches the issue's original WH_OTEL_ENABLED proposal.

Refs #97 — the docs portion (env var reference table + TLS-constraint acknowledgment + sidecar workaround) shipped here. Issue stays open and is re-scoped to the actual feature work: endpoint scheme sniffing (https://... → TLS) + WH_OTEL_HEADERS for direct shipping to Grafana Cloud / Honeycomb / Datadog OTLP without a sidecar.

Refs #121 — SigNoz dev-stack cleanup (version pinning, healthchecks, drop container_name:) + make dev-obs wiring for first-class "dev mode + traces visible in SigNoz UI" workflow. Filed as separate PR for another contributor to pick up.

Tracked separately: #120 — singleflight context-cancellation tradeoff in internal/api/query.go, surfaced during this PR's review but not specific to observability.

Summary by CodeRabbit

New Features
- OpenTelemetry observability: per-signal enablement and sampling, OTLP log exporter with WARN+ export floor, WH_LOG_LEVEL-controlled stdout/OTLP log levels, and optional Prometheus /metrics endpoint (main port or dedicated).
Bug Fixes
- Bounded OTLP shutdown (5s); unreachable collectors no longer block startup or exports.
Configuration
- New otel and prometheus config blocks with validation and sensible defaults.
Tests
- In-process OTLP test receiver and integration tests for sampling, per-signal gating, Prometheus scrape, and unreachable-endpoint behavior.
Documentation
- Updated architecture, configuration, deployment docs; local SigNoz compose and .env example added.

gemini-code-assist · 2026-05-12T16:51:49Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive observability pipeline to WaveHouse, enabling deeper system inspection and simplified debugging. By integrating OpenTelemetry and enhancing the logging stack, the system can now export telemetry data to external collectors while maintaining robust error reporting. The changes also include necessary updates to deployment configurations to ensure secure and reliable observability data collection.

Highlights

Observability Infrastructure: Integrated OpenTelemetry for unified tracing, metrics, and logging, with support for OTLP-compatible backends.
Structured Logging: Upgraded the global logger to an OTel-aware slog handler that automatically attaches Trace IDs and stack traces for error-level logs.
Configuration & Deployment: Added observability configuration to the system and updated the SigNoz deployment setup to include authentication and proper volume management.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request integrates OpenTelemetry logging, introduces configurable log levels, and updates the SigNoz deployment configuration. It also modifies context handling in the query API and adds automatic stack trace capturing for error logs. Feedback identifies critical issues including improper context propagation that prevents resource cleanup, hardcoded credentials in configuration files, and variable shadowing that ignores environment-based log levels. Performance concerns were raised regarding stack trace generation on all errors, along with a request for minor code cleanup. Iterate.

claude · 2026-05-12T18:08:42Z

2 [MUST], 1 [SHOULD], 1 [MAY] — see inline threads for detail.

Severity	Finding	Location
[MUST]	`logLevel` variable shadow silently ignores `WH_LOG_LEVEL` when OTel is enabled	`cmd/wavehouse/main.go:105`
[MUST]	Missing doc-sync: no `docs/configuration.md` entry, no `CHANGELOG.md` entry	`internal/config/config.go:31`
[SHOULD]	Hardcoded `password` in compose + collector config (MEDIUM security)	`deployments/signoz/docker-compose.yaml:22–59`
[SHOULD]	`debug.Stack()` captures slog-internal stack, not caller; hot-path allocation on every error log	`internal/observability/logger.go:31`
[MAY]	`// --- END PASTE ---` leftover comment	`internal/observability/provider.go:113`

Iterate — fix the logLevel shadow and the doc-sync before merge.

@claude

- Remove logLevel shadow in main.go so WH_LOG_LEVEL applies to OTel logger - Drop context.WithoutCancel in query.go (regressed multi-waiter singleflight, see #120 for the underlying tradeoff) - Drop debug.Stack() injection in logger.go — captures wrong frames and allocates several KB on every error log - Move SigNoz ClickHouse credentials to env-substituted vars + .env.example so default:password no longer ships in source-controlled YAML - Remove `// --- END PASTE ---` artifact in provider.go - Add Observability + Logging sections to configuration.md and deployment.md, plus CHANGELOG entry Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude · 2026-05-13T14:06:40Z

Review summary — 0 [MUST], 4 [SHOULD], 3 [MAY]

All findings from prior reviewers (logLevel shadowing, stray paste artifact, `debug.Stack()` in TraceHandler, credentials in SigNoz compose/collector YAML, `context.WithoutCancel` in `query.go`, missing doc-sync for `configuration.md`/`deployment.md`/`CHANGELOG.md`) are addressed in the commit history. The core architecture is sound: private `prometheus.Registry`, WARN+ non-configurable safety floor, bounded-shutdown defer, `WithAttrs`/`WithGroup`-safe `TraceHandler`, lazy gRPC dialing.

See inline threads for detail.

Iterate — no blockers, but two [SHOULD]s are worth a quick pass before merge:

router.go:82 — `otelhttp` middleware wraps `/metrics`, `/health`, and `/ready`; every Prometheus scrape and liveness probe generates a trace span. Add those paths to the existing `/v1/stream/` bypass filter.
main.go:104 — `RegisterSystemMetrics` is called unconditionally outside the `if cfg.OTel.Enabled || cfg.Prometheus.Enabled` gate; it silently no-ops against the global no-op meter when observability is fully disabled.

The other [SHOULD]s (shutdown error discarded in `handleErr`, Prometheus sidecar missing `ReadTimeout`/`WriteTimeout`) and [MAY]s (reserved-path validation, sampling test tolerance, `api.md` doc-sync) are lower urgency.

`api.md` doc-sync note [MAY]: `docs/src/content/docs/api.md` does not document the new `GET /metrics` endpoint. AGENTS.md §"Documentation & Consistency Sync" requires `api.md` to be updated for every new endpoint. `deployment.md` covers it well — a short entry in the API reference page is all that's needed (unauthenticated, Prometheus-format, active only when `prometheus.enabled: true`).

Seven valid items from the post-refactor review: - main.go: gate RegisterSystemMetrics on otel.enabled || prometheus.enabled so it doesn't silently no-op against the no-op global MeterProvider. - main.go: add ReadTimeout/WriteTimeout on the Prometheus sidecar http.Server. Single-shot scrape endpoint — no streaming reason to skip them, and they bound slow-client exposure. - provider.go: handleErr now surfaces partial-init shutdown errors via slog.Warn instead of dropping them silently. The setup error is still returned to the caller; the shutdown error is diagnostic-only but matters when partial-init fails in production. - router.go: extend the existing /v1/stream/* otelhttp bypass to also skip /health, /ready, and the configured Prometheus path. Avoids ~4 spans/min of pure scrape-noise plus a self-loop when the same backend stores both traces and scraped metrics. - config.Validate: reject prometheus.path of /health or /ready — Chi registers /metrics before the health routes, so a collision would silently shadow the probe (first-registered-wins). - otel_test.go: tighten TestOTel_TraceSampling_HalfRate tolerance from ±500 (~22σ) to ±200 (~9σ). Still flake-proof but actually catches a sampler accidentally pinned to 25% / 75%. - otel_test.go: rename `infoCount` → `lowSevCount` and `warnCount` → `warnAndAboveCount` to match what the expressions actually compute (DEBUG+INFO and WARN+ERROR respectively). Declined: a sync.Once guard around runtime.Start. The upstream package provides no shutdown, but adding sync.Once would make a 2nd InitProvider call behave differently from the 1st, masking the leak rather than fixing it. Integration tests intentionally accept the leak. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (2)

internal/observability/provider.go (1)

155-158: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard runtime.Start so repeated init doesn't register runtime instrumentation twice.

runtime.Start(...) is still invoked on every successful metrics/prometheus init. A second InitProvider in the same process can therefore register duplicate runtime instrumentation and leave extra background work behind. Please wrap this in a process-wide sync.Once (or equivalent) and make the one-time behavior explicit.

🧵 Proposed one-time guard

+var runtimeStartOnce sync.Once
+
 func InitProvider(ctx context.Context, serviceName string, cfg ProviderConfig) (func(context.Context) error, http.Handler, error) {
@@
-		if err := runtime.Start(runtime.WithMinimumReadMemStatsInterval(15 * time.Second)); err != nil {
+		var runtimeErr error
+		runtimeStartOnce.Do(func() {
+			runtimeErr = runtime.Start(runtime.WithMinimumReadMemStatsInterval(15 * time.Second))
+		})
+		if runtimeErr != nil {
 			handleErr(err)
-			return shutdown, nil, err
+			return shutdown, nil, runtimeErr
 		}
 	}

Does go.opentelemetry.io/contrib/instrumentation/runtime expose a shutdown or stop API for runtime.Start, and is runtime.Start intended to be called only once per process?

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/observability/provider.go` around lines 155 - 158, InitProvider
currently calls runtime.Start each time it initializes metrics which can
register duplicate instrumentation; guard the runtime.Start call with a
process-wide sync.Once so it runs only once. Add a package-level sync.Once
(e.g., runtimeStartOnce) and an error holder (e.g., runtimeStartErr), then
replace the direct runtime.Start call inside InitProvider with
runtimeStartOnce.Do(func(){ runtimeStartErr = runtime.Start(...)}); after the
Once.Do check runtimeStartErr and call handleErr/return as before; reference
runtime.Start, InitProvider, handleErr and ensure the one-time behavior is
explicit and process-wide.

cmd/wavehouse/main.go (1)

344-365: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast if the dedicated Prometheus listener cannot bind.

The sidecar server is started asynchronously, so run() can keep serving the API even when the metrics listener immediately dies with EADDRINUSE. If Prometheus is explicitly enabled on a dedicated port, bind the socket synchronously and treat bind failures as startup failures instead of degrading after boot.

📡 Proposed startup handling

+	ln, err := net.Listen("tcp", fmt.Sprintf(":%d", promPort))
+	if err != nil {
+		logger.Error("prometheus server listen", "error", err)
+		return 1
+	}
+
 	promSrv = &http.Server{
 		Addr:              fmt.Sprintf(":%d", promPort),
 		Handler:           mux,
@@
 	go func() {
 		logger.Info("starting prometheus metrics server", "addr", promSrv.Addr, "path", promPath)
-		if err := promSrv.ListenAndServe(); err != nil && !errors.Is(err, http.ErrServerClosed) {
+		if err := promSrv.Serve(ln); err != nil && !errors.Is(err, http.ErrServerClosed) {
 			logger.Error("prometheus server error", "error", err)
 		}
 	}()

Also applies to: 398-404

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/wavehouse/main.go` around lines 344 - 365, When cfg.Prometheus.Port is
non-zero and promHandler != nil, ensure the Prometheus sidecar socket is bound
synchronously and any bind error aborts startup: replace the current late-fail
async start of promSrv by creating a net.Listener (e.g. via
net.Listen(fmt.Sprintf(":%d", cfg.Prometheus.Port))) before calling run() or
launching the goroutine, assign that listener to the http.Server (promSrv) and
return the bind error immediately so run() fails fast; update the code paths
that currently create promSrv and start it asynchronously (refer to promSrv,
promHandler, cfg.Prometheus.Port and the async startup around run()) to use the
listener-based, fail-fast semantics.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/config/config.go`:
- Around line 197-218: The Prometheus path validation currently only forbids
exact collisions with reserved endpoints; update the check in the Prometheus
block (c.Prometheus / local p variable and p.Path) so that when prometheus is
mounted on the API server (p.Port == 0) it also rejects any path that is equal
to or is a descendant of the API namespace (startsWith "/v1"), not just
"/health" and "/ready". Modify the loop/conditional that examines p.Path to
return an error for p.Port == 0 && strings.HasPrefix(p.Path, "/v1") (and keep
the existing reserved checks) so config like "/v1" or "/v1/anything" is rejected
when port==0.

In `@internal/observability/provider.go`:
- Around line 69-79: Save the current global providers at the start of init
(capture the return values of the current tracer, meter and logger providers
before calling otel.SetTracerProvider, otel.SetMeterProvider, and
global.SetLoggerProvider), and in handleErr restore those saved providers back
to the globals before/after invoking shutdown; specifically, modify handleErr to
call otel.SetTracerProvider(prevTracer), otel.SetMeterProvider(prevMeter) and
global.SetLoggerProvider(prevLogger) (using the saved prev* variables) so that
on any initialization failure the original globals are reinstated prior to
leaving the function.

In `@tests/integration/otel_test.go`:
- Around line 66-141: Collapse the three TestOTel_TraceSampling_* functions into
a single table-driven test: create a slice tests := []struct{name string, rate
float64, iterations int, expectFunc func(int) bool, expectMsg string} with
entries for FullRate, HalfRate, ZeroRate; inside a loop call t.Run(tt.name,
func(t *testing.T){ ... }) and move common setup/teardown (guardOTelGlobals, r
:= testutil.NewFakeOTLP, shutdown,_ := initAndShutdown(...
ProviderConfig{Endpoint: r.Addr(), TracesEnabled: true, TracesSampleRate:
tt.rate}), tracer := otel.Tracer("test"), span loop, drain context and shutdown
assertion) into the subtest body, then replace individual assertions with a
single check using tt.expectFunc(r.SpanCount()) / assert helpers and
tt.expectMsg; keep references to initAndShutdown, observability.ProviderConfig,
guardOTelGlobals, testutil.NewFakeOTLP, otel.Tracer, shutdown and r.SpanCount to
locate and refactor the code.

---

Duplicate comments:
In `@cmd/wavehouse/main.go`:
- Around line 344-365: When cfg.Prometheus.Port is non-zero and promHandler !=
nil, ensure the Prometheus sidecar socket is bound synchronously and any bind
error aborts startup: replace the current late-fail async start of promSrv by
creating a net.Listener (e.g. via net.Listen(fmt.Sprintf(":%d",
cfg.Prometheus.Port))) before calling run() or launching the goroutine, assign
that listener to the http.Server (promSrv) and return the bind error immediately
so run() fails fast; update the code paths that currently create promSrv and
start it asynchronously (refer to promSrv, promHandler, cfg.Prometheus.Port and
the async startup around run()) to use the listener-based, fail-fast semantics.

In `@internal/observability/provider.go`:
- Around line 155-158: InitProvider currently calls runtime.Start each time it
initializes metrics which can register duplicate instrumentation; guard the
runtime.Start call with a process-wide sync.Once so it runs only once. Add a
package-level sync.Once (e.g., runtimeStartOnce) and an error holder (e.g.,
runtimeStartErr), then replace the direct runtime.Start call inside InitProvider
with runtimeStartOnce.Do(func(){ runtimeStartErr = runtime.Start(...)}); after
the Once.Do check runtimeStartErr and call handleErr/return as before; reference
runtime.Start, InitProvider, handleErr and ensure the one-time behavior is
explicit and process-wide.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: c21dedad-d0c2-43a7-8f4b-9b7feb43a5e7

📥 Commits

Reviewing files that changed from the base of the PR and between 329db0c and 86fcf65.

📒 Files selected for processing (5)

cmd/wavehouse/main.go
internal/api/router.go
internal/config/config.go
internal/observability/provider.go
tests/integration/otel_test.go

claude · 2026-05-13T14:20:34Z

1 [SHOULD], 0 [MAY] — see the inline thread for detail.

Severity	Finding	Location
[SHOULD]	Three SigNoz images pinned to `:latest` — migrator, query-service, otel-collector	`deployments/signoz/docker-compose.yaml:33,49,64`

Iterate — the [SHOULD] above is the only new finding; pin the SigNoz images to concrete version tags.

State of CodeRabbit's open CHANGES_REQUESTED (from 2026-05-13T14:04):

Two items from that review are still present in the code as of HEAD 86fcf65:

AGENTS.md File Structure section missing internal/observability/ — The package description IS in AGENTS.md (line 24 and Key Design Decisions Dual-Mode Metric Export (OTLP Push vs. Prometheus Scrape) #15); only the quick-reference text block at §File Structure is stale. Low severity but it's the doc-sync rule.
runtime.Start goroutine not tracked in shutdown — The goroutine IS cleaned up through meterProvider.Shutdown() in shutdownFuncs for normal operation. The unreachable-endpoint integration test explicitly comments on the edge case and works around it with a fire-and-forget goroutine. No production correctness risk; the test suite is intentionally not using goleak. Acceptable as-is or addressable via runtime.WithContext.

If these threads were resolved by Eric with a commit-to-follow-up or counter-argument, the two findings are below ship-blocking level. The core implementation — per-signal gating, WARN+ERROR floor invariant, bounded 5s shutdown, private Prometheus registry, lazy gRPC dial — is correct and well-tested.

@claude

Three valid items from this round: - provider.go: snapshot OTel globals (propagator + tracer/meter/log providers) on entry; restore them in handleErr before shutdown so a partial init failure leaves the process with the no-op defaults rather than dead, already-shut-down providers. main.go logs and continues on init error, so the prior behavior would have left the caller running against unusable globals. - config.Validate: reject prometheus.path == "/v1" or "/v1/*" when prometheus.port == 0. Same-port mode mounts the unauthenticated metrics handler on the main router; a /v1-namespaced path would silently shadow the authenticated API subtree at an authenticated-looking URL. Sidecar-port mode (port != 0) is unaffected. - otel_test.go: collapse the three TestOTel_TraceSampling_* functions into a single table-driven TestOTel_TraceSampling with full / half / zero subtests. New cases now share the arrange/act/drain scaffolding. New tests: TestValidate_PrometheusPathReservedConflicts covers /health, /ready, /v1, and /v1/* in same-port mode; TestValidate_PrometheusV1PathAllowedOnSidecarPort verifies the same-port-only restriction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude · 2026-05-13T14:31:15Z

Review — round 6

1 [MUST], 1 [SHOULD], 0 [MAY] — see inline threads for detail.

[MUST] AGENTS.md File Structure section still omits internal/observability/ — added to Architecture and Key Design Decisions but the table was not updated; mandatory per the doc-sync rules.
[SHOULD] provider.go:169 — runtime.Start called without a sync.Once guard; goroutines it spawns are unshuttable and accumulate on repeated InitProvider calls (three in the current integration suite). Production path is fine (called once), but this was already flagged by Gemini and is still unaddressed.

Everything else is solid: credentials are env-substituted with a gitignored .env, the hardcoded-10% trace sample rate is gone and replaced with the configurable field, context propagation is correct, handleErr properly restores OTel globals before partial shutdown, TraceHandler.WithAttrs/WithGroup are implemented with a regression test, and the integration-test suite is thorough and well-structured.

Iterate — fix the AGENTS.md File Structure entry; the sync.Once is a quality improvement worth doing in the same pass.

@taitelee

…tart leak Two items from this review round: - AGENTS.md: File Structure section now lists internal/observability/ between internal/mq/ and internal/pipes/ to keep alphabetical order; package count corrected from "Eleven" to "Twelve". The earlier Architecture + Key Design Decisions edits had updated the prose sections but not the structure block — AGENTS.md's own Documentation & Consistency Sync rule explicitly requires this. - provider.go: wrap runtime.Start in a package-level sync.Once so test suites that re-init the provider don't accumulate goroutines that the upstream package has no way to stop. The trade-off (subsequent re-inits miss runtime metrics on their new MeterProvider) is acceptable: no test asserts on runtime metric presence and production calls InitProvider exactly once. Errors from runtime.Start now log via slog.Warn instead of routing through handleErr — instrumentation failure shouldn't tear down a fully-initialized pipeline and roll back the globals. The sync.Once item is a reversal from round-4 where I declined a similar suggestion. The earlier decline was about "subsequent inits behave differently" — Claude's reframing here (instrumentation errors via Warn, not handleErr) showed the path that avoids the over-aggressive rollback on a non-fatal error. The goroutine accumulation in tests was concrete while the cost (re-init missing runtime metrics) is only observable in a hypothetical hot-reload path that doesn't exist today. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude · 2026-05-13T14:39:32Z

0 [MUST], 1 [SHOULD], 0 [MAY] — see the inline thread for detail.

The HEAD commit (6de31ee) closes the two open [MUST] threads from CodeRabbit's last CHANGES_REQUESTED: the internal/observability/ entry is now in the AGENTS.md File Structure table, and runtimeStartOnce guards runtime.Start against repeated InitProvider calls. CodeRabbit approved on this commit.

The only new finding is the carryover [SHOULD] from prior rounds:

Severity	File	Finding
[SHOULD]	`deployments/signoz/docker-compose.yaml:33,49,64`	`migrator`, `signoz`, `otel-collector` images on `:latest` while `clickhouse` in the same file is pinned

Ship it — no correctness or security blockers. The core implementation is solid: per-signal gating, WARN+ERROR floor invariant, bounded 5 s shutdown, private Prometheus registry, lazy gRPC dial, and a thorough integration suite. Address the [SHOULD] before the dev-loop integration tracked in #121, or inline it with that PR.

@claude

Closes the remaining items from #121: integrate the SigNoz stack into the Makefile so `make dev-obs` is a one-shot "dev mode + SigNoz + WaveHouse pointed at OTLP collector" experience, and finish the stack hardening that PR #116's review surfaced. Makefile — new `##@ Observability` section mirroring the `deps-*` family: - signoz-up (idempotent, --wait on UI + collector health) - signoz-down (preserves UI history + admin account) - signoz-logs (tails UI + collector) - signoz-wipe (destroys volumes — admin account reset) - signoz-dashboards (upsert via load-dashboards.sh; guards on creds) - dev-obs (deps-up + signoz-up + air with WH_OTEL_ENABLED=true WH_OTEL_ADDR=127.0.0.1:4317; auto-loads dashboards iff SIGNOZ_TOKEN or SIGNOZ_EMAIL+PASSWORD in env, otherwise prints a pointer at signoz-dashboards) clean-all also tears down the SigNoz compose project so `make clean-all` is a true full reset. Compose hardening (deployments/signoz/compose.yaml — renamed from docker-compose.yaml per modern Compose v2 convention): - Drop `container_name:` from all six services so the project-prefix naming actually works (and multiple instances no longer collide). - Drop the explicit `signoz-net` network block — project-default works and `standalone.signoz.yaml` reaches the collector via host.docker.internal. - Add an `otel-collector` healthcheck against its `health_check` extension on :13133. The signoz/signoz-otel-collector image has no curl/wget but does have bash, so probe via /dev/tcp + GET / and grep for `200 OK`. This is what `--wait signoz otel-collector` blocks on. - Gate `signoz` and `otel-collector` on `signoz-telemetrystore-migrator: service_completed_successfully`. Without this, `--wait signoz otel-collector` skipped the migrator entirely (nothing pulled it into the closure) and the collector's `migrate sync check` retried forever against a nonexistent `signoz_traces` schema. Warm restarts hid the bug because the schema persisted in the volume — only a wipe + up surfaced it. - Header comment now points operators at `make signoz-up`, calls out the dev-only posture, and points at `deploy.resources.limits` as the prod knob. Docs: - docs/development.md — new "Running with observability" section with the target table + first-run admin-account guidance. - docs/deployment.md — refresh the SigNoz bring-up + dashboards flow around the new Make wrappers; update compose-path references. - README.md — quick-start bullet pointing at `make dev-obs`. WHissues.md, standalone.signoz.yaml: update `docker-compose.yaml` → `compose.yaml` path references. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

taitelee requested a review from EricAndrechek May 12, 2026 16:50

taitelee assigned EricAndrechek and taitelee May 12, 2026

github-project-automation Bot added this to WaveHouse Task Board May 12, 2026

github-project-automation Bot moved this to Backlog in WaveHouse Task Board May 12, 2026