Skip to content

Add built-in observability with structured logging and tracing#12

Merged
chinmaymk merged 11 commits intomainfrom
claude/add-observability-logging-rSZQn
Mar 10, 2026
Merged

Add built-in observability with structured logging and tracing#12
chinmaymk merged 11 commits intomainfrom
claude/add-observability-logging-rSZQn

Conversation

@chinmaymk
Copy link
Copy Markdown
Owner

Adds a production-grade observability system that captures all agent
actions as structured JSON logs with trace/span correlation IDs.
Enabled by default at info level on stderr.

  • Structured logger with levels (debug/info/warn/error), output to
    stderr/stdout/file, and JSON-formatted log lines
  • Span-based tracer with nested spans for loop, iteration, model call,
    and tool execution phases
  • Full instrumentation of the agent loop: loop lifecycle, model
    responses (with token usage and response preview), tool execution
    (with input/output previews and timing), and errors
  • Config support via ra.config.json and RA_OBSERVABILITY_* env vars
  • Logger/tracer threaded through CLI, REPL, HTTP, and MCP interfaces
  • 17 new tests covering logger, tracer, and factory
  • Documentation with visualization guides for jq, Grafana+Loki,
    Jaeger, ELK, and OpenTelemetry Collector

https://claude.ai/code/session_01RpTTaxAzxVxHEqjj3WwnUx

@chinmaymk chinmaymk force-pushed the claude/add-observability-logging-rSZQn branch from a4b6b79 to b4e71db Compare March 8, 2026 22:08
Implement production-grade observability as middleware — no changes
to the agent loop, tools, or interfaces. Observability hooks into
ra's existing 9 lifecycle points via createObservabilityMiddleware().

New files:
- src/observability/logger.ts — Structured JSON logger with levels
- src/observability/tracer.ts — Span-based tracer with timing
- src/observability/middleware.ts — All 9 hooks in one place
- src/observability/index.ts — Factory with split log/trace config
- docs/observability.md — Config, log reference, visualization guides

Existing files touched:
- src/config/ — Add ObservabilityConfig type and defaults
- src/index.ts — Create and wire observability middleware

Enabled by default. Logs to stderr. Configure via config file or
RA_LOG_LEVEL, RA_LOG_OUTPUT, RA_TRACE_OUTPUT env vars.

https://claude.ai/code/session_01RpTTaxAzxVxHEqjj3WwnUx
@chinmaymk chinmaymk force-pushed the claude/add-observability-logging-rSZQn branch from b4e71db to a244e4f Compare March 8, 2026 22:15
claude added 10 commits March 8, 2026 22:20
The merge helper was a one-off function in index.ts with type casts.
Now it lives alongside runMiddlewareChain, is properly typed, and
accepts any number of middleware configs via rest params.

https://claude.ai/code/session_01RpTTaxAzxVxHEqjj3WwnUx
Same pattern as resolver and memory middleware: just prepend into the
existing middleware object. No separate merge step needed.

https://claude.ai/code/session_01RpTTaxAzxVxHEqjj3WwnUx
- Drop no-op onStreamChunk hook from obs middleware
- Add error.stack to onError log for debuggability
- Add onCompact callback to CompactionConfig so compaction events are logged
- Clean up middleware: middleware → shorthand in index.ts
- Add Observability section to README documenting logs, traces, and config

https://claude.ai/code/session_01RpTTaxAzxVxHEqjj3WwnUx
- Replace stale mergeMiddleware reference with prepend-to-chain approach
- Add context compacted log event to reference table
- Add stack field to agent loop failed event
- Note onCompact callback pattern for compaction logging

https://claude.ai/code/session_01RpTTaxAzxVxHEqjj3WwnUx
Bugs fixed:
- onError only closed loopSpan, leaving iterationSpan, modelSpan, and
  toolSpans orphaned in the tracer's activeSpans Map. Added
  drainOpenSpans() to end all child spans before closing the root.
- Middleware reuse across loop runs could leak stale spans from a
  crashed run. beforeLoopBegin now drains any leftover state on entry.
- error.stack was logged but not traced — added to the loopSpan
  error attributes for consistency.
- All span variables are now nullable (Span | undefined) with guards,
  preventing endSpan calls on uninitialized spans.

Tests added:
- Error path: verifies all 3 span types are emitted with error status
  and stack trace is present in both log and trace output
- Reuse safety: same middleware instance across multiple successful runs
- Crash recovery: successful run after a failed run with same middleware
- onCompact callback: called with correct info, not called when skipped
  or when summarization fails

Also added Observability link to README nav header.

https://claude.ai/code/session_01RpTTaxAzxVxHEqjj3WwnUx
Startup/shutdown logs:
- custom middleware loaded (info, hookCount)
- session storage initialized (debug, path)
- resuming session (info, sessionId, messageCount)
- shutting down (info)

Test additions:
- Tool execution failure: verifies error log + error span status
- Remove unused firstRunOutput variable from reuse test

Docs:
- README: add toolCallId to tool execution complete/failed (was
  inconsistent with executing tool row), add startup event summary
  with link to full reference
- docs/observability.md: add all 4 new log events to reference table

https://claude.ai/code/session_01RpTTaxAzxVxHEqjj3WwnUx
tsc with emitDeclarationOnly cannot resolve .ts extension imports
on value exports (type-only exports are erased and fine). Remove
the extensions so build:types passes.

https://claude.ai/code/session_01RpTTaxAzxVxHEqjj3WwnUx
- Add parameter signatures to NoopLogger overrides to match base class
- Add parameter signatures to NoopTracer overrides to match base class
- Remove stale @ts-expect-error (Tracer constructor already accepts null)

https://claude.ai/code/session_01RpTTaxAzxVxHEqjj3WwnUx
- Add onCompact callback to compaction type in src/config/types.ts
- Use type assertion in middleware merge loop to avoid union discrimination issue

https://claude.ai/code/session_01RpTTaxAzxVxHEqjj3WwnUx
- Merge origin/main (subagent tool feature)
- Add subagent-specific observability: log task count, per-task status,
  and aggregate token usage in beforeToolExecution/afterToolExecution
- Fix HTTP integration test flake: keep draining stderr pipe after port
  detection so observability log writes don't block/crash the server

https://claude.ai/code/session_01RpTTaxAzxVxHEqjj3WwnUx
@chinmaymk chinmaymk merged commit de67852 into main Mar 10, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants