Skip to content

Feature/metrics observability plan#27

Merged
loning merged 10 commits intodevfrom
feature/metrics-observability-plan
Mar 8, 2026
Merged

Feature/metrics observability plan#27
loning merged 10 commits intodevfrom
feature/metrics-observability-plan

Conversation

@louis4li
Copy link
Copy Markdown
Contributor

@louis4li louis4li commented Mar 6, 2026

Refactor metrics instrumentation and add observability baseline

Summary

Unifies API and runtime metrics instrumentation using scope-based patterns, adds Prometheus alerting rules, and provisions Grafana dashboards for local development.

Changes

  • Metrics instrumentation: Refactor to scope-based (ApiRequestScope, EventHandleScope) for cleaner separation of concerns
  • API metrics: Add request rate, latency (p95/p99), first-response time, and error ratio tracking
  • Runtime metrics: Add event handle duration, active actors count, and event rate by direction/result
  • Prometheus alerts: Add baseline alert rules for error ratio and latency SLO violations
  • Grafana provisioning: Auto-provision dashboard and datasource via Docker Compose
  • Tests: Add 223 lines of API metrics tests (ChatEndpointsInternalTests)

Validation

  • ✅ All tests passing (261 passed)
  • ✅ Local observability stack verified (Prometheus + Grafana)
  • ✅ Metrics collection confirmed via Prometheus queries

louis4li added 10 commits March 4, 2026 17:43
…rst-response contract

- Remove pipeline=ai|core label from runtime metrics (fixes CI guard violation)
- Fix duplicate Meter name: ApiMetrics uses Aevatar.Api instead of Aevatar.Agents
- Remove dead AddMeter("Aevatar.GenAI") registration
- Update Grafana dashboard: remove AI/Core panels, expand first-response panel
- Add first-response semantics documentation and tests
- Centralize metric tag/value constants in ApiMetrics and AgentMetrics
- Update all call sites to use constants instead of hard-coded strings

Made-with: Cursor
- Classify only 5xx errors as 'error' in API metrics (4xx are 'ok')
- Split First Response and Full Request latency into separate panels
- Fix Error Ratio calculation for stable values at low traffic
- Update observability README with latest dashboard structure

Made-with: Cursor
- Add AgentMetrics.RecordEventHandled() helper to match ApiMetrics.RecordRequest() pattern
  - Eliminates 8-line duplication in LocalActor and RuntimeActorGrain
  - Maintains API symmetry between runtime and API metric layers
- Move ResolveMetricResult from ChatEndpoints to ApiMetrics.ResolveResult()
  - Centralizes all result classification semantics in metrics layer
  - Removes dependency on StatusCodes constant, uses pure int comparison
- Add metrics instrumentation to HandleCommand endpoint
- Add first-response metric for WebSocket parse error path
- Add error classification tests (4xx→ok, 5xx→error)
- Update metrics baseline plan documentation

All tests pass (26 tests), architecture guards pass.

Made-with: Cursor
- Set requestResult via ApiMetrics.ResolveResult(statusCode) in the
  result.Error != None branch before return, so 503/5xx are recorded
  as result=error instead of ok
- Add HandleCommand_WhenProjectionDisabled_ShouldClassify503AsResultError
  test to assert 503 -> error classification

Made-with: Cursor
- Fix HTTP first-response metric: record after WriteAsync succeeds (not before)
- Use Interlocked.CompareExchange for thread-safe first-response recording
- Add WebSocket first-response metric tests (parse error, started, execution throws)
- Dashboard improvements:
  - Remove API Request Latency p95, keep only p99 for consistency
  - Fix Runtime Self Events / API Request: compute only when API requests > 0
  - Fix Error Ratio panel: use empty-as-zero to show both API and runtime event error ratio lines
  - Rename 'runtime error ratio' to 'runtime event error ratio' for clarity
- Add metrics quick reference runbook with PromQL, thresholds, and common misreads
- Document error ratio panel empty-as-zero implementation

Made-with: Cursor
Resolve merge conflicts in runtime grain and chat capability metrics paths, keeping first-response metrics hardening while adopting latest request normalization and retry handling from dev.

Made-with: Cursor
- Create EventHandleScope: unified scope for runtime event handling
  - Composes Activity (tracing) + log scope + AgentMetrics (metrics)
  - Single Stopwatch, single error classification per operation
  - Fix scope placement: only wrap actual HandleEventAsync call
    (filtered envelopes no longer counted as handled)

- Create ApiRequestScope: unified scope for API request handling
  - Encapsulates Stopwatch, result tracking, first-response recording
  - Fix first-response timing: await write completion before recording
  - Fix OperationCanceledException: handle in HandleCommand (return 499)

- Refactor ChatEndpoints: replace manual metrics boilerplate with scopes
- Refactor ChatWebSocketRunCoordinator: remove metrics return value coupling
- Extract shared ApiMetricCapture test helper
- Update TracingContextHelpers: remove replaced HandleEnvelopeInstrumentation

Fixes:
- Runtime metrics scope drift (filtered envelopes excluded)
- SSE first-response recorded before write completion
- HandleCommand cancellation classified as error

All tests pass (1064). Architecture guards pass.

Made-with: Cursor
…Prometheus alert coverage.

This change aligns runtime telemetry, tests, and ops documentation so regressions versus dev are easier to detect and score.

Made-with: Cursor
- Replace Prometheus scraping endpoint with OTLP export
- Add OpenTelemetry Collector to local stack (docker-compose)
- Configure collector to forward traces to Jaeger and expose metrics to Prometheus
- Update documentation to clarify OTLP endpoints and data flow

Made-with: Cursor
@loning loning merged commit 6e922a2 into dev Mar 8, 2026
8 checks passed
eanzhao added a commit that referenced this pull request May 8, 2026
Address review batch on PR #562 (10 inline comments). All in files I have
recent ownership of and require no architectural shifts:

- #16 (blocker, security): ssh_exec is now opt-in via NyxIdToolOptions.
  EnableSshExecTool. Hosts that haven't wired the approval middleware no
  longer see the tool by default. Mainnet host opts in (Lark bot needs it).
- #21 (major, bug): code_execute keeps the modern /execute + {language,
  script} contract, but on a NyxID-proxy upstream 404 it retries the legacy
  /run + {language, code} contract so deployments still pinned to old
  chrono-sandbox-service builds keep working.
- #22 (major, bug): SkillRegistry.IsFresh now exempts SkillSource != Remote
  from TTL — local skills are baked in at registration and don't need
  expiring; prior behavior dropped them from use_skill after the first 5min.
- #18 (major, bug): TurnRunner.TryResolveSenderBindingAsync narrows the
  catch to transient infra errors (Http/Timeout/IO/JSON) and surfaces
  non-transient (logic, NRE, serialization) at Error level so ops can
  distinguish "sender unbound" from "binding store broken".
- #19 (major, bug): ConversationReplyGenerator narrows the
  sender-route-fallback catch to transient errors via
  IsRetryableSenderRouteFailure. Programmer errors no longer cost an LLM
  round on retry.
- #29 + #30 (minor): inbox runtime gives metadata enrichment its own 15s
  budget separate from the LLM run, surfacing
  errorCode=llm_reply_metadata_timeout when scope/UserConfig lookup is
  slow. ResolveFallbackTimeout treats ResponseTimeoutSeconds<=0 as "no
  timeout" rather than silently snapping back to 120s.
- #12 (minor): ConversationGAgent's stream-chunk and final-stream-chunk
  edits run under a 10s CTS now; the failure path already uses one. A hung
  relay can no longer pin the actor turn forever.
- #27 (minor, security): ConstantTimeEquals docstring tightened — removed
  the "for future callers" line and added a SCOPE comment that this helper
  is rebuild-admin-only and shouldn't be promoted to internal/public
  without replacing its length-leak with a length-padding scheme.
- #23 (major, bug): CLI ornn skills slug default → ornn-api (matches the
  registered slug; bare "ornn" is the SPA frontend that returns HTML).

Build clean (NyxId / Skills / NyxidChat / Mainnet hosts), 30 AI tests +
15 inbox runtime tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants