Feature/metrics observability plan by louis4li · Pull Request #27 · aevatarAI/aevatar

louis4li · 2026-03-06T03:41:16Z

Refactor metrics instrumentation and add observability baseline

Summary

Unifies API and runtime metrics instrumentation using scope-based patterns, adds Prometheus alerting rules, and provisions Grafana dashboards for local development.

Changes

Metrics instrumentation: Refactor to scope-based (ApiRequestScope, EventHandleScope) for cleaner separation of concerns
API metrics: Add request rate, latency (p95/p99), first-response time, and error ratio tracking
Runtime metrics: Add event handle duration, active actors count, and event rate by direction/result
Prometheus alerts: Add baseline alert rules for error ratio and latency SLO violations
Grafana provisioning: Auto-provision dashboard and datasource via Docker Compose
Tests: Add 223 lines of API metrics tests (ChatEndpointsInternalTests)

Validation

✅ All tests passing (261 passed)
✅ Local observability stack verified (Prometheus + Grafana)
✅ Metrics collection confirmed via Prometheus queries

…rst-response contract - Remove pipeline=ai|core label from runtime metrics (fixes CI guard violation) - Fix duplicate Meter name: ApiMetrics uses Aevatar.Api instead of Aevatar.Agents - Remove dead AddMeter("Aevatar.GenAI") registration - Update Grafana dashboard: remove AI/Core panels, expand first-response panel - Add first-response semantics documentation and tests - Centralize metric tag/value constants in ApiMetrics and AgentMetrics - Update all call sites to use constants instead of hard-coded strings Made-with: Cursor

- Classify only 5xx errors as 'error' in API metrics (4xx are 'ok') - Split First Response and Full Request latency into separate panels - Fix Error Ratio calculation for stable values at low traffic - Update observability README with latest dashboard structure Made-with: Cursor

- Add AgentMetrics.RecordEventHandled() helper to match ApiMetrics.RecordRequest() pattern - Eliminates 8-line duplication in LocalActor and RuntimeActorGrain - Maintains API symmetry between runtime and API metric layers - Move ResolveMetricResult from ChatEndpoints to ApiMetrics.ResolveResult() - Centralizes all result classification semantics in metrics layer - Removes dependency on StatusCodes constant, uses pure int comparison - Add metrics instrumentation to HandleCommand endpoint - Add first-response metric for WebSocket parse error path - Add error classification tests (4xx→ok, 5xx→error) - Update metrics baseline plan documentation All tests pass (26 tests), architecture guards pass. Made-with: Cursor

- Set requestResult via ApiMetrics.ResolveResult(statusCode) in the result.Error != None branch before return, so 503/5xx are recorded as result=error instead of ok - Add HandleCommand_WhenProjectionDisabled_ShouldClassify503AsResultError test to assert 503 -> error classification Made-with: Cursor

- Fix HTTP first-response metric: record after WriteAsync succeeds (not before) - Use Interlocked.CompareExchange for thread-safe first-response recording - Add WebSocket first-response metric tests (parse error, started, execution throws) - Dashboard improvements: - Remove API Request Latency p95, keep only p99 for consistency - Fix Runtime Self Events / API Request: compute only when API requests > 0 - Fix Error Ratio panel: use empty-as-zero to show both API and runtime event error ratio lines - Rename 'runtime error ratio' to 'runtime event error ratio' for clarity - Add metrics quick reference runbook with PromQL, thresholds, and common misreads - Document error ratio panel empty-as-zero implementation Made-with: Cursor

Resolve merge conflicts in runtime grain and chat capability metrics paths, keeping first-response metrics hardening while adopting latest request normalization and retry handling from dev. Made-with: Cursor

- Create EventHandleScope: unified scope for runtime event handling - Composes Activity (tracing) + log scope + AgentMetrics (metrics) - Single Stopwatch, single error classification per operation - Fix scope placement: only wrap actual HandleEventAsync call (filtered envelopes no longer counted as handled) - Create ApiRequestScope: unified scope for API request handling - Encapsulates Stopwatch, result tracking, first-response recording - Fix first-response timing: await write completion before recording - Fix OperationCanceledException: handle in HandleCommand (return 499) - Refactor ChatEndpoints: replace manual metrics boilerplate with scopes - Refactor ChatWebSocketRunCoordinator: remove metrics return value coupling - Extract shared ApiMetricCapture test helper - Update TracingContextHelpers: remove replaced HandleEnvelopeInstrumentation Fixes: - Runtime metrics scope drift (filtered envelopes excluded) - SSE first-response recorded before write completion - HandleCommand cancellation classified as error All tests pass (1064). Architecture guards pass. Made-with: Cursor

…Prometheus alert coverage. This change aligns runtime telemetry, tests, and ops documentation so regressions versus dev are easier to detect and score. Made-with: Cursor

Made-with: Cursor

- Replace Prometheus scraping endpoint with OTLP export - Add OpenTelemetry Collector to local stack (docker-compose) - Configure collector to forward traces to Jaeger and expose metrics to Prometheus - Update documentation to clarify OTLP endpoints and data flow Made-with: Cursor

Address review batch on PR #562 (10 inline comments). All in files I have recent ownership of and require no architectural shifts: - #16 (blocker, security): ssh_exec is now opt-in via NyxIdToolOptions. EnableSshExecTool. Hosts that haven't wired the approval middleware no longer see the tool by default. Mainnet host opts in (Lark bot needs it). - #21 (major, bug): code_execute keeps the modern /execute + {language, script} contract, but on a NyxID-proxy upstream 404 it retries the legacy /run + {language, code} contract so deployments still pinned to old chrono-sandbox-service builds keep working. - #22 (major, bug): SkillRegistry.IsFresh now exempts SkillSource != Remote from TTL — local skills are baked in at registration and don't need expiring; prior behavior dropped them from use_skill after the first 5min. - #18 (major, bug): TurnRunner.TryResolveSenderBindingAsync narrows the catch to transient infra errors (Http/Timeout/IO/JSON) and surfaces non-transient (logic, NRE, serialization) at Error level so ops can distinguish "sender unbound" from "binding store broken". - #19 (major, bug): ConversationReplyGenerator narrows the sender-route-fallback catch to transient errors via IsRetryableSenderRouteFailure. Programmer errors no longer cost an LLM round on retry. - #29 + #30 (minor): inbox runtime gives metadata enrichment its own 15s budget separate from the LLM run, surfacing errorCode=llm_reply_metadata_timeout when scope/UserConfig lookup is slow. ResolveFallbackTimeout treats ResponseTimeoutSeconds<=0 as "no timeout" rather than silently snapping back to 120s. - #12 (minor): ConversationGAgent's stream-chunk and final-stream-chunk edits run under a 10s CTS now; the failure path already uses one. A hung relay can no longer pin the actor turn forever. - #27 (minor, security): ConstantTimeEquals docstring tightened — removed the "for future callers" line and added a SCOPE comment that this helper is rebuild-admin-only and shouldn't be promoted to internal/public without replacing its length-leak with a length-padding scheme. - #23 (major, bug): CLI ornn skills slug default → ornn-api (matches the registered slug; bare "ornn" is the SPA frontend that returns HTML). Build clean (NyxId / Skills / NyxidChat / Mainnet hosts), 30 AI tests + 15 inbox runtime tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

louis4li added 10 commits March 4, 2026 17:43

Merge latest dev into metrics observability branch.

4151e9f

Resolve merge conflicts in runtime grain and chat capability metrics paths, keeping first-response metrics hardening while adopting latest request normalization and retry handling from dev. Made-with: Cursor

Refine workflow API observability with internal endpoint metrics and …

7fdd6e6

…Prometheus alert coverage. This change aligns runtime telemetry, tests, and ops documentation so regressions versus dev are easier to detect and score. Made-with: Cursor

Update metrics observability scorecard with local validation evidence

e14f89f

Made-with: Cursor

loning merged commit 6e922a2 into dev Mar 8, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/metrics observability plan#27

Feature/metrics observability plan#27
loning merged 10 commits intodevfrom
feature/metrics-observability-plan

louis4li commented Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants