Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions docker-compose.observability.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
services:
jaeger:
image: jaegertracing/all-in-one:latest
container_name: aevatar-jaeger
environment:
COLLECTOR_OTLP_ENABLED: "true"
ports:
- "16686:16686"
restart: unless-stopped

otel-collector:
image: otel/opentelemetry-collector-contrib:latest
container_name: aevatar-otel-collector
command:
- --config=/etc/otelcol-contrib/config.yml
ports:
- "4317:4317"
- "4318:4318"
- "9464:9464"
volumes:
- ./tools/observability/otel-collector-config.yml:/etc/otelcol-contrib/config.yml:ro
depends_on:
- jaeger
restart: unless-stopped

prometheus:
image: prom/prometheus:latest
container_name: aevatar-prometheus
command:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --web.enable-lifecycle
ports:
- "9090:9090"
volumes:
- ./tools/observability/prometheus:/etc/prometheus:ro
- prometheus-data:/prometheus
depends_on:
- otel-collector
restart: unless-stopped

grafana:
image: grafana/grafana:latest
container_name: aevatar-grafana
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD: admin
GF_USERS_ALLOW_SIGN_UP: "false"
volumes:
- grafana-data:/var/lib/grafana
- ./tools/observability/grafana/provisioning:/etc/grafana/provisioning:ro
depends_on:
- prometheus
restart: unless-stopped

volumes:
prometheus-data:
grafana-data:
249 changes: 249 additions & 0 deletions docs/architecture/metrics-baseline-plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,249 @@
# Metrics Baseline Plan

> Status: In progress on branch `feature/metrics-observability-plan`.

## 1. Purpose

Provide a low-cardinality, production-safe observability baseline for workflow runtime and API:

- Separate platform health metrics from user-perceived latency metrics.
- Support SSE/WS first-response analysis (TTFB-like) in addition to full request duration.
- Keep instrumentation generic, minimal, and layered.

## 2. Design Principles

- One meter per domain boundary: `Aevatar.Agents` for runtime, `Aevatar.Api` for API layer.
- Prefer low-cardinality labels over high-cardinality identifiers.
- Track both user-perceived latency (API) and backend processing overhead (runtime).
- Use OpenTelemetry OTLP export with Prometheus + Grafana as the default metrics display stack.

## 3. Anti-Patterns to Avoid

- Per-actor/per-session/per-command labels in metrics.
- Parallel metric systems with overlapping names and meanings.
- Multiple `Meter` objects sharing the same name across assemblies.
- Dashboard-only metrics that are not used in SLO/SLA decisions.
- `TypeUrl.Contains(...)` in src (blocked by CI guard).

## 4. Current Implementation Status

Implemented:

- OTLP metric export from `Workflow.Host.Api`.
- Collector-centered local stack (`docker-compose.observability.yml`) with Jaeger, Prometheus, and Grafana.
- Runtime metric cleanup and low-cardinality refactor:
- removed high-cardinality labels (`agent_id`, `publisher_id`).
- removed low-value instruments (`RouteTargets`, `StateLoads`, `StateSaves`, `HandlerDuration`).
- Runtime metrics emitted from both Local and Orleans paths.
- API metrics for request count and full duration (meter `Aevatar.Api`) on all interaction endpoints:
- `POST /api/chat`
- `POST /api/workflows/resume`
- `POST /api/workflows/signal`
- command-style HTTP request path
- `GET /api/ws/chat`
- API first-response duration metric for streaming paths and WS parse error responses.
- Unified instrumentation scopes that compose tracing + logging + metrics:
- `EventHandleScope` (runtime): single scope drives Activity span, log scope, and metrics recording.
- `ApiRequestScope` (API): single scope drives stopwatch, result classification, and first-response tracking.
- Eliminates duplicate Stopwatch and independent error-tracking across tracing/metrics.
- `OperationCanceledException` consistently classified as `result=ok` across HTTP and WebSocket paths.
- `ChatWebSocketRunCoordinator` decoupled from metrics (no metrics return value; scope passed from caller).
- Histogram views are configured explicitly for AI-oriented request latency and runtime event latency buckets.
- Grafana dashboard panels for:
- health/error ratio
- runtime/API throughput and latency
- window totals (`increase`)
- first-response vs full-response comparison

Pending:

- Add explicit SLO panel with thresholds and status coloring.
- Tune alert thresholds with production baselines after traffic observation.
- Tune histogram bucket boundaries with production latency samples if workload profile changes.

## 5. Metric Contract (Current)

### 5.1 Runtime Metrics (Meter: `Aevatar.Agents`)

| Name | Type | Labels | Meaning |
|---|---|---|---|
| `aevatar_runtime_events_handled_total` | Counter | `direction`, `result` | Number of handled runtime events |
| `aevatar_runtime_event_handle_duration_ms` | Histogram | `result` | Event handling duration (platform overhead) |
| `aevatar_runtime_active_actors` | UpDownCounter | none | Active actor count |

### 5.2 API Metrics (Meter: `Aevatar.Api`)

| Name | Type | Labels | Meaning |
|---|---|---|---|
| `aevatar_api_requests_total` | Counter | `transport`, `result` | API request volume |
| `aevatar_api_request_duration_ms` | Histogram | `transport` | End-to-end request duration |
| `aevatar_api_first_response_duration_ms` | Histogram | `transport`, `result` | Time to first response frame/ack |

`transport` values: `http`, `ws`

`result` values: `ok`, `error`

### 5.3 First-Response Semantics (Explicit Contract)

`aevatar_api_first_response_duration_ms` is defined as "time from request accepted to first response frame/ack/error sent to client".

- `http` path:
- recorded on the first response signal sent to the client:
- run-context bootstrap frame (`aevatar.run.context`), or
- first streamed output frame (`emitAsync`) if no bootstrap frame was written first.
- not recorded for prompt validation early return (400) because no response stream frame is produced.
- `ws` path:
- recorded on first outbound message among `command ack`, `agui event`, or `command error`.
- if the request fails before any websocket message is sent, no first-response metric is recorded.

This contract intentionally tracks "first observable response signal" rather than "request finished".

### 5.4 Request-Duration Tagging Trade-Off

`aevatar_api_request_duration_ms` intentionally omits the `result` label. This is a deliberate cardinality and dashboard-simplicity trade-off:

- error ratio is derived from `aevatar_api_requests_total{result=...}`.
- latency panels focus on user wait time split by `transport`.
- adding `result` to the histogram would double API latency series without materially improving the primary SLO view.

If per-result latency distributions become operationally necessary later, introduce them as a separate histogram with an explicit need, rather than retrofitting the core baseline.

### 5.5 Cancellation Semantics

`OperationCanceledException` is treated as `result=ok` in the request metric. Client-initiated cancellation is not a service error; it reflects normal connection lifecycle behavior (e.g., user navigates away, timeout). This keeps the error ratio focused on genuine service-side failures.

## 6. Diagnostic Model

When AI is part of the core request path, end-to-end latency alone cannot diagnose health. The layered metric approach separates concerns:

| Metric | Answers | Normal Range |
|---|---|---|
| **error ratio** | Is the service stable? | < 1% |
| **first_response p95** | Is user-perceived responsiveness OK? | Acceptable TTFB |
| **request_duration p95** | Is total time reasonable? | High variance expected with AI |
| **runtime event duration p95** | Is platform overhead normal? | Should be consistently low |

Interpretation rules:

- High full latency + normal first-response + low error ratio → AI generation variance (normal).
- Rising error ratio + dropping throughput → service incident.
- Rising runtime event latency → platform bottleneck (investigate regardless of AI content).
- Full request duration minus first response → AI generation time (expected to vary).

### Why NOT split AI/core at the runtime event level

The runtime event layer processes individual envelopes. A single AI chat request fans out to many events (ChatRequest, TextMessageStart, N × TextMessageContent, TextMessageEnd). Only one event (the LLM call trigger) is slow; the rest are fast streaming relays. A `pipeline=ai|core` label at this level gives misleading distributions (p50 looks fast because most AI events are fast) and adds label cardinality without clear diagnostic value.

The correct separation is at the API level: `first_response_duration` captures the real AI-latency impact on user experience, and `request_duration` captures the total including AI generation.

## 7. Dashboard and Display

Grafana dashboard file:

- `tools/observability/grafana/provisioning/dashboards/aevatar-runtime-overview.json`

Panels (SLO section, default view):

1. SLO Read Guide (text)
2. Error Ratio — API and runtime (timeseries)
3. User Latency: First Response p95 (timeseries)
4. User Latency: Full Request p95 (timeseries)
5. API Request Latency p99 (timeseries)
6. Runtime Event Handle Latency p95/p99 (timeseries)

Panels (Runtime Diagnostics section, drill-down):

7. Runtime Diagnostics Guide (text)
8. Active Actors (stat)
9. Runtime Events — Self, Window Total (stat)
10. API Requests — Window Total (stat)
11. Runtime Self Events / API Request (stat)
12. Runtime Events Rate — Self, by result (timeseries)
13. API Requests Rate by result (timeseries)

Note: "Runtime Self Events / API Request" is computed only when window API request count is greater than 0; otherwise the panel is intentionally empty to avoid misleading inflation from background self events.

Local stack:

- `docker-compose.observability.yml`
- OpenTelemetry Collector OTLP HTTP ingest endpoint: `http://localhost:4318` (ingest only, no UI)
- Prometheus: `http://localhost:9090`
- Grafana: `http://localhost:3000`
- Jaeger: `http://localhost:16686`
- Prometheus alert examples: `tools/observability/prometheus/alerts.yml`

## 8. Verification Checklist

- Prometheus target `aevatar-otel-collector` is `UP`.
- Collector-exposed metrics include:
- `aevatar_runtime_events_handled_total`
- `aevatar_api_requests_total`
- `aevatar_api_first_response_duration_ms`
- No high-cardinality labels appear in metric series.
- No `pipeline` label in runtime metrics.
- Dashboard shows first-response vs full-response latency.
- `alerts.yml` loads successfully in Prometheus rule status page.

## 9. Metric Quick Reference (Runbook)

### 9.1 Core SLO Metrics

| Metric | PromQL (Reference) | Why it matters | Suggested alert threshold | Common misread |
|---|---|---|---|---|
| API error ratio (5m) | `sum(increase(aevatar_api_requests_total{result="error"}[5m])) / clamp_min(sum(increase(aevatar_api_requests_total[5m])), 1)` | Primary service stability signal for client requests | `> 1%` for 5-10 minutes | Counting 4xx as service error (current contract treats only 5xx as `error`) |
| Runtime event error ratio (5m) | `sum(increase(aevatar_runtime_events_handled_total{result="error"}[5m])) / clamp_min(sum(increase(aevatar_runtime_events_handled_total[5m])), 1)` | Runtime pipeline health signal | `> 1%` for 5-10 minutes | Comparing directly with API ratio without considering event fan-out |
| First response latency p95 | `histogram_quantile(0.95, sum by (le) (rate(aevatar_api_first_response_duration_ms_bucket[$__rate_interval])))` | User-perceived responsiveness (TTFB-like) | Set per workload; start from your current p95 baseline + 30% | Treating missing first-response samples as zero (they are "not emitted", not "fast") |
| Full request latency p95 | `histogram_quantile(0.95, sum by (le) (rate(aevatar_api_request_duration_ms_bucket[$__rate_interval])))` | End-to-end user waiting time | Set per workflow family; usually looser than first response | Assuming high full latency alone means platform issue |
| API request latency p99 | `histogram_quantile(0.99, sum by (le) (rate(aevatar_api_request_duration_ms_bucket[$__rate_interval])))` | Tail-latency regression detection | Alert when sustained spike exceeds SLO budget | Using p99 as primary product KPI instead of engineering diagnostic |
| Runtime event handle latency p95/p99 | `histogram_quantile(0.95, sum by (le) (rate(aevatar_runtime_event_handle_duration_ms_bucket[$__rate_interval])))` and p99 equivalent | Detects platform/runtime overhead changes independent of model generation variance | Trigger when both p95 and p99 trend up with stable traffic | Correlating directly to user latency without checking first/full API latency pair |

### 9.2 Diagnostic Order (Fast Triage)

1. Check API and runtime event error ratio (incident vs non-incident).
2. Check first-response p95 (user "is it responsive?" signal).
3. Check full-request p95 and API p99 (overall wait and tail behavior).
4. Check runtime event latency p95/p99 (platform overhead confirmation).
5. If only full-request worsens while first-response is stable, prioritize model/downstream generation analysis.

Error-ratio panel implementation note: dashboard queries use "empty-as-zero" (`or on() vector(0)`) to keep both API and runtime ratio series visible even when a 5-minute window has no error samples.

## 10. Next Plan Items

1. Add dashboard status coloring for core SLO panels.
2. Re-baseline alert thresholds after collecting production traffic.
3. Add tests:
- WebSocket path first-response integration tests (requires WebSocket mock infrastructure)
4. Re-evaluate `IMeterFactory` only if host lifecycle isolation or meter injection becomes a proven need.

## 11. Default Histogram Buckets and Alerts

### 11.1 Default Histogram Buckets

Configured in `ObservabilityExtensions`:

- API latency histograms (`aevatar_api_request_duration_ms`, `aevatar_api_first_response_duration_ms`):
- `25, 50, 100, 250, 500, 1000, 2500, 5000, 10000, 20000, 30000, 45000, 60000, 90000, 120000` ms
- Runtime event latency histogram (`aevatar_runtime_event_handle_duration_ms`):
- `1, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000` ms

Configuration overrides:

- `Observability:Metrics:ApiLatencyBucketsMs`
- `Observability:Metrics:RuntimeLatencyBucketsMs`

Both settings accept comma-separated millisecond boundaries in ascending order.

### 11.2 Default Alert Examples

The repository now includes Prometheus alert examples in `tools/observability/prometheus/alerts.yml`:

- API error ratio above 1% for 10 minutes
- Runtime event error ratio above 1% for 10 minutes
- API first-response p95 above 5000 ms for 10 minutes

These are default guardrails for local and pre-production validation. They should be tuned after collecting real workload baselines.

## 12. Related Documents

- `docs/architecture/stream-first-tracing-design.md`
- `docs/architecture/workflow-jaeger-observability-guide.md`
14 changes: 5 additions & 9 deletions docs/architecture/workflow-jaeger-observability-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,15 +37,10 @@ Expected runtime logging behavior:

## 3. Local Setup

Start Jaeger all-in-one:
Start the local observability stack:

```bash
docker run --rm --name jaeger \
-e COLLECTOR_OTLP_ENABLED=true \
-p 16686:16686 \
-p 4317:4317 \
-p 4318:4318 \
jaegertracing/all-in-one:latest
docker compose -f docker-compose.observability.yml up -d jaeger otel-collector prometheus grafana
```

Set OTEL environment variables before host startup:
Expand Down Expand Up @@ -88,7 +83,8 @@ Validate all checks:
2. Response body includes `correlationId` for async accepted flow.
3. Runtime logs include `trace_id`, `correlation_id`, and `causation_id`.
4. Jaeger UI (`http://localhost:16686`) shows trace under `Aevatar.Workflow.Host.Api`.
5. Log `trace_id` equals Jaeger trace id (trace_id is not exposed in API responses, only in logs).
5. Prometheus target page (`http://localhost:9090/targets`) shows `aevatar-otel-collector` as `UP`.
6. Log `trace_id` equals Jaeger trace id (trace_id is not exposed in API responses, only in logs).

## 5. Automated Test Checklist

Expand All @@ -105,7 +101,7 @@ Recommended minimum checks:

No traces in Jaeger:

- verify Jaeger container is healthy and OTLP ports are reachable
- verify the collector and Jaeger containers are healthy and OTLP ports are reachable
- verify OTEL environment variables are visible to host process
- verify sampling is not effectively zero

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Metrics Observability Dev-Diff Scorecard

> Branch: `feature/metrics-observability-plan` vs `dev`
> Date: 2026-03-06
> Scope: API/runtime metrics instrumentation, Prometheus alerting baseline, Grafana dashboard provisioning

## Executive Score

**Overall Score: 9.1 / 10**

The branch is in a strong state for observability baseline quality.
Instrumentation boundaries are clean, metric cardinality remains controlled, tests are passing, and local observability stack provisioning works end-to-end.

## Dimension Scores

| # | Dimension | Score | Weight | Weighted |
|---|---|:---:|:---:|:---:|
| 1 | Architectural alignment (layering and scope ownership) | 9.2 | 20% | 1.84 |
| 2 | Metric correctness and low-cardinality safety | 9.2 | 20% | 1.84 |
| 3 | Test coverage and regression confidence | 9.0 | 20% | 1.80 |
| 4 | Operational readiness (Prometheus/Grafana/alerts) | 9.0 | 20% | 1.80 |
| 5 | Documentation quality and maintainability | 9.0 | 20% | 1.80 |
| | **Total** | | | **9.08** |

Rounded verdict: **9.1 / 10**

## Validation Evidence (Local)

- `dotnet test test/Aevatar.Workflow.Host.Api.Tests/Aevatar.Workflow.Host.Api.Tests.csproj --nologo`
- Result: **261 passed, 0 failed, 0 skipped**
- `docker compose -f docker-compose.observability.yml up -d prometheus grafana`
- Result: both `aevatar-prometheus` and `aevatar-grafana` are up
- Generated API traffic to produce metrics:
- `POST /api/chat` returned HTTP 200 three times
- Prometheus query validation:
- `sum(increase(aevatar_api_requests_total[5m]))` returned value `3.080946740700676` (greater than zero)
- Grafana startup/provisioning logs:
- dashboard provisioning finished
- dashboard live channel initialized for UID `aevatar-runtime-overview`

## Notes and Follow-Ups

- Grafana HTTP auth check returned `401` with `admin/admin`, indicating existing persisted credentials in local volume differ from defaults. This does not block provisioning validation, but UI login verification should use local real credentials.
- Alert thresholds and histogram buckets are good defaults; tune with real traffic after baseline observation.
- Keep API and runtime metric contracts stable to preserve dashboard/query continuity across future refactors.
Loading