Skip to content

feat(metrics): implement Phase 2 Grafana stack and Phase 3 histogram metrics (#2865)#2875

Merged
bug-ops merged 1 commit intomainfrom
prometheus-metrics-export
Apr 10, 2026
Merged

feat(metrics): implement Phase 2 Grafana stack and Phase 3 histogram metrics (#2865)#2875
bug-ops merged 1 commit intomainfrom
prometheus-metrics-export

Conversation

@bug-ops
Copy link
Copy Markdown
Owner

@bug-ops bug-ops commented Apr 10, 2026

Summary

  • Phase 2 — Grafana stack: docker/docker-compose.metrics.yml overlay (Prometheus v3.4.0 + Grafana 11.6.0), auto-provisioned datasource + pre-built 7-row dashboard covering all 25 Phase 1 metrics, mdbook setup guide with Linux/macOS networking notes
  • Phase 3 — Histogram metrics: HistogramRecorder trait in zeph-core::metrics, 3 histograms (zeph_llm_latency_seconds, zeph_turn_duration_seconds, zeph_tool_execution_seconds) with buckets [0.1..60s], recording points in agent loop, single elapsed capture per call site, borrow_and_update() in sync task

Closes #2865

Non-blocking follow-ups filed

Test plan

  • cargo +nightly fmt --check — clean
  • cargo build --features prometheus — compiles
  • cargo build --features full — compiles
  • cargo nextest run --workspace --features full --lib --bins — 8109 passed
  • Phase 2: docker files present and validated, mdbook entry confirmed
  • Phase 3: 11 metrics-specific tests pass (histogram observation, bucket config, trait dispatch)
  • Live test: docker compose -f docker/docker-compose.metrics.yml up + curl localhost:8090/metrics (manual, post-merge)

…metrics (#2865)

Phase 2 — Grafana stack:
- docker/docker-compose.metrics.yml: Prometheus v3.4.0 + Grafana 11.6.0 overlay
  with health checks, host.docker.internal networking for macOS/Linux
- docker/prometheus/prometheus.yml: scrape config targeting zeph /metrics endpoint
- docker/grafana/provisioning: auto-provisioned Prometheus datasource + dashboard
- docker/grafana/dashboards/zeph-overview.json: 7-row dashboard covering all 25 metrics
- book/src/guides/prometheus.md: setup guide with Linux networking troubleshooting

Phase 3 — Histogram metrics:
- HistogramRecorder trait in zeph-core::metrics (object-safe, Send+Sync)
- 3 histograms in PrometheusMetrics: zeph_llm_latency_seconds,
  zeph_turn_duration_seconds, zeph_tool_execution_seconds
  with buckets [0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0]s
- Impl HistogramRecorder for PrometheusMetrics in src/metrics_export.rs
- Agent builder wiring: Agent::with_histogram_recorder(Arc<dyn HistogramRecorder>)
- Recording points: LLM call (native.rs record_chat_metrics_and_compact),
  turn end (utils.rs flush_turn_timings), per-tool (native.rs handle_native_tool_calls)
- Single elapsed capture per call site (gauge and histogram use same Duration)
- spawn_metrics_sync uses borrow_and_update() for efficient watch channel consumption
- Wired in runner.rs via shared Arc<PrometheusMetrics> for sync task and recorder

Closes #2865
@github-actions github-actions Bot added enhancement New feature or request size/XL Extra large PR (500+ lines) documentation Improvements or additions to documentation rust Rust code changes core zeph-core crate and removed size/XL Extra large PR (500+ lines) labels Apr 10, 2026
@bug-ops bug-ops enabled auto-merge (squash) April 10, 2026 23:29
@bug-ops bug-ops merged commit 20e0b38 into main Apr 10, 2026
34 checks passed
@bug-ops bug-ops deleted the prometheus-metrics-export branch April 10, 2026 23:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core zeph-core crate documentation Improvements or additions to documentation enhancement New feature or request rust Rust code changes

Projects

None yet

1 participant