feat(metrics): implement Phase 2 Grafana stack and Phase 3 histogram metrics (#2865)#2875
Merged
feat(metrics): implement Phase 2 Grafana stack and Phase 3 histogram metrics (#2865)#2875
Conversation
…metrics (#2865) Phase 2 — Grafana stack: - docker/docker-compose.metrics.yml: Prometheus v3.4.0 + Grafana 11.6.0 overlay with health checks, host.docker.internal networking for macOS/Linux - docker/prometheus/prometheus.yml: scrape config targeting zeph /metrics endpoint - docker/grafana/provisioning: auto-provisioned Prometheus datasource + dashboard - docker/grafana/dashboards/zeph-overview.json: 7-row dashboard covering all 25 metrics - book/src/guides/prometheus.md: setup guide with Linux networking troubleshooting Phase 3 — Histogram metrics: - HistogramRecorder trait in zeph-core::metrics (object-safe, Send+Sync) - 3 histograms in PrometheusMetrics: zeph_llm_latency_seconds, zeph_turn_duration_seconds, zeph_tool_execution_seconds with buckets [0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0]s - Impl HistogramRecorder for PrometheusMetrics in src/metrics_export.rs - Agent builder wiring: Agent::with_histogram_recorder(Arc<dyn HistogramRecorder>) - Recording points: LLM call (native.rs record_chat_metrics_and_compact), turn end (utils.rs flush_turn_timings), per-tool (native.rs handle_native_tool_calls) - Single elapsed capture per call site (gauge and histogram use same Duration) - spawn_metrics_sync uses borrow_and_update() for efficient watch channel consumption - Wired in runner.rs via shared Arc<PrometheusMetrics> for sync task and recorder Closes #2865
This was
linked to
issues
Apr 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
docker/docker-compose.metrics.ymloverlay (Prometheus v3.4.0 + Grafana 11.6.0), auto-provisioned datasource + pre-built 7-row dashboard covering all 25 Phase 1 metrics, mdbook setup guide with Linux/macOS networking notesHistogramRecordertrait inzeph-core::metrics, 3 histograms (zeph_llm_latency_seconds,zeph_turn_duration_seconds,zeph_tool_execution_seconds) with buckets [0.1..60s], recording points in agent loop, singleelapsedcapture per call site,borrow_and_update()in sync taskCloses #2865
Non-blocking follow-ups filed
zeph_turn_duration_secondshistogramZEPH_METRICS_HOSTenv var in docker-composeHistogramRecorderwiring through agent builder and recording call sitesTest plan
cargo +nightly fmt --check— cleancargo build --features prometheus— compilescargo build --features full— compilescargo nextest run --workspace --features full --lib --bins— 8109 passeddocker compose -f docker/docker-compose.metrics.yml up+curl localhost:8090/metrics(manual, post-merge)