Skip to content

server: expose speculative decoding counters in Prometheus metrics#23328

Draft
boxcee wants to merge 1 commit into
ggml-org:masterfrom
boxcee:feat/prometheus-spec-metrics
Draft

server: expose speculative decoding counters in Prometheus metrics#23328
boxcee wants to merge 1 commit into
ggml-org:masterfrom
boxcee:feat/prometheus-spec-metrics

Conversation

@boxcee
Copy link
Copy Markdown

@boxcee boxcee commented May 19, 2026

Adds two new counters to the /metrics Prometheus endpoint:

  • llamacpp:spec_tokens_drafted_total
  • llamacpp:spec_tokens_accepted_total

Divide accepted by drafted to get acceptance rate.

The counters reuse n_draft_total / n_draft_accepted introduced in #22673. Those fields are already tracked per slot during speculative decoding. No new tracking logic, just plumbing those values to the metrics handler via server_metrics::on_prediction().

Without this, the only way to watch acceptance rate is to grep the server logs. Grafana and similar tools can now track it directly.

Changes:

  • server_metrics: two new uint64_t fields, incremented in on_prediction()
  • server_task_result_metrics: two new fields to carry the values to the HTTP handler
  • Prometheus handler: two new counter entries in all_metrics_def
  • README: metrics table updated

AI disclosure: I used Claude Code to help locate the relevant code paths and scaffold the initial implementation. I've reviewed every line and can explain it.

@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented May 19, 2026

Hi @boxcee, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@boxcee boxcee force-pushed the feat/prometheus-spec-metrics branch from e3f5d53 to d735725 Compare May 19, 2026 08:47
Adds two new counters to the /metrics endpoint:
- llamacpp:spec_tokens_drafted_total
- llamacpp:spec_tokens_accepted_total

Accumulated in server_metrics::on_prediction() from the per-slot
n_draft_total and n_draft_accepted fields. Divide accepted by drafted
to get acceptance rate.
@boxcee boxcee force-pushed the feat/prometheus-spec-metrics branch from d735725 to 122fb87 Compare May 19, 2026 08:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant