server: expose speculative decoding counters in Prometheus metrics#23328
Draft
boxcee wants to merge 1 commit into
Draft
server: expose speculative decoding counters in Prometheus metrics#23328boxcee wants to merge 1 commit into
boxcee wants to merge 1 commit into
Conversation
|
Hi @boxcee, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
e3f5d53 to
d735725
Compare
Adds two new counters to the /metrics endpoint: - llamacpp:spec_tokens_drafted_total - llamacpp:spec_tokens_accepted_total Accumulated in server_metrics::on_prediction() from the per-slot n_draft_total and n_draft_accepted fields. Divide accepted by drafted to get acceptance rate.
d735725 to
122fb87
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds two new counters to the
/metricsPrometheus endpoint:llamacpp:spec_tokens_drafted_totalllamacpp:spec_tokens_accepted_totalDivide accepted by drafted to get acceptance rate.
The counters reuse
n_draft_total/n_draft_acceptedintroduced in #22673. Those fields are already tracked per slot during speculative decoding. No new tracking logic, just plumbing those values to the metrics handler viaserver_metrics::on_prediction().Without this, the only way to watch acceptance rate is to grep the server logs. Grafana and similar tools can now track it directly.
Changes:
server_metrics: two newuint64_tfields, incremented inon_prediction()server_task_result_metrics: two new fields to carry the values to the HTTP handlercounterentries inall_metrics_defAI disclosure: I used Claude Code to help locate the relevant code paths and scaffold the initial implementation. I've reviewed every line and can explain it.