Skip to content

spec: add backend sampling support for eagle3#24655

Merged
ggerganov merged 1 commit into
ggml-org:masterfrom
ruixiang63:eagle3-backend-sampling
Jun 16, 2026
Merged

spec: add backend sampling support for eagle3#24655
ggerganov merged 1 commit into
ggml-org:masterfrom
ruixiang63:eagle3-backend-sampling

Conversation

@ruixiang63

Copy link
Copy Markdown
Member

Overview

Following #23287 to add backend sampling support for eagle3.

Performance results on SpeedBench

  • eagle3 baseline
python tools/server/bench/speed-bench/speed_bench.py --url localhost:8080 --bench qualitative --category all --osl 512 --concurrency 1 --limit 5 --output eagle3-qwen3-baseline.json

Summary (elapsed=353.27s)
category       samples  avg_prompt_t/s  avg_pred_t/s  avg_latency  accept_rate
-------------  -------  --------------  ------------  -----------  -----------
coding         5        1804.50         96.32         6.580s       0.4810     
humanities     5        446.18          82.61         7.500s       0.3669     
math           5        46.57           82.13         6.273s       0.3605     
qa             5        605.86          92.40         4.808s       0.4444     
rag            5        4787.48         97.18         6.307s       0.5010     
reasoning      5        201.42          82.05         6.298s       0.3574     
stem           5        47.51           82.29         6.259s       0.3605     
writing        5        4234.16         91.77         6.931s       0.4415     
multilingual   5        2644.99         95.54         5.463s       0.4741     
summarization  5        628.52          87.96         3.667s       0.4104     
roleplay       5        2782.64         89.17         10.565s      0.4230     
overall        55       1657.26         89.04         6.423s       0.4185     
  • eagle3 with backend sampling
python tools/server/bench/speed-bench/speed_bench.py --url localhost:8080 --bench qualitative --category all --osl 512 --concurrency 1 --limit 5 --output eagle3-qwen3-backend-sampling.json

Summary (elapsed=351.79s)
category       samples  avg_prompt_t/s  avg_pred_t/s  avg_latency  accept_rate
-------------  -------  --------------  ------------  -----------  -----------
coding         5        1817.48         97.18         6.577s       0.4810     
humanities     5        455.30          83.50         7.414s       0.3669     
math           5        46.17           83.14         6.196s       0.3605     
qa             5        622.08          93.79         4.736s       0.4444     
rag            5        3754.37         94.52         6.738s       0.5010     
reasoning      5        202.27          82.78         6.243s       0.3574     
stem           5        47.31           83.19         6.191s       0.3605     
writing        5        4273.63         93.15         6.835s       0.4415     
multilingual   5        2613.75         96.83         5.393s       0.4741     
summarization  5        642.39          89.23         3.602s       0.4104     
roleplay       5        2700.37         90.38         10.431s      0.4230     
overall        55       1561.37         89.79         6.396s       0.4185 
  • comparison
python tools/server/bench/speed-bench/speed_bench_compare.py --baseline eagle3-qwen3-baseline.json --speculative eagle3-qwen3-backend-sampling.json

Comparison: baseline=eagle3-qwen3-baseline.json speculative=eagle3-qwen3-backend-sampling.json
category       base_avg_pred_t/s  spec_avg_pred_t/s  decode_speedup  base_avg_latency  spec_avg_latency  latency_speedup  accept_rate
-------------  -----------------  -----------------  --------------  ----------------  ----------------  ---------------  -----------
coding         96.32              95.56              0.99x           6.580s            6.668s            0.99x            0.4810     
humanities     82.61              83.96              1.02x           7.500s            7.381s            1.02x            0.3669     
math           82.13              83.57              1.02x           6.273s            6.163s            1.02x            0.3605     
qa             92.40              93.98              1.02x           4.808s            4.725s            1.02x            0.4444     
rag            97.18              98.35              1.01x           6.307s            6.329s            1.00x            0.5010     
reasoning      82.05              83.00              1.01x           6.298s            6.225s            1.01x            0.3574     
stem           82.29              83.26              1.01x           6.259s            6.187s            1.01x            0.3605     
writing        91.77              92.99              1.01x           6.931s            6.843s            1.01x            0.4415     
multilingual   95.54              96.79              1.01x           5.463s            5.431s            1.01x            0.4741     
summarization  87.96              89.06              1.01x           3.667s            3.612s            1.02x            0.4104     
roleplay       89.17              90.50              1.01x           10.565s           10.430s           1.01x            0.4230     
overall        89.04              90.09              1.01x           6.423s            6.363s            1.01x            0.4185    

Requirements

@ruixiang63 ruixiang63 requested a review from a team as a code owner June 15, 2026 14:21
@ggerganov ggerganov self-assigned this Jun 15, 2026
@ggerganov ggerganov added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label Jun 16, 2026
@ggerganov ggerganov merged commit a182490 into ggml-org:master Jun 16, 2026
25 checks passed
@ruixiang63 ruixiang63 deleted the eagle3-backend-sampling branch June 16, 2026 13:29
papamoose pushed a commit to papamoose/llama.cpp that referenced this pull request Jun 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants