Find the optimal draft model for speculative decoding on your hardware.
The problem: Speculative decoding can speed up LLM inference by 50-80%, but only if you pick the right draft model. Too small and accuracy suffers. Too large and the overhead kills your gains. The optimal choice depends on your target model, quantization, and hardware.
The solution: draftbench automatically tests every combination of target + draft models you give it, measures the throughput, and shows you which pairing works best. Instead of guessing, you get data.
How it works:
- You provide a list of target models (the big ones you want to run fast)
- You provide a list of draft models (smaller models from the same family)
- draftbench tests each combination: baseline speed, then speed with each draft
- Results are saved to JSON and visualized as interactive charts
Speculative decoding uses a small "draft" model to propose tokens that a larger "target" model then verifies. When the draft model predicts correctly, multiple tokens are accepted in a single forward pass, significantly speeding up generation.
Key findings from our benchmarks:
- Slow targets (72B Q8_0 @ 6 tok/s): +80% speedup with the right draft model
- Fast targets (72B Q4_K_M @ 9.5 tok/s): +12% speedup - diminishing returns
- Sweet spot: 3B Q4_K_M draft works well across different target sizes
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
mkdir build && cd build
cmake .. # Add -DLLAMA_CUDA=ON for NVIDIA or -DLLAMA_METAL=ON for Apple Silicon
cmake --build . --config Release -jYou need GGUF models from the same family (same tokenizer). We use Qwen 2.5:
Target models (large):
Qwen2.5-72B-Instruct-Q8_0.gguforQ4_K_MQwen2.5-32B-Instruct-Q8_0.gguforQ4_K_MQwen2.5-14B-Instruct-Q8_0.ggufQwen2.5-7B-Instruct-Q8_0.gguf
Draft models (small):
qwen2.5-0.5b-instruct-q8_0.ggufqwen2.5-1.5b-instruct-q4_k_m.ggufqwen2.5-3b-instruct-q4_k_m.gguf
Download from Hugging Face: Qwen or bartowski
pip install requestsNo other dependencies - charts use Plotly.js CDN.
Test a single target + draft combination:
# Start server with speculative decoding
llama-server \
-m /path/to/target-model.gguf \
--model-draft /path/to/draft-model.gguf \
-ngl 99 -c 4096 --port 8080
# In another terminal, run benchmark
python bench.py --url http://127.0.0.1:8080 --requests 5 --max-tokens 512# Without draft (baseline)
python server.py llama-cpp --model-path /path/to/72b-model.gguf
# With draft (speculative decoding)
python server.py llama-cpp \
--model-path /path/to/72b-model.gguf \
--draft-path /path/to/3b-draft.ggufOptions: --port 8080, --gpu-layers 99, --ctx-size 4096
A sweep tests all combinations of target and draft models automatically.
Create configs/my_sweep.json:
{
"name": "qwen25-72b",
"hardware": "rtx4090-24gb",
"backend": "llamacpp",
"model_family": "Qwen2.5",
"targets": [
{"label": "72B Q8_0", "path": "/path/to/Qwen2.5-72B-Instruct-Q8_0.gguf"},
{"label": "72B Q4_K_M", "path": "/path/to/Qwen2.5-72B-Instruct-Q4_K_M.gguf"}
],
"drafts": [
{"label": "0.5B Q8_0", "path": "/path/to/qwen2.5-0.5b-instruct-q8_0.gguf"},
{"label": "1.5B Q4_K_M", "path": "/path/to/qwen2.5-1.5b-instruct-q4_k_m.gguf"},
{"label": "3B Q4_K_M", "path": "/path/to/qwen2.5-3b-instruct-q4_k_m.gguf"}
],
"settings": {
"llama_bin": "/path/to/llama-server",
"runs": 1,
"max_tokens": 1024,
"temperature": 0.0,
"gpu_layers": 99,
"ctx_size": 4096,
"port": 8080
}
}Metadata fields:
name: Short identifier for this sweep (used in filenames)hardware: Hardware identifier (e.g.,rtx4090-24gb,a100-80gb)backend: Inference backend (llamacpp,vllm,lmstudio)model_family: Model family name for chart titles
Settings:
llama_bin: Path tollama-serverbinary (auto-detected from PATH if omitted)
# Run a single config
python sweep.py --config configs/my_sweep.json
# Creates: results/<hardware>_<backend>_<name>.json
# results/<hardware>_<backend>_<name>.html
# Or specify custom output paths
python sweep.py --config configs/my_sweep.json --results results/custom.json --chart results/custom.htmlThis will:
- Test each target model without a draft (baseline)
- Test each target + draft combination
- Save results incrementally to JSON (with hardware/backend metadata)
- Generate interactive charts in HTML
If you have several config files (e.g., one per target model size), you can run them all in sequence:
python sweep.py --config-dir configs/This finds all *.json files in the directory (excluding example_*.json templates), runs each sweep back-to-back, and generates separate results and charts for each. If one config fails, it skips to the next and reports a summary at the end.
Example output:
============================================================
Sweep: 2 targets x 3 drafts = 8 runs
============================================================
[1/8] 72B Q8_0 (baseline)
Starting server ... ready
Benchmarking ... 5.93 tok/s
Server stopped
[2/8] 72B Q8_0 + 0.5B Q8_0
Starting server ... ready
Benchmarking ... 9.83 tok/s (acceptance: 57%)
Server stopped
...
=== Sweep complete ===
Results saved to results.json
Chart saved to chart.html
If you stopped a sweep early or want to regenerate charts:
python sweep.py --results results.json --chart chart.html --chart-onlyopen chart.htmlThe generated HTML file contains three interactive charts:
Bar chart showing tokens/second for each target model with:
- Baseline (no draft)
- Best draft from each size category (0.5B, 1.5B, 3B, 7B)
Bar chart showing percentage improvement over baseline for each draft size.
Color-coded matrix showing speedup % for every target + draft combination:
- Green = good speedup (60-80%+)
- Yellow = moderate speedup (30-50%)
- Orange/Red = minimal or negative impact
Hover over any cell for details.
-
Slow target models: The slower your target, the more you gain
- 72B Q8_0 (6 tok/s baseline) → +80% with 3B draft
- 72B Q4_K_M (9.5 tok/s baseline) → +12% with 3B draft
-
Same model family: Draft and target must share the same tokenizer
- Qwen 2.5 family: 0.5B through 72B all compatible
- Mixing families (e.g., Llama 3 + Llama 3.2) causes token translation overhead
-
Draft size sweet spot:
- Too small (0.5B): ~57% acceptance rate, limited gains
- Sweet spot (1.5B-3B): ~68-70% acceptance, best throughput
- Too large (7B): ~72% acceptance but draft is too slow
| Target Speed | Recommended Draft | Expected Gain |
|---|---|---|
| < 8 tok/s | 3B Q4_K_M | +60-80% |
| 8-15 tok/s | 1.5B-3B Q4_K_M | +10-30% |
| > 15 tok/s | 0.5B-1.5B Q4_K_M | +5-15% |
| > 30 tok/s | Not recommended | Overhead > gains |
Results are saved as JSON with full metadata:
{
"timestamp": "2026-02-05T01:18:17.066992+00:00",
"name": "qwen25-72b",
"hardware": "rtx4090-24gb",
"backend": "llamacpp",
"model_family": "Qwen2.5",
"settings": { ... },
"results": [
{
"target": "72B Q8_0",
"draft": null,
"mean_tps": 5.93,
"median_tps": 6.05,
"mean_ttft": 0.901,
"mean_total_time": 87.34,
"acceptance_rate": null
},
{
"target": "72B Q8_0",
"draft": "3B Q4_K_M",
"mean_tps": 10.56,
"median_tps": 10.06,
"mean_ttft": 0.671,
"mean_total_time": 49.93,
"acceptance_rate": 0.6888
}
]
}draftbench/
├── bench.py # Core benchmark logic
├── server.py # Server launcher (llama.cpp, LM Studio, vLLM)
├── sweep.py # Automated sweep + chart generation
├── configs/ # Sweep configuration files
│ └── example_sweep.json # Template - copy and customize
├── results/ # Benchmark results and charts (auto-generated)
│ ├── *.json # Raw results with metadata
│ └── *.html # Interactive Plotly visualizations
└── README.md
Wait a few seconds between runs or change the port in your config.
Your draft and target models may have incompatible tokenizers. Use models from the same family.
Your target model is already fast enough that draft overhead hurts. Try a smaller draft or skip speculative decoding.
Reduce gpu_layers or use more aggressive quantization (Q4_0 instead of Q8_0).