Benchmark local LLMs with precision: tokens/sec, latency percentiles, VRAM profiling, multi-format support.
ModelBench is a comprehensive benchmarking framework for local Large Language Models (LLMs) running on GPU. It provides detailed performance metrics including token generation speed, latency percentiles (P50/P95/P99/P99.9), VRAM profiling, time-to-first-token (TTFT), and multi-model comparison with beautiful HTML reports.
+------------------------------------------------------------------+
| ModelBench CLI |
| modelbench run | compare | report | gpu-info | list-formats |
+------------------------------------------------------------------+
| | | |
v v v v
+----------------+ +----------+ +-----------+ +-------------+
| BenchmarkRunner| | Comparer | | Reporter | | GPU Profiler|
| | | | | | | |
| - warmup | | - multi | | - HTML | | - VRAM |
| - multi-prompt | | model | | - JSON | | - Util % |
| - batch test | | bench | | - Markdown| | - Temp |
| - TTFT | | - ranking| | - CSV | | - Power |
| - percentiles | | | | - SVG | | - Timeline |
+-------+--------+ +----+-----+ +----+------+ +------+------+
| | | |
v v v v
+------------------------------------------------------------------+
| Model Loader (Unified) |
| HuggingFace (safetensors, bf16) | GGUF (llama.cpp) | GPTQ/AWQ |
+------------------------------------------------------------------+
| | |
v v v
+------------------------------------------------------------------+
| GPU (NVIDIA CUDA) |
+------------------------------------------------------------------+
- Multi-Format Support: HuggingFace Transformers, GGUF (llama-cpp-python), GPTQ, AWQ
- Comprehensive Metrics: tokens/sec, prompt processing speed, TTFT, latency percentiles
- GPU Profiling: Real-time VRAM tracking, GPU utilization, temperature, power draw
- Rich Reports: HTML with inline SVG charts, JSON, Markdown, CSV
- Multi-Model Comparison: Side-by-side benchmarks with ranking
- Statistical Stability: Multiple runs with warmup, percentile calculations
- Auto-Detection: Automatic model format detection from path/name
================================================================
MODEL: meta-llama/Llama-2-7b-hf (bfloat16)
GPU: NVIDIA RTX 4090 (24GB)
================================================================
Generation Speed: 52.3 tok/s
Prompt Processing: 487.1 tok/s
Time to First Token: 68.2 ms
Peak VRAM: 13,847 MB
Latency Percentiles:
+----------+------------+
| P50 | 1,412 ms |
| P95 | 2,034 ms |
| P99 | 2,198 ms |
| P99.9 | 2,389 ms |
+----------+------------+
VRAM Timeline (ASCII):
14.0 GB | ************************************
13.5 GB | **
13.0 GB | *
0.5 GB |*
+-------------------------------------->
0s 5s 10s 15s
# Clone the repository
git clone https://github.com/ayinedjimi/ModelBench.git
cd ModelBench
# Install with pip
pip install -e .
# With GGUF support
pip install -e ".[gguf]"
# With all optional dependencies
pip install -e ".[all]"# Benchmark a HuggingFace model
modelbench run meta-llama/Llama-2-7b-hf
# Benchmark a GGUF model
modelbench run ./models/llama-2-7b.Q4_K_M.gguf
# Compare multiple models
modelbench compare meta-llama/Llama-2-7b-hf ./models/llama-2-7b.Q4_K_M.gguf
# Generate reports from saved JSON
modelbench report benchmark_result.json --formats html,markdown
# Display GPU information
modelbench gpu-infofrom modelbench import BenchmarkRunner, ReportGenerator
from modelbench.models import BenchmarkConfig, PromptCategory
# Configure benchmark
config = BenchmarkConfig(
model_path="meta-llama/Llama-2-7b-hf",
max_new_tokens=128,
num_runs=5,
warmup_runs=2,
prompt_categories=[PromptCategory.SHORT, PromptCategory.MEDIUM],
gpu_profiling=True,
)
# Run benchmark
runner = BenchmarkRunner(config)
result = runner.run()
# Print key metrics
print(f"Generation: {result.throughput.generation_tokens_per_sec:.1f} tok/s")
print(f"TTFT: {result.throughput.time_to_first_token_ms:.1f} ms")
print(f"P99 Latency: {result.latency.p99_ms:.2f} ms")
print(f"Peak VRAM: {result.peak_vram_mb:.0f} MB")
# Generate all reports
reporter = ReportGenerator(result)
reporter.generate_all("./reports")| Command | Description |
|---|---|
modelbench run <model> |
Benchmark a single model |
modelbench compare <m1> <m2> ... |
Compare multiple models |
modelbench report <json> |
Generate reports from JSON |
modelbench gpu-info |
Display GPU information |
modelbench list-formats |
List supported model formats |
| Format | Library | Detection | Example |
|---|---|---|---|
| HuggingFace | transformers | Default | meta-llama/Llama-2-7b-hf |
| GGUF | llama-cpp-python | .gguf extension |
model.Q4_K_M.gguf |
| GPTQ | auto-gptq | gptq in name |
TheBloke/Model-GPTQ |
| AWQ | transformers | awq in name |
TheBloke/Model-AWQ |
ModelBench est un framework complet de benchmarking pour les Large Language Models (LLMs) locaux sur GPU. Il fournit des metriques de performance detaillees incluant la vitesse de generation de tokens, les percentiles de latence (P50/P95/P99/P99.9), le profilage VRAM, le time-to-first-token (TTFT), et la comparaison multi-modeles avec des rapports HTML.
- Support Multi-Format : HuggingFace Transformers, GGUF (llama-cpp-python), GPTQ, AWQ
- Metriques Completes : tokens/sec, vitesse de traitement prompt, TTFT, percentiles de latence
- Profilage GPU : Suivi VRAM en temps reel, utilisation GPU, temperature, puissance
- Rapports Riches : HTML avec graphiques SVG inline, JSON, Markdown, CSV
- Comparaison Multi-Modeles : Benchmarks cote a cote avec classement
- Stabilite Statistique : Runs multiples avec warmup, calcul de percentiles
- Detection Automatique : Detection automatique du format depuis le chemin/nom
# Cloner le depot
git clone https://github.com/ayinedjimi/ModelBench.git
cd ModelBench
# Installer
pip install -e ".[all]"
# Lancer un benchmark
modelbench run meta-llama/Llama-2-7b-hf
# Comparer des modeles
modelbench compare model1 model2
# Informations GPU
modelbench gpu-infofrom modelbench import BenchmarkRunner
from modelbench.models import BenchmarkConfig
config = BenchmarkConfig(
model_path="meta-llama/Llama-2-7b-hf",
max_new_tokens=128,
num_runs=5,
)
runner = BenchmarkRunner(config)
result = runner.run()
print(f"Vitesse: {result.throughput.generation_tokens_per_sec:.1f} tok/s")
print(f"VRAM max: {result.peak_vram_mb:.0f} MB")