ModelBench

Automated LLM Benchmarking on GPU

Benchmark local LLMs with precision: tokens/sec, latency percentiles, VRAM profiling, multi-format support.

English

Overview

ModelBench is a comprehensive benchmarking framework for local Large Language Models (LLMs) running on GPU. It provides detailed performance metrics including token generation speed, latency percentiles (P50/P95/P99/P99.9), VRAM profiling, time-to-first-token (TTFT), and multi-model comparison with beautiful HTML reports.

Architecture

+------------------------------------------------------------------+
|                        ModelBench CLI                             |
|  modelbench run | compare | report | gpu-info | list-formats     |
+------------------------------------------------------------------+
         |              |              |              |
         v              v              v              v
+----------------+ +----------+ +-----------+ +-------------+
| BenchmarkRunner| | Comparer | | Reporter  | | GPU Profiler|
|                | |          | |           | |             |
| - warmup       | | - multi  | | - HTML    | | - VRAM      |
| - multi-prompt | |   model  | | - JSON    | | - Util %    |
| - batch test   | |   bench  | | - Markdown| | - Temp      |
| - TTFT         | | - ranking| | - CSV     | | - Power     |
| - percentiles  | |          | | - SVG     | | - Timeline  |
+-------+--------+ +----+-----+ +----+------+ +------+------+
        |                |            |               |
        v                v            v               v
+------------------------------------------------------------------+
|                     Model Loader (Unified)                        |
|  HuggingFace (safetensors, bf16) | GGUF (llama.cpp) | GPTQ/AWQ |
+------------------------------------------------------------------+
         |                    |                   |
         v                    v                   v
+------------------------------------------------------------------+
|                    GPU (NVIDIA CUDA)                              |
+------------------------------------------------------------------+

Key Features

Multi-Format Support: HuggingFace Transformers, GGUF (llama-cpp-python), GPTQ, AWQ
Comprehensive Metrics: tokens/sec, prompt processing speed, TTFT, latency percentiles
GPU Profiling: Real-time VRAM tracking, GPU utilization, temperature, power draw
Rich Reports: HTML with inline SVG charts, JSON, Markdown, CSV
Multi-Model Comparison: Side-by-side benchmarks with ranking
Statistical Stability: Multiple runs with warmup, percentile calculations
Auto-Detection: Automatic model format detection from path/name

Example Benchmark Results

================================================================
MODEL: meta-llama/Llama-2-7b-hf (bfloat16)
GPU: NVIDIA RTX 4090 (24GB)
================================================================

  Generation Speed:  52.3 tok/s
  Prompt Processing: 487.1 tok/s
  Time to First Token: 68.2 ms
  Peak VRAM:         13,847 MB

  Latency Percentiles:
  +----------+------------+
  | P50      |  1,412 ms  |
  | P95      |  2,034 ms  |
  | P99      |  2,198 ms  |
  | P99.9    |  2,389 ms  |
  +----------+------------+

  VRAM Timeline (ASCII):
  14.0 GB |    ************************************
  13.5 GB |  **
  13.0 GB | *
   0.5 GB |*
           +-------------------------------------->
           0s          5s         10s         15s

Installation

# Clone the repository
git clone https://github.com/ayinedjimi/ModelBench.git
cd ModelBench

# Install with pip
pip install -e .

# With GGUF support
pip install -e ".[gguf]"

# With all optional dependencies
pip install -e ".[all]"

Quick Start

# Benchmark a HuggingFace model
modelbench run meta-llama/Llama-2-7b-hf

# Benchmark a GGUF model
modelbench run ./models/llama-2-7b.Q4_K_M.gguf

# Compare multiple models
modelbench compare meta-llama/Llama-2-7b-hf ./models/llama-2-7b.Q4_K_M.gguf

# Generate reports from saved JSON
modelbench report benchmark_result.json --formats html,markdown

# Display GPU information
modelbench gpu-info

Python API

from modelbench import BenchmarkRunner, ReportGenerator
from modelbench.models import BenchmarkConfig, PromptCategory

# Configure benchmark
config = BenchmarkConfig(
    model_path="meta-llama/Llama-2-7b-hf",
    max_new_tokens=128,
    num_runs=5,
    warmup_runs=2,
    prompt_categories=[PromptCategory.SHORT, PromptCategory.MEDIUM],
    gpu_profiling=True,
)

# Run benchmark
runner = BenchmarkRunner(config)
result = runner.run()

# Print key metrics
print(f"Generation: {result.throughput.generation_tokens_per_sec:.1f} tok/s")
print(f"TTFT: {result.throughput.time_to_first_token_ms:.1f} ms")
print(f"P99 Latency: {result.latency.p99_ms:.2f} ms")
print(f"Peak VRAM: {result.peak_vram_mb:.0f} MB")

# Generate all reports
reporter = ReportGenerator(result)
reporter.generate_all("./reports")

CLI Options

Command	Description
`modelbench run <model>`	Benchmark a single model
`modelbench compare <m1> <m2> ...`	Compare multiple models
`modelbench report <json>`	Generate reports from JSON
`modelbench gpu-info`	Display GPU information
`modelbench list-formats`	List supported model formats

Supported Models

Format	Library	Detection	Example
HuggingFace	transformers	Default	`meta-llama/Llama-2-7b-hf`
GGUF	llama-cpp-python	`.gguf` extension	`model.Q4_K_M.gguf`
GPTQ	auto-gptq	`gptq` in name	`TheBloke/Model-GPTQ`
AWQ	transformers	`awq` in name	`TheBloke/Model-AWQ`

Francais

Presentation

ModelBench est un framework complet de benchmarking pour les Large Language Models (LLMs) locaux sur GPU. Il fournit des metriques de performance detaillees incluant la vitesse de generation de tokens, les percentiles de latence (P50/P95/P99/P99.9), le profilage VRAM, le time-to-first-token (TTFT), et la comparaison multi-modeles avec des rapports HTML.

Fonctionnalites Principales

Support Multi-Format : HuggingFace Transformers, GGUF (llama-cpp-python), GPTQ, AWQ
Metriques Completes : tokens/sec, vitesse de traitement prompt, TTFT, percentiles de latence
Profilage GPU : Suivi VRAM en temps reel, utilisation GPU, temperature, puissance
Rapports Riches : HTML avec graphiques SVG inline, JSON, Markdown, CSV
Comparaison Multi-Modeles : Benchmarks cote a cote avec classement
Stabilite Statistique : Runs multiples avec warmup, calcul de percentiles
Detection Automatique : Detection automatique du format depuis le chemin/nom

Demarrage Rapide

# Cloner le depot
git clone https://github.com/ayinedjimi/ModelBench.git
cd ModelBench

# Installer
pip install -e ".[all]"

# Lancer un benchmark
modelbench run meta-llama/Llama-2-7b-hf

# Comparer des modeles
modelbench compare model1 model2

# Informations GPU
modelbench gpu-info

Utilisation Python

from modelbench import BenchmarkRunner
from modelbench.models import BenchmarkConfig

config = BenchmarkConfig(
    model_path="meta-llama/Llama-2-7b-hf",
    max_new_tokens=128,
    num_runs=5,
)

runner = BenchmarkRunner(config)
result = runner.run()

print(f"Vitesse: {result.throughput.generation_tokens_per_sec:.1f} tok/s")
print(f"VRAM max: {result.peak_vram_mb:.0f} MB")

Author / Auteur

Ayi NEDJIMI

Professional AI/ML Engineering & Consulting

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src/modelbench		src/modelbench
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ModelBench

Automated LLM Benchmarking on GPU

English

Overview

Architecture

Key Features

Example Benchmark Results

Installation

Quick Start

Python API

CLI Options

Supported Models

Francais

Presentation

Fonctionnalites Principales

Demarrage Rapide

Utilisation Python

Author / Auteur

About

Uh oh!

Releases 1

Packages

Languages

License

ayinedjimi/ModelBench

Folders and files

Latest commit

History

Repository files navigation

ModelBench

Automated LLM Benchmarking on GPU

English

Overview

Architecture

Key Features

Example Benchmark Results

Installation

Quick Start

Python API

CLI Options

Supported Models

Francais

Presentation

Fonctionnalites Principales

Demarrage Rapide

Utilisation Python

Author / Auteur

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages