AIPerf is a comprehensive benchmarking tool that measures the performance of generative AI models served by your preferred inference solution. It provides detailed metrics using a command line display as well as extensive benchmark performance reports.
This quick start guide leverages Ollama via Docker Desktop.
In order to set up an Ollama server, run granite4:350m using the following commands:
docker run -d \
--name ollama \
-p 11434:11434 \
-v ollama-data:/root/.ollama \
ollama/ollama:latest
docker exec -it ollama ollama pull granite4:350mCreate a virtual environment and install AIPerf:
python3 -m venv venv
source venv/bin/activate
pip install aiperfTo run a simple benchmark against your Ollama server:
aiperf profile \
--model "granite4:350m" \
--streaming \
--endpoint-type chat \
--tokenizer ibm-granite/granite-4.0-micro \
--url http://localhost:11434aiperf profile \
--model "granite4:350m" \
--streaming \
--endpoint-type chat \
--tokenizer ibm-granite/granite-4.0-micro \
--url http://localhost:11434
--concurrency 5 \
--request-count 10Example output:
NOTE: The example performance is reflective of a CPU-only run and does not represent an official benchmark.
NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p50 ┃ std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩
│ Time to First Token (ms) │ 7,463.28 │ 7,125.81 │ 9,484.24 │ 9,295.48 │ 7,596.62 │ 7,240.23 │ 677.23 │
│ Time to Second Token (ms) │ 68.73 │ 32.01 │ 102.86 │ 102.55 │ 99.80 │ 67.37 │ 24.95 │
│ Time to First Output Token (ms) │ 7,463.28 │ 7,125.81 │ 9,484.24 │ 9,295.48 │ 7,596.62 │ 7,240.23 │ 677.23 │
│ Request Latency (ms) │ 13,829.40 │ 9,029.36 │ 27,905.46 │ 27,237.77 │ 21,228.48 │ 11,338.31 │ 5,614.32 │
│ Inter Token Latency (ms) │ 65.31 │ 53.06 │ 81.31 │ 81.24 │ 80.64 │ 63.79 │ 9.09 │
│ Output Token Throughput Per User │ 15.60 │ 12.30 │ 18.85 │ 18.77 │ 18.08 │ 15.68 │ 2.05 │
│ (tokens/sec/user) │ │ │ │ │ │ │ │
│ Output Sequence Length (tokens) │ 95.20 │ 29.00 │ 295.00 │ 283.12 │ 176.20 │ 63.00 │ 77.08 │
│ Input Sequence Length (tokens) │ 550.00 │ 550.00 │ 550.00 │ 550.00 │ 550.00 │ 550.00 │ 0.00 │
│ Output Token Throughput (tokens/sec) │ 6.85 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Request Throughput (requests/sec) │ 0.07 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Request Count (requests) │ 10.00 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
└──────────────────────────────────────┴───────────┴──────────┴───────────┴───────────┴───────────┴───────────┴──────────┘
CLI Command: aiperf profile --model 'granite4:350m' --streaming --endpoint-type 'chat' --tokenizer 'ibm-granite/granite-4.0-micro' --url 'http://localhost:11434'
Benchmark Duration: 138.89 sec
CSV Export: /home/user/aiperf/artifacts/granite4:350m-openai-chat-concurrency1/profile_export_aiperf.csv
JSON Export: /home/user/Code/aiperf/artifacts/granite4:350m-openai-chat-concurrency1/profile_export_aiperf.json
Log File: /home/user/Code/aiperf/artifacts/granite4:350m-openai-chat-concurrency1/logs/aiperf.log- Scalable multiprocess architecture with 9 services communicating via ZMQ
- 3 UI modes:
dashboard(real-time TUI),simple(progress bars),none(headless) - Multiple benchmarking modes: concurrency, request-rate, request-rate with max concurrency, trace replay
- Extensible plugin system for endpoints, datasets, transports, and metrics
- Public dataset support including ShareGPT and custom formats
- OpenAI chat completions, completions, embeddings, audio, images
- NIM embeddings, rankings
- Basic Tutorial - Profile Qwen3-0.6B with vLLM
- Comprehensive Benchmarking Guide - 5 real-world use cases
- User Interface - Dashboard, simple, or headless
- Hugging Face TGI - Profile Hugging Face TGI models
- OpenAI Text Endpoints - Profile OpenAI-compatible text APIs
- Request Rate with Max Concurrency - Dual request control
- Arrival Patterns - Constant, Poisson, gamma traffic
- Prefill Concurrency - Memory-safe long-context benchmarking
- Gradual Ramping - Smooth ramp-up of concurrency and request rate
- Warmup Phase - Eliminate cold-start effects
- User-Centric Timing - Per-user rate limiting for KV cache benchmarking
- Request Cancellation - Timeout and resilience testing
- Multi-URL Load Balancing - Distribute across servers
- Trace Benchmarking - Deterministic workload replay
- Bailian Traces - Bailian production trace replay
- Custom Prompt Benchmarking - Send exact prompts as-is
- Custom Dataset - Custom dataset formats
- ShareGPT Dataset - Profile with ShareGPT dataset
- Synthetic Dataset Generation - Generate synthetic datasets
- Fixed Schedule - Precise timestamp-based execution
- Time-based Benchmarking - Duration-based testing
- Sequence Distributions - Mixed ISL/OSL pairings
- Prefix Synthesis - Prefix data synthesis for KV cache testing
- Reproducibility - Deterministic datasets with
--random-seed - Template Endpoint - Custom Jinja2 request templates
- Multi-Turn Conversations - Multi-turn conversation benchmarking
- Local Tokenizer - Use local tokenizers without HuggingFace
- Embeddings - Profile embedding models
- Rankings - Profile ranking models
- Audio - Profile audio language models
- Vision - Profile vision language models
- Image Generation - Benchmark any OpenAI-compatible image generation API
- SGLang Video Generation - Video generation benchmarking
- Synthetic Video - Synthetic video generation
- Timeslice Metrics - Per-timeslice performance analysis
- Goodput - SLO-based throughput measurement
- HTTP Trace Metrics - DNS, TCP/TLS, TTFB timing
- Multi-Run Confidence - Confidence intervals across repeated runs
- Profile Exports - Post-processing with Pydantic models
- Visualization and Plotting - PNG charts and multi-run comparison
- GPU Telemetry - DCGM metrics collection
- Server Metrics - Prometheus-compatible metrics
| Document | Purpose |
|---|---|
| Architecture | Three-plane architecture, core components, credit system, data flow |
| CLI Options | Complete command and option reference |
| Metrics Reference | All metric definitions, formulas, and requirements |
| Environment Variables | All AIPERF_* configuration variables |
| Plugin System | Plugin architecture, 25+ categories, creation guide |
| Creating Plugins | Step-by-step plugin tutorial |
| Accuracy Benchmarks | Accuracy evaluation stubs and datasets |
| Benchmark Modes | Trace replay and timing modes |
| Server Metrics | Prometheus-compatible server metrics collection |
| Tokenizer Auto-Detection | Pre-flight tokenizer detection |
| Dataset Synthesis API | Synthesis module API reference |
| Code Patterns | Code examples for services, models, messages, plugins |
| Migrating from Genai-Perf | Migration guide and feature comparison |
| Design Proposals | Enhancement proposals and discussions |
See CONTRIBUTING.md for development setup, coding conventions, and contribution guidelines.
- Output sequence length constraints (
--output-tokens-mean) cannot be guaranteed unless you passignore_eosand/ormin_tokensvia--extra-inputsto an inference server that supports them. - Very high concurrency settings (typically >15,000) may lead to port exhaustion on some systems. Adjust system limits or reduce concurrency if connection failures occur.
- Startup errors caused by invalid configuration settings can cause AIPerf to hang indefinitely. Terminate the process and check configuration settings.
- Copying selected text may not work reliably in the dashboard UI. Use the
ckey to copy all logs.