Community-driven benchmark database for running LLMs locally on Apple Silicon Macs.
Goal: Build a comprehensive, reproducible performance database so anyone can look up how fast a given LLM runs on their specific Mac — and find the optimal settings for it.
Browse results by chip generation:
| Generation | Link | Status |
|---|---|---|
| Apple M1 | View results | Awaiting contributions |
| Apple M2 | View results | Awaiting contributions |
| Apple M3 | View results | Awaiting contributions |
| Apple M4 | View results | Awaiting contributions |
| Apple M5 | View results | 1 config, 22 models |
Each generation page contains separate tables for every variant (base, Pro, Max, Ultra) and hardware configuration (CPU cores, GPU cores, RAM).
Full results index with cross-generation comparison: results/README.md
git clone https://github.com/enescingoz/mac-llm-bench.git
cd mac-llm-bench
# Install dependencies
brew install llama.cpp
pip3 install huggingface-hub
# Run a quick smoke test (~0.8GB download)
./bench.sh --quick
# Benchmark all models that fit in your RAM
./bench.sh --auto
# Regenerate result tables after benchmarking
python3 scripts/generate_results.pyWe use llama-bench as the core benchmark — standardized, content-agnostic, and fully reproducible. It measures raw token processing and generation speed at fixed token counts (pp128, pp256, pp512, tg128, tg256). No custom prompts, no subjectivity, no need to ever re-benchmark if test cases change.
| Metric | Source | Description |
|---|---|---|
| pp128/pp256/pp512 (tok/s) | llama-bench |
Prompt processing speed |
| tg128/tg256 (tok/s) | llama-bench |
Text generation speed |
| Peak Memory (GB) | /usr/bin/time |
Maximum RAM usage |
| Perplexity | llama-perplexity |
Quality on WikiText-2 (optional) |
Currently benchmarking 5 model families (22 models total):
| Family | Models | Sizes |
|---|---|---|
| Gemma 3 (Google) | 4 models | 1B, 4B, 12B, 27B |
| Qwen 3 (Alibaba) | 7 models | 0.6B, 1.7B, 4B, 8B, 14B, 32B, 30B-A3B MoE |
| DeepSeek R1 Distill | 3 models | 7B, 14B, 32B |
| Phi-4 (Microsoft) | 4 models | Mini 3.8B, Mini Reasoning 3.8B, 14B, Reasoning Plus 14B |
| Mistral | 4 models | 7B v0.3, Nemo 12B, Small 3.1 24B, Devstral Small 24B |
All ungated — no HuggingFace login required. More model families can be added via PR. Run ./bench.sh --list to see all available models.
We aim to cover every Apple Silicon configuration:
M1 / M2 / M3 / M4 / M5
× base / Pro / Max / Ultra
× various CPU/GPU core counts
× various RAM sizes (8GB – 256GB)
Results are organized by generation → variant → hardware config. See CONTRIBUTING.md for how to add your machine.
Find optimal settings for each model on your hardware:
./bench.sh --model gemma-3-4b --sweep # Quick sweep
./bench.sh --model gemma-3-4b --sweep-full # Exhaustive sweepmac-llm-bench/
├── bench.sh # Main CLI
├── models.yaml # Model registry
├── requirements.txt # Python dependencies
├── lib/ # Benchmark scripts
├── scripts/
│ └── generate_results.py # Generates result tables from raw data
├── results/
│ ├── README.md # Auto-generated index
│ ├── m1/ ... m5/ # Per-generation results
│ │ ├── README.md # Auto-generated tables
│ │ └── raw/ # Raw JSON benchmark data
│ │ └── {chip}_{cpu}c-{gpu}g_{ram}gb/
│ │ └── {model}_{quant}_ngl{n}.json
├── schemas/
│ └── result.schema.json # Result JSON format
├── CONTRIBUTING.md # How to submit results
└── GUIDE.md # User guide
- GUIDE.md — Detailed user guide for benchmarking
- CONTRIBUTING.md — How to submit results and add models
- results/ — All benchmark results
- macOS on Apple Silicon (M1/M2/M3/M4/M5)
- llama.cpp —
brew install llama.cpp - huggingface-hub —
pip3 install huggingface-hub - Python 3 (pre-installed on macOS)
MIT