Benchmark local LLMs (via Ollama) and Claude side-by-side on your own tasks.
Runs the same prompts through every model and generates a readable markdown report so you can compare quality and speed at a glance. Built for practical ops tasks — summarization, priority extraction, message drafting, and structured data output.
Built as part of a real automation project — we used this to choose the model running a daily ops bot for a small business. The full story is in the article series on Medium:
- Why I Don't Trust AI Models I Can't Run Offline
- We Built a Daily Ops Bot for a Small Business on a $1000 Mac Mini
- The 50-Line Python Script That Runs a Team Standup Every Morning
- You wouldn't crack a walnut with a sledgehammer, would you?
Four task types, every model, same prompts:
| Task | What it tests |
|---|---|
summarize |
Condense a week of open items into a short digest |
priorities |
Extract the 3 most urgent items |
classify |
Return structured JSON from unstructured task data |
draft |
Write a short message based on context |
Output: a markdown report comparing every model side by side — response time, tokens/sec, and raw output.
- Ollama installed and running — Ollama is the local model runner used to pull and serve models like Gemma and Mistral
- At least one model pulled:
ollama pull gemma - Python 3.8+
- Optional: Claude CLI for cloud comparison (Claude Sonnet, Claude Haiku)
# Pull a model
ollama pull gemma
# Run the benchmark
python3 llm-benchmark.py
# Local models only (no Claude)
python3 llm-benchmark.py --skip-claude
# Specific models and tasks
python3 llm-benchmark.py --models gemma mistral --tasks summarize classifyEverything is in the TASKS dict near the top of the script. Swap in your own prompts, system instructions, and sample data:
TASKS = {
"your-task": {
"description": "one line explaining what this tests",
"system": "optional system prompt / persona",
"prompt": "the actual prompt sent to every model",
},
}Update SAMPLE_CONTEXT with your own data — or load it from a file.
--models Ollama model names to benchmark (must be pulled first)
--tasks Which tasks to run (default: all)
--skip-claude Skip Claude CLI, run local models only
--output Output report filename (default: benchmark-YYYY-MM-DD.md)
See what a benchmark report actually looks like: sample-output/benchmark-sample.md
Or see the results live with context: devonclemente.com/#llm-benchmark
See what I build when the model is chosen — devonclemente.com
