llm-benchmark

Benchmark local LLMs (via Ollama) and Claude side-by-side on your own tasks.

Runs the same prompts through every model and generates a readable markdown report so you can compare quality and speed at a glance. Built for practical ops tasks — summarization, priority extraction, message drafting, and structured data output.

Background

Built as part of a real automation project — we used this to choose the model running a daily ops bot for a small business. The full story is in the article series on Medium:

What it does

Four task types, every model, same prompts:

Task	What it tests
`summarize`	Condense a week of open items into a short digest
`priorities`	Extract the 3 most urgent items
`classify`	Return structured JSON from unstructured task data
`draft`	Write a short message based on context

Output: a markdown report comparing every model side by side — response time, tokens/sec, and raw output.

Requirements

Ollama installed and running — Ollama is the local model runner used to pull and serve models like Gemma and Mistral
At least one model pulled: ollama pull gemma
Python 3.8+
Optional: Claude CLI for cloud comparison (Claude Sonnet, Claude Haiku)

Quick start

# Pull a model
ollama pull gemma

# Run the benchmark
python3 llm-benchmark.py

# Local models only (no Claude)
python3 llm-benchmark.py --skip-claude

# Specific models and tasks
python3 llm-benchmark.py --models gemma mistral --tasks summarize classify

Customizing for your own tasks

Everything is in the TASKS dict near the top of the script. Swap in your own prompts, system instructions, and sample data:

TASKS = {
    "your-task": {
        "description": "one line explaining what this tests",
        "system": "optional system prompt / persona",
        "prompt": "the actual prompt sent to every model",
    },
}

Update SAMPLE_CONTEXT with your own data — or load it from a file.

CLI options

--models      Ollama model names to benchmark (must be pulled first)
--tasks       Which tasks to run (default: all)
--skip-claude Skip Claude CLI, run local models only
--output      Output report filename (default: benchmark-YYYY-MM-DD.md)

Sample output

See what a benchmark report actually looks like: sample-output/benchmark-sample.md

Or see the results live with context: devonclemente.com/#llm-benchmark

See what I build when the model is chosen — devonclemente.com

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
sample-output		sample-output
.gitignore		.gitignore
README.md		README.md
ai-royal-rumble-cloud-vs-local.png		ai-royal-rumble-cloud-vs-local.png
llm-benchmark.py		llm-benchmark.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-benchmark

Background

What it does

Requirements

Quick start

Customizing for your own tasks

CLI options

Sample output

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llm-benchmark

Background

What it does

Requirements

Quick start

Customizing for your own tasks

CLI options

Sample output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages