Skip to content

devonclemente/llm-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llm-benchmark

Benchmark local LLMs (via Ollama) and Claude side-by-side on your own tasks.

Runs the same prompts through every model and generates a readable markdown report so you can compare quality and speed at a glance. Built for practical ops tasks — summarization, priority extraction, message drafting, and structured data output.


AI Royal Rumble — Cloud vs Local

Background

Built as part of a real automation project — we used this to choose the model running a daily ops bot for a small business. The full story is in the article series on Medium:


What it does

Four task types, every model, same prompts:

Task What it tests
summarize Condense a week of open items into a short digest
priorities Extract the 3 most urgent items
classify Return structured JSON from unstructured task data
draft Write a short message based on context

Output: a markdown report comparing every model side by side — response time, tokens/sec, and raw output.


Requirements

  • Ollama installed and running — Ollama is the local model runner used to pull and serve models like Gemma and Mistral
  • At least one model pulled: ollama pull gemma
  • Python 3.8+
  • Optional: Claude CLI for cloud comparison (Claude Sonnet, Claude Haiku)

Quick start

# Pull a model
ollama pull gemma

# Run the benchmark
python3 llm-benchmark.py

# Local models only (no Claude)
python3 llm-benchmark.py --skip-claude

# Specific models and tasks
python3 llm-benchmark.py --models gemma mistral --tasks summarize classify

Customizing for your own tasks

Everything is in the TASKS dict near the top of the script. Swap in your own prompts, system instructions, and sample data:

TASKS = {
    "your-task": {
        "description": "one line explaining what this tests",
        "system": "optional system prompt / persona",
        "prompt": "the actual prompt sent to every model",
    },
}

Update SAMPLE_CONTEXT with your own data — or load it from a file.


CLI options

--models      Ollama model names to benchmark (must be pulled first)
--tasks       Which tasks to run (default: all)
--skip-claude Skip Claude CLI, run local models only
--output      Output report filename (default: benchmark-YYYY-MM-DD.md)


Sample output

See what a benchmark report actually looks like: sample-output/benchmark-sample.md

Or see the results live with context: devonclemente.com/#llm-benchmark


See what I build when the model is chosen — devonclemente.com

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages