LLM Bench — Accuracy • Speed • Memory

Lightweight, configurable benchmark harness for LLM providers. Measure accuracy (EM/F1/MC), latency/throughput, and resource timelines (CPU/RAM, optional GPU).

This repository is intentionally small and pluggable — add adapters for your provider and drop in any JSONL dataset.

Quick start (local, no external model)

Create and activate a Python virtualenv. The project supports Python 3.10+.
(Optional) install dev deps for testing and plotting:

python -m pip install -r requirements-dev.txt
# Optional GPU support: python -m pip install -e .[gpu]

Run the example harness (uses an in-repo MockProvider):

python - <<'PY'
import asyncio, sys
sys.path.insert(0, '')
from benches.harness import run_bench
asyncio.run(run_bench('bench_config.yaml'))
PY

Output files will be written to the reports/ prefix declared in bench_config.yaml (JSONL per-sample, CSV summary, resources timeline and a compact Markdown report).

Configuration (`bench_config.yaml`)

Key fields:

provider: select kind: mock | ollama | openai and provider-specific connection options.
io.dataset_path: path to JSONL dataset.
io.output_prefix: prefix for the output artifacts in reports/.
prompt.system and prompt.template: system message and per-sample template using {input} and other fields from the dataset.
load.concurrency and load.batch_size: concurrency/batch settings.
limits.max_samples: limit number of samples for fast experiments.
metrics.normalization: optional normalization (e.g., lower_strip) applied to accuracy metrics.

Example config is included as bench_config.yaml.

Dataset formats

Free‑text QA JSONL (one object per line):

{"id":"1","input":"Capital of France?","target":"Paris"}

Multiple choice JSONL:

{"id":"1","input":"Capital of France?","choices":["Paris","Lyon"],"answer":"Paris"}

Providers

Implement a Provider with two async methods:

generate(prompt, system=None, options=None) -> dict — returns at least output and may provide latency_s, ttft_s, prompt_eval_count, eval_count.
tokenize(text) -> int — optional but helpful for token counts.

Included adapters:

OllamaProvider (calls /api/generate and /api/tokenize)
OpenAIStyleProvider (calls /v1/chat/completions)
MockProvider (local, for testing and CI)

Add your provider implementation to benches/providers.py and register it in _load_provider in benches/harness.py.

Metrics

Exact Match (EM), token-F1, multiple-choice accuracy implemented in benches/metrics.py.
BLEU and ROUGE-L are optional; they require sacrebleu and rouge-score respectively.

Resource monitoring

benches/monitor.py samples process CPU/RAM (via psutil) and optionally GPU stats via NVML.

GPU sampling is optional and controlled by the environment variable LLM_BENCH_SKIP_GPU=1 (CI sets this variable by default).
GPU support is available via the optional package extra gpu (recommended package nvidia-ml-py, fallback to pynvml is supported).

Install the GPU extra locally with:

python -m pip install -e .[gpu]

Note: GitHub-hosted runners do not provide GPUs; the CI workflow sets LLM_BENCH_SKIP_GPU=1.

Outputs

*.jsonl — per-sample detailed results
*_summary.csv — single-row summary (latency percentiles, accuracy means, tokens)
*_resources.csv — timeline of CPU/RAM/(optional)GPU samples
*_report.md — compact human report

Tests & CI

Unit and integration tests live in tests/.
Run tests locally with pytest or make test.
CI (.github/workflows/ci.yml) runs tests and sets LLM_BENCH_SKIP_GPU=1 so GPU sampling is skipped on GitHub runners.

Examples

examples/run_mock.py — programmatic example that runs the harness against the MockProvider.
benches/plot.py — helper to plot resources.csv (requires matplotlib + pandas).

Extending

Add a provider: implement Provider.generate() and tokenize(), and register it in _load_provider.
Add a metric: implement in benches/metrics.py and wire into benches/harness.py.
Throughput sweeps: write a wrapper that modifies bench_config.yaml concurrency/batch settings and re-runs the harness to gather scaling data.

License

MIT — do what you want, but please share interesting improvements.

GPU (optional):

The harness can sample NVIDIA GPU stats via NVML. This is optional — GitHub Actions runners don't have GPUs, and CI skips GPU sampling by default.

To install the optional GPU dependency locally:

python -m pip install -e .[gpu]
# or: pip install nvidia-ml-py

CI note: the provided GitHub Actions workflow sets LLM_BENCH_SKIP_GPU=1 so GPU sampling is disabled in CI.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
benches		benches
datasets		datasets
examples		examples
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
bench_config.yaml		bench_config.yaml
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Bench — Accuracy • Speed • Memory

Quick start (local, no external model)

Configuration (`bench_config.yaml`)

Dataset formats

Providers

Metrics

Resource monitoring

Outputs

Tests & CI

Examples

Extending

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

constructorfleet/llm-bench

Folders and files

Latest commit

History

Repository files navigation

LLM Bench — Accuracy • Speed • Memory

Quick start (local, no external model)

Configuration (bench_config.yaml)

Dataset formats

Providers

Metrics

Resource monitoring

Outputs

Tests & CI

Examples

Extending

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Configuration (`bench_config.yaml`)

Packages