Lightweight, configurable benchmark harness for LLM providers. Measure accuracy (EM/F1/MC), latency/throughput, and resource timelines (CPU/RAM, optional GPU).
This repository is intentionally small and pluggable — add adapters for your provider and drop in any JSONL dataset.
-
Create and activate a Python virtualenv. The project supports Python 3.10+.
-
(Optional) install dev deps for testing and plotting:
python -m pip install -r requirements-dev.txt
# Optional GPU support: python -m pip install -e .[gpu]- Run the example harness (uses an in-repo MockProvider):
python - <<'PY'
import asyncio, sys
sys.path.insert(0, '')
from benches.harness import run_bench
asyncio.run(run_bench('bench_config.yaml'))
PYOutput files will be written to the reports/ prefix declared in bench_config.yaml (JSONL per-sample, CSV summary, resources timeline and a compact Markdown report).
Key fields:
provider: selectkind: mock | ollama | openaiand provider-specific connection options.io.dataset_path: path to JSONL dataset.io.output_prefix: prefix for the output artifacts inreports/.prompt.systemandprompt.template: system message and per-sample template using{input}and other fields from the dataset.load.concurrencyandload.batch_size: concurrency/batch settings.limits.max_samples: limit number of samples for fast experiments.metrics.normalization: optional normalization (e.g.,lower_strip) applied to accuracy metrics.
Example config is included as bench_config.yaml.
- Free‑text QA JSONL (one object per line):
{"id":"1","input":"Capital of France?","target":"Paris"}- Multiple choice JSONL:
{"id":"1","input":"Capital of France?","choices":["Paris","Lyon"],"answer":"Paris"}Implement a Provider with two async methods:
generate(prompt, system=None, options=None) -> dict— returns at leastoutputand may providelatency_s,ttft_s,prompt_eval_count,eval_count.tokenize(text) -> int— optional but helpful for token counts.
Included adapters:
OllamaProvider(calls /api/generate and /api/tokenize)OpenAIStyleProvider(calls /v1/chat/completions)MockProvider(local, for testing and CI)
Add your provider implementation to benches/providers.py and register it in _load_provider in benches/harness.py.
- Exact Match (EM), token-F1, multiple-choice accuracy implemented in
benches/metrics.py. - BLEU and ROUGE-L are optional; they require
sacrebleuandrouge-scorerespectively.
benches/monitor.py samples process CPU/RAM (via psutil) and optionally GPU stats via NVML.
- GPU sampling is optional and controlled by the environment variable
LLM_BENCH_SKIP_GPU=1(CI sets this variable by default). - GPU support is available via the optional package extra
gpu(recommended packagenvidia-ml-py, fallback topynvmlis supported).
Install the GPU extra locally with:
python -m pip install -e .[gpu]Note: GitHub-hosted runners do not provide GPUs; the CI workflow sets LLM_BENCH_SKIP_GPU=1.
*.jsonl— per-sample detailed results*_summary.csv— single-row summary (latency percentiles, accuracy means, tokens)*_resources.csv— timeline of CPU/RAM/(optional)GPU samples*_report.md— compact human report
- Unit and integration tests live in
tests/. - Run tests locally with
pytestormake test. - CI (
.github/workflows/ci.yml) runs tests and setsLLM_BENCH_SKIP_GPU=1so GPU sampling is skipped on GitHub runners.
examples/run_mock.py— programmatic example that runs the harness against the MockProvider.benches/plot.py— helper to plotresources.csv(requiresmatplotlib+pandas).
- Add a provider: implement
Provider.generate()andtokenize(), and register it in_load_provider. - Add a metric: implement in
benches/metrics.pyand wire intobenches/harness.py. - Throughput sweeps: write a wrapper that modifies
bench_config.yamlconcurrency/batch settings and re-runs the harness to gather scaling data.
MIT — do what you want, but please share interesting improvements.
GPU (optional):
The harness can sample NVIDIA GPU stats via NVML. This is optional — GitHub Actions runners don't have GPUs, and CI skips GPU sampling by default.
To install the optional GPU dependency locally:
python -m pip install -e .[gpu]
# or: pip install nvidia-ml-pyCI note: the provided GitHub Actions workflow sets LLM_BENCH_SKIP_GPU=1 so GPU sampling is disabled in CI.