Baseweight Benchmark

A configurable pipeline for comparing QLoRA fine-tuned open-source models against frontier API models on various tasks. Measures accuracy, latency, cost per query, and 12-month TCO under training-data-constrained conditions (zero-shot, 5-shot, and LoRA fine-tuned on 500 or full examples).

Results: baseweight.co/benchmark Methodology: baseweight.co/methodology

What this benchmarks

Open-source models (QLoRA fine-tuned via Unsloth + vLLM):

Model	Parameter count
Qwen3-8B	8B
Gemma 3 4B	4B
Phi-4 Mini	3.8B

Frontier API baseline (zero-shot and 5-shot):

Model	Provider	Role
GPT-5.4 Mini	OpenAI	v1 benchmark baseline
GPT-5.4 Nano	OpenAI	`--smoke-test` stand-in

OpenAI SFT was dropped from the benchmark (deprecated), so API models are evaluated zero-shot and 5-shot only — there is no API fine-tuning condition.

Tasks and metrics:

Task	Dataset	Type	Metric
Customer support routing	BANKING77	Classification	Weighted F1
Contract clause extraction	CUAD	Extraction	Token F1 (+ answer-detection F1, AUPR)
Legal document classification	LEDGAR	Classification	Macro F1
Financial sentiment	FPB	Classification	Macro F1
Medical QA	MedMCQA	Classification	Accuracy
Code generation	MBPP	Code	Pass@1

CUAD uses the full Atticus CUAD QA grid (theatticusproject/cuad-qa), balanced 50/50 between questions whose clause is present and questions where it is absent — so it measures abstention as well as extraction. Long contracts are handled with sliding-window chunking (one inference per window, aggregated to a per-question score). See configs/tasks/cuad.yaml for the methodology detail.

Conditions per model:

Condition	Description
`zero-shot`	System + user prompt, no examples
`5-shot`	5 in-context examples
`lora`	QLoRA fine-tuned on the task's training set (open-source models only)

Repository layout

The repo is the persistence boundary — everything the pipeline needs to keep across runs lives inside it. Nothing under /workspace, $HOME, or any other host-specific path.

configs/         Task and model YAML configs, pricing
data/            Raw and prepared datasets        (gitignored)
prompts/         Per-task prompt templates
results/         Predictions, classified outputs, summaries (gitignored)
checkpoints/     HF Trainer checkpoints, train_state.json   (gitignored)
.cache/          HF, vLLM, torch_inductor, triton, pip caches (gitignored)
.conda-envs/     Project conda env                (gitignored)
scripts/         Pipeline scripts
site/            Static dashboard (Chart.js)

The only state outside the repo is $HOME/miniconda3 (the conda installer itself, treated as a system tool). start.sh reinstalls it if missing, so a fresh container or new machine reaches the same working state with one command.

Quick start

git clone https://github.com/baseweight/baseweight-benchmark.git
cd baseweight-benchmark
./start.sh                    # installs miniconda + creates the conda env in-repo
cp .env.example .env          # add OPENAI_API_KEY, GOOGLE_API_KEY, HF_TOKEN
conda activate ./.conda-envs/baseweight-benchmark

start.sh is idempotent — safe to re-run any time. It detects what's missing (miniconda, conda env, env-vars wiring) and provisions only that. When environment.yml changes, it runs conda env update --prune. Force a clean rebuild with ./start.sh --recreate-env.

The conda env's activate hook auto-exports HF_HOME, VLLM_CACHE_ROOT, TORCHINDUCTOR_CACHE_DIR, TRITON_CACHE_DIR, and PIP_CACHE_DIR into <repo>/.cache/, so conda activate is the only setup step a user has to remember.

1. Download and prepare a task

python scripts/download_data.py --task banking77
python scripts/prepare_datasets.py --task banking77

Pass --task all to operate on all six tasks. Both scripts require an explicit --task argument — they will not run all tasks by default.

2. Fine-tune an open-source model

# Train on 500 examples and full set
python scripts/train_local.py --model qwen3-8b --task banking77 --condition all

# With HuggingFace auto-upload (recommended for remote GPU persistence)
python scripts/train_local.py --model qwen3-8b --task banking77 --condition all --auto-upload

3. Evaluate

# Local model via vLLM
python scripts/eval_local.py --model qwen3-8b --task banking77 --condition all

# Frontier API baseline (zero-shot and 5-shot)
python scripts/eval_api.py --model gpt-5.4-mini --task banking77 --condition all

4. Classify errors and compute metrics

python scripts/classify_errors.py --task banking77

5. Generate dashboard data

python scripts/generate_dashboard_data.py

Configuring your own run

Each task has a YAML in configs/tasks/<task_id>.yaml. Key fields:

metric_id: which metric to compute (weighted_f1, macro_f1, accuracy, token_f1)
max_seq_length: overrides the model's default for that task
training_cap: caps the full training set size
test_sample_size: caps test set for faster evaluation

Model training configs live in configs/training/<model_id>.yaml and control LoRA hyperparameters, sequence length, and enable_thinking for Qwen3.

API pricing is in configs/pricing.yaml and feeds cost-per-query and TCO calculations in the dashboard.

Artifact persistence

Everything the pipeline produces — datasets, prepared splits, training checkpoints, adapters, predictions, summaries — lives inside the repo. On a RunPod-style host where the repo is cloned onto the persistent volume, that volume IS the persistence boundary; no separate "network volume" path is involved.

For offsite backup or cross-host sharing, use sync_artifacts.py:

# Sync everything to HuggingFace (safe to run any time)
python scripts/sync_artifacts.py --what all

# Download adapters and predictions on a new pod
python scripts/sync_artifacts.py --what adapters --direction down

Smoke mode

--smoke-test runs a tiny end-to-end pipeline (12-row dataset, 1-epoch training) against a small stand-in model. All smoke outputs live under a parallel smoke/ namespace so they cannot clobber real benchmark data:

data/smoke/raw/ — tiny raw downloads
data/smoke/prepared/ — smoke-prepared JSONL
checkpoints/smoke/ — smoke training checkpoints
results/smoke/ — predictions, summaries, classified errors, training metadata, adapters
dashboard-data/smoke/ — smoke dashboard render

Read-only paths (configs in configs/, prompts in prompts/) are shared between smoke and real runs. The --smoke-test flag is plumbed through every stage by scripts/run.py and is also accepted at each stage script directly. Auto-upload to HuggingFace (train_local.py --auto-upload) is disabled for smoke runs. prepare_datasets.py validates raw row counts against EXPECTED_COUNTS (in pipeline/data_quality.py) on non-smoke runs and fails loudly if the raw artifact looks truncated (e.g. left over from a stale smoke download under an older layout).

Limitations

Hyperparameters are not exhaustively tuned. LoRA rank, learning rate, weight decay, and warmup are documented defaults (configs/training/), with one principled per-task override (configs/tasks/fpb.yaml: a lower learning rate and higher weight decay for that small dataset). They were chosen a priori — never tuned on the test set — and not searched via a per-task sweep or k-fold cross-validation. Within a run, the epoch checkpoint is selected by load_best_model_at_end on a held-out, stratified validation split (val.jsonl, produced by prepare_datasets.py). A fuller hyperparameter search could yield modest gains, but the test sets are small enough that such gains may fall within measurement noise.
Test sets are modest in size. Metrics are reported as mean ± spread across n_eval_seeds resamples of the test set (configs/run_defaults.yaml) to quantify sampling noise; single-seed numbers should not be over-interpreted.

License

Code: MIT. Model adapters follow each model's original license. Datasets: see individual dataset cards on HuggingFace.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.vscode		.vscode
configs		configs
dashboard-data		dashboard-data
principles		principles
prompts		prompts
scripts		scripts
tests		tests
.coveragerc		.coveragerc
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
README.md		README.md
environment.yml		environment.yml
pytest.ini		pytest.ini
run.sh		run.sh
start.sh		start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Baseweight Benchmark

What this benchmarks

Repository layout

Quick start

1. Download and prepare a task

2. Fine-tune an open-source model

3. Evaluate

4. Classify errors and compute metrics

5. Generate dashboard data

Configuring your own run

Artifact persistence

Smoke mode

Limitations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Baseweight Benchmark

What this benchmarks

Repository layout

Quick start

1. Download and prepare a task

2. Fine-tune an open-source model

3. Evaluate

4. Classify errors and compute metrics

5. Generate dashboard data

Configuring your own run

Artifact persistence

Smoke mode

Limitations

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages