A 3B-parameter text-to-SQL model that matches Claude Haiku 4.5 — at $0 per query.
| NicheLM v1 | Claude Haiku 4.5 | Llama 3.2 3B (base) | |
|---|---|---|---|
| Execution accuracy | 85.1 % | 85.0 % | 65.0 % |
| Cost per query | $0 (local) | $0.0006 | $0 |
| Mean latency | 1.7 s (RTX 4090) | 0.9 s | 2.7 s |
| Trainable parameters | 24.3 M (0.75 % of 3 B) | — | — |
| Training cost | ~$0.50 / ~14 min on a 4090 | — | — |
NicheLM is a fine-tuned Llama 3.2 3B Instruct that takes a SQL schema and a natural-language question and produces a valid SQLite query. It is trained on the public Spider dataset (200+ databases, ~7k pairs) and evaluated on a held-out, never-seen e-commerce schema. The target metric is execution accuracy; the goal is to match or beat Claude Haiku 4.5 at a fraction of the cost per query — and it does.
System prompt — schema (excerpt):
CREATE TABLE customers (id INTEGER PRIMARY KEY, name TEXT, country TEXT, signup_date TEXT);
CREATE TABLE orders (id INTEGER PRIMARY KEY, customer_id INTEGER, order_date TEXT, total REAL,
FOREIGN KEY (customer_id) REFERENCES customers(id));User question:
Which 3 customers from Germany have placed the most orders?
NicheLM output:
SELECT c.name, COUNT(o.id) AS n_orders
FROM customers c
JOIN orders o ON o.customer_id = c.id
WHERE c.country = 'Germany'
GROUP BY c.id
ORDER BY n_orders DESC
LIMIT 3;The model never saw this schema during training. It generalizes from 200+ unrelated Spider databases.
First training run complete — QLoRA fine-tune on Spider, evaluated on the held-out e-commerce schema. NicheLM matches Claude Haiku 4.5 on execution accuracy at $0 per query (local inference) versus Haiku's $0.0006 per query. Baselines below were sampled at n=20 while the tuned model ran the full n=342 test set; a uniform n=500 sweep is on the roadmap.
| model | exec_acc | exact | valid_sql | mean_latency_s | mean_cost_usd | n |
|---|---|---|---|---|---|---|
| claude-haiku-4-5 | 0.850 | 0.500 | 1.000 | 0.874 | 0.000602 | 20 |
| llama-3.2-3b-base | 0.650 | 0.350 | 0.900 | 2.714 | 0.000000 | 20 |
| nichelm-v1 | 0.851 | 0.056 | 0.991 | 1.696 | 0.000000 | 342 |
QLoRA fine-tune on a single RTX 4090 — 313 steps, effective batch size 16, ~2.6M tokens seen, total wall time ~14 minutes. Loss collapses fast (Spider is a forgiving format) and converges by step ~150.
erDiagram
customers ||--o{ orders : places
customers ||--o{ reviews : writes
categories ||--o{ categories : "parent_id"
categories ||--o{ products : "categorizes"
products ||--o{ order_items : "in"
orders ||--o{ order_items : contains
products ||--o{ reviews : "about"
customers {
INTEGER id PK
TEXT name
TEXT email "UNIQUE"
TEXT country
TEXT signup_date
}
categories {
INTEGER id PK
TEXT name
INTEGER parent_id FK "NULL = top-level"
}
products {
INTEGER id PK
TEXT name
INTEGER category_id FK
REAL price
INTEGER stock
}
orders {
INTEGER id PK
INTEGER customer_id FK
TEXT order_date
TEXT status
REAL total
}
order_items {
INTEGER id PK
INTEGER order_id FK
INTEGER product_id FK
INTEGER quantity
REAL unit_price
}
reviews {
INTEGER id PK
INTEGER product_id FK
INTEGER customer_id FK
INTEGER rating
TEXT body
TEXT created_at
}
Six tables, foreign keys declared, indexed on every FK, 2-level category hierarchy. See data/schema.sql.
# 1. Offline pipeline (laptop, ~1 min total)
make install # uv sync --extra dev
make seed # creates data/processed/ecom.sqlite
make dataset-eval # creates test.jsonl (500 rows)
make inspect FILE=data/processed/test.jsonl # eyeball the eval set
make qc # validates against the real DB
# 2. Spider train set (needs `data` extra + network, ~10-15 min)
uv sync --extra dev --extra data
make dataset-train # auto-discovers Spider in HF cache
make inspect FILE=data/processed/train.jsonl SAMPLE=10
make qc # full QC including denylist + overlap
# 3. Baselines (needs `eval` extra + ANTHROPIC_API_KEY in .env)
uv sync --extra dev --extra data --extra eval
make baselines-smoke # 5-prompt Claude smoke test (~$0.01)
make baselines # full 500 — includes local Llama (overnight on CPU)Windows / PowerShell — same targets via the bundled wrapper:
.\make.ps1 install
.\make.ps1 seed
.\make.ps1 dataset-eval
.\make.ps1 inspect -InspectFile data/processed/test.jsonl
.\make.ps1 qc
.\make.ps1 ci # = lint + test + fixture qcThe synthetic e-commerce test set is generated locally and never overlaps with Spider's training distribution by construction (see "Decisions" below).
- Repo scaffold (CI green: lint + tests + fixture quality check)
- Synthetic e-commerce DB seeded
- Spider train/val JSONL built
- E-commerce 500-question test set built
- Baselines run (Claude Haiku 4.5, raw Llama 3.2 3B)
- First training run (Unsloth QLoRA on RunPod 4090, ~14 min)
Data. Training uses xlangai/spider from the Hugging Face Hub. Each row's prompt embeds the full CREATE TABLE statements for that database in the system message — the same format the model sees at inference time. The eval harness is a fixed e-commerce schema with deterministic Faker-seeded data and a registry of ~30 SQL "shapes" that produce 500 validated (question, SQL) pairs against the seeded DB.
Train/eval domain isolation. A small denylist of clearly e-commerce-flavored Spider databases (e_commerce, store_product, customers_card_transactions, …) is excluded from training so the held-out evaluation is genuinely unseen. The denylist lives in data/build_train_dataset.py and is enforced by data/quality_check.py.
Evaluation. The primary metric is execution accuracy — predicted and gold SQL are both run against a fresh seeded DB and the row sets are compared as sorted lists of tuples. Secondary metrics (exact match after sqlglot normalization, valid-SQL rate) are reported but not optimized.
Training. Unsloth + TRL SFTTrainer, QLoRA rank 16 / alpha 32 on q/k/v/o/gate/up/down proj, 4-bit base, 2 epochs over 5,000 Spider examples at lr 2e-4. Driven by train/configs/default.yaml. Designed to fit on a single 24 GB GPU.
Tested on macOS / Linux / WSL. Windows works for the scaffold (lint/test/qc) but the Spider build and baselines need data / eval extras.
git clone <this repo>
cd nichelm
cp .env.example .env # then fill in real keys
# CI parity (no GPU, no network):
make install # uv sync --extra dev
make ci # lint + tests + fixture qc
# Full pipeline (needs network for Spider + API keys for baselines):
uv sync --extra dev --extra data --extra eval
make seed
make dataset # uses --seed 42 by default
make qc
make baselinesTraining and tuned-model eval run on a rented GPU (24 GB+):
uv sync --extra train
make train # QLoRA fine-tune, writes outputs/nichelm-v1/final/
make eval # runs the checkpoint over test.jsonl
# override target with: make eval CHECKPOINT=outputs/my-run/final- One DB engine. SQLite dialect only (Spider's native dialect for testing). PostgreSQL / MySQL would need re-tokenizing the schema and probably re-tuning.
- One natural language. English only; Spider is en-only and so is the e-commerce eval.
- One target schema for the headline number. The held-out evaluation is fixed e-commerce. The model is general (Spider trained), but the published exec-accuracy figure speaks to e-commerce transfer specifically.
- Synthetic data. The e-commerce DB is Faker-generated. Real-world distributions (long-tail product names, multilingual reviews, NULLs everywhere) are not represented.
- No agentic loops. This is a single-shot prompt → SQL model. No tool use, no result-set inspection, no self-correction.
Design choices made during scaffolding, with brief justifications:
- Spider as training corpus, fixed e-commerce as held-out eval — the portfolio narrative ("trained on 200 DBs, beats Claude Haiku on a brand new schema") is more compelling than "fine-tuned on a synthetic single-domain set". Confirmed with the user.
- Schema-in-system-prompt format — required to generalize across Spider's many DBs; carries no overhead at e-commerce eval time and lets the deployed server work against arbitrary user schemas.
- Denylist of e-commerce-shaped Spider DBs in training — keeps the "domain it never saw" claim honest. Documented in
data/build_train_dataset.py::DENYLIST. - Spider DB source:
SALT-NLP/spider_VALUE/data.zip— the standardxlangai/spiderHF dataset is parquet-only (text fields).SALT-NLP/spider_VALUE/data.zipis a public mirror of the canonical Yale tarball with all 166 per-DB SQLite files atdata/database/<db_id>/<db_id>.sqlite. Auto-downloaded bybuild_train_dataset.pyto.hf_cache/spider_databases/on first run. max_seq_length = 4096— Spider DDL plus a long question can run past 2 k tokens; 4 k gives headroom without halving throughput.sqlglotfor SQL normalization — handles SQLite-flavored parsing for exact-match. Falls back to whitespace-lower if a query refuses to parse.- Faker
en_US, fixed seed 42 — full data reproducibility from a single flag. - Anthropic model id
claude-haiku-4-5-20251001(full id) — the API takes the full id; we keepclaude-haiku-4-5as the CLI alias for ergonomics. uvextras:dev,data,eval,train,serve— keeps CI installs lean (justdev) so torch / transformers / datasets aren't pulled into a PR check that doesn't need them.- Not committing generated data —
data/processed/*.{jsonl,sqlite}is.gitignored. Reproduce frommake seed && make dataset(requires--seed 42for byte-identical output).
nichelm/
├── data/
│ ├── schema.sql # 6-table e-commerce schema
│ ├── seed.py # Faker-seeded SQLite population
│ ├── _common.py # shared seeding / JSONL / DDL helpers
│ ├── build_train_dataset.py # Spider → train.jsonl + val.jsonl
│ ├── build_eval_dataset.py # synthetic e-commerce → test.jsonl
│ └── quality_check.py # validates everything
├── eval/
│ ├── metrics.py # execution_accuracy / exact_match / valid_sql_rate
│ ├── run_baselines.py # Claude Haiku 4.5 + raw Llama 3.2 3B
│ └── run_eval.py # tuned-checkpoint runner
├── train/
│ ├── train.py # Unsloth QLoRA (RUN ON GPU ONLY)
│ └── configs/default.yaml
├── serve/
│ ├── server.py # FastAPI generate endpoint
│ └── Modelfile # Ollama template (Llama 3.2 chat tokens)
├── tests/ # pytest + JSONL fixtures (no DB / network)
├── Makefile
├── pyproject.toml # uv-managed, four optional extras
└── .github/workflows/ci.yml # lint + test + fixture qc
Zsombor Horvath — built as a portfolio project to demonstrate end-to-end LLM fine-tuning:
data pipeline → training → evaluation → serving, all driven by a single Makefile and reproducible from --seed 42.
Feedback, issues, and pull requests welcome.
MIT — free to use, modify, and redistribute.



