-
Notifications
You must be signed in to change notification settings - Fork 1
Evaluation
SQL Query Engine includes two independent evaluation pipelines that measure the impact of the self-healing loop on query accuracy. Both use a 3-config ablation study tested across 5 LLM backends.
- Synthetic Evaluation (this page) — 120 questions across 3 controlled PostgreSQL databases with reproducible Faker-generated data
- BIRD Benchmark — 500 real-world questions from the BIRD mini-dev set across 11 databases, with automatic SQLite-to-PostgreSQL migration
The central claim of SQL Query Engine is that its self-healing loop — where PostgreSQL diagnostics (SQLSTATE codes, hints, tracebacks) are fed back to the LLM for iterative repair — improves query accuracy over raw single-shot generation. The evaluation pipelines quantify that claim.
Both pipelines execute three ablation configurations:
| Config | retryCount | What It Tests |
|---|---|---|
| Config C | 0 (generation only) | Baseline — raw LLM SQL generation quality with no execution feedback |
| Config B | 1 (single attempt) | Single-shot evaluation — can the LLM fix a query with one try? |
| Config A | 5 (full pipeline) | Full self-healing loop — iterative repair with execution feedback |
All three configurations share the same schema context and LLM model. The only difference is how many repair iterations are allowed. The delta between Config A and Config C is the headline result — it directly measures how many initially-failing queries the self-healing loop recovers.
evaluation/
├── shared/ # Shared utilities
│ ├── resultComparator.py # Order-independent result comparison
│ └── resourceMetrics.py # Wall time, memory, throughput tracking
├── synthetic/ # Synthetic evaluation (this page)
│ ├── entrypoint.py # Pipeline orchestrator
│ ├── evalRunner.py # 3-config ablation runner
│ ├── evalConfig.py # Environment-driven config
│ ├── seedData.py # Database seeding (Faker)
│ ├── schemaDefinitions.py # DDL for 3 databases
│ ├── questionRunner.py # Gold query executor
│ ├── scoreReport.py # Summary tables + summary.json
│ ├── questions/ # 120 gold questions (40 per DB)
│ └── results/runs/ # Archived results per model
└── bird/ # BIRD benchmark (see BIRD Evaluation page)
├── birdEntrypoint.py # Pipeline orchestrator
├── birdDataLoader.py # Dataset loading + dialect conversion
├── sqliteToPostgres.py # SQLite → PostgreSQL migration
├── birdEvalRunner.py # 3-config ablation runner
├── birdScoreReport.py # BIRD-specific scoring
└── bird_results/runs/ # Archived results per model
The synthetic pipeline seeds three PostgreSQL databases using Faker with a fixed random seed for reproducibility:
| Database | Domain | Tables | Questions |
|---|---|---|---|
eval_ecommerce |
E-commerce (customers, orders, products, payments, reviews) | 7 | 40 |
eval_hospital |
Hospital management (patients, doctors, appointments, prescriptions, lab results) | 8 | 40 |
eval_university |
University (students, courses, enrollments, faculty, departments) | 7 | 40 |
Each database has 40 gold-annotated questions across 4 difficulty tiers: easy (simple SELECT/COUNT), medium (JOINs, GROUP BY), hard (subqueries, CTEs, window functions), and extra_hard (multi-step aggregations, conditional logic).
Five LLM backends were tested across all three databases and configurations (75 questions per run, 25 per database):
| Model | Config C | Config B | Config A | Delta (A−C) | Healed | Regressions |
|---|---|---|---|---|---|---|
| Llama 4 Scout 17B | 48.0% | 50.7% | 57.3% | +9.3pp | 7 | 0 |
| GPT-OSS 20B | 45.3% | 45.3% | 53.3% | +8.0pp | 6 | 0 |
| Llama 3.3 70B | 53.3% | 54.7% | 54.7% | +1.4pp | 3 | 2 |
| GPT-OSS 120B | 50.7% | 49.3% | 48.0% | −2.7pp | 2 | 4 |
| Qwen3 32B | 48.0% | 42.7% | 46.7% | −1.3pp | 5 | 6 |
Key findings:
- Llama 4 Scout 17B is the best model on synthetic data: highest Config A accuracy (57.3%), largest healing delta (+9.3pp), and zero regressions.
- Self-healing provides up to +9.3pp improvement for models that handle it well.
- Regressions occur when the LLM rewrites a working query into a broken one. The engine's early-accept design (accept on execution success without an LLM call) mitigates this, but some models still regress on queries that return rows with incorrect content.
- GPT-OSS 120B shows a negative delta (−2.7pp) with 4 regressions, demonstrating that larger models don't always perform better at self-healing.
| Difficulty | Config C | Config A | Delta |
|---|---|---|---|
| Easy | 100.0% | 100.0% | — |
| Medium | 77.8% | 77.8% | — |
| Hard | 5.6% | 44.4% | +38.8pp |
| Extra Hard | 0.0% | 0.0% | — |
Self-healing provides the largest gains on hard-difficulty queries, where initial generation often fails but iterative repair can recover the correct result.
| Metric | Config C | Config B | Config A |
|---|---|---|---|
| Wall time | 70.6s | 135.3s | 166.5s |
| Peak memory | 62.1 MB | 64.8 MB | 64.9 MB |
| Throughput | 63.8 q/min | 33.3 q/min | 27.0 q/min |
| p95 latency | 2.72s | 7.08s | 9.90s |
The evaluation compares predicted query results against gold query results using normalized comparison:
- Execute both the predicted and gold queries against the same database
- Sort rows deterministically (by all columns)
- Normalize values: collapse numeric trailing zeros (100.0 → 100), strip whitespace
- Compare row-by-row — an exact match counts as correct
This logic lives in evaluation/shared/resultComparator.py and is shared by both the synthetic and BIRD pipelines.
cd sqlqueryengine
# 1. Set LLM credentials in docker-compose-synthetic-evaluation.yml
# Edit: LLM_BASE_URL, LLM_MODEL, LLM_API_KEY
# 2. Build and launch
docker compose -f docker-compose-synthetic-evaluation.yml build
docker compose -f docker-compose-synthetic-evaluation.yml up -d
# 3. Monitor progress
docker logs eval-runner --tail 10 -f
# 4. Results appear in evaluation/synthetic/results/The evaluation stack starts four services: PostgreSQL (seeded automatically), Redis, the SQL Query Engine, and the evaluation runner.
| Variable | Default | Description |
|---|---|---|
LLM_BASE_URL |
— | OpenAI-compatible endpoint (Groq, vLLM, Ollama, etc.) |
LLM_MODEL |
— | Model ID to benchmark |
LLM_API_KEY |
your-api-key-here |
API key (replace before running) |
EVAL_MAX_WORKERS |
3 |
Parallel question count (1–6, depends on rate limits) |
QUESTIONS_PER_DB |
25 |
Questions per database (25 for quick runs, 40 for full) |
DEFAULT_SCHEMA_EXAMPLES |
5 |
Sample rows per table in schema context (3–5) |
evaluation/synthetic/results/
├── questions.json ← Gold questions with expected results
├── data_manifest.json ← Seeded database snapshot (row counts)
└── runs/ ← Archived results per model
├── llama-4-scout-17b-16e-instruct/
│ ├── summary.json ← Aggregated scores (overall, by difficulty, by DB)
│ ├── results_config_a.json ← Full pipeline results (retryCount=5)
│ ├── results_config_b.json ← Single-shot results (retryCount=1)
│ ├── results_config_c.json ← Generation-only results (no execution)
│ ├── metrics_config_a.json ← Resource metrics for Config A
│ ├── metrics_config_b.json ← Resource metrics for Config B
│ └── metrics_config_c.json ← Resource metrics for Config C
├── llama-3.3-70b-versatile/
│ └── ...
└── ...
Each run is self-contained: the data_manifest.json records the exact database state so results are reproducible even if the Faker seed changes.
| Module | Location | Description |
|---|---|---|
resultComparator.py |
evaluation/shared/ |
Normalized comparison of predicted vs gold results |
resourceMetrics.py |
evaluation/shared/ |
Wall time, memory, throughput, latency percentiles |
entrypoint.py |
evaluation/synthetic/ |
Orchestrates: seed → questions → evaluate → score |
evalRunner.py |
evaluation/synthetic/ |
3-config ablation runner with parallel execution |
evalConfig.py |
evaluation/synthetic/ |
Environment-driven configuration and connection helpers |
questionRunner.py |
evaluation/synthetic/ |
Executes gold queries to produce reference results |
seedData.py |
evaluation/synthetic/ |
Creates databases, applies schemas, populates with Faker data |
schemaDefinitions.py |
evaluation/synthetic/ |
Inline DDL for the three evaluation databases |
scoreReport.py |
evaluation/synthetic/ |
Generates summary tables (overall, by difficulty, by database) |
- BIRD Benchmark — Real-world evaluation against the BIRD mini-dev dataset
📄 Paper: arXiv:2604.16511 | 📊 Dataset: Hugging Face | 💻 Source: GitHub
SQL Query Engine
Design
Setup
API
Internals
Evaluation
Help