Evaluation

SQL Query Engine includes two independent evaluation pipelines that measure the impact of the self-healing loop on query accuracy. Both use a 3-config ablation study tested across 5 LLM backends.

Synthetic Evaluation (this page) — 120 questions across 3 controlled PostgreSQL databases with reproducible Faker-generated data
BIRD Benchmark — 500 real-world questions from the BIRD mini-dev set across 11 databases, with automatic SQLite-to-PostgreSQL migration

What This Evaluates

The central claim of SQL Query Engine is that its self-healing loop — where PostgreSQL diagnostics (SQLSTATE codes, hints, tracebacks) are fed back to the LLM for iterative repair — improves query accuracy over raw single-shot generation. The evaluation pipelines quantify that claim.

Ablation Configurations

Both pipelines execute three ablation configurations:

Config	retryCount	What It Tests
Config C	0 (generation only)	Baseline — raw LLM SQL generation quality with no execution feedback
Config B	1 (single attempt)	Single-shot evaluation — can the LLM fix a query with one try?
Config A	5 (full pipeline)	Full self-healing loop — iterative repair with execution feedback

All three configurations share the same schema context and LLM model. The only difference is how many repair iterations are allowed. The delta between Config A and Config C is the headline result — it directly measures how many initially-failing queries the self-healing loop recovers.

Evaluation Framework

evaluation/
├── shared/                          # Shared utilities
│   ├── resultComparator.py          # Order-independent result comparison
│   └── resourceMetrics.py           # Wall time, memory, throughput tracking
├── synthetic/                       # Synthetic evaluation (this page)
│   ├── entrypoint.py                # Pipeline orchestrator
│   ├── evalRunner.py                # 3-config ablation runner
│   ├── evalConfig.py                # Environment-driven config
│   ├── seedData.py                  # Database seeding (Faker)
│   ├── schemaDefinitions.py         # DDL for 3 databases
│   ├── questionRunner.py            # Gold query executor
│   ├── scoreReport.py               # Summary tables + summary.json
│   ├── questions/                   # 120 gold questions (40 per DB)
│   └── results/runs/                # Archived results per model
└── bird/                            # BIRD benchmark (see BIRD Evaluation page)
    ├── birdEntrypoint.py            # Pipeline orchestrator
    ├── birdDataLoader.py            # Dataset loading + dialect conversion
    ├── sqliteToPostgres.py          # SQLite → PostgreSQL migration
    ├── birdEvalRunner.py            # 3-config ablation runner
    ├── birdScoreReport.py           # BIRD-specific scoring
    └── bird_results/runs/           # Archived results per model

Synthetic Evaluation

Databases and Questions

The synthetic pipeline seeds three PostgreSQL databases using Faker with a fixed random seed for reproducibility:

Database	Domain	Tables	Questions
`eval_ecommerce`	E-commerce (customers, orders, products, payments, reviews)	7	40
`eval_hospital`	Hospital management (patients, doctors, appointments, prescriptions, lab results)	8	40
`eval_university`	University (students, courses, enrollments, faculty, departments)	7	40

Each database has 40 gold-annotated questions across 4 difficulty tiers: easy (simple SELECT/COUNT), medium (JOINs, GROUP BY), hard (subqueries, CTEs, window functions), and extra_hard (multi-step aggregations, conditional logic).

Synthetic Benchmark Results

Five LLM backends were tested across all three databases and configurations (75 questions per run, 25 per database):

Model	Config C	Config B	Config A	Delta (A−C)	Healed	Regressions
Llama 4 Scout 17B	48.0%	50.7%	57.3%	+9.3pp	7	0
GPT-OSS 20B	45.3%	45.3%	53.3%	+8.0pp	6	0
Llama 3.3 70B	53.3%	54.7%	54.7%	+1.4pp	3	2
GPT-OSS 120B	50.7%	49.3%	48.0%	−2.7pp	2	4
Qwen3 32B	48.0%	42.7%	46.7%	−1.3pp	5	6

Key findings:

Llama 4 Scout 17B is the best model on synthetic data: highest Config A accuracy (57.3%), largest healing delta (+9.3pp), and zero regressions.
Self-healing provides up to +9.3pp improvement for models that handle it well.
Regressions occur when the LLM rewrites a working query into a broken one. The engine's early-accept design (accept on execution success without an LLM call) mitigates this, but some models still regress on queries that return rows with incorrect content.
GPT-OSS 120B shows a negative delta (−2.7pp) with 4 regressions, demonstrating that larger models don't always perform better at self-healing.

Accuracy by Difficulty (Llama 4 Scout 17B)

Difficulty	Config C	Config A	Delta
Easy	100.0%	100.0%	—
Medium	77.8%	77.8%	—
Hard	5.6%	44.4%	+38.8pp
Extra Hard	0.0%	0.0%	—

Self-healing provides the largest gains on hard-difficulty queries, where initial generation often fails but iterative repair can recover the correct result.

Resource Metrics (Llama 4 Scout 17B)

Metric	Config C	Config B	Config A
Wall time	70.6s	135.3s	166.5s
Peak memory	62.1 MB	64.8 MB	64.9 MB
Throughput	63.8 q/min	33.3 q/min	27.0 q/min
p95 latency	2.72s	7.08s	9.90s

Result Comparison Logic

The evaluation compares predicted query results against gold query results using normalized comparison:

Execute both the predicted and gold queries against the same database
Sort rows deterministically (by all columns)
Normalize values: collapse numeric trailing zeros (100.0 → 100), strip whitespace
Compare row-by-row — an exact match counts as correct

This logic lives in evaluation/shared/resultComparator.py and is shared by both the synthetic and BIRD pipelines.

Running the Synthetic Evaluation

cd sqlqueryengine

# 1. Set LLM credentials in docker-compose-synthetic-evaluation.yml
#    Edit: LLM_BASE_URL, LLM_MODEL, LLM_API_KEY

# 2. Build and launch
docker compose -f docker-compose-synthetic-evaluation.yml build
docker compose -f docker-compose-synthetic-evaluation.yml up -d

# 3. Monitor progress
docker logs eval-runner --tail 10 -f

# 4. Results appear in evaluation/synthetic/results/

The evaluation stack starts four services: PostgreSQL (seeded automatically), Redis, the SQL Query Engine, and the evaluation runner.

Configuration Reference

Variable	Default	Description
`LLM_BASE_URL`	—	OpenAI-compatible endpoint (Groq, vLLM, Ollama, etc.)
`LLM_MODEL`	—	Model ID to benchmark
`LLM_API_KEY`	`your-api-key-here`	API key (replace before running)
`EVAL_MAX_WORKERS`	`3`	Parallel question count (1–6, depends on rate limits)
`QUESTIONS_PER_DB`	`25`	Questions per database (25 for quick runs, 40 for full)
`DEFAULT_SCHEMA_EXAMPLES`	`5`	Sample rows per table in schema context (3–5)

Results Directory Structure

evaluation/synthetic/results/
├── questions.json                     ← Gold questions with expected results
├── data_manifest.json                 ← Seeded database snapshot (row counts)
└── runs/                              ← Archived results per model
    ├── llama-4-scout-17b-16e-instruct/
    │   ├── summary.json               ← Aggregated scores (overall, by difficulty, by DB)
    │   ├── results_config_a.json      ← Full pipeline results (retryCount=5)
    │   ├── results_config_b.json      ← Single-shot results (retryCount=1)
    │   ├── results_config_c.json      ← Generation-only results (no execution)
    │   ├── metrics_config_a.json      ← Resource metrics for Config A
    │   ├── metrics_config_b.json      ← Resource metrics for Config B
    │   └── metrics_config_c.json      ← Resource metrics for Config C
    ├── llama-3.3-70b-versatile/
    │   └── ...
    └── ...

Each run is self-contained: the data_manifest.json records the exact database state so results are reproducible even if the Faker seed changes.

Module Map

Module	Location	Description
`resultComparator.py`	`evaluation/shared/`	Normalized comparison of predicted vs gold results
`resourceMetrics.py`	`evaluation/shared/`	Wall time, memory, throughput, latency percentiles
`entrypoint.py`	`evaluation/synthetic/`	Orchestrates: seed → questions → evaluate → score
`evalRunner.py`	`evaluation/synthetic/`	3-config ablation runner with parallel execution
`evalConfig.py`	`evaluation/synthetic/`	Environment-driven configuration and connection helpers
`questionRunner.py`	`evaluation/synthetic/`	Executes gold queries to produce reference results
`seedData.py`	`evaluation/synthetic/`	Creates databases, applies schemas, populates with Faker data
`schemaDefinitions.py`	`evaluation/synthetic/`	Inline DDL for the three evaluation databases
`scoreReport.py`	`evaluation/synthetic/`	Generates summary tables (overall, by difficulty, by database)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation

What This Evaluates

Ablation Configurations

Evaluation Framework

Synthetic Evaluation

Databases and Questions

Synthetic Benchmark Results

Accuracy by Difficulty (Llama 4 Scout 17B)

Resource Metrics (Llama 4 Scout 17B)

Result Comparison Logic

Running the Synthetic Evaluation

Configuration Reference

Results Directory Structure

Module Map

See Also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally