Skip to content

Evaluation

codeadeel edited this page Apr 7, 2026 · 2 revisions

SQL Query Engine includes two independent evaluation pipelines that measure the impact of the self-healing loop on query accuracy. Both use a 3-config ablation study tested across 5 LLM backends.

  • Synthetic Evaluation (this page) — 120 questions across 3 controlled PostgreSQL databases with reproducible Faker-generated data
  • BIRD Benchmark — 500 real-world questions from the BIRD mini-dev set across 11 databases, with automatic SQLite-to-PostgreSQL migration

What This Evaluates

The central claim of SQL Query Engine is that its self-healing loop — where PostgreSQL diagnostics (SQLSTATE codes, hints, tracebacks) are fed back to the LLM for iterative repair — improves query accuracy over raw single-shot generation. The evaluation pipelines quantify that claim.

Ablation Configurations

Both pipelines execute three ablation configurations:

Config retryCount What It Tests
Config C 0 (generation only) Baseline — raw LLM SQL generation quality with no execution feedback
Config B 1 (single attempt) Single-shot evaluation — can the LLM fix a query with one try?
Config A 5 (full pipeline) Full self-healing loop — iterative repair with execution feedback

All three configurations share the same schema context and LLM model. The only difference is how many repair iterations are allowed. The delta between Config A and Config C is the headline result — it directly measures how many initially-failing queries the self-healing loop recovers.

Evaluation Framework

evaluation/
├── shared/                          # Shared utilities
│   ├── resultComparator.py          # Order-independent result comparison
│   └── resourceMetrics.py           # Wall time, memory, throughput tracking
├── synthetic/                       # Synthetic evaluation (this page)
│   ├── entrypoint.py                # Pipeline orchestrator
│   ├── evalRunner.py                # 3-config ablation runner
│   ├── evalConfig.py                # Environment-driven config
│   ├── seedData.py                  # Database seeding (Faker)
│   ├── schemaDefinitions.py         # DDL for 3 databases
│   ├── questionRunner.py            # Gold query executor
│   ├── scoreReport.py               # Summary tables + summary.json
│   ├── questions/                   # 120 gold questions (40 per DB)
│   └── results/runs/                # Archived results per model
└── bird/                            # BIRD benchmark (see BIRD Evaluation page)
    ├── birdEntrypoint.py            # Pipeline orchestrator
    ├── birdDataLoader.py            # Dataset loading + dialect conversion
    ├── sqliteToPostgres.py          # SQLite → PostgreSQL migration
    ├── birdEvalRunner.py            # 3-config ablation runner
    ├── birdScoreReport.py           # BIRD-specific scoring
    └── bird_results/runs/           # Archived results per model

Synthetic Evaluation

Databases and Questions

The synthetic pipeline seeds three PostgreSQL databases using Faker with a fixed random seed for reproducibility:

Database Domain Tables Questions
eval_ecommerce E-commerce (customers, orders, products, payments, reviews) 7 40
eval_hospital Hospital management (patients, doctors, appointments, prescriptions, lab results) 8 40
eval_university University (students, courses, enrollments, faculty, departments) 7 40

Each database has 40 gold-annotated questions across 4 difficulty tiers: easy (simple SELECT/COUNT), medium (JOINs, GROUP BY), hard (subqueries, CTEs, window functions), and extra_hard (multi-step aggregations, conditional logic).

Synthetic Benchmark Results

Five LLM backends were tested across all three databases and configurations (75 questions per run, 25 per database):

Model Config C Config B Config A Delta (A−C) Healed Regressions
Llama 4 Scout 17B 48.0% 50.7% 57.3% +9.3pp 7 0
GPT-OSS 20B 45.3% 45.3% 53.3% +8.0pp 6 0
Llama 3.3 70B 53.3% 54.7% 54.7% +1.4pp 3 2
GPT-OSS 120B 50.7% 49.3% 48.0% −2.7pp 2 4
Qwen3 32B 48.0% 42.7% 46.7% −1.3pp 5 6

Key findings:

  • Llama 4 Scout 17B is the best model on synthetic data: highest Config A accuracy (57.3%), largest healing delta (+9.3pp), and zero regressions.
  • Self-healing provides up to +9.3pp improvement for models that handle it well.
  • Regressions occur when the LLM rewrites a working query into a broken one. The engine's early-accept design (accept on execution success without an LLM call) mitigates this, but some models still regress on queries that return rows with incorrect content.
  • GPT-OSS 120B shows a negative delta (−2.7pp) with 4 regressions, demonstrating that larger models don't always perform better at self-healing.

Accuracy by Difficulty (Llama 4 Scout 17B)

Difficulty Config C Config A Delta
Easy 100.0% 100.0%
Medium 77.8% 77.8%
Hard 5.6% 44.4% +38.8pp
Extra Hard 0.0% 0.0%

Self-healing provides the largest gains on hard-difficulty queries, where initial generation often fails but iterative repair can recover the correct result.

Resource Metrics (Llama 4 Scout 17B)

Metric Config C Config B Config A
Wall time 70.6s 135.3s 166.5s
Peak memory 62.1 MB 64.8 MB 64.9 MB
Throughput 63.8 q/min 33.3 q/min 27.0 q/min
p95 latency 2.72s 7.08s 9.90s

Result Comparison Logic

The evaluation compares predicted query results against gold query results using normalized comparison:

  1. Execute both the predicted and gold queries against the same database
  2. Sort rows deterministically (by all columns)
  3. Normalize values: collapse numeric trailing zeros (100.0 → 100), strip whitespace
  4. Compare row-by-row — an exact match counts as correct

This logic lives in evaluation/shared/resultComparator.py and is shared by both the synthetic and BIRD pipelines.

Running the Synthetic Evaluation

cd sqlqueryengine

# 1. Set LLM credentials in docker-compose-synthetic-evaluation.yml
#    Edit: LLM_BASE_URL, LLM_MODEL, LLM_API_KEY

# 2. Build and launch
docker compose -f docker-compose-synthetic-evaluation.yml build
docker compose -f docker-compose-synthetic-evaluation.yml up -d

# 3. Monitor progress
docker logs eval-runner --tail 10 -f

# 4. Results appear in evaluation/synthetic/results/

The evaluation stack starts four services: PostgreSQL (seeded automatically), Redis, the SQL Query Engine, and the evaluation runner.

Configuration Reference

Variable Default Description
LLM_BASE_URL OpenAI-compatible endpoint (Groq, vLLM, Ollama, etc.)
LLM_MODEL Model ID to benchmark
LLM_API_KEY your-api-key-here API key (replace before running)
EVAL_MAX_WORKERS 3 Parallel question count (1–6, depends on rate limits)
QUESTIONS_PER_DB 25 Questions per database (25 for quick runs, 40 for full)
DEFAULT_SCHEMA_EXAMPLES 5 Sample rows per table in schema context (3–5)

Results Directory Structure

evaluation/synthetic/results/
├── questions.json                     ← Gold questions with expected results
├── data_manifest.json                 ← Seeded database snapshot (row counts)
└── runs/                              ← Archived results per model
    ├── llama-4-scout-17b-16e-instruct/
    │   ├── summary.json               ← Aggregated scores (overall, by difficulty, by DB)
    │   ├── results_config_a.json      ← Full pipeline results (retryCount=5)
    │   ├── results_config_b.json      ← Single-shot results (retryCount=1)
    │   ├── results_config_c.json      ← Generation-only results (no execution)
    │   ├── metrics_config_a.json      ← Resource metrics for Config A
    │   ├── metrics_config_b.json      ← Resource metrics for Config B
    │   └── metrics_config_c.json      ← Resource metrics for Config C
    ├── llama-3.3-70b-versatile/
    │   └── ...
    └── ...

Each run is self-contained: the data_manifest.json records the exact database state so results are reproducible even if the Faker seed changes.

Module Map

Module Location Description
resultComparator.py evaluation/shared/ Normalized comparison of predicted vs gold results
resourceMetrics.py evaluation/shared/ Wall time, memory, throughput, latency percentiles
entrypoint.py evaluation/synthetic/ Orchestrates: seed → questions → evaluate → score
evalRunner.py evaluation/synthetic/ 3-config ablation runner with parallel execution
evalConfig.py evaluation/synthetic/ Environment-driven configuration and connection helpers
questionRunner.py evaluation/synthetic/ Executes gold queries to produce reference results
seedData.py evaluation/synthetic/ Creates databases, applies schemas, populates with Faker data
schemaDefinitions.py evaluation/synthetic/ Inline DDL for the three evaluation databases
scoreReport.py evaluation/synthetic/ Generates summary tables (overall, by difficulty, by database)

See Also

  • BIRD Benchmark — Real-world evaluation against the BIRD mini-dev dataset

SQL Query Engine

Design

Setup

API

Internals

Evaluation

Help

Clone this wiki locally