BIRD Evaluation

SQL Query Engine has been benchmarked against the BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) dataset, one of the most widely used and challenging NL-to-SQL benchmarks in the research community. This page covers the evaluation methodology, results, and how to reproduce the benchmark.

For the synthetic evaluation on controlled databases, see Evaluation.

What Is BIRD

BIRD is a cross-domain text-to-SQL benchmark containing 12,751 question-SQL pairs across 95 real-world databases spanning 37 professional domains (healthcare, finance, sports, education, etc.). Unlike earlier benchmarks like Spider, BIRD emphasizes real-world database complexity: messy data, ambiguous column names, and external knowledge requirements.

We evaluate against the BIRD mini-dev subset: 500 curated questions across 11 databases.

BIRD Databases Used

Database	Domain	Questions
`california_schools`	Education	Public school performance data
`card_games`	Entertainment	Trading card game database
`codebase_community`	Software	Stack Overflow-style Q&A data
`debit_card_specializing`	Finance	Banking transaction records
`european_football_2`	Sports	European football match data
`financial`	Finance	Czech bank financial data
`formula_1`	Sports	Formula 1 racing statistics
`student_club`	Education	University student organizations
`superhero`	Entertainment	Superhero character database
`thrombosis_prediction`	Healthcare	Medical diagnostic records
`toxicology`	Science	Chemical toxicology data

SQLite to PostgreSQL Migration

BIRD databases are distributed as SQLite files. Since SQL Query Engine runs against PostgreSQL, the evaluation pipeline includes an automatic migration layer that converts each BIRD database at runtime.

What Gets Converted

Schema conversion (sqliteToPostgres.py):

Type affinity mapping: INTEGER → BIGINT, REAL → DOUBLE PRECISION, TEXT → TEXT, BLOB → TEXT
Foreign key dependencies processed in topological order
Identifier names lowercased (PostgreSQL convention)
Bulk data insertion with batch sizes of 1,000 rows
Sequence counters reset after bulk insert

Gold SQL dialect conversion (birdDataLoader.py):

The pipeline converts gold SQL queries from SQLite dialect to PostgreSQL. This involves 14 transformation rules:

Backtick identifiers → double quotes
IIF(cond, t, f) → CASE WHEN cond THEN t ELSE f END
STRFTIME('%Y', col) → EXTRACT(YEAR FROM col::TEXT::TIMESTAMP)::TEXT
JULIANDAY() → PostgreSQL date arithmetic
GROUP_CONCAT(x) → STRING_AGG(x::TEXT, ',')
SUBSTR() → SUBSTRING()
IFNULL() → COALESCE()
INSTR(a, b) → POSITION(b IN a)
LIKE → ILIKE (case-insensitive matching)
DATE('now') → CURRENT_DATE
DATETIME('now') → NOW()
CAST(x AS FLOAT) → CAST(x AS DOUBLE PRECISION)
CAST(x AS REAL) → CAST(x AS NUMERIC)
LIMIT offset, count → LIMIT count OFFSET offset

Conversion Error Rate

Not all gold queries convert cleanly. Out of 500 mini-dev questions, 63 (12.6%) fail conversion due to semantic differences between SQLite and PostgreSQL (text-based date arithmetic, multi-word unquoted column names, PostgreSQL's stricter GROUP BY requirements). These are excluded from accuracy calculations and reported separately.

Difficulty	Conversion Errors
Simple	6
Moderate	35
Challenging	22

The most affected databases are thrombosis_prediction (20 errors) and california_schools (18 errors), which rely heavily on SQLite-specific date functions and column naming conventions.

Benchmark Results

Five LLM backends were tested on the BIRD mini-dev set (437 evaluated questions after excluding 63 conversion errors). All runs use temperature 0.1 with evidence hints enabled (BIRD_USE_EVIDENCE=true).

Overall Execution Accuracy

Model	Config C	Config B	Config A	Delta (A−C)	Healed	Regressions
GPT-OSS 120B	44.4%	42.6%	49.0%	+4.6pp	39	19
Llama 3.3 70B	43.7%	38.0%	46.5%	+2.8pp	35	23
Llama 4 Scout 17B	37.1%	34.3%	40.5%	+3.4pp	35	20
GPT-OSS 20B	43.5%	39.8%	43.2%	−0.3pp	26	24
Qwen3 32B	40.7%	41.6%	39.4%	−1.3pp	31	17

Accuracy by Difficulty (GPT-OSS 120B, Config A)

Difficulty	Config C	Config A	Delta
Simple	54.2%	59.9%	+5.7pp
Moderate	44.7%	48.8%	+4.1pp
Challenging	26.2%	30.0%	+3.8pp

Accuracy by Database (GPT-OSS 120B, Config A)

Database	Accuracy
student_club	73.3%
superhero	63.5%
toxicology	55.0%
european_football_2	50.0%
codebase_community	46.8%
formula_1	45.8%
debit_card_specializing	42.3%
california_schools	41.7%
financial	37.0%
thrombosis_prediction	36.7%
card_games	31.4%

Self-Healing Analysis

Healing Breakdown (All 5 Models, Config A)

Metric	GPT-OSS 120B	Llama 3.3 70B	Scout 17B	GPT-OSS 20B	Qwen3 32B
Correct on 1st attempt	175	168	142	163	141
Fixed by healing	39	35	35	26	31
Exhausted retries	204	211	238	206	184
Errors/timeouts	0	0	2	18	64
Regressions	19	23	20	24	17

Key Findings

Self-healing is effective on real-world data, but with more regressions than synthetic data. On synthetic databases, the best model (Llama Scout 17B) achieved +9.3pp with zero regressions. On BIRD, the best model (GPT-OSS 120B) achieves +4.6pp but with 19 regressions. This is expected: BIRD databases have messier schemas, more ambiguous column names, and more complex query patterns that increase the risk of over-correction.

Config B (single-shot) often underperforms Config C. On BIRD, most models do worse with a single repair attempt than with no repair at all. This suggests that a single error-feedback cycle is insufficient for complex real-world queries; the self-healing loop needs multiple iterations (Config A) to be beneficial.

Model size correlates with BIRD accuracy but not with synthetic accuracy. GPT-OSS 120B leads on BIRD (49.0%) while trailing on synthetic data (48.0%). Llama Scout 17B leads on synthetic (57.3%) but scores lower on BIRD (40.5%). Larger models better handle the complexity and ambiguity of real-world schemas.

Timeout rates vary significantly. Qwen3 32B has 64 timeouts (14.6% of questions), likely due to its longer reasoning chains hitting the 180-second timeout. GPT-OSS 120B and Llama 3.3 70B have zero timeouts.

Comparison with Published Baselines

For context, here are published zero-shot execution accuracy results on the BIRD dev set:

System	EX Accuracy	Notes
ChatGPT (zero-shot)	40.08%	Official BIRD paper baseline
GPT-4 (zero-shot)	46.35%	Official BIRD paper baseline
GPT-4 + evidence	54.89%	With BIRD evidence hints
DIN-SQL + GPT-4	55.9%	Multi-stage pipeline
CHASE-SQL (SOTA)	73.01%	Multi-agent ensemble
Ours (GPT-OSS 120B, Config A)	49.0%	With evidence, single-LLM pipeline
Ours (Llama 3.3 70B, Config A)	46.5%	With evidence, single-LLM pipeline

Our results are competitive with published zero-shot baselines while using a lightweight, open-source, single-LLM pipeline. The self-healing delta (+2.8pp to +4.6pp on BIRD) demonstrates that execution-grounded iterative repair adds measurable value on real-world databases.

Resource Metrics

GPT-OSS 120B (Best BIRD Accuracy)

Metric	Config C	Config B	Config A
Wall time	1,177s	1,816s	1,972s
Peak memory	103.1 MB	103.1 MB	103.1 MB
Throughput	22.3 q/min	14.4 q/min	13.3 q/min
p95 latency	10.37s	18.67s	17.83s

Llama 4 Scout 17B

Metric	Config C	Config B	Config A
Wall time	1,165s	2,191s	2,068s
Peak memory	106.6 MB	106.6 MB	106.6 MB
Throughput	22.5 q/min	12.0 q/min	12.7 q/min
p95 latency	11.45s	19.98s	19.67s

Running the BIRD Evaluation

1. Download the BIRD Dataset

Place the BIRD mini-dev dataset at evaluation/bird/bird_data/:

evaluation/bird/bird_data/
├── mini_dev_sqlite.json              # 500 questions (auto-download attempted)
└── dev_databases/
    ├── california_schools/
    │   └── california_schools.sqlite
    ├── card_games/
    │   └── card_games.sqlite
    ├── ...
    └── toxicology/
        └── toxicology.sqlite

The dataset can be downloaded from:

If the question JSON is missing, the runner attempts automatic download via the HuggingFace datasets library. The SQLite database files must be placed manually.

2. Configure and Launch

cd sqlqueryengine

# 1. Set LLM credentials in docker-compose-bird-evaluation.yml
#    Edit: LLM_BASE_URL, LLM_MODEL, LLM_API_KEY

# 2. Build and launch
docker compose -f docker-compose-bird-evaluation.yml build
docker compose -f docker-compose-bird-evaluation.yml up -d

# 3. Monitor progress
docker logs bird-runner --tail 10 -f

# 4. Results appear in evaluation/bird/bird_results/

The BIRD stack starts four services: PostgreSQL (with 2 GB shared memory for large databases), Redis, the SQL Query Engine, and the BIRD evaluation runner.

3. Pipeline Execution Flow

1. Wait for PostgreSQL
2. Load BIRD dataset (birdDataLoader.py)
3. Migrate SQLite databases → PostgreSQL (sqliteToPostgres.py)
4. Execute converted gold queries, save bird_results/questions.json
5. Wait for engine
6. Run 3-config ablation (birdEvalRunner.py)
7. Generate score report (birdScoreReport.py)

Expected runtime: 2-4 hours for 500 questions across 3 configurations, depending on LLM speed and rate limits.

Configuration Reference

Variable	Default	Description
`LLM_BASE_URL`	—	OpenAI-compatible endpoint
`LLM_MODEL`	—	Model ID to benchmark
`LLM_API_KEY`	`your-api-key-here`	API key (replace before running)
`BIRD_DATASET`	`mini`	Dataset variant: `mini` (500 questions) or `full` (1,534)
`BIRD_USE_EVIDENCE`	`true`	Include BIRD evidence hints in prompts
`EVAL_MAX_WORKERS`	`4`	Parallel question workers (tune based on rate limits)
`TIMEOUT_SECONDS`	`180`	Per-request timeout in seconds

Results Directory Structure

evaluation/bird/bird_results/
├── questions.json                     ← Converted questions with PostgreSQL gold results
└── runs/                              ← Archived results per model
    ├── llama-4-scout-17b-16e-instruct/
    │   ├── summary.json               ← All metrics (accuracy, healing, resources)
    │   ├── results_config_a.json      ← Full pipeline per-question results
    │   ├── results_config_b.json      ← Single-shot per-question results
    │   ├── results_config_c.json      ← Generation-only per-question results
    │   ├── metrics_config_a.json      ← Resource metrics for Config A
    │   ├── metrics_config_b.json      ← Resource metrics for Config B
    │   ├── metrics_config_c.json      ← Resource metrics for Config C
    │   ├── predictions.txt            ← Official BIRD submission format
    │   └── conversion_report.json     ← SQLite→PostgreSQL conversion stats
    ├── gpt-oss-120b/
    │   └── ...
    └── ...

Module Map

Module	Description
`birdEntrypoint.py`	Orchestrates: load → migrate → evaluate → score
`birdDataLoader.py`	Loads BIRD questions, converts gold SQL dialect (14 transformation rules)
`sqliteToPostgres.py`	Introspects SQLite schemas, creates PostgreSQL databases, bulk-inserts data
`birdEvalRunner.py`	3-config ablation runner with per-database schema warmup and parallel execution
`birdScoreReport.py`	Generates summary tables, comparison with published baselines, official prediction file
`birdConfig.py`	Environment-driven configuration for the BIRD pipeline

BIRD Evaluation

What Is BIRD

BIRD Databases Used

SQLite to PostgreSQL Migration

What Gets Converted

Conversion Error Rate

Benchmark Results

Overall Execution Accuracy

Accuracy by Difficulty (GPT-OSS 120B, Config A)

Accuracy by Database (GPT-OSS 120B, Config A)

Self-Healing Analysis

Healing Breakdown (All 5 Models, Config A)

Key Findings

Comparison with Published Baselines

Resource Metrics

GPT-OSS 120B (Best BIRD Accuracy)

Llama 4 Scout 17B

Running the BIRD Evaluation

1. Download the BIRD Dataset

2. Configure and Launch

3. Pipeline Execution Flow

Configuration Reference

Results Directory Structure

Module Map

See Also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally