Skip to content

BIRD Evaluation

codeadeel edited this page Apr 7, 2026 · 1 revision

SQL Query Engine has been benchmarked against the BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) dataset, one of the most widely used and challenging NL-to-SQL benchmarks in the research community. This page covers the evaluation methodology, results, and how to reproduce the benchmark.

For the synthetic evaluation on controlled databases, see Evaluation.

What Is BIRD

BIRD is a cross-domain text-to-SQL benchmark containing 12,751 question-SQL pairs across 95 real-world databases spanning 37 professional domains (healthcare, finance, sports, education, etc.). Unlike earlier benchmarks like Spider, BIRD emphasizes real-world database complexity: messy data, ambiguous column names, and external knowledge requirements.

We evaluate against the BIRD mini-dev subset: 500 curated questions across 11 databases.

BIRD Databases Used

Database Domain Questions
california_schools Education Public school performance data
card_games Entertainment Trading card game database
codebase_community Software Stack Overflow-style Q&A data
debit_card_specializing Finance Banking transaction records
european_football_2 Sports European football match data
financial Finance Czech bank financial data
formula_1 Sports Formula 1 racing statistics
student_club Education University student organizations
superhero Entertainment Superhero character database
thrombosis_prediction Healthcare Medical diagnostic records
toxicology Science Chemical toxicology data

SQLite to PostgreSQL Migration

BIRD databases are distributed as SQLite files. Since SQL Query Engine runs against PostgreSQL, the evaluation pipeline includes an automatic migration layer that converts each BIRD database at runtime.

What Gets Converted

Schema conversion (sqliteToPostgres.py):

  • Type affinity mapping: INTEGERBIGINT, REALDOUBLE PRECISION, TEXTTEXT, BLOBTEXT
  • Foreign key dependencies processed in topological order
  • Identifier names lowercased (PostgreSQL convention)
  • Bulk data insertion with batch sizes of 1,000 rows
  • Sequence counters reset after bulk insert

Gold SQL dialect conversion (birdDataLoader.py):

The pipeline converts gold SQL queries from SQLite dialect to PostgreSQL. This involves 14 transformation rules:

  1. Backtick identifiers → double quotes
  2. IIF(cond, t, f)CASE WHEN cond THEN t ELSE f END
  3. STRFTIME('%Y', col)EXTRACT(YEAR FROM col::TEXT::TIMESTAMP)::TEXT
  4. JULIANDAY() → PostgreSQL date arithmetic
  5. GROUP_CONCAT(x)STRING_AGG(x::TEXT, ',')
  6. SUBSTR()SUBSTRING()
  7. IFNULL()COALESCE()
  8. INSTR(a, b)POSITION(b IN a)
  9. LIKEILIKE (case-insensitive matching)
  10. DATE('now')CURRENT_DATE
  11. DATETIME('now')NOW()
  12. CAST(x AS FLOAT)CAST(x AS DOUBLE PRECISION)
  13. CAST(x AS REAL)CAST(x AS NUMERIC)
  14. LIMIT offset, countLIMIT count OFFSET offset

Conversion Error Rate

Not all gold queries convert cleanly. Out of 500 mini-dev questions, 63 (12.6%) fail conversion due to semantic differences between SQLite and PostgreSQL (text-based date arithmetic, multi-word unquoted column names, PostgreSQL's stricter GROUP BY requirements). These are excluded from accuracy calculations and reported separately.

Difficulty Conversion Errors
Simple 6
Moderate 35
Challenging 22

The most affected databases are thrombosis_prediction (20 errors) and california_schools (18 errors), which rely heavily on SQLite-specific date functions and column naming conventions.

Benchmark Results

Five LLM backends were tested on the BIRD mini-dev set (437 evaluated questions after excluding 63 conversion errors). All runs use temperature 0.1 with evidence hints enabled (BIRD_USE_EVIDENCE=true).

Overall Execution Accuracy

Model Config C Config B Config A Delta (A−C) Healed Regressions
GPT-OSS 120B 44.4% 42.6% 49.0% +4.6pp 39 19
Llama 3.3 70B 43.7% 38.0% 46.5% +2.8pp 35 23
Llama 4 Scout 17B 37.1% 34.3% 40.5% +3.4pp 35 20
GPT-OSS 20B 43.5% 39.8% 43.2% −0.3pp 26 24
Qwen3 32B 40.7% 41.6% 39.4% −1.3pp 31 17

Accuracy by Difficulty (GPT-OSS 120B, Config A)

Difficulty Config C Config A Delta
Simple 54.2% 59.9% +5.7pp
Moderate 44.7% 48.8% +4.1pp
Challenging 26.2% 30.0% +3.8pp

Accuracy by Database (GPT-OSS 120B, Config A)

Database Accuracy
student_club 73.3%
superhero 63.5%
toxicology 55.0%
european_football_2 50.0%
codebase_community 46.8%
formula_1 45.8%
debit_card_specializing 42.3%
california_schools 41.7%
financial 37.0%
thrombosis_prediction 36.7%
card_games 31.4%

Self-Healing Analysis

Healing Breakdown (All 5 Models, Config A)

Metric GPT-OSS 120B Llama 3.3 70B Scout 17B GPT-OSS 20B Qwen3 32B
Correct on 1st attempt 175 168 142 163 141
Fixed by healing 39 35 35 26 31
Exhausted retries 204 211 238 206 184
Errors/timeouts 0 0 2 18 64
Regressions 19 23 20 24 17

Key Findings

Self-healing is effective on real-world data, but with more regressions than synthetic data. On synthetic databases, the best model (Llama Scout 17B) achieved +9.3pp with zero regressions. On BIRD, the best model (GPT-OSS 120B) achieves +4.6pp but with 19 regressions. This is expected: BIRD databases have messier schemas, more ambiguous column names, and more complex query patterns that increase the risk of over-correction.

Config B (single-shot) often underperforms Config C. On BIRD, most models do worse with a single repair attempt than with no repair at all. This suggests that a single error-feedback cycle is insufficient for complex real-world queries; the self-healing loop needs multiple iterations (Config A) to be beneficial.

Model size correlates with BIRD accuracy but not with synthetic accuracy. GPT-OSS 120B leads on BIRD (49.0%) while trailing on synthetic data (48.0%). Llama Scout 17B leads on synthetic (57.3%) but scores lower on BIRD (40.5%). Larger models better handle the complexity and ambiguity of real-world schemas.

Timeout rates vary significantly. Qwen3 32B has 64 timeouts (14.6% of questions), likely due to its longer reasoning chains hitting the 180-second timeout. GPT-OSS 120B and Llama 3.3 70B have zero timeouts.

Comparison with Published Baselines

For context, here are published zero-shot execution accuracy results on the BIRD dev set:

System EX Accuracy Notes
ChatGPT (zero-shot) 40.08% Official BIRD paper baseline
GPT-4 (zero-shot) 46.35% Official BIRD paper baseline
GPT-4 + evidence 54.89% With BIRD evidence hints
DIN-SQL + GPT-4 55.9% Multi-stage pipeline
CHASE-SQL (SOTA) 73.01% Multi-agent ensemble
Ours (GPT-OSS 120B, Config A) 49.0% With evidence, single-LLM pipeline
Ours (Llama 3.3 70B, Config A) 46.5% With evidence, single-LLM pipeline

Our results are competitive with published zero-shot baselines while using a lightweight, open-source, single-LLM pipeline. The self-healing delta (+2.8pp to +4.6pp on BIRD) demonstrates that execution-grounded iterative repair adds measurable value on real-world databases.

Resource Metrics

GPT-OSS 120B (Best BIRD Accuracy)

Metric Config C Config B Config A
Wall time 1,177s 1,816s 1,972s
Peak memory 103.1 MB 103.1 MB 103.1 MB
Throughput 22.3 q/min 14.4 q/min 13.3 q/min
p95 latency 10.37s 18.67s 17.83s

Llama 4 Scout 17B

Metric Config C Config B Config A
Wall time 1,165s 2,191s 2,068s
Peak memory 106.6 MB 106.6 MB 106.6 MB
Throughput 22.5 q/min 12.0 q/min 12.7 q/min
p95 latency 11.45s 19.98s 19.67s

Running the BIRD Evaluation

1. Download the BIRD Dataset

Place the BIRD mini-dev dataset at evaluation/bird/bird_data/:

evaluation/bird/bird_data/
├── mini_dev_sqlite.json              # 500 questions (auto-download attempted)
└── dev_databases/
    ├── california_schools/
    │   └── california_schools.sqlite
    ├── card_games/
    │   └── card_games.sqlite
    ├── ...
    └── toxicology/
        └── toxicology.sqlite

The dataset can be downloaded from:

If the question JSON is missing, the runner attempts automatic download via the HuggingFace datasets library. The SQLite database files must be placed manually.

2. Configure and Launch

cd sqlqueryengine

# 1. Set LLM credentials in docker-compose-bird-evaluation.yml
#    Edit: LLM_BASE_URL, LLM_MODEL, LLM_API_KEY

# 2. Build and launch
docker compose -f docker-compose-bird-evaluation.yml build
docker compose -f docker-compose-bird-evaluation.yml up -d

# 3. Monitor progress
docker logs bird-runner --tail 10 -f

# 4. Results appear in evaluation/bird/bird_results/

The BIRD stack starts four services: PostgreSQL (with 2 GB shared memory for large databases), Redis, the SQL Query Engine, and the BIRD evaluation runner.

3. Pipeline Execution Flow

1. Wait for PostgreSQL
2. Load BIRD dataset (birdDataLoader.py)
3. Migrate SQLite databases → PostgreSQL (sqliteToPostgres.py)
4. Execute converted gold queries, save bird_results/questions.json
5. Wait for engine
6. Run 3-config ablation (birdEvalRunner.py)
7. Generate score report (birdScoreReport.py)

Expected runtime: 2-4 hours for 500 questions across 3 configurations, depending on LLM speed and rate limits.

Configuration Reference

Variable Default Description
LLM_BASE_URL OpenAI-compatible endpoint
LLM_MODEL Model ID to benchmark
LLM_API_KEY your-api-key-here API key (replace before running)
BIRD_DATASET mini Dataset variant: mini (500 questions) or full (1,534)
BIRD_USE_EVIDENCE true Include BIRD evidence hints in prompts
EVAL_MAX_WORKERS 4 Parallel question workers (tune based on rate limits)
TIMEOUT_SECONDS 180 Per-request timeout in seconds

Results Directory Structure

evaluation/bird/bird_results/
├── questions.json                     ← Converted questions with PostgreSQL gold results
└── runs/                              ← Archived results per model
    ├── llama-4-scout-17b-16e-instruct/
    │   ├── summary.json               ← All metrics (accuracy, healing, resources)
    │   ├── results_config_a.json      ← Full pipeline per-question results
    │   ├── results_config_b.json      ← Single-shot per-question results
    │   ├── results_config_c.json      ← Generation-only per-question results
    │   ├── metrics_config_a.json      ← Resource metrics for Config A
    │   ├── metrics_config_b.json      ← Resource metrics for Config B
    │   ├── metrics_config_c.json      ← Resource metrics for Config C
    │   ├── predictions.txt            ← Official BIRD submission format
    │   └── conversion_report.json     ← SQLite→PostgreSQL conversion stats
    ├── gpt-oss-120b/
    │   └── ...
    └── ...

Module Map

Module Description
birdEntrypoint.py Orchestrates: load → migrate → evaluate → score
birdDataLoader.py Loads BIRD questions, converts gold SQL dialect (14 transformation rules)
sqliteToPostgres.py Introspects SQLite schemas, creates PostgreSQL databases, bulk-inserts data
birdEvalRunner.py 3-config ablation runner with per-database schema warmup and parallel execution
birdScoreReport.py Generates summary tables, comparison with published baselines, official prediction file
birdConfig.py Environment-driven configuration for the BIRD pipeline

See Also

  • Evaluation — Synthetic evaluation on controlled databases
  • Architecture — System design and self-healing pipeline details

SQL Query Engine

Design

Setup

API

Internals

Evaluation

Help

Clone this wiki locally