-
Notifications
You must be signed in to change notification settings - Fork 1
BIRD Evaluation
SQL Query Engine has been benchmarked against the BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) dataset, one of the most widely used and challenging NL-to-SQL benchmarks in the research community. This page covers the evaluation methodology, results, and how to reproduce the benchmark.
For the synthetic evaluation on controlled databases, see Evaluation.
BIRD is a cross-domain text-to-SQL benchmark containing 12,751 question-SQL pairs across 95 real-world databases spanning 37 professional domains (healthcare, finance, sports, education, etc.). Unlike earlier benchmarks like Spider, BIRD emphasizes real-world database complexity: messy data, ambiguous column names, and external knowledge requirements.
We evaluate against the BIRD mini-dev subset: 500 curated questions across 11 databases.
| Database | Domain | Questions |
|---|---|---|
california_schools |
Education | Public school performance data |
card_games |
Entertainment | Trading card game database |
codebase_community |
Software | Stack Overflow-style Q&A data |
debit_card_specializing |
Finance | Banking transaction records |
european_football_2 |
Sports | European football match data |
financial |
Finance | Czech bank financial data |
formula_1 |
Sports | Formula 1 racing statistics |
student_club |
Education | University student organizations |
superhero |
Entertainment | Superhero character database |
thrombosis_prediction |
Healthcare | Medical diagnostic records |
toxicology |
Science | Chemical toxicology data |
BIRD databases are distributed as SQLite files. Since SQL Query Engine runs against PostgreSQL, the evaluation pipeline includes an automatic migration layer that converts each BIRD database at runtime.
Schema conversion (sqliteToPostgres.py):
- Type affinity mapping:
INTEGER→BIGINT,REAL→DOUBLE PRECISION,TEXT→TEXT,BLOB→TEXT - Foreign key dependencies processed in topological order
- Identifier names lowercased (PostgreSQL convention)
- Bulk data insertion with batch sizes of 1,000 rows
- Sequence counters reset after bulk insert
Gold SQL dialect conversion (birdDataLoader.py):
The pipeline converts gold SQL queries from SQLite dialect to PostgreSQL. This involves 14 transformation rules:
- Backtick identifiers → double quotes
-
IIF(cond, t, f)→CASE WHEN cond THEN t ELSE f END -
STRFTIME('%Y', col)→EXTRACT(YEAR FROM col::TEXT::TIMESTAMP)::TEXT -
JULIANDAY()→ PostgreSQL date arithmetic -
GROUP_CONCAT(x)→STRING_AGG(x::TEXT, ',') -
SUBSTR()→SUBSTRING() -
IFNULL()→COALESCE() -
INSTR(a, b)→POSITION(b IN a) -
LIKE→ILIKE(case-insensitive matching) -
DATE('now')→CURRENT_DATE -
DATETIME('now')→NOW() -
CAST(x AS FLOAT)→CAST(x AS DOUBLE PRECISION) -
CAST(x AS REAL)→CAST(x AS NUMERIC) -
LIMIT offset, count→LIMIT count OFFSET offset
Not all gold queries convert cleanly. Out of 500 mini-dev questions, 63 (12.6%) fail conversion due to semantic differences between SQLite and PostgreSQL (text-based date arithmetic, multi-word unquoted column names, PostgreSQL's stricter GROUP BY requirements). These are excluded from accuracy calculations and reported separately.
| Difficulty | Conversion Errors |
|---|---|
| Simple | 6 |
| Moderate | 35 |
| Challenging | 22 |
The most affected databases are thrombosis_prediction (20 errors) and california_schools (18 errors), which rely heavily on SQLite-specific date functions and column naming conventions.
Five LLM backends were tested on the BIRD mini-dev set (437 evaluated questions after excluding 63 conversion errors). All runs use temperature 0.1 with evidence hints enabled (BIRD_USE_EVIDENCE=true).
| Model | Config C | Config B | Config A | Delta (A−C) | Healed | Regressions |
|---|---|---|---|---|---|---|
| GPT-OSS 120B | 44.4% | 42.6% | 49.0% | +4.6pp | 39 | 19 |
| Llama 3.3 70B | 43.7% | 38.0% | 46.5% | +2.8pp | 35 | 23 |
| Llama 4 Scout 17B | 37.1% | 34.3% | 40.5% | +3.4pp | 35 | 20 |
| GPT-OSS 20B | 43.5% | 39.8% | 43.2% | −0.3pp | 26 | 24 |
| Qwen3 32B | 40.7% | 41.6% | 39.4% | −1.3pp | 31 | 17 |
| Difficulty | Config C | Config A | Delta |
|---|---|---|---|
| Simple | 54.2% | 59.9% | +5.7pp |
| Moderate | 44.7% | 48.8% | +4.1pp |
| Challenging | 26.2% | 30.0% | +3.8pp |
| Database | Accuracy |
|---|---|
| student_club | 73.3% |
| superhero | 63.5% |
| toxicology | 55.0% |
| european_football_2 | 50.0% |
| codebase_community | 46.8% |
| formula_1 | 45.8% |
| debit_card_specializing | 42.3% |
| california_schools | 41.7% |
| financial | 37.0% |
| thrombosis_prediction | 36.7% |
| card_games | 31.4% |
| Metric | GPT-OSS 120B | Llama 3.3 70B | Scout 17B | GPT-OSS 20B | Qwen3 32B |
|---|---|---|---|---|---|
| Correct on 1st attempt | 175 | 168 | 142 | 163 | 141 |
| Fixed by healing | 39 | 35 | 35 | 26 | 31 |
| Exhausted retries | 204 | 211 | 238 | 206 | 184 |
| Errors/timeouts | 0 | 0 | 2 | 18 | 64 |
| Regressions | 19 | 23 | 20 | 24 | 17 |
Self-healing is effective on real-world data, but with more regressions than synthetic data. On synthetic databases, the best model (Llama Scout 17B) achieved +9.3pp with zero regressions. On BIRD, the best model (GPT-OSS 120B) achieves +4.6pp but with 19 regressions. This is expected: BIRD databases have messier schemas, more ambiguous column names, and more complex query patterns that increase the risk of over-correction.
Config B (single-shot) often underperforms Config C. On BIRD, most models do worse with a single repair attempt than with no repair at all. This suggests that a single error-feedback cycle is insufficient for complex real-world queries; the self-healing loop needs multiple iterations (Config A) to be beneficial.
Model size correlates with BIRD accuracy but not with synthetic accuracy. GPT-OSS 120B leads on BIRD (49.0%) while trailing on synthetic data (48.0%). Llama Scout 17B leads on synthetic (57.3%) but scores lower on BIRD (40.5%). Larger models better handle the complexity and ambiguity of real-world schemas.
Timeout rates vary significantly. Qwen3 32B has 64 timeouts (14.6% of questions), likely due to its longer reasoning chains hitting the 180-second timeout. GPT-OSS 120B and Llama 3.3 70B have zero timeouts.
For context, here are published zero-shot execution accuracy results on the BIRD dev set:
| System | EX Accuracy | Notes |
|---|---|---|
| ChatGPT (zero-shot) | 40.08% | Official BIRD paper baseline |
| GPT-4 (zero-shot) | 46.35% | Official BIRD paper baseline |
| GPT-4 + evidence | 54.89% | With BIRD evidence hints |
| DIN-SQL + GPT-4 | 55.9% | Multi-stage pipeline |
| CHASE-SQL (SOTA) | 73.01% | Multi-agent ensemble |
| Ours (GPT-OSS 120B, Config A) | 49.0% | With evidence, single-LLM pipeline |
| Ours (Llama 3.3 70B, Config A) | 46.5% | With evidence, single-LLM pipeline |
Our results are competitive with published zero-shot baselines while using a lightweight, open-source, single-LLM pipeline. The self-healing delta (+2.8pp to +4.6pp on BIRD) demonstrates that execution-grounded iterative repair adds measurable value on real-world databases.
| Metric | Config C | Config B | Config A |
|---|---|---|---|
| Wall time | 1,177s | 1,816s | 1,972s |
| Peak memory | 103.1 MB | 103.1 MB | 103.1 MB |
| Throughput | 22.3 q/min | 14.4 q/min | 13.3 q/min |
| p95 latency | 10.37s | 18.67s | 17.83s |
| Metric | Config C | Config B | Config A |
|---|---|---|---|
| Wall time | 1,165s | 2,191s | 2,068s |
| Peak memory | 106.6 MB | 106.6 MB | 106.6 MB |
| Throughput | 22.5 q/min | 12.0 q/min | 12.7 q/min |
| p95 latency | 11.45s | 19.98s | 19.67s |
Place the BIRD mini-dev dataset at evaluation/bird/bird_data/:
evaluation/bird/bird_data/
├── mini_dev_sqlite.json # 500 questions (auto-download attempted)
└── dev_databases/
├── california_schools/
│ └── california_schools.sqlite
├── card_games/
│ └── card_games.sqlite
├── ...
└── toxicology/
└── toxicology.sqlite
The dataset can be downloaded from:
If the question JSON is missing, the runner attempts automatic download via the HuggingFace datasets library. The SQLite database files must be placed manually.
cd sqlqueryengine
# 1. Set LLM credentials in docker-compose-bird-evaluation.yml
# Edit: LLM_BASE_URL, LLM_MODEL, LLM_API_KEY
# 2. Build and launch
docker compose -f docker-compose-bird-evaluation.yml build
docker compose -f docker-compose-bird-evaluation.yml up -d
# 3. Monitor progress
docker logs bird-runner --tail 10 -f
# 4. Results appear in evaluation/bird/bird_results/The BIRD stack starts four services: PostgreSQL (with 2 GB shared memory for large databases), Redis, the SQL Query Engine, and the BIRD evaluation runner.
1. Wait for PostgreSQL
2. Load BIRD dataset (birdDataLoader.py)
3. Migrate SQLite databases → PostgreSQL (sqliteToPostgres.py)
4. Execute converted gold queries, save bird_results/questions.json
5. Wait for engine
6. Run 3-config ablation (birdEvalRunner.py)
7. Generate score report (birdScoreReport.py)
Expected runtime: 2-4 hours for 500 questions across 3 configurations, depending on LLM speed and rate limits.
| Variable | Default | Description |
|---|---|---|
LLM_BASE_URL |
— | OpenAI-compatible endpoint |
LLM_MODEL |
— | Model ID to benchmark |
LLM_API_KEY |
your-api-key-here |
API key (replace before running) |
BIRD_DATASET |
mini |
Dataset variant: mini (500 questions) or full (1,534) |
BIRD_USE_EVIDENCE |
true |
Include BIRD evidence hints in prompts |
EVAL_MAX_WORKERS |
4 |
Parallel question workers (tune based on rate limits) |
TIMEOUT_SECONDS |
180 |
Per-request timeout in seconds |
evaluation/bird/bird_results/
├── questions.json ← Converted questions with PostgreSQL gold results
└── runs/ ← Archived results per model
├── llama-4-scout-17b-16e-instruct/
│ ├── summary.json ← All metrics (accuracy, healing, resources)
│ ├── results_config_a.json ← Full pipeline per-question results
│ ├── results_config_b.json ← Single-shot per-question results
│ ├── results_config_c.json ← Generation-only per-question results
│ ├── metrics_config_a.json ← Resource metrics for Config A
│ ├── metrics_config_b.json ← Resource metrics for Config B
│ ├── metrics_config_c.json ← Resource metrics for Config C
│ ├── predictions.txt ← Official BIRD submission format
│ └── conversion_report.json ← SQLite→PostgreSQL conversion stats
├── gpt-oss-120b/
│ └── ...
└── ...
| Module | Description |
|---|---|
birdEntrypoint.py |
Orchestrates: load → migrate → evaluate → score |
birdDataLoader.py |
Loads BIRD questions, converts gold SQL dialect (14 transformation rules) |
sqliteToPostgres.py |
Introspects SQLite schemas, creates PostgreSQL databases, bulk-inserts data |
birdEvalRunner.py |
3-config ablation runner with per-database schema warmup and parallel execution |
birdScoreReport.py |
Generates summary tables, comparison with published baselines, official prediction file |
birdConfig.py |
Environment-driven configuration for the BIRD pipeline |
- Evaluation — Synthetic evaluation on controlled databases
- Architecture — System design and self-healing pipeline details
📄 Paper: arXiv:2604.16511 | 📊 Dataset: Hugging Face | 💻 Source: GitHub
SQL Query Engine
Design
Setup
API
Internals
Evaluation
Help