# Benchmarking Text-to-SQL: Evaluating AI on Complex Natural Language to SQL Tasks

**Author:** [Your Name]  
**Date:** [Today's Date]

## Why this Notebook?

As Natural Language Processing (NLP) models become more powerful, the ambition to query databases using plain English grows. But bridging the gap between human queries and database logic isn't trivial—especially for real-world, complex business questions.

This notebook walks you through:
- Why Text-to-SQL is challenging
- How to build your own benchmark set of complex NLQ (Natural Language Queries) and SQL pairs
- How to set up, document, and share your evaluation process

*Let’s dive in!*

## Why Focus on Complex Queries?

Most simple queries (like "Show all customers") don’t test the true capabilities of a Text-to-SQL system. In real business settings, users ask questions involving:

- **Aggregations:** Summaries like averages, counts, sums
- **Joins:** Merging information across tables
- **Filtering and Grouping:** Extracting patterns and segments
- **Orderings and Limits:** Prioritizing the most relevant results

These are exactly where most models struggle—and where your benchmark needs to shine!

In [None]:
import pandas as pd

# Define a list of complex NLQ (Natural Language Query) and Gold SQL pairs
complex_samples = [
    {
        "NLQ": "What is the full name and id of the college with the largest number of baseball players?",
        "Gold SQL": """
        SELECT T1.name_full, T1.college_id
        FROM college AS T1
        JOIN player_college AS T2 ON T1.college_id = T2.college_id
        GROUP BY T1.college_id
        ORDER BY COUNT(*) DESC
        LIMIT 1;
        """,
        "Complexity": "Aggregation, Join, Group By, Ordering, Limit",
        "Why_Complex": "This query combines a join between two tables, groups the result, counts players per college, orders by that count, and limits to the top college. It reflects business questions like 'Who is our top customer segment?'"
    },
    {
        "NLQ": "What is average salary of the players in the team named 'Boston Red Stockings'?",
        "Gold SQL": """
        SELECT AVG(T1.salary)
        FROM salary AS T1
        JOIN team AS T2 ON T1.team_id = T2.team_id_br
        WHERE T2.name = 'Boston Red Stockings';
        """,
        "Complexity": "Join, Aggregation, Filtering",
        "Why_Complex": "Requires finding the correct team by name, joining salary and team tables, and computing an average. It mimics executive analytics like 'What’s the average cost per department?'"
    },
    {
        "NLQ": "What are first and last names of players participating in all star game in 1998?",
        "Gold SQL": """
        SELECT name_first, name_last
        FROM player AS T1
        JOIN all_star AS T2 ON T1.player_id = T2.player_id
        WHERE YEAR = 1998;
        """,
        "Complexity": "Join, Filtering",
        "Why_Complex": "Links two tables on player ID and filters by year—a step up from just querying one table."
    },
    {
        "NLQ": "What are the first name, last name and id of the player with the most all star game experiences?",
        "Gold SQL": """
        SELECT name_first, name_last, player_id
        FROM player AS T1
        JOIN all_star AS T2 ON T1.player_id = T2.player_id
        GROUP BY T1.player_id
        ORDER BY COUNT(*) DESC
        LIMIT 1;
        """,
        "Complexity": "Join, Aggregation, Group By, Ordering, Limit",
        "Why_Complex": "Finds the player with the maximum number of all-star games—a multi-step analytic involving joins, grouping, and aggregation."
    }
]

df = pd.DataFrame(complex_samples)
df[['NLQ', 'Complexity', 'Why_Complex']]

## Visualizing Our Evaluation Set

Below, we display the NLQs, what makes them complex, and which SQL features are exercised.  
*You can expand this table as your business use cases evolve.*

In [None]:
# Display the benchmark set with explanations
pd.set_option('display.max_colwidth', None)
df[['NLQ', 'Complexity', 'Why_Complex']]

## How to Use

- **Expand the Set:** Add your own business-critical NLQ-SQL pairs.
- **Connect Your Model:** Use your favorite Text-to-SQL model to generate SQL for each NLQ.
- **Evaluate:** Compare model output to the 'Gold SQL' using:
  - Exact match (for structure)
  - Execution accuracy (do the results match?)
  - Partial credit (for close but not exact answers)

**Why this matters:**  
Domain-relevant queries expose gaps that public datasets may miss, especially in regulated, enterprise, or niche sectors.

## Model Evaluation (To Be Implemented)

Here’s how you could extend this notebook to:
- Generate SQL queries using your LLM or Text-to-SQL system
- Automatically compare generated SQL with gold SQL
- Compute accuracy metrics (exact match, execution match, etc.)

*Code for evaluation can be added here in future versions!*

# Conclusion

Building a benchmark set for Text-to-SQL is an **iterative, ongoing process**.  
As your business grows or your schema evolves, revisit your test queries.  

**Share this notebook:**  
- On GitHub (with your additions)
- With your team to ensure everyone speaks the same 'language' when it comes to analytics

---

**Inspired by hands-on AI work in real enterprises and co-authored with ChatGPT by OpenAI.**