## ü§ñ Agent Logic Overview ‚Äî How the SQL Agent Works

This notebook demonstrates how to build a reflective SQL Agent using Large Language Models (LLMs). The agent is designed to take a natural-language question, generate an initial SQL query, evaluate its correctness using real execution results, and refine the query if needed. This process ensures that the final output better matches the user's intent.

---
### üéØ Agent Objective

To create a reliable SQL Agent that:
- Understands natural-language questions
- Generates SQL queries based on schema context
- Evaluates query results to detect semantic errors
- Refines queries through iterative feedback
- Produces accurate, business-aligned answers

---

### üß© Agent Workflow Structure

The agent follows a multi-step workflow, inspired by modular LLM orchestration:

1. **LLM #1 ‚Äî SQL Generation**  
   Receives the user‚Äôs question and schema, and generates an initial SQL query (V1).

2. **Execution Engine**  
   Runs the V1 query against the database and returns the actual output.

3. **LLM #2 ‚Äî Reflection**  
   Reviews the original query and its output to identify logical or semantic issues.

4. **LLM #3 ‚Äî Refinement**  
   Produces a refined SQL query (V2) that corrects any issues found.

5. **Execution Engine**  
   Runs the V2 query and returns the final answer.

> *Note: Each LLM step can use a different model with specialized capabilities (e.g., generation vs. evaluation).*

---

### üîÅ Agentic Reflection Mechanisms

- **Static Reflection**:  
  Reviews the SQL query structure without executing it. Useful for catching syntax issues, missing filters, or incorrect aggregations.

- **Dynamic Reflection**:  
  Evaluates the SQL query based on its actual output. Detects subtle semantic errors such as negative totals, empty results, or incorrect logic.

---

### üìä Agent Logic Flow 
![SQL Agent Workflow](image.png)

---



## üß† Part 1 ‚Äî Key Takeaways: What I Learned

### 1.1 ÊäÄÊúØËÉΩÂäõÊÄªÁªì

*   Â¶Ç‰Ωï‰ΩøÁî® LLM Â∞ÜËá™ÁÑ∂ËØ≠Ë®ÄËΩ¨Âåñ‰∏∫ SQL Êü•ËØ¢

*   Â¶Ç‰ΩïÊûÑÂª∫ event-sourced Êï∞ÊçÆÂ∫ìÂπ∂ÊèêÂèñ schema
*   Â¶Ç‰ΩïÊâßË°å SQL Âπ∂Áî® pandas Â±ïÁ§∫ÁªìÊûú
*   Â¶Ç‰Ωï‰ΩøÁî®ÂèçÊÄùÊú∫Âà∂‰øÆÂ§çËØ≠‰πâÈîôËØØÔºàÈùôÊÄÅ + Âä®ÊÄÅÔºâ

### 1.2 Agentic ÊÄùÁª¥ÊñπÂºèÊÄªÁªì

*   ‰∏∫‰ªÄ‰πà‰∏çËÉΩÂè™‰æùËµñÊ®°ÂûãÁöÑÁ¨¨‰∏ÄÊ¨°ËæìÂá∫

*   Â¶Ç‰ΩïÈÄöËøáÊâßË°åÁªìÊûúÂèëÁé∞ÈöêËóèÈóÆÈ¢ò
*   Â¶Ç‰ΩïÊûÑÂª∫ÂÖ∑Â§á‚ÄúÁé∞ÂÆûÊÑüÁü•‚ÄùÁöÑ SQL Agent
*   ‰ªé‚ÄúÊ®°ÂûãË∞ÉÁî®‚ÄùÂà∞‚ÄúAgentË°å‰∏∫‚ÄùÁöÑËΩ¨Âèò


## üîÅ Part 2 ‚Äî End-to-End SQL Agent Workflow

### ÁéØÂ¢ÉÂáÜÂ§á

In [5]:
import json
import utils
import pandas as pd
from dotenv import load_dotenv

_ = load_dotenv()

import aisuite as ai

client = ai.Client()

### Êï∞ÊçÆÂáÜÂ§á

In [6]:
utils.create_transactions_db()

utils.print_html(utils.get_schema('products.db'))

SQLite database 'products.db' created with a single 'transactions' table (event-sourced).


### ÂáΩÊï∞ÂÆö‰πâ

1. generate_sql()

> ÂäüËÉΩÔºöÊ†πÊçÆÁî®Êà∑ÊèêÂá∫ÁöÑËá™ÁÑ∂ËØ≠Ë®ÄÈóÆÈ¢òÂíåÊï∞ÊçÆÂ∫ìÁªìÊûÑÔºåÁîüÊàê‰∏ÄÊù° SQL Êü•ËØ¢ËØ≠Âè•„ÄÇ

In [7]:
def generate_sql(question: str, schema: str, model: str) -> str:
    """
    Ê†πÊçÆÁî®Êà∑ÁöÑÈóÆÈ¢òÂíåÊï∞ÊçÆÂ∫ìÁªìÊûÑÔºåÁîüÊàê‰∏ÄÊù° SQL Êü•ËØ¢ËØ≠Âè•„ÄÇ

    ÂèÇÊï∞Ôºö
    - question: Áî®Êà∑ÊèêÂá∫ÁöÑÈóÆÈ¢òÔºàËá™ÁÑ∂ËØ≠Ë®ÄÔºâ
    - schema: Êï∞ÊçÆÂ∫ìÁöÑË°®ÁªìÊûÑÔºàÂ≠óÁ¨¶‰∏≤ÂΩ¢ÂºèÔºâ
    - model: ‰ΩøÁî®ÁöÑËØ≠Ë®ÄÊ®°ÂûãÂêçÁß∞ÔºàÂ¶Ç dashscope:qwen3-vl-plusÔºâ

    ËøîÂõûÔºö
    - SQL Êü•ËØ¢ËØ≠Âè•ÔºàÂ≠óÁ¨¶‰∏≤Ôºâ
    """

    # ÊûÑÈÄ†ÊèêÁ§∫ËØçÔºåÂëäËØâÊ®°ÂûãÂÆÉÊòØ‰∏Ä‰∏™ SQL Âä©Êâã
    # Êèê‰æõ schema ÂíåÁî®Êà∑ÈóÆÈ¢òÔºåË¶ÅÊ±ÇÂè™ËøîÂõû SQL Êü•ËØ¢ËØ≠Âè•
    prompt = f"""
    You are a SQL assistant. Given the schema and the user's question, write a SQL query for SQLite.

    Schema:
    {schema}

    User question:
    {question}

    Respond with the SQL only.
    """

    # Ë∞ÉÁî®ËØ≠Ë®ÄÊ®°ÂûãÁîüÊàê SQL Êü•ËØ¢
    response = client.chat.completions.create(
        model=model,  # ÊåáÂÆö‰ΩøÁî®ÁöÑÊ®°Âûã
        messages=[{"role": "user", "content": prompt}],  # ÊèêÁ§∫ËØç‰Ωú‰∏∫Áî®Êà∑Ê∂àÊÅØ‰º†ÂÖ•
        temperature=0,  # ËÆæÁΩÆ‰∏∫ 0 Ë°®Á§∫ËæìÂá∫Êõ¥Á®≥ÂÆö„ÄÅÁ°ÆÂÆöÊÄßÊõ¥Âº∫
    )

    # ÊèêÂèñÊ®°ÂûãËøîÂõûÁöÑ SQL ÂÜÖÂÆπÔºåÂπ∂ÂéªÈô§È¶ñÂ∞æÁ©∫Ê†º
    return response.choices[0].message.content.strip()


2. evaluate_sql()

> üí° Note: The function `execute_sql()` is defined in `utils.py` for modularity and reuse. It takes a SQL string and a database path, and returns the query result as a pandas DataFrame.


3. refine_sql_external_feedback()

ÁªìÂêàÊâßË°åÁªìÊûúËøõË°åËØ≠‰πâÂèçÊÄù‰∏é‰øÆÂ§ç„ÄÇ


In [8]:
def refine_sql_external_feedback(
    question: str,
    sql_query: str,
    df_feedback: pd.DataFrame,
    schema: str,
    model: str,
) -> tuple[str, str]:
    """
    Ê†πÊçÆ SQL Êü•ËØ¢ÁöÑÊâßË°åÁªìÊûúËøõË°åÂèçÊÄùÔºåÂπ∂ÁîüÊàê‰øÆÂ§çÂêéÁöÑ SQL Êü•ËØ¢„ÄÇ

    ÂèÇÊï∞Ôºö
    - question: Áî®Êà∑ÊèêÂá∫ÁöÑÈóÆÈ¢òÔºàËá™ÁÑ∂ËØ≠Ë®ÄÔºâ
    - sql_query: ÂàùÂßã SQL Êü•ËØ¢ËØ≠Âè•ÔºàV1Ôºâ
    - df_feedback: SQL ÊâßË°åÁªìÊûúÔºàDataFrameÔºâ
    - schema: Êï∞ÊçÆÂ∫ìË°®ÁªìÊûÑ
    - model: ‰ΩøÁî®ÁöÑËØ≠Ë®ÄÊ®°ÂûãÂêçÁß∞

    ËøîÂõûÔºö
    - feedback: Ê®°ÂûãÂØπÂàùÂßã SQL ÁöÑËØÑ‰ª∑‰∏éÂª∫ËÆÆÔºàÂ≠óÁ¨¶‰∏≤Ôºâ
    - refined_sql: ‰øÆÂ§çÂêéÁöÑ SQL Êü•ËØ¢ËØ≠Âè•ÔºàV2Ôºâ
    """

    # ÊûÑÈÄ†ÊèêÁ§∫ËØçÔºåË¶ÅÊ±ÇÊ®°ÂûãËØÑ‰º∞ SQL ËæìÂá∫ÊòØÂê¶ÂêàÁêÜÔºåÂπ∂Êèê‰æõ‰øÆÂ§çÂª∫ËÆÆ
    prompt = f"""
    You are a SQL reviewer and refiner.

    User asked:
    {question}

    Original SQL:
    {sql_query}

    SQL Output:
    {df_feedback.to_markdown(index=False)}

    Table Schema:
    {schema}

    Step 1: Briefly evaluate if the SQL output answers the user's question.
    Step 2: If the SQL could be improved, provide a refined SQL query.
    If the original SQL is already correct, return it unchanged.

    Return a strict JSON object with two fields:
    - "feedback": brief evaluation and suggestions
    - "refined_sql": the final SQL to run
    """

    # Ë∞ÉÁî®Ê®°ÂûãËøõË°åÂèçÊÄù‰∏é‰øÆÂ§ç
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=1.0,  # ËÆæÁΩÆ‰∏∫ 1.0ÔºåÈºìÂä±Ê®°ÂûãÊèêÂá∫Â§öÊ†∑Âåñ‰øÆÂ§çÂª∫ËÆÆ
    )

    # Â∞ùËØïËß£ÊûêÊ®°ÂûãËøîÂõûÁöÑ JSON Ê†ºÂºèÂÜÖÂÆπ
    content = response.choices[0].message.content
    try:
        obj = json.loads(content)
        feedback = str(obj.get("feedback", "")).strip()
        refined_sql = str(obj.get("refined_sql", sql_query)).strip()
        if not refined_sql:
            refined_sql = sql_query
    except Exception:
        # Â¶ÇÊûúÊ®°ÂûãËøîÂõûÁöÑ‰∏çÊòØÊúâÊïà JSONÔºåÂàô‰øùÁïôÂéüÂßã SQL Âπ∂Â∞ÜÂÜÖÂÆπ‰Ωú‰∏∫ÂèçÈ¶à
        feedback = content.strip()
        refined_sql = sql_query

    return feedback, refined_sql


### SQL Agent Â∑•‰ΩúÊµÅ

In [21]:
def run_sql_workflow(
    db_path: str,
    question: str,
    model_generation: str = "deepseek:deepseek-chat",
    model_evaluation: str = "dashscope:qwen3-vl-plus",
    return_results: bool = True
):
    print(f"üîß Using model: {model_generation} for generation, {model_evaluation} for evaluation")

    # Step 1: Schema
    schema = utils.get_schema(db_path)
    utils.print_html(schema, title="üìò Step 1 ‚Äî Extract Database Schema")

    # Step 2: Generate SQL (V1)
    sql_v1 = generate_sql(question, schema, model_generation)
    utils.print_html(sql_v1, title="üß† Step 2 ‚Äî Generate SQL (V1)")

    # Step 3: Execute V1
    df_v1 = utils.execute_sql(sql_v1, db_path)
    utils.print_html(df_v1, title="üß™ Step 3 ‚Äî Execute V1 (SQL Output)")

    # Step 4: Reflect + Refine
    feedback, sql_v2 = refine_sql_external_feedback(
        question=question,
        sql_query=sql_v1,
        df_feedback=df_v1,
        schema=schema,
        model=model_evaluation,
    )
    utils.print_html(feedback, title="üß≠ Step 4 ‚Äî Reflect on V1 (Feedback)")
    utils.print_html(sql_v2, title="üîÅ Step 4 ‚Äî Refined SQL (V2)")

    # Step 5: Execute V2
    df_v2 = utils.execute_sql(sql_v2, db_path)
    utils.print_html(df_v2, title="‚úÖ Step 5 ‚Äî Execute V2 (Final Answer)")

    if return_results:
        return {
            "sql_v1": sql_v1,
            "df_v1": df_v1,
            "feedback": feedback,
            "sql_v2": sql_v2,
            "df_v2": df_v2
        }


In [22]:
run_sql_workflow(
    "products.db", 
    "What is the total sales amount per color?",
    model_generation="deepseek:deepseek-chat",
    model_evaluation="dashscope:qwen3-vl-plus"
)

üîß Using model: deepseek:deepseek-chat for generation, dashscope:qwen3-vl-plus for evaluation


color,total_sales_amount
black,-275176.15
blue,-190571.46
green,-214464.7
red,-242075.23
white,-358315.09


color,total_sales_amount
black,275176.15
blue,190571.46
green,214464.7
red,242075.23
white,358315.09


{'sql_v1': "```sql\nSELECT \n    color,\n    SUM(qty_delta * unit_price) AS total_sales_amount\nFROM transactions\nWHERE action = 'sale' AND qty_delta < 0\nGROUP BY color;\n```",
 'df_v1':    color  total_sales_amount
 0  black          -275176.15
 1   blue          -190571.46
 2  green          -214464.70
 3    red          -242075.23
 4  white          -358315.09,
 'feedback': 'The SQL correctly calculates total sales amount per color, but the negative values may be misleading since sales amounts are typically expressed as positive. The query should multiply by -1 to reflect actual revenue.',
 'sql_v2': "SELECT \n    color,\n    SUM(qty_delta * unit_price * -1) AS total_sales_amount\nFROM transactions\nWHERE action = 'sale' AND qty_delta < 0\nGROUP BY color;",
 'df_v2':    color  total_sales_amount
 0  black           275176.15
 1   blue           190571.46
 2  green           214464.70
 3    red           242075.23
 4  white           358315.09}

### Ê®°ÂûãÂØπÊØîÂÆûÈ™å

In [24]:
def compare_model_combinations(
    db_path: str,
    question: str,
    model_pairs: list[tuple[str, str]]
) -> list[dict]:
    """
    ÂØπÊØî‰∏çÂêåÊ®°ÂûãÁªÑÂêàÂú® SQL Agent Â∑•‰ΩúÊµÅ‰∏≠ÁöÑË°®Áé∞„ÄÇ

    ÂèÇÊï∞Ôºö
    - db_path: Êï∞ÊçÆÂ∫ìË∑ØÂæÑ
    - question: Áî®Êà∑ÈóÆÈ¢ò
    - model_pairs: Ê®°ÂûãÁªÑÂêàÂàóË°®ÔºåÂ¶Ç [("deepseek-chat", "qwen3-vl-plus"), ...]

    ËøîÂõûÔºö
    - ÊØèÁªÑÊ®°ÂûãÁöÑÊâßË°åÁªìÊûúÂàóË°®ÔºàÂ≠óÂÖ∏ÂΩ¢ÂºèÔºâ
    """
    results = []

    for gen_model, eval_model in model_pairs:
        print(f"\nüîç Testing: Gen={gen_model}, Eval={eval_model}")

        output = run_sql_workflow(
            db_path=db_path,
            question=question,
            model_generation=gen_model,
            model_evaluation=eval_model,
            return_results=True
        )

        # Âà§Êñ≠ÊòØÂê¶ÂèëÁîüËØ≠‰πâ‰øÆÂ§ç
        repaired = output["sql_v1"].strip() != output["sql_v2"].strip()

        # Âà§Êñ≠ÊúÄÁªàÁªìÊûúÊòØÂê¶‰∏∫Ê≠£ÂÄºÔºàÈÄªËæëÊ≠£Á°ÆÔºâ
        df = output["df_v2"]
        valid = df.select_dtypes(include="number").gt(0).all().all()

        results.append({
            "gen_model": gen_model,
            "eval_model": eval_model,
            "sql_v1": output["sql_v1"],
            "sql_v2": output["sql_v2"],
            "repaired": repaired,
            "valid_result": valid,
            "feedback": output["feedback"],
            "df_v2": df
        })

    return results


In [26]:
model_pairs = [
    ("deepseek:deepseek-chat", "dashscope:qwen3-vl-plus"),
    ("dashscope:qwen3-vl-plus", "deepseek:deepseek-chat"),
    ("dashscope:qwen3-vl-plus", "dashscope:qwen3-vl-plus"),
    ("deepseek:deepseek-chat", "deepseek:deepseek-chat"),
]

results = compare_model_combinations(
    db_path="products.db",
    question="What is the total sales amount per color?",
    model_pairs=model_pairs
)



üîç Testing: Gen=deepseek:deepseek-chat, Eval=dashscope:qwen3-vl-plus
üîß Using model: deepseek:deepseek-chat for generation, dashscope:qwen3-vl-plus for evaluation


color,total_sales_amount
black,-275176.15
blue,-190571.46
green,-214464.7
red,-242075.23
white,-358315.09


color,total_sales_amount



üîç Testing: Gen=dashscope:qwen3-vl-plus, Eval=deepseek:deepseek-chat
üîß Using model: dashscope:qwen3-vl-plus for generation, deepseek:deepseek-chat for evaluation


color,total_sales_amount
black,-275176.15
blue,-190571.46
green,-214464.7
red,-242075.23
white,-358315.09


color,total_sales_amount
black,275176.15
blue,190571.46
green,214464.7
red,242075.23
white,358315.09



üîç Testing: Gen=dashscope:qwen3-vl-plus, Eval=dashscope:qwen3-vl-plus
üîß Using model: dashscope:qwen3-vl-plus for generation, dashscope:qwen3-vl-plus for evaluation


color,total_sales_amount
black,-275176.15
blue,-190571.46
green,-214464.7
red,-242075.23
white,-358315.09


color,total_sales_amount
black,-275176.15
blue,-190571.46
green,-214464.7
red,-242075.23
white,-358315.09



üîç Testing: Gen=deepseek:deepseek-chat, Eval=deepseek:deepseek-chat
üîß Using model: deepseek:deepseek-chat for generation, deepseek:deepseek-chat for evaluation


color,total_sales_amount
black,-275176.15
blue,-190571.46
green,-214464.7
red,-242075.23
white,-358315.09


color,total_sales_amount
black,-275176.15
blue,-190571.46
green,-214464.7
red,-242075.23
white,-358315.09


In [27]:
pd.DataFrame([{
    "Gen": r["gen_model"],
    "Eval": r["eval_model"],
    "Repaired": r["repaired"],
    "Valid": r["valid_result"]
} for r in results])

Unnamed: 0,Gen,Eval,Repaired,Valid
0,deepseek:deepseek-chat,dashscope:qwen3-vl-plus,True,True
1,dashscope:qwen3-vl-plus,deepseek:deepseek-chat,True,True
2,dashscope:qwen3-vl-plus,dashscope:qwen3-vl-plus,False,False
3,deepseek:deepseek-chat,deepseek:deepseek-chat,False,False


#### üß† Model Comparison Summary

This experiment tested four combinations of SQL generation and evaluation models:

| Generation Model | Evaluation Model | Repaired | Valid |
|------------------|------------------|----------|-------|
| DeepSeek         | Qwen             | ‚úÖ        | ‚úÖ     |
| Qwen             | DeepSeek         | ‚úÖ        | ‚úÖ     |
| Qwen             | Qwen             | ‚ùå        | ‚ùå     |
| DeepSeek         | DeepSeek         | ‚ùå        | ‚ùå     |

> üí° Insight: Using different models for generation and evaluation significantly improves semantic accuracy. Cross-model reflection enables the agent to detect and correct logical errors that single-model workflows miss.


## ‚úÖ Part 3 Summary ‚Äî End-to-End SQL Agent & Model Comparison

We built and tested a modular SQL Agent workflow with the following capabilities:

### üîÅ Workflow Steps
1. Extract database schema
2. Generate SQL (V1) using a language model
3. Execute SQL and display output
4. Reflect on SQL output and propose refined SQL (V2)
5. Execute V2 and display final answer

### üß† Modular Design
- All functions are defined in `utils.py` or notebook cells
- Supports flexible model selection for generation and evaluation
- Results are displayed inline using `print_html()`

---

### üß™ Model Comparison Results

We tested four combinations of generation and evaluation models:

| Generation Model        | Evaluation Model        | Repaired | Valid |
|-------------------------|-------------------------|----------|-------|
| `deepseek:deepseek-chat` | `dashscope:qwen3-vl-plus` | ‚úÖ        | ‚úÖ     |
| `dashscope:qwen3-vl-plus` | `deepseek:deepseek-chat` | ‚úÖ        | ‚úÖ     |
| `dashscope:qwen3-vl-plus` | `dashscope:qwen3-vl-plus` | ‚ùå        | ‚ùå     |
| `deepseek:deepseek-chat` | `deepseek:deepseek-chat` | ‚ùå        | ‚ùå     |

> üí° Insight: Using different models for generation and evaluation significantly improves semantic accuracy. Cross-model reflection enables the agent to detect and correct logical errors that single-model workflows miss.

---

## üì¶ Next Steps (Optional Extensions)

- Add chart visualization to compare V1 vs V2 results
- Support user-uploaded databases for custom queries
- Build a reusable knowledge base summarizing schema, queries, and model 