# Pydough LLM Demo

This notebook showcases how an LLM can generate PyDough queries from natural language instructions. The goal is to demonstrate how AI can automate complex data analysis, making querying faster, more intuitive, and accessible without needing deep technical expertise.

Each example highlights different capabilities, including aggregations, filtering, ranking, and calculations across multiple collections.

## Setup

First, we import the created client.

In [None]:
from llm import LLMClient

SyntaxError: invalid syntax (3373058556.py, line 1)

Then, we define the `provider` and `model` as variables and initialize the client with the selected values. 

These can be adjusted as needed.

In [6]:
provider= "aws"
model = "us.anthropic.claude-3-7-sonnet-20250219-v1:0"

client = LLMClient(provider, model)

One should use the `ask()` method to make a query to the model.

We get a `result` object with the next attributes: 

- `pydough_code`:  The PyDough query generated by the LLM.
  
- `full_explanation`: A detailed explanation of how the query works.
- `df`:The dataframe containing the query results.
- `exception`: Stores any errors encountered while executing the query.
- `original_question`: The natural language question input by the user.
- `sql_output`: The SQL equivalent of the generated PyDough query.
- `base_prompt`: The initial instruction given to the LLM to generate the query.
- `cheat_sheet`:  A reference guide or example queries to help the LLM structure responses.
- `knowledge_graph`: The metadata structure that informs the LLM about available collections and relationships.
 
Example:

In [16]:
result= client.ask("For each of the 5 largest part sizes, find the part of that size with the largest retail price")

print(result.full_explanation)

I'll create a PyDough code snippet to find the part with the largest retail price for each of the 5 largest part sizes.

First, I need to:
1. Group parts by their size
2. For each size, find the part with the maximum retail price
3. Order the sizes in descending order
4. Take the top 5 largest sizes

Here's the PyDough code:

```python
# First, calculate the maximum retail price for each part size
max_price_by_size = GROUP_BY(parts, name="parts_group", by=size).CALCULATE(
    size=size,
    max_retail_price=MAX(parts_group.retail_price)
)

# For each size, find the part with the matching maximum retail price
# Then order by size in descending order and take the top 5
largest_parts = parts.CALCULATE(
    size=size,
    name=name,
    retail_price=retail_price,
    part_key=key
).WHERE(
    retail_price == max_price_by_size.WHERE(size == parts.size).max_retail_price
).ORDER_BY(
    size.DESC()
).TOP_K(5, by=size.DESC())

# The result contains the part with the largest retail price for ea

## Demonstration Queries

### 1. Find the names of all customers and the number of orders placed in 1995 in Europe.

In [17]:
query= "Find the names of all customers and the number of orders placed in 1995 in Europe."

result= client.ask(query)

print(result.full_explanation)

ERROR WHILE EXECUTING QUERY:
SELECT
  name AS customer_name,
  COALESCE(agg_0, 0) AS orders_in_1995
FROM (
  SELECT
    agg_0,
    name
  FROM (
    SELECT
      key,
      name
    FROM (
      SELECT
        _table_alias_2.key AS key,
        name,
        name_3
      FROM (
        SELECT
          c_custkey AS key,
          c_name AS name,
          c_nationkey AS nation_key
        FROM main.CUSTOMER
      ) AS _table_alias_2
      LEFT JOIN (
        SELECT
          _table_alias_0.key AS key,
          name AS name_3
        FROM (
          SELECT
            n_nationkey AS key,
            n_regionkey AS region_key
          FROM main.NATION
        ) AS _table_alias_0
        INNER JOIN (
          SELECT
            r_regionkey AS key,
            r_name AS name
          FROM main.REGION
        ) AS _table_alias_1
          ON region_key = _table_alias_1.key
      ) AS _table_alias_3
        ON nation_key = _table_alias_3.key
    )
    WHERE
      name_3 = 'EUROPE'
  )
 

**Follow up**: Now, give me the ones who have an account balance greater than $700 and placed at least one order in that same year.

In [18]:
query += "Who have an account balance greater than $700 and placed at least one order in 1995."

result= client.ask(query)

print(result.full_explanation)

ERROR WHILE EXECUTING QUERY:
SELECT
  name AS customer_name,
  COALESCE(agg_0, 0) AS orders_in_1995
FROM (
  SELECT
    agg_0,
    name
  FROM (
    SELECT
      acctbal,
      agg_0,
      name,
      name_3
    FROM (
      SELECT
        _table_alias_2.key AS key,
        acctbal,
        name,
        name_3
      FROM (
        SELECT
          c_acctbal AS acctbal,
          c_custkey AS key,
          c_name AS name,
          c_nationkey AS nation_key
        FROM main.CUSTOMER
      ) AS _table_alias_2
      LEFT JOIN (
        SELECT
          _table_alias_0.key AS key,
          name AS name_3
        FROM (
          SELECT
            n_nationkey AS key,
            n_regionkey AS region_key
          FROM main.NATION
        ) AS _table_alias_0
        INNER JOIN (
          SELECT
            r_regionkey AS key,
            r_name AS name
          FROM main.REGION
        ) AS _table_alias_1
          ON region_key = _table_alias_1.key
      ) AS _table_alias_3
        

### 2. Find the region name with the highest total order value in 1996.

The total order value is defined as potential revenue, defined as the sum of extended_price * (1 - discount)

In [19]:
query="Find the region name with the highest total order value in 1996. The total order value is defined as potential revenue, defined as the sum of extended_price * (1 - discount)"

result= client.ask(query)

print(result.full_explanation)

ERROR WHILE EXECUTING QUERY:
SELECT
  region_name,
  total_revenue
FROM (
  SELECT
    ordering_1,
    region_name,
    total_revenue
  FROM (
    SELECT
      COALESCE(agg_0, 0) AS ordering_1,
      COALESCE(agg_0, 0) AS total_revenue,
      name AS region_name
    FROM (
      SELECT
        agg_0,
        name
      FROM (
        SELECT
          r_regionkey AS key,
          r_name AS name
        FROM main.REGION
      )
      LEFT JOIN (
        SELECT
          SUM(extended_price * (
            1 - discount
          )) AS agg_0,
          region_key
        FROM (
          SELECT
            discount,
            extended_price,
            region_key
          FROM (
            SELECT
              key_5,
              region_key
            FROM (
              SELECT
                key AS key_5,
                order_date,
                region_key
              FROM (
                SELECT
                  _table_alias_1.key AS key_2,
                  region_key
  

### 3. Which 10 customers purchased the highest quantity of products during 1998?

In [33]:
query= "Which 10 customers purchased the highest quantity of products during 1998?"

result= client.ask(query)

print(result.full_explanation)

ERROR WHILE EXECUTING QUERY:
SELECT
  customer_key,
  customer_name,
  total_quantity
FROM (
  SELECT
    customer_key,
    customer_name,
    ordering_1,
    total_quantity
  FROM (
    SELECT
      COALESCE(agg_0, 0) AS ordering_1,
      COALESCE(agg_0, 0) AS total_quantity,
      key AS customer_key,
      name AS customer_name
    FROM (
      SELECT
        agg_0,
        key,
        name
      FROM (
        SELECT
          c_custkey AS key,
          c_name AS name
        FROM main.CUSTOMER
      )
      LEFT JOIN (
        SELECT
          SUM(quantity) AS agg_0,
          customer_key
        FROM (
          SELECT
            customer_key,
            quantity
          FROM (
            SELECT
              customer_key,
              key
            FROM (
              SELECT
                o_custkey AS customer_key,
                o_orderdate AS order_date,
                o_orderkey AS key
              FROM main.ORDERS
            )
            WHERE
            

**Follow up**: Give me the ones that have "green" on the product name.

In [35]:
query += "That have 'green' on the product name."

result= client.ask(query)

print(result.full_explanation)

ERROR WHILE EXECUTING QUERY:
SELECT
  customer_name,
  total_green_quantity
FROM (
  SELECT
    customer_name,
    ordering_1,
    total_green_quantity
  FROM (
    SELECT
      total_green_quantity AS ordering_1,
      customer_name,
      total_green_quantity
    FROM (
      SELECT
        COALESCE(agg_0, 0) AS total_green_quantity,
        name AS customer_name
      FROM (
        SELECT
          agg_0,
          name
        FROM (
          SELECT
            c_custkey AS key,
            c_name AS name
          FROM main.CUSTOMER
        )
        LEFT JOIN (
          SELECT
            SUM(quantity) AS agg_0,
            customer_key
          FROM (
            SELECT
              customer_key,
              quantity
            FROM (
              SELECT
                customer_key,
                name,
                quantity
              FROM (
                SELECT
                  customer_key,
                  part_key,
                  quantity
           

### 4. List customers who ordered in 1996 but not in 1997, with a total spent of over 1000$?

In [39]:
query= "List customers who ordered in 1996 but not in 1997 with a total spent of over 1000$?"

result= client.ask(query)

print(result.df)

ERROR WHILE EXECUTING QUERY:
SELECT
  customer_name,
  customer_key_3 AS customer_key,
  total_spent_1996
FROM (
  SELECT
    customer_key_2 AS customer_key_3,
    total_spent_1996 AS ordering_1,
    customer_name,
    total_spent_1996
  FROM (
    SELECT
      COALESCE(agg_0, 0) AS total_spent_1996,
      key AS customer_key_2,
      name AS customer_name
    FROM (
      SELECT
        agg_0,
        key,
        name
      FROM (
        SELECT
          agg_0,
          key,
          name
        FROM (
          SELECT
            c_custkey AS key,
            c_name AS name
          FROM main.CUSTOMER
        )
        INNER JOIN (
          SELECT
            SUM(total_price) AS agg_0,
            customer_key
          FROM (
            SELECT
              customer_key,
              total_price
            FROM (
              SELECT
                o_custkey AS customer_key,
                o_orderdate AS order_date,
                o_totalprice AS total_price
           

**Follow up**: Include the number of months since the last order and sort by total spent, highest first.

In [36]:
query += "Include the number of months since the last order and sort by total spent, highest first."

result= client.ask(query)

print(result.full_explanation) 

ERROR WHILE EXECUTING QUERY:
SELECT
  customer_name,
  total_quantity,
  total_spent,
  last_order_date,
  months_since_last_order
FROM (
  SELECT
    customer_name,
    last_order_date,
    months_since_last_order,
    ordering_4,
    total_quantity,
    total_spent
  FROM (
    SELECT
      total_spent AS ordering_4,
      customer_name,
      last_order_date,
      months_since_last_order,
      total_quantity,
      total_spent
    FROM (
      SELECT
        COALESCE(agg_2, 0) AS total_quantity,
        COALESCE(agg_3, 0) AS total_spent,
        agg_0 AS last_order_date,
        name AS customer_name,
        (
          CAST(STRFTIME('%Y', DATETIME('now')) AS INTEGER) - CAST(STRFTIME('%Y', agg_1) AS INTEGER)
        ) * 12 + CAST(STRFTIME('%m', DATETIME('now')) AS INTEGER) - CAST(STRFTIME('%m', agg_1) AS INTEGER) AS months_since_last_order
      FROM (
        SELECT
          agg_0,
          agg_1,
          agg_2,
          agg_3,
          name
        FROM (
          SELECT

### 5. What is the february 1996 SPM for the almond antique blue royal burnished part in China?

SPM (Selling Profit Margin) = (Total Amount from Sells - (Tax + Commission)) / Total Amount from Sells * 100

In [None]:
query= "What is the february 1996 SPM for the almond antique blue royal burnished part in China? SPM (Selling Profit Margin) = (Total Amount from Sells - (Tax + Commission)) / Total Amount from Sells * 100"

result= client.ask(query)

print(result.full_explanation)

I need to analyze this query to find the "SPM" (which I assume means "Sales Per Month") for a specific part in China during February 1996.

Let me break down what we need:
1. Find the specific part with attributes "almond antique blue royal burnished"
2. Filter for sales in China
3. Filter for February 1996
4. Calculate the sales metric (SPM)

```python
# First, find the part that matches the description
target_part = parts.WHERE(
    CONTAINS(name, "almond") & 
    CONTAINS(name, "antique") & 
    CONTAINS(name, "blue") & 
    CONTAINS(name, "royal") & 
    CONTAINS(name, "burnished")
)

# Find all line items for this part, in China, from February 1996
china_feb_1996_sales = lines.WHERE(
    (part.key == target_part.key) &
    (supplier.nation.name == "CHINA") &
    (YEAR(ship_date) == 1996) &
    (MONTH(ship_date) == 2)
)

# Calculate the SPM (Sales Per Month) for this part in China for February 1996
result = china_feb_1996_sales.CALCULATE(
    part_name=part.name,
    nation=supplie

**Follow up**: Compare that to november 1995 SPM, have we seen an increase?

In [None]:
query+= "Compare that to november 1995 SPM, have we seen an increase?"

result= client.ask(query)

print(result.full_explanation)

I'll analyze this request and create a PyDough query to find the Sales Per Month (SPM) for the specified part in China, comparing February 1996 with November 1995.

First, I need to understand what we're looking for:
1. Find a specific part: "almond antique blue royal burnished"
2. Calculate SPM (Sales Per Month) in China
3. Compare February 1996 vs November 1995

```python
# First, identify the specific part by its description
target_part = parts.WHERE(
    CONTAINS(name, "almond") & 
    CONTAINS(name, "antique") & 
    CONTAINS(name, "blue") & 
    CONTAINS(name, "royal") & 
    CONTAINS(name, "burnished")
)

# Find China in the nations collection
china = nations.WHERE(name == "CHINA")

# Calculate SPM (Sales Per Month) for the target part in China for Feb 1996
feb_1996_spm = lines.WHERE(
    (part_key == target_part.key) &
    (supplier.nation.key == china.key) &
    (YEAR(ship_date) == 1996) &
    (MONTH(ship_date) == 2)
).CALCULATE(
    part_name=part.name,
    nation=supplier.na

**Follow up**: Now exclude supplier Supplier#000001305 and focus only on supplier Supplier#000008802

In [None]:
query+= "Now exclude supplier Supplier#000001305 and focus only on supplier Supplier#000008802"

result= client.ask(query)

print(result.full_explanation)

🔹 Why These 5?

they progressively introduce PyDough features:

Query 1-2 → Basic filtering, aggregation, and ranking.
Query 3-4 → More complex filtering and calculations across years.
Query 5 → Advanced partitioning and comparisons.

✅ They represent real-world business cases:

Customer segmentation (Query 1, 4).
Revenue and sales performance (Query 2, 5).
Product trend analysis (Query 3).

✅ The follow-ups make the demo dynamic:

Showcases how we can refine and modify queries based on insights.
Demonstrates the LLM’s flexibility in query generation.