# Pydough LLM Demo

This notebook showcases how an LLM can generate PyDough queries from natural language instructions. The goal is to demonstrate how AI can automate complex data analysis, making querying faster, more intuitive, and accessible without needing deep technical expertise.

Each example highlights different capabilities, including aggregations, filtering, ranking, and calculations across multiple collections.

## Setup

First, we import the created client.

In [2]:
from llm import LLMClient

Then, we define the `provider` and `model` as variables and initialize the client with the selected values. 

These can be adjusted as needed.

In [3]:
provider= "aws"
model = "us.anthropic.claude-3-7-sonnet-20250219-v1:0"

client = LLMClient(provider, model)

One should use the `ask()` method to make a query to the model.

We get a `result` object with the next attributes: 

- `code`:  The PyDough query generated by the LLM.
  
- `full_explanation`: A detailed explanation of how the query works.
- `df`:The dataframe containing the query results.
- `exception`: Stores any errors encountered while executing the query.
- `original_question`: The natural language question input by the user.
- `sql_output`: The SQL equivalent of the generated PyDough query.
- `base_prompt`: The initial instruction given to the LLM to generate the query.
- `cheat_sheet`:  A reference guide or example queries to help the LLM structure responses.
- `knowledge_graph`: The metadata structure that informs the LLM about available collections and relationships.
 


### Example:

First, we give the client the query we need pydough code for.

In [6]:
result= client.ask("Give me all the suppliers name from United States")

After that, we can consult all the necessary atributes from the result.

At first, I want the pydough code with a full explanation.

In [7]:
print(result.full_explanation)

I'll create a PyDough code snippet to retrieve all supplier names from the United States.

First, I need to identify suppliers whose nation is the United States, then return just their names.

```python
# Find all suppliers from the United States and return their names
us_suppliers = nations.WHERE(name == "UNITED STATES").suppliers.CALCULATE(
    supplier_name=name
)
```

This query:
1. Starts with the `nations` collection
2. Filters to only include the nation with name "UNITED STATES"
3. Accesses the `suppliers` sub-collection for that nation
4. Uses CALCULATE to return only the supplier names, with the field renamed to "supplier_name"


We can also ask for the pydough code without the explanation.

In [9]:
print(result.code)

# Find all suppliers from the United States and return their names
us_suppliers = nations.WHERE(name == "UNITED STATES").suppliers.CALCULATE(
    supplier_name=name
)


If we want to visually check the dataframe we also can, and also use several pandas functions like head()

In [13]:
result.df

Unnamed: 0,supplier_name
0,Supplier#000000010
1,Supplier#000000019
2,Supplier#000000046
3,Supplier#000000049
4,Supplier#000000055
...,...
388,Supplier#000009819
389,Supplier#000009829
390,Supplier#000009859
391,Supplier#000009906


In [27]:
result= client.ask("For each of the 5 largest part sizes, find the part of that size with the largest retail price")

print(result.full_explanation)

I'll analyze this request and create a PyDough code snippet to find the part with the largest retail price for each of the 5 largest part sizes.

First, I need to:
1. Group parts by their size
2. For each size, find the part with the highest retail price
3. Sort by part size in descending order
4. Take only the top 5 sizes

Here's the PyDough code to accomplish this:

```python
# First, calculate the maximum retail price for each part size
max_price_by_size = GROUP_BY(parts.CALCULATE(size=size, retail_price=retail_price), 
                            name="parts_group", 
                            by=size).CALCULATE(
    size=size,
    max_retail_price=MAX(parts_group.retail_price)
)

# Now join back to find the specific part with that maximum price for each size
parts_with_max_price = parts.CALCULATE(
    size=size,
    name=name,
    retail_price=retail_price,
    key=key
).WHERE(
    (size == max_price_by_size.size) & 
    (retail_price == max_price_by_size.max_retail_price)
)

# G

If one gets an error, `None`, or a empty dataframe, it is possible we should have a PyDough exception. We can check this by running:

In [28]:
print(result.exception)

An error occurred while processing the code: Unrecognized term of simple table collection 'parts' in graph 'TPCH': 'parts'


You can try to fix the error using the `correct` method, include in the client. We are going to declare a new variable to obtain the corrected result. 

In [None]:
corrected_result = client.correct(result)

To see how the model try to solve the issue, you can print the full explanation of the `corrected_result`.

In [None]:
print(corrected_result.full_explanation)

I see the issue. The error message indicates that the term 'parts' is not recognized in the graph. Looking at the database structure reference file, it seems that the collection name should be singular - `parts` instead of `part`. Let me correct the code:

```python
# First, calculate the maximum retail price for each part size
max_price_by_size = parts.CALCULATE(
    size=size,
    retail_price=retail_price
)

# Group by size to find the maximum retail price for each size
size_max_price = GROUP_BY(max_price_by_size, name="parts_group", by=size).CALCULATE(
    size=size,
    max_retail_price=MAX(parts_group.retail_price)
)

# Join back to the parts collection to find parts with the maximum retail price for their size
parts_with_max_price = parts.CALCULATE(
    size=size,
    name=name,
    retail_price=retail_price,
    part_type=part_type
)

# Filter to only include parts that have the maximum retail price for their size
max_price_parts = parts_with_max_price.WHERE(
    retail_price =

Note: You can try this as many times as you like if an exception keeps ocurring. 

## Test Cases

### Customer Segmentation.

#### 1. Find the names of all customers and the number of orders placed in 1995 in Europe.

Demonstrates simple filtering, counting, and sorting while being business-relevant for regional market analysis. Adds a second filtering layer by including account balance and order activity, making it more dynamic.

In [5]:
query= "Find the names of all customers and the number of orders placed in 1995 in Europe."

result= client.ask(query)

print(result.full_explanation)
result.df.head()

I'll create a PyDough code snippet to find the names of all customers and the number of orders they placed in 1995 in Europe.

```python
# Find customers in Europe
# Count orders placed in 1995 for each customer
european_customers_1995_orders = customers.WHERE(
    nation.region.name == "EUROPE"
).CALCULATE(
    customer_name=name,
    order_count=COUNT(orders.WHERE(YEAR(order_date) == 1995))
)
```

This code:
1. Starts with the `customers` collection
2. Filters for customers in Europe using `WHERE(nation.region.name == "EUROPE")`
3. Uses `CALCULATE` to:
   - Include the customer name as `customer_name`
   - Count orders placed in 1995 using `COUNT(orders.WHERE(YEAR(order_date) == 1995))`

The result will contain each European customer's name and the number of orders they placed in 1995.


Unnamed: 0,customer_name,order_count
0,Customer#000000011,1
1,Customer#000000015,0
2,Customer#000000018,0
3,Customer#000000020,3
4,Customer#000000026,2


**Follow up**: Now, give me the ones who have an account balance greater than $700 and placed at least one order in that same year. Sorted in descending order by the number of orders.

In [None]:
result= client.discourse(result, 
"""Now, give me the ones who have an account balance greater than $700 and placed at least one order in that same year. 
Sorted in descending order by the number of orders.""")

print(result.full_explanation)
result.df.head()

I'll create a PyDough code snippet to find customers from Europe who placed orders in 1995, have an account balance greater than $700, and placed at least one order in that year, sorted by the number of orders in descending order.

```python
# First, identify customers from Europe with orders in 1995
european_customers_with_orders_1995 = customers.CALCULATE(
    customer_name=name,
    account_balance=acctbal,
    order_count=COUNT(orders.WHERE(YEAR(order_date) == 1995))
).WHERE(
    # Filter for European customers
    (nation.region.name == "EUROPE") &
    # Filter for account balance > 700
    (account_balance > 700) &
    # Filter for at least one order in 1995
    (order_count > 0)
).ORDER_BY(
    # Sort by number of orders in descending order
    order_count.DESC()
)
```

This query:
1. Calculates the customer name, account balance, and counts orders placed in 1995 for each customer
2. Filters for customers who:
   - Are from Europe (by checking if their nation's region name is "E

Unnamed: 0,customer_name,account_balance,order_count
0,Customer#000107440,2464.61,12
1,Customer#000014920,7026.03,11
2,Customer#000079606,989.51,11
3,Customer#000108496,7388.38,11
4,Customer#000009019,2247.06,10


### 2. List customers who ordered in 1996 but not in 1997, with a total spent of over 1000$?

Showcases PyDough’s HAS() and HASNOT() functions, helping analyze customer retention and spending patterns. Also incorporates a time-based calculation.

In [None]:
query= "List customers who ordered in 1996 but not in 1997 with a total spent of over 1000$?"

result= client.ask(query)

print(result.full_explanation)
result.df.head()

**Follow up**: Include the number of months since the last order and sort by total spent, highest first.

In [None]:
result= client.discourse(query,
"Include the number of months since the last order and sort by total spent, highest first.")

print(result.full_explanation)
result.df.head()

### Sales Performance

#### 3. Find the region name with the highest total order value in 1996.

The total order value is defined as potential revenue, defined as the sum of extended_price * (1 - discount)

It introduces precise calculations within the query, ensuring revenue insights.

In [None]:
query="""Find the region name with the highest total order value in 1996. 
The total order value is defined as potential revenue, defined as the sum of extended_price * (1 - discount)"""

result= client.ask(query)

print(result.full_explanation)
result.df.head()

### Product Trends

#### 4. Which 10 customers purchased the highest quantity of products during 1998?

Highlights ranking queries (TOP_K()), customer segmentation, and purchasing trends. 

In [None]:
query= "Which 10 customers purchased the highest quantity of products during 1998?"

result= client.ask(query)

print(result.full_explanation)
result.df.head()

**Follow up**: Now only the ones that have "green" on the product name.

In [None]:
result= client.discourse(query, "Only the ones that have 'green' on the product name.")

print(result.full_explanation)
result.df.head()

### Revenue Performance

#### 5. What is the february 1996 SPM for the almond antique blue royal burnished part in China?

SPM (Selling Profit Margin) = (Total Amount from Sells - (Tax + Commission)) / Total Amount from Sells * 100

This query was provided as a representative example of potential stakeholder inquiries.

Showcases advanced partitioning and filtering, demonstrating how PyDough can be used for highly specific business KPIs. Compare with a previous time period and exclude/include specific suppliers, making it a progressive data exploration example.

In [None]:
query= """What is the february 1996 SPM for the almond antique blue royal burnished part in China? 
SPM (Selling Profit Margin) = (Total Amount from Sells - (Tax + Commission)) / Total Amount from Sells * 100"""

result= client.ask(query)

print(result.full_explanation)
result.df.head()

**Follow up**: Compare that to november 1995 SPM, have we seen an increase?

In [None]:
query+= """Compare that to november 1995 SPM, have we seen an increase?"""

result= client.ask(query)

print(result.full_explanation)
result.df.head()

**Follow up**: Now exclude supplier Supplier#000001305 and focus only on supplier Supplier#000008802

In [None]:
query+= """Now exclude supplier Supplier#000001305 and focus only on supplier Supplier#000008802"""

result= client.ask(query)

print(result.full_explanation)
result.df.head()