# Pydough LLM Demo

This notebook showcases how an LLM can generate PyDough queries from natural language instructions. The goal is to demonstrate how AI can automate complex data analysis, making querying faster, more intuitive, and accessible without needing deep technical expertise.

Each example highlights different capabilities, including aggregations, filtering, ranking, and calculations across multiple collections.

## Setup

First, we import the created client.

In [1]:
from llm import LLMClient

Then we initialize the client.

In [2]:
client = LLMClient()

One should use the `ask()` method to make a query to the model.

We get a `result` object with the next attributes: 

- `code`:  The PyDough query generated by the LLM.
  
- `full_explanation`: A detailed explanation of how the query works.
- `df`:The dataframe containing the query results.
- `exception`: Stores any errors encountered while executing the query.
- `original_question`: The natural language question input by the user.
- `sql_output`: The SQL equivalent of the generated PyDough query.
- `base_prompt`: The initial instruction given to the LLM to generate the query.
- `cheat_sheet`:  A reference guide or example queries to help the LLM structure responses.
- `knowledge_graph`: The metadata structure that informs the LLM about available collections and relationships.
 


### Example:

First, we give the client the query we need pydough code for.

In [3]:
result= client.ask("Give me all the suppliers name from United States")

After that, we can consult all the necessary atributes from the result.

At first, I want the pydough code with a full explanation.

In [4]:
print(result.full_explanation)

```python
suppliers_from_usa = nations.WHERE(name == "UNITED STATES").suppliers.CALCULATE(name=name)
```



We can also ask for the pydough code without the explanation.

In [5]:
print(result.code)

suppliers_from_usa = nations.WHERE(name == "UNITED STATES").suppliers.CALCULATE(name=name)


And if we want to visually check, analyze or edit the dataframe we also can.

In [6]:
result.df

Unnamed: 0,name
0,Supplier#000000010
1,Supplier#000000019
2,Supplier#000000046
3,Supplier#000000049
4,Supplier#000000055
...,...
388,Supplier#000009819
389,Supplier#000009829
390,Supplier#000009859
391,Supplier#000009906


The next example is one with an exception.

In [7]:
result= client.ask("For each of the 5 largest part sizes, find the part of that size with the largest retail price")

print(result.full_explanation)

```python
largest_parts = parts.TOP_K(5, by=size.DESC())
result = largest_parts.TOP_K(1, by=retail_price.DESC())
```



If one calls the dataframe and gets an error, no response, or a empty dataframe, it is possible that there is a PyDough exception. We can check this by running:

In [8]:
print(result.exception)

None


You can try to fix the error using the `correct` method. We are going to declare a new variable to obtain the corrected result. 

In [9]:
corrected_result = client.correct(result)

To see how the model try to solve the issue, you can print the full explanation of the `corrected_result`.

In [10]:
print(corrected_result.full_explanation)

```python
largest_parts = parts.TOP_K(5, by=size.DESC())
result = largest_parts.TOP_K(1, by=retail_price.DESC())
```



Note: You can try this as many times as you like if an exception keeps ocurring. 

## Test Cases

### Customer Segmentation.

#### 1. Find the names of all customers and the number of orders placed in 1995 in Europe.

Demonstrates simple filtering, counting, and sorting while being business-relevant for regional market analysis. Adds a second filtering layer by including account balance and order activity, making it more dynamic.

In [11]:
query= "Find the names of all customers and the number of orders placed in 1995 in Europe."

result= client.ask(query)

print(result.full_explanation)
result.df.head()

```python
customer_order_counts = customers.CALCULATE(
    customer_name=name,
    num_orders=COUNT(orders.WHERE(YEAR(order_date) == 1995))
).WHERE(nation.region.name == "EUROPE")

final_result = customer_order_counts.CALCULATE(customer_name, num_orders)
```



Unnamed: 0,customer_name,num_orders
0,Customer#000000011,1
1,Customer#000000015,0
2,Customer#000000018,0
3,Customer#000000020,3
4,Customer#000000026,2


**Follow up**: Now, give me the ones who have an account balance greater than $700 and placed at least one order in that same year. Sorted in descending order by the number of orders.

In [12]:
result= client.discourse(result, 
"""Now, give me the ones who have an account balance greater than $700 and placed at least one order in that same year. 
Sorted in descending order by the number of orders.""")

print(result.full_explanation)
result.df.head()

```python
customer_order_counts = customers.CALCULATE(
    customer_name=name,
    nation_name=nation.name,
    region_name=nation.region.name,
    order_count=COUNT(orders.WHERE(YEAR(order_date) == 1995))
).WHERE(
    (nation.region.name == "EUROPE") & (acctbal > 700) & (COUNT(orders.WHERE(YEAR(order_date) == 1995)) >= 1)
).ORDER_BY(order_count.DESC())

final_result = customer_order_counts.CALCULATE(customer_name, order_count)
```



Unnamed: 0,customer_name,order_count
0,Customer#000107440,12
1,Customer#000014920,11
2,Customer#000079606,11
3,Customer#000108496,11
4,Customer#000009019,10


### 2. List customers who ordered in 1996 but not in 1997, with a total spent of over 1000$?

Showcases PyDough’s HAS() and HASNOT() functions, helping analyze customer retention and spending patterns. Also incorporates a time-based calculation.

In [15]:
query= "List customers who ordered in 1996 but not in 1997 with a total spent of over 1000$?"

result= client.ask(query)

print(result.full_explanation)
result.df.head()

```python
customers_1996 = customers.CALCULATE(
    customer_key=key,
    name=name,
    total_spent_1996=SUM(orders.WHERE(YEAR(order_date) == 1996).total_price)
)
customers_1997 = customers.CALCULATE(
    customer_key=key,
    name=name,
    total_spent_1997=SUM(orders.WHERE(YEAR(order_date) == 1997).total_price)
)

customers_ordered_1996_not_1997 = customers_1996.WHERE(
    (total_spent_1996 > 1000) & (customers_1997.total_spent_1997 == 0)
).CALCULATE(
    customer_name=name,
    total_spent_1996=total_spent_1996
)
```



AttributeError: 'NoneType' object has no attribute 'head'

**Follow up**: Include the number of months since the last order and sort by total spent, highest first.

In [None]:
result= client.discourse(query,
"Include the number of months since the last order and sort by total spent, highest first.")

print(result.full_explanation)
result.df.head()

### Sales Performance

#### 3. Find the region name with the highest total order value in 1996.

The total order value is defined as potential revenue, defined as the sum of extended_price * (1 - discount)

It introduces precise calculations within the query, ensuring revenue insights.

In [None]:
query="""Find the region name with the highest total order value in 1996. 
The total order value is defined as potential revenue, defined as the sum of extended_price * (1 - discount)"""

result= client.ask(query)

print(result.full_explanation)
result.df.head()

### Product Trends

#### 4. Which 10 customers purchased the highest quantity of products during 1998?

Highlights ranking queries (TOP_K()), customer segmentation, and purchasing trends. 

In [None]:
query= "Which 10 customers purchased the highest quantity of products during 1998?"

result= client.ask(query)

print(result.full_explanation)
result.df.head()

**Follow up**: Now only the ones that have "green" on the product name.

In [None]:
result= client.discourse(query, "Only the ones that have 'green' on the product name.")

print(result.full_explanation)
result.df.head()

### Revenue Performance

#### 5. What is the february 1996 SPM for the almond antique blue royal burnished part in China?

SPM (Selling Profit Margin) = (Total Amount from Sells - (Tax + Commission)) / Total Amount from Sells * 100

This query was provided as a representative example of potential stakeholder inquiries.

Showcases advanced partitioning and filtering, demonstrating how PyDough can be used for highly specific business KPIs. Compare with a previous time period and exclude/include specific suppliers, making it a progressive data exploration example.

In [None]:
query= """What is the february 1996 SPM for the almond antique blue royal burnished part in China? 
SPM (Selling Profit Margin) = (Total Amount from Sells - (Tax + Commission)) / Total Amount from Sells * 100"""

result= client.ask(query)

print(result.full_explanation)
result.df.head()

**Follow up**: Compare that to november 1995 SPM, have we seen an increase?

In [None]:
query+= """Compare that to november 1995 SPM, have we seen an increase?"""

result= client.ask(query)

print(result.full_explanation)
result.df.head()

**Follow up**: Now exclude supplier Supplier#000001305 and focus only on supplier Supplier#000008802

In [None]:
query+= """Now exclude supplier Supplier#000001305 and focus only on supplier Supplier#000008802"""

result= client.ask(query)

print(result.full_explanation)
result.df.head()