# Pydough LLM Demo

This notebook showcases how an LLM can generate PyDough queries from natural language instructions. The goal is to demonstrate how AI can automate complex data analysis, making querying faster, more intuitive, and accessible without needing deep technical expertise.

Each example highlights different capabilities, including aggregations, filtering, ranking, and calculations across multiple collections.

## Setup

First, we import the created client.

In [None]:
from llm import LLMClient

Then, we define the `provider` and `model` as variables and initialize the client with the selected values. 

These can be adjusted as needed.

In [None]:
provider= "aws"
model = "us.anthropic.claude-3-7-sonnet-20250219-v1:0"

client = LLMClient(provider, model)

One should use the `ask()` method to make a query to the model.

We get a `result` object with the next attributes: 

- `pydough_code`:  The PyDough query generated by the LLM.
  
- `full_explanation`: A detailed explanation of how the query works.
- `df`:The dataframe containing the query results.
- `exception`: Stores any errors encountered while executing the query.
- `original_question`: The natural language question input by the user.
- `sql_output`: The SQL equivalent of the generated PyDough query.
- `base_prompt`: The initial instruction given to the LLM to generate the query.
- `cheat_sheet`:  A reference guide or example queries to help the LLM structure responses.
- `knowledge_graph`: The metadata structure that informs the LLM about available collections and relationships.
 
Example:

In [8]:
result= client.ask("Give me all the suppliers name from United States")

print(result.full_explanation)

print(result.df.head())

I'll create a PyDough code snippet to retrieve all supplier names from the United States.

First, I need to identify suppliers whose nation is the United States, then return just their names.

```python
# Find all suppliers from the United States
us_suppliers = suppliers.WHERE(nation.name == "UNITED STATES").CALCULATE(
    supplier_name=name
)
```

This query:
1. Starts with the suppliers collection
2. Filters to only include suppliers where the associated nation's name is "UNITED STATES"
3. Returns only the supplier name field (renamed to supplier_name for clarity)
        supplier_name
0  Supplier#000000010
1  Supplier#000000019
2  Supplier#000000046
3  Supplier#000000049
4  Supplier#000000055


## Test Cases

### Customer Segmentation.

#### 1. Find the names of all customers and the number of orders placed in 1995 in Europe.

Demonstrates simple filtering, counting, and sorting while being business-relevant for regional market analysis. Adds a second filtering layer by including account balance and order activity, making it more dynamic.

In [None]:
query= "Find the names of all customers and the number of orders placed in 1995 in Europe."

result= client.ask(query)

print(result.full_explanation)
print(result.df)

**Follow up**: Now, give me the ones who have an account balance greater than $700 and placed at least one order in that same year. Sorted in descending order by the number of orders.

In [None]:
query += ""

result= client.ask(query)

print(result.full_explanation)
print(result.df)

### 2. List customers who ordered in 1996 but not in 1997, with a total spent of over 1000$?

Showcases PyDough’s HAS() and HASNOT() functions, helping analyze customer retention and spending patterns. Also incorporates a time-based calculation.

In [None]:
query= "List customers who ordered in 1996 but not in 1997 with a total spent of over 1000$?"

result= client.ask(query)

print(result.full_explanation)
print(result.df)

**Follow up**: Include the number of months since the last order and sort by total spent, highest first.

In [None]:
query += "Include the number of months since the last order and sort by total spent, highest first."

result= client.ask(query)

print(result.full_explanation)
print(result.df) 

### Sales Performance

#### 3. Find the region name with the highest total order value in 1996.

The total order value is defined as potential revenue, defined as the sum of extended_price * (1 - discount)

It introduces precise calculations within the query, ensuring revenue insights.

In [None]:
query="""Find the region name with the highest total order value in 1996. 
The total order value is defined as potential revenue, defined as the sum of extended_price * (1 - discount)"""

result= client.ask(query)

print(result.full_explanation)
print(result.df)

### Product Trends

#### 4. Which 10 customers purchased the highest quantity of products during 1998?

Highlights ranking queries (TOP_K()), customer segmentation, and purchasing trends. 

In [None]:
query= "Which 10 customers purchased the highest quantity of products during 1998?"

result= client.ask(query)

print(result.full_explanation)
print(result.df)

**Follow up**: Now only the ones that have "green" on the product name.

In [None]:
query += "Only the ones that have 'green' on the product name."

result= client.ask(query)

print(result.full_explanation)
print(result.df)

### Revenue Performance

#### 5. What is the february 1996 SPM for the almond antique blue royal burnished part in China?

SPM (Selling Profit Margin) = (Total Amount from Sells - (Tax + Commission)) / Total Amount from Sells * 100

This query was provided as a representative example of potential stakeholder inquiries.

Showcases advanced partitioning and filtering, demonstrating how PyDough can be used for highly specific business KPIs. Compare with a previous time period and exclude/include specific suppliers, making it a progressive data exploration example.

In [None]:
query= "What is the february 1996 SPM for the almond antique blue royal burnished part in China? SPM (Selling Profit Margin) = (Total Amount from Sells - (Tax + Commission)) / Total Amount from Sells * 100"

result= client.ask(query)

print(result.full_explanation)
print(result.df)

**Follow up**: Compare that to november 1995 SPM, have we seen an increase?

In [None]:
query+= "Compare that to november 1995 SPM, have we seen an increase?"

result= client.ask(query)

print(result.full_explanation)
print(result.df)

**Follow up**: Now exclude supplier Supplier#000001305 and focus only on supplier Supplier#000008802

In [None]:
query+= "Now exclude supplier Supplier#000001305 and focus only on supplier Supplier#000008802"

result= client.ask(query)

print(result.full_explanation)
print(result.df)