# Pydough LLM Demo

This notebook showcases how an LLM can generate PyDough queries from natural language instructions. The goal is to demonstrate how AI can automate complex data analysis, making querying faster, more intuitive, and accessible without needing deep technical expertise.

Each example highlights different capabilities, including aggregations, filtering, ranking, and calculations across multiple collections.

## Setup

First, we import the created client.

In [6]:
from llm import LLMClient

Then, we define the `provider` and `model` as variables and initialize the client with the selected values. 

These can be adjusted as needed.

In [7]:
provider= "aws"
model = "us.anthropic.claude-3-7-sonnet-20250219-v1:0"

client = LLMClient(provider, model)

One should use the `ask()` method to make a query to the model.

We get a `result` object with the next attributes: 

- `pydough_code`:  The PyDough query generated by the LLM.
  
- `full_explanation`: A detailed explanation of how the query works.
- `df`:The dataframe containing the query results.
- `exception`: Stores any errors encountered while executing the query.
- `original_question`: The natural language question input by the user.
- `sql_output`: The SQL equivalent of the generated PyDough query.
- `base_prompt`: The initial instruction given to the LLM to generate the query.
- `cheat_sheet`:  A reference guide or example queries to help the LLM structure responses.
- `knowledge_graph`: The metadata structure that informs the LLM about available collections and relationships.
 
Example:

In [3]:
result= client.ask("For each of the 5 largest part sizes, find the part of that size with the largest retail price")

print(result.full_explanation)

I'll create a PyDough code snippet to find the part with the largest retail price for each of the 5 largest part sizes.

First, I need to:
1. Get the 5 largest distinct part sizes
2. For each of those sizes, find the part with the highest retail price

```python
# Step 1: Calculate the 5 largest distinct part sizes
largest_sizes = parts.CALCULATE(size).ORDER_BY(size.DESC()).TOP_K(5, by=size.DESC())

# Step 2: For each of those sizes, find the part with the highest retail price
result = parts.WHERE(ISIN(size, largest_sizes.size)).CALCULATE(
    part_key=key,
    part_name=name,
    size=size,
    retail_price=retail_price,
    size_rank=RANKING(by=size.DESC(), levels=0),
    price_rank_within_size=RANKING(by=retail_price.DESC(), levels=1)
).WHERE(price_rank_within_size == 1).ORDER_BY(size.DESC())
```

This code works by:
1. First identifying the 5 largest distinct part sizes by ordering parts by size in descending order and taking the top 5
2. Then finding all parts that have those size

## Test Cases

### Customer Segmentation.

#### 1. Find the names of all customers and the number of orders placed in 1995 in Europe.

In [8]:
query= "Find the names of all customers and the number of orders placed in 1995 in Europe."

result= client.ask(query)

print(result.full_explanation)

ERROR WHILE EXECUTING QUERY:
SELECT
  name AS customer_name,
  COALESCE(agg_0, 0) AS orders_in_1995
FROM (
  SELECT
    agg_0,
    name
  FROM (
    SELECT
      key,
      name
    FROM (
      SELECT
        _table_alias_2.key AS key,
        name,
        name_3
      FROM (
        SELECT
          c_custkey AS key,
          c_name AS name,
          c_nationkey AS nation_key
        FROM main.CUSTOMER
      ) AS _table_alias_2
      LEFT JOIN (
        SELECT
          _table_alias_0.key AS key,
          name AS name_3
        FROM (
          SELECT
            n_nationkey AS key,
            n_regionkey AS region_key
          FROM main.NATION
        ) AS _table_alias_0
        INNER JOIN (
          SELECT
            r_regionkey AS key,
            r_name AS name
          FROM main.REGION
        ) AS _table_alias_1
          ON region_key = _table_alias_1.key
      ) AS _table_alias_3
        ON nation_key = _table_alias_3.key
    )
    WHERE
      name_3 = 'EUROPE'
  )
 

**Follow up**: Now, give me the ones who have an account balance greater than $700 and placed at least one order in that same year.

In [None]:
query += "Who have an account balance greater than $700 and placed at least one order in 1995."

result= client.ask(query)

print(result.full_explanation)

### 2. List customers who ordered in 1996 but not in 1997, with a total spent of over 1000$?

In [None]:
query= "List customers who ordered in 1996 but not in 1997 with a total spent of over 1000$?"

result= client.ask(query)

print(result.df)

**Follow up**: Include the number of months since the last order and sort by total spent, highest first.

In [None]:
query += "Include the number of months since the last order and sort by total spent, highest first."

result= client.ask(query)

print(result.full_explanation) 

### Sales Performance

#### 3. Find the region name with the highest total order value in 1996.

The total order value is defined as potential revenue, defined as the sum of extended_price * (1 - discount)

In [None]:
query="Find the region name with the highest total order value in 1996. The total order value is defined as potential revenue, defined as the sum of extended_price * (1 - discount)"

result= client.ask(query)

print(result.full_explanation)

### Product Trends

#### 4. Which 10 customers purchased the highest quantity of products during 1998?

In [None]:
query= "Which 10 customers purchased the highest quantity of products during 1998?"

result= client.ask(query)

print(result.full_explanation)

**Follow up**: Now only the ones that have "green" on the product name.

In [None]:
query += "Only the ones that have 'green' on the product name."

result= client.ask(query)

print(result.full_explanation)

### Revenue Performance

#### 5. What is the february 1996 SPM for the almond antique blue royal burnished part in China?

SPM (Selling Profit Margin) = (Total Amount from Sells - (Tax + Commission)) / Total Amount from Sells * 100

In [None]:
query= "What is the february 1996 SPM for the almond antique blue royal burnished part in China? SPM (Selling Profit Margin) = (Total Amount from Sells - (Tax + Commission)) / Total Amount from Sells * 100"

result= client.ask(query)

print(result.full_explanation)

**Follow up**: Compare that to november 1995 SPM, have we seen an increase?

In [None]:
query+= "Compare that to november 1995 SPM, have we seen an increase?"

result= client.ask(query)

print(result.full_explanation)

**Follow up**: Now exclude supplier Supplier#000001305 and focus only on supplier Supplier#000008802

In [None]:
query+= "Now exclude supplier Supplier#000001305 and focus only on supplier Supplier#000008802"

result= client.ask(query)

print(result.full_explanation)


they progressively introduce PyDough features:

Query 1-2 → Basic filtering, aggregation, and ranking.
Query 3-4 → More complex filtering and calculations across years.
Query 5 → Advanced partitioning and comparisons.

They represent real-world business cases:

Customer segmentation (Query 1, 4).
Revenue and sales performance (Query 2, 5).
Product trend analysis (Query 3).
