In [1]:
%load_ext pydough.jupyter_extensions

import pydough
import datetime
import pandas as pd
# Setup demo metadata
pydough.active_session.load_metadata_graph("../metadata/tpch_demo_graph.json", "TPCH");
pydough.active_session.connect_database("sqlite", database="../../tpch.db");
pd.options.display.float_format = '{:.6f}'.format

# Tutorial: From SQL to PyDough – A Step-by-Step Guide

In this tutorial, we will explore how to perform similar queries in PyDough that we would typically write in SQL. We will focus on the Customers, Orders, Nations, and Regions tables (from the TPC-H database). For each query, we will first see the SQL version and then convert it into the PyDough equivalent, explaining each step.



### 1. Get All Nations

In SQL, the query to retrieve all nations would look like this:

```SQL
SELECT * FROM nation; 
```


In PyDough, we can perform the same operation using the nation collection, which retrieves all documents from a collection. 
A collection in PyDough is an abstraction for any "document", but in most cases represents a table. If we want to access the nations table, we will use our corresponding PyDough collection.

In [7]:
%%pydough

filter_c= nations(key, name, comment)

pydough.to_df(filter_c)


Unnamed: 0,key,name,comment
0,0,ALGERIA,haggle. carefully final deposits detect slyly...
1,1,ARGENTINA,al foxes promise slyly according to the regula...
2,2,BRAZIL,y alongside of the pending deposits. carefully...
3,3,CANADA,"eas hang ironic, silent packages. slyly regula..."
4,4,EGYPT,y above the carefully unusual theodolites. fin...
5,5,ETHIOPIA,ven packages wake quickly. regu
6,6,FRANCE,"refully final requests. regular, ironi"
7,7,GERMANY,"l platelets. regular accounts x-ray: unusual, ..."
8,8,INDIA,ss excuses cajole slyly across the packages. d...
9,9,INDONESIA,slyly express asymptotes. regular deposits ha...


regions(): Refers to the regions collection (similar to the regions table in SQL).


### 2. Find Nations Whose Name Starts with "A"
Next, we’ll use WHERE and LIKE to filter nations whose names start with the letter "A".

```SQL
SELECT C_NAME, C_ACCTBAL
FROM n
WHERE C_NAME LIKE 'A%';
```

In PyDough, the WHERE statement is similar to the WHERE operation in SQL. It can be used to filter unwanted entries in a context. You can use the STARTSWITH() method to match patterns or LIKE() using the same syntax as SQL. Here's how you would do it:

In [None]:
%%pydough

# PyDough equivalent using STARTWITH
nations_startwith= nations(n_name=name, n_comment= comment).WHERE(STARTSWITH(name,'A'))

# PyDough equivalent using LIKE
nations_like= nations(n_name=name, n_comment= comment).WHERE(LIKE(name,'A%'))

print(pydough.to_df(nations_startwith))
pydough.to_df(nations_like)

### 3. Filter Customers by Nation

The next situation involves a client seeking to identify all customers from a specific country, Peru.

SQL example: 

```SQL
SELECT C.C_NAME
FROM customers C
JOIN nation N
ON C.C_NATIONKEY = N.N_NATIONKEY
WHERE N.N_NAME = 'Peru';


In PyDough, to accomplish this task, the query begins by accessing the customers table to collect customer information. It then joins this data with the nations table, which contains country details. By matching the nation key from both tables, the query filters the results to include only customers from Peru. The outcome is a list of Peruvian customer names. The join is handled through metadata relationships using the WHERE method.

It's crucial to understand that the path between collections is how we integrate data across multiple tables. This approach enables efficient data retrieval and manipulation without requiring explicit joins in the code.

In [43]:
%%pydough

peru_nations= customers.WHERE(nation.name == "PERU")

customers_from_peru= peru_nations(n_name=name)

pydough.to_df(customers_from_peru)

Unnamed: 0,n_name
0,Customer#000000008
1,Customer#000000033
2,Customer#000000035
3,Customer#000000061
4,Customer#000000077
...,...
5970,Customer#000149914
5971,Customer#000149928
5972,Customer#000149939
5973,Customer#000149948


### 4. Get a specific customer

The next situation involves analyzing a specific set of customers based on several conditions. The goal is to identify customers who are in debt, meaning their account balance is negative, and have placed more than 5 orders. Additionally, the focus is on customers from America, excluding those from Brazil.

This query filters customers based on specific conditions related to their account balance, order count, and geographical region. It checks if a customer’s account balance is negative, if they have made at least 5 orders, if their region is "AMERICA," and if they are not from Brazil. The WHERE clause applies all these conditions using & (AND) to ensure that all must be true for a customer to be included in the results.

PyDough does not yet support the AND, OR, NOT, IN expressions, as well as trying in-between comparisons like (1 < x < 5)

In [42]:
%%pydough

customer_in_debt= customers(
    name
).WHERE(
    (acctbal < 0) &
    (COUNT(orders) >= 5) &
    (nation.region.name == "AMERICA") &
    (nation.name != "BRAZIL"))

pydough.to_df(customer_in_debt)

Unnamed: 0,name
0,Customer#000000064
1,Customer#000000478
2,Customer#000000488
3,Customer#000000632
4,Customer#000000872
...,...
1441,Customer#000149812
1442,Customer#000149815
1443,Customer#000149831
1444,Customer#000149890


### 5. Find the total number of orders per customers placed in 1998

The next situation consists of analyzing customer order activity from the year 1998. The query retrieves customer details, specifically their customer keys and names, while also counting how many orders each customer placed in 1998. The goal is to identify which customers were the most active that year, ordering them by the total number of orders in descending order

```SQL
SELECT c.c_custkey, c.c_name, COUNT(o.o_orderkey) AS total_orders
FROM customer c
JOIN orders o ON c.c_custkey = o.o_custkey
WHERE strftime('%Y', o.o_orderdate) = '1998'  
ORDER BY total_orders DESC;


In this query, we'll introduce the use of ORDER BY. In PyDough, the ORDER_BY method is the same as SQL. It uses a COUNT function with a WHERE clause to filter orders by year. The results are then ordered by the total number of orders in descending order, showing the most active customers first.

Here's how you would perform this task in PyDough:

In [7]:
%%pydough

customer_order_counts  = customers(
    key,
    name,
    # Get the total number of orders placed in 1998 by customer
    num_orders=COUNT(
        orders.WHERE(YEAR(order_date) == 1998)
    ),
).ORDER_BY(num_orders.DESC())
pydough.to_df(customer_order_counts )


Unnamed: 0,key,name,num_orders
0,11719,Customer#000011719,9
1,93778,Customer#000093778,9
2,102295,Customer#000102295,9
3,111394,Customer#000111394,9
4,4789,Customer#000004789,8
...,...,...,...
149995,149991,Customer#000149991,0
149996,149993,Customer#000149993,0
149997,149994,Customer#000149994,0
149998,149997,Customer#000149997,0


This demonstrates an alternative approach to solving the same query by leveraging date and time filtering in the orders table. In this case, we define a specific date range to filter orders, selecting only those with an order_date between January 1, 1998 (inclusive) and January 1, 1999 (exclusive). The WHERE clause applies the condition directly to the orders table, and the key parameter specifies how the results should be grouped or indexed after filtering.

In [24]:
%%pydough

# Filter orders placed in 1998
orders_in_1998 = orders.WHERE(
    (order_date >= datetime.date(1998, 1, 1)) 
    & (order_date < datetime.date(1999, 1, 1))
)

# Retrieve customer information along with the total number of orders placed in 1998 for each customer
customer_order_summary = customers(
    key,
    name,
    total_orders_in_1998=COUNT(orders_in_1998),
).ORDER_BY(total_orders_in_1998.DESC())

pydough.to_df(customer_order_summary)

Unnamed: 0,key,name,total_orders_in_1998
0,11719,Customer#000011719,9
1,93778,Customer#000093778,9
2,102295,Customer#000102295,9
3,111394,Customer#000111394,9
4,4789,Customer#000004789,8
...,...,...,...
149995,149991,Customer#000149991,0
149996,149993,Customer#000149993,0
149997,149994,Customer#000149994,0
149998,149997,Customer#000149997,0


### 6. Count the Number of Orders in Each Nation:

This query counts the number of orders placed in each nation, showing them in descending order of count.
```SQL

SELECT n.n_name, COUNT(o.o_orderkey) AS order_count
FROM nation n
JOIN customer c ON n.n_nationkey = c.c_nationkey
JOIN orders o ON c.c_custkey = o.o_custkey
GROUP BY n.n_name
ORDER BY order_count DESC;


In PyDough, to achieve this, we would use the metadata relationship and the PARTITION method, which is equivalent to GROUP BY in SQL. Keys can be specified using the by argument, which is the element to be grouped, and data columns to be aggregated can be referenced using the name argument.

In [23]:
%%pydough

# Retrieves orders for a specific nation 
orders_by_region = orders.customer(
    nation_name=nation.name  # 'nation.name' specifies which nation's orders to filter
)

# Partitions the orders by nation and counts the total orders in each region
grouped_orders_by_region = PARTITION(
    orders_by_region, name="order", by=(nation_name)  # Group orders by 'nation_name'
)(
    nation_name=nation_name,  # The name of the nation for which orders are grouped
    total_orders_in_region=COUNT(order)  # Counts the total number of orders in each nation
).ORDER_BY(total_orders_in_region.DESC())  # Orders the result by the total number of orders, descending


pydough.to_df(grouped_orders_by_region)




Unnamed: 0,nation_name,total_orders_in_region
0,FRANCE,61600
1,RUSSIA,61495
2,INDONESIA,61377
3,MOZAMBIQUE,61267
4,ROMANIA,61012
5,CHINA,60784
6,JORDAN,60736
7,CANADA,60480
8,VIETNAM,60347
9,BRAZIL,60137


### 7. Determine the number of orders placed in each month of a year:	

The next situation consists of analyzing the number of orders placed each month during 1998. The query focuses on extracting the month from each order date and counting how many orders were placed within each month of that year.

```SQL
SELECT
    strftime('%m', o_orderdate) AS order_month,
    COUNT(o_orderkey) AS num_orders            
FROM
    orders
WHERE
    o_orderdate >= '1998-01-01'  
    AND o_orderdate < '1999-01-01'
GROUP BY
    order_month
ORDER BY
    order_month;


In PyDough to resolve this query, it is necessary to use Boolean expression. In this case, we are using the **&** operator to perform the AND operation, but we can also use the **|** and **~** being used as logical OR and NOT. The query uses the MONTH() function to extract the month from each order's date, then filters the orders with the WHERE function to include only those placed in 1995. The PARTITION function groups the filtered orders by month, and the COUNT function counts the number of orders in each month. 

In [27]:
%%pydough
# Retrieve orders for the year 1995, with each order grouped by the month it was placed
orders_in_1995_by_month = orders.WHERE((order_date >= datetime.date(1995, 1, 1))  
& (order_date < datetime.date(1996, 1, 1))  
)(order_month=MONTH(order_date))

# Group the filtered orders by month and count the total number of orders per month in 1995
monthly_order_summary_1995 = PARTITION(orders_in_1995_by_month, name="order", by=(order_month))(
    order_month=order_month,  
    total_orders_in_month=COUNT(order) 
)
pydough.to_df(monthly_order_summary_1995)


Unnamed: 0,order_month,total_orders_in_month
0,1,19472
1,2,17721
2,3,19313
3,4,18901
4,5,19342
5,6,18874
6,7,19471
7,8,19460
8,9,18746
9,10,19502


### 8. Find Customers Who Have Placed Orders Above $1000

Let’s filter customers who Have Placed Orders Above $1000 in ASIA region

```SQL
SELECT c.c_custkey, c.c_name, SUM(o.o_totalprice) AS total_spent
FROM customer c
JOIN orders o ON c.c_custkey = o.o_custkey
JOIN nation n ON c.c_nationkey = n.n_nationkey
JOIN region r ON n.n_regionkey = r.r_regionkey
WHERE r.r_name = 'ASIA'
GROUP BY c.c_custkey, c.c_name
HAVING SUM(o.o_totalprice) > 1000;
```



In the PyDough we can appreciate how we can also use aggregation functions just like in SQL. In this example, we are using the SUM function, which has the same functionality as the SQL SUM to calculate the total.

In [29]:
%%pydough

# Retrieve customers with total order price greater than 1000 and from the "ASIA" region
high_value_customers_in_asia = customers(
    customer_key=key,  
    customer_name=name,  
    total_spent=SUM(orders.total_price) 
).WHERE((total_spent > 1000)  & (nation.region.name == "ASIA"))

pydough.to_df(high_value_customers_in_asia)

Unnamed: 0,customer_key,customer_name,total_spent
0,7,Customer#000000007,2957861.160000
1,19,Customer#000000019,3611713.600000
2,25,Customer#000000025,3135039.320000
3,28,Customer#000000028,2429022.210000
4,37,Customer#000000037,2860377.420000
...,...,...,...
20019,149980,Customer#000149980,3115223.230000
20020,149981,Customer#000149981,1700503.960000
20021,149984,Customer#000149984,1153164.880000
20022,149987,Customer#000149987,472026.460000


Another approach that shows how we can resolve the same query

In [32]:
%%pydough

# Filter nations to include only those in the "ASIA" region
asian_nations = nations.WHERE(region.name == "ASIA")

# Retrieve high-value customers from Asian nations who have spent more than 1000
high_value_customers_in_asia = asian_nations.customers(
    customer_key=key,  
    customer_name=name,  
    total_spent=SUM(orders.total_price)  
).WHERE(
    total_spent > 1000  
)

pydough.to_df(high_value_customers_in_asia)

Unnamed: 0,customer_key,customer_name,total_spent
0,7,Customer#000000007,2957861.160000
1,19,Customer#000000019,3611713.600000
2,25,Customer#000000025,3135039.320000
3,28,Customer#000000028,2429022.210000
4,37,Customer#000000037,2860377.420000
...,...,...,...
20019,149980,Customer#000149980,3115223.230000
20020,149981,Customer#000149981,1700503.960000
20021,149984,Customer#000149984,1153164.880000
20022,149987,Customer#000149987,472026.460000


### 9. Average Order Value by Region 

```SQL 
SELECT 
    r.r_name AS Region, 
    AVG(o.o_totalprice) AS AvgOrderValue 
FROM 
    orders o
JOIN 
    customer c ON o.o_custkey = c.c_custkey
JOIN 
    nation n ON c.c_nationkey = n.n_nationkey
JOIN 
    region r ON n.n_regionkey = r.r_regionkey
GROUP BY 
    r.r_name;


The main idea is to link customers with orders. We store the customer's region and then access the orders subcollection to retrieve the total price. We use the BACK method to return to the ancestor collection and get the variable we defined earlier. Then, we perform the PARTITION to group the collection by region name. 

In [34]:
%%pydough

# Retrieve orders from customers by region and calculate their total order price
selected_customers_by_region = customers(customer_region_name=nation.region.name).orders(
    order_price=total_price, 
    customer_region_name=BACK(1).customer_region_name  
)

# Group the selected customers by region and calculate the average order price for each region
region_revenue_summary = PARTITION(selected_customers_by_region, "customer", by=customer_region_name)(
    region_name=customer_region_name, 
    total_revenue=AVG(customer.order_price) 
)

pydough.to_df(region_revenue_summary)

Unnamed: 0,region_name,total_revenue
0,AFRICA,151274.687459
1,AMERICA,151476.057596
2,ASIA,151167.942741
3,EUROPE,150990.370343
4,MIDDLE EAST,151192.10578


### 10. Identifying the 5 Most Profitable Regions

Find regions contributing the most to revenue.
```SQL
SELECT 
    r.r_name AS RegionName, 
    SUM(o.o_totalprice) AS TotalRevenue
FROM 
    region r
JOIN nation n ON r.r_regionkey = n.n_regionkey
JOIN customer c ON n.n_nationkey = c.c_nationkey
JOIN orders o ON c.c_custkey = o.o_custkey
GROUP BY 
    r.r_name
ORDER BY 
    TotalRevenue DESC
LIMIT 5;

In PyDough, to achieve this, we would use the TOP_K method. The TOP_K operation sorts a collection and then selects the first k values from the ordered results. First we select customers by region and their order total prices. Then, we group the data by region and calculate the total revenue for each region. Finally, we sort the results by total revenue and return the top 5 regions with the highest revenue.


In [45]:
%%pydough

selected_customers = nations(region_name= region.name, TOTALREVENUE= SUM(customers.orders.total_price))

pydough.to_df(selected_customers.ORDER_BY(TOTALREVENUE.DESC()))

Unnamed: 0,region_name,TOTALREVENUE
0,EUROPE,9318715232.78
1,ASIA,9300830039.29
2,EUROPE,9282323186.28
3,AFRICA,9249114609.66
4,MIDDLE EAST,9229296044.98
5,EUROPE,9196280024.51
6,ASIA,9161685172.34
7,AMERICA,9143635385.19
8,ASIA,9121689438.2
9,AMERICA,9107367126.7


### 11. Max and Min Order Value Difference by Nation and Region

Retrieve the name of the region and nation for each unique combination of region and nation. For each region-nation pair, calculate the maximum and minimum order values. Then, find the difference between the maximum and minimum order values. Additionally, count the total number of orders for each region-nation pair. Group the data by region and nation to ensure calculations are done separately for each pair. Finally, sort the results by the difference in order values, with the highest difference appearing first.

```SQL
SELECT 
    r.r_name AS region_name,
    n.n_name AS nation_name,
    MAX(o.o_totalprice) AS max_order_value,
    MIN(o.o_totalprice) AS min_order_value,
    MAX(o.o_totalprice) - MIN(o.o_totalprice) AS order_value_difference,
    COUNT(o.o_orderkey) AS total_orders
FROM region r
JOIN nation n ON r.r_regionkey = n.n_regionkey  
JOIN customer c ON c.c_nationkey = n.n_nationkey
JOIN orders o ON o.o_custkey = c.c_custkey
GROUP BY r.r_name, n.n_name
ORDER BY order_value_difference DESC;


In PyDough, to achieve this we first select customers based on their region and nation using selected_customers. Then, we filter their associated orders and calculate the total price for each order. Using PARTITION, we group the data by region and nation and calculate the maximum and minimum order values using the MAX and MIN functions, their difference, and the total number of orders. Finally, we sort the results by the order value difference in descending order using ORDER_BY.

In [46]:
%%pydough
# Retrieve customers' order details for each nation, calculating max, min, and order value differences
selected_customers_by_nation = nations(
    region_name=region.name,  
    nation_name=name,  
    max_order_value=MAX(customers.orders.total_price), 
    min_order_value=MIN(customers.orders.total_price),  
    order_value_difference=MAX(customers.orders.total_price) - MIN(customers.orders.total_price), 
    total_orders=COUNT(customers.orders.total_price)  
)

# Sort the customers by the difference between max and min order values, in descending order
sorted_customers_by_order_value_difference = selected_customers_by_nation.ORDER_BY(order_value_difference.DESC())

pydough.to_df(sorted_customers_by_order_value_difference)

Unnamed: 0,region_name,nation_name,max_order_value,min_order_value,order_value_difference,total_orders
0,EUROPE,RUSSIA,555285.16,932.41,554352.75,61495
1,AMERICA,PERU,544089.09,891.74,543197.35,59018
2,AMERICA,ARGENTINA,530604.44,877.3,529727.14,59547
3,AMERICA,UNITED STATES,525590.57,913.45,524677.12,59921
4,MIDDLE EAST,IRAN,522644.48,924.51,521719.97,59675
5,AMERICA,CANADA,515531.82,908.18,514623.64,60480
6,EUROPE,FRANCE,508668.52,885.75,507782.77,61600
7,AFRICA,MOZAMBIQUE,508047.99,896.59,507151.4,61267
8,ASIA,VIETNAM,504509.06,911.67,503597.39,60347
9,ASIA,JAPAN,502742.76,857.71,501885.05,59405


### 12. Nations with the Most Customers in a Specific Market Segment

Find the nations with the most customers in AUTOMOBILE and BUILDING market segment.

```SQL

SELECT 
    n.n_name AS nation_name,
    COUNT(c.c_custkey) AS customer_count
FROM nation n
JOIN customer c ON c.c_nationkey = n.n_nationkey
WHERE c.c_mktsegment IN ('MACHINERY', 'AUTOMOBILE') 
GROUP BY n.n_name
ORDER BY customer_count DESC;


In PyDough, to achieve this, the steps to follow are filtering the customers based on their market segment using the WHERE and ISIN functions, which represent the IN function in SQL. First, we select the customers that belong to the target market segments, specifically those in the "MACHINERY" and "AUTOMOBILE" segments. Then, we filter the selected customers based on their nation and customer key. Afterward, we group the selected customers by nation and calculate the count of customers in each nation. Finally, we sort the results by customer count in descending order. This allows us to identify the nations with the highest number of customers in the target market segments, helping us focus on the most relevant regions for business analysis or marketing efforts.

In [49]:
%%pydough

# Retrieve customers in the 'MACHINERY' or 'AUTOMOBILE' market segments
customer_in_target_mktsegment = customers.WHERE(ISIN(mktsegment, ('MACHINERY', 'AUTOMOBILE')))

# Filter selected customers by their nation and customer key
selected_customers_by_nation = customer_in_target_mktsegment(
    nation_name=nation.name,  
    customer_key=key 
)

# Group selected customers by nation and calculate the count of customers in each nation
nation_customer_count = PARTITION(selected_customers_by_nation, name="customer", by=(nation_name))(
    nation_name=nation_name, 
    customer_count=COUNT(customer.key)  
).ORDER_BY(customer_count.DESC())  


pydough.to_df(nation_customer_count)


Unnamed: 0,nation_name,customer_count
0,ROMANIA,2545
1,INDONESIA,2489
2,CHINA,2481
3,ETHIOPIA,2423
4,BRAZIL,2419
5,EGYPT,2414
6,RUSSIA,2414
7,GERMANY,2402
8,UNITED STATES,2399
9,JORDAN,2397


### 14. Region with the Highest Percentage of High-Priority Orders

Calculate the percentage of high-priority orders (e.g., '1-URGENT', '2-HIGH') for each region.

```SQL

SELECT r.r_name AS region_name, 
  ROUND(
    SUM(
      CASE 
        WHEN o.o_orderpriority IN ('1-URGENT', '2-HIGH') THEN 1 
        ELSE 0 
      END
    ) * 100.0 / COUNT(o.o_orderkey),
    2
  ) AS high_priority_percentage
  
FROM orders o
JOIN customer c ON o.o_custkey = c.c_custkey
JOIN nation n ON c.c_nationkey = n.n_nationkey
JOIN region r ON n.n_regionkey = r.r_regionkey
GROUP BY r.r_name
ORDER BY high_priority_percentage DESC;

As we can see, we start filtering the customers based on their region using the region_name attribute. First, we select the customers in the target region by referencing the region_name in the customers table. Then, we retrieve the orders placed by these customers. Afterward, we partition the selected orders by region and calculate the percentage of high-priority orders for each region. Specifically, we calculate the percentage of orders with a priority of "1-URGENT" or "2-HIGH" by summing those orders and dividing by the total number of orders per region. Finally, we sort the results by the high-priority order percentage in descending order. This allows us to identify the regions with the highest percentage of high-priority orders, providing valuable insights into order urgency across different regions.

In [50]:
%%pydough

# Retrieve customers based on their region
customers_in_region = customers(region_name=nation.region.name)

# Retrieve orders placed by customers in the selected region
selected_orders_by_region = customers_in_region.orders(
    customer_key=key,
    region_name=BACK(1).region_name 
)

# Partition the selected orders by region and calculate the high-priority order percentage for each region
region_priority_summary = PARTITION(selected_orders_by_region, name="order", by=(region_name))(
    region_name=region_name, 
    high_priority_percentage=ROUND(
        (SUM(ISIN(order.order_priority, ('1-URGENT', '2-HIGH'))) * 100) / COUNT(order.customer_key), 2
    ) 
).ORDER_BY(high_priority_percentage.DESC()) 

pydough.to_df(region_priority_summary)


Unnamed: 0,region_name,high_priority_percentage
0,MIDDLE EAST,40.2
1,AMERICA,40.16
2,EUROPE,39.99
3,ASIA,39.91
4,AFRICA,39.89


### 15. Customers Who Have Never Placed Orders

The next situation consists in identifying customers who have not placed any orders. The query retrieves the customer details, such as their customer key and name.

```SQL
SELECT c.c_custkey, c.c_name
FROM customer c
LEFT JOIN orders o ON c.c_custkey = o.o_custkey
WHERE o.o_orderkey IS NULL;


In Pydough we use the function HAS/HASNOT to resolve the is null statement. HAS returns True if at least one record of the sub-collection exists and HASNOT returns True if at least one record of the sub-collection does'nt exists. So the steps to follow are first filtering the customers who have not placed any orders using the WHERE clause combined with the HASNOT function. This identifies the customers who have no associated orders. Then, we select these customers by retrieving their unique customer_key and customer_name. 

In [2]:
%%pydough

# Retrieve customers who have not placed any orders
customers_without_orders = customers.WHERE(HASNOT(orders))

# Select the customers who do not have any orders, retrieving their unique key and name
selected_customers_without_orders = customers_without_orders(
    customer_key=key, 
    customer_name=name  
)

pydough.to_df(selected_customers_without_orders)

Unnamed: 0,customer_key,customer_name
0,3,Customer#000000003
1,6,Customer#000000006
2,9,Customer#000000009
3,12,Customer#000000012
4,15,Customer#000000015
...,...,...
49999,149988,Customer#000149988
50000,149991,Customer#000149991
50001,149994,Customer#000149994
50002,149997,Customer#000149997


### 16. Customer Activity by Nation
Question: How many total, active, and inactive customers are there in each nation, sorted by the total number of customers?

Purpose: This query shows the total number of customers, active customers (those with orders), and inactive customers (those without orders) for each nation. The results are sorted by the total number of customers.

```SQL

SELECT
    n.n_name,
    COUNT(DISTINCT c.c_custkey) AS total_customers,
    COUNT(DISTINCT CASE WHEN o.o_orderkey IS NOT NULL THEN c.c_custkey END) AS active_customers,
    COUNT(DISTINCT CASE WHEN o.o_orderkey IS NULL THEN c.c_custkey END) AS inactive_customers
FROM
    nation n
JOIN customer c ON n.n_nationkey = c.c_nationkey
LEFT JOIN orders o ON c.c_custkey = o.o_custkey
GROUP BY n.n_name
ORDER BY total_customers DESC;

In Pydoguh to achieve this, we first filter customers by nation and classify them as active or inactive based on their order history using the KEEP_IF function, which is similar to the SQL expression CASE WHEN b THEN a END. Then, we partition the customers by nation, calculating the total, active, and inactive customer counts for each nation. Finally, we sort the results by total customers in descending order, allowing us to identify nations with the highest customer distribution and activity levels.

In [6]:
%%pydough

# Retrieve customers by nation, classifying them into active and inactive based on order history
customers_by_nation = customers(
    customer_nation_name=nation.name,  
    active_customers=KEEP_IF(key, HAS(orders)),  
    inactive_customers=KEEP_IF(key, HASNOT(orders))  
)

# Partition the selected customers by nation and calculate customer statistics
nation_customer_summary = PARTITION(customers_by_nation, "customer", by=customer_nation_name)(
    nation_name=customer_nation_name,  
    total_customers=COUNT(customer), 
    active_customers=NDISTINCT(customer.active_customers),  
    inactive_customers=NDISTINCT(customer.inactive_customers), 
).ORDER_BY(total_customers.DESC())  # Sort by total customers in descending order

pydough.to_df(nation_customer_summary)

Unnamed: 0,nation_name,total_customers,active_customers,inactive_customers
0,INDONESIA,6161,4081,2080
1,FRANCE,6100,4149,1951
2,ROMANIA,6100,4087,2013
3,RUSSIA,6078,4089,1989
4,INDIA,6042,3958,2084
5,JORDAN,6033,4025,2008
6,CHINA,6024,4011,2013
7,CANADA,6020,4006,2014
8,UNITED KINGDOM,6011,3989,2022
9,IRAN,6009,4013,1996


### 17. Customers with High Balance but Low Spending

Question: Retrieve customers who belong to the top 10% in account balance but rank in the bottom 25% in terms of order activity

Purpose: Find customers with top 10% account balances but bottom 25% order activity.

```SQL
SELECT c_name, c_acctbal, total_orders
FROM (
    SELECT 
        c.c_name,
        c.c_acctbal,
        COUNT(o.o_orderkey) AS total_orders,
        PERCENT_RANK() OVER (ORDER BY c.c_acctbal DESC) AS balance_percentile,
        PERCENT_RANK() OVER (ORDER BY COUNT(o.o_orderkey)) AS order_activity_percentile
    FROM customer c
    LEFT JOIN orders o ON c.c_custkey = o.o_custkey
    GROUP BY c.c_custkey, c.c_name, c.c_acctbal
) sub
WHERE 
    balance_percentile <= 0.1  
    AND order_activity_percentile <= 0.25 
ORDER BY c_acctbal DESC;


In [7]:
%%pydough

# Filter customers who are in the bottom 10% by account balance and the bottom 25% by order count
customers_in_low_percentiles = customers.WHERE(
    (PERCENTILE(by=acctbal.DESC()) <= 10)  
    & (PERCENTILE(by=COUNT(orders.key).ASC()) <= 25) 
)

# Select the filtered customers, retrieving their key, name, and account balance
selected_customers_in_low_percentiles = customers_in_low_percentiles(
    customer_key=key,  
    customer_name=name, 
    account_balance=acctbal  
)

# Partition the selected customers by key, name, and account balance, and order by account balance
customer_activity_summary = PARTITION(selected_customers_in_low_percentiles, name="customer", by=(key, name, acctbal))(
    key=key,  
    name=name,  
    acctbal=acctbal  
).ORDER_BY(acctbal.DESC())  

pydough.to_df(customer_activity_summary)



Unnamed: 0,key,name,acctbal
0,69321,Customer#000069321,9999.960000
1,2487,Customer#000002487,9999.720000
2,43044,Customer#000043044,9999.490000
3,76146,Customer#000076146,9999.230000
4,34047,Customer#000034047,9998.970000
...,...,...,...
3708,62682,Customer#000062682,8894.780000
3709,82611,Customer#000082611,8894.490000
3710,13560,Customer#000013560,8894.430000
3711,78429,Customer#000078429,8894.390000
