#### SQL-PYDOUGH CODE TESTING NOTEBOOK

Setup for the PyDough package is done on the next cell, run it to import necessary packages

In [1]:
import pydough

%load_ext pydough.jupyter_extensions
#%reload_ext pydough.jupyter_extensions

#Necessary for comparison
import pandas as pd
from pandas.testing import assert_frame_equal, assert_series_equal
import re
import dfcompare

import collections
import numpy as np
import sqlite3 as sql
import os


### Now we can set the SQLite database and connect it to PyDough. Please change the next strings to match: 
1. .sql filename to initialize the database
2. Metadata path for the graphs
3. Graph name of the graph you want to use

In [2]:
#YOUR .SQL FILE TO CREATE THE DATABASE, COPY IT TO THIS FOLDER.
SQL_filename = 'ewallet.sql'

#METADATA FOR THE GRAPH .JSON
metadata_path = "ewallet_graph.json"

#GRAPH NAME
graph_name = "Ewallet"

#DESIRED DATABASE NAME
DB_name = "DATABASE.db"



#with open(SQL_filename, 'r') as sql_file:
#    sql_script = sql_file.read()

#os.remove(DB_name)
#connection = sql.connect(DB_name)
#cursor = connection.cursor()
#cursor.executescript(sql_script)

pydough.active_session.load_metadata_graph("../metadata/tpch_demo_graph.json", "TPCH");
pydough.active_session.connect_database("sqlite", database="../../tpch.db");

### Graph Structure
In case you want to have the structure of the graph to understand the relations and names, you can run this next cell and select "View as a scrollable element" at the bottom of the result to be able to see the full structure in case the result does not display the complete list

In [46]:
graph = pydough.active_session.metadata
print(pydough.explain_structure(graph))

Structure of PyDough graph: TPCH

  customers
  ├── acctbal
  ├── address
  ├── comment
  ├── key
  ├── mktsegment
  ├── name
  ├── nation_key
  ├── phone
  ├── nation [one member of nations] (reverse of nations.customers)
  └── orders [multiple orders] (reverse of orders.customer)

  lines
  ├── comment
  ├── commit_date
  ├── discount
  ├── extended_price
  ├── line_number
  ├── order_key
  ├── part_key
  ├── quantity
  ├── receipt_date
  ├── return_flag
  ├── ship_date
  ├── ship_instruct
  ├── ship_mode
  ├── status
  ├── supplier_key
  ├── tax
  ├── order [one member of orders] (reverse of orders.lines)
  ├── part [one member of parts] (reverse of parts.lines)
  ├── part_and_supplier [one member of supply_records] (reverse of supply_records.lines)
  └── supplier [one member of suppliers] (reverse of suppliers.lines)

  nations
  ├── comment
  ├── key
  ├── name
  ├── region_key
  ├── customers [multiple customers] (reverse of customers.nation)
  ├── region [one member of regions] 

### SQL Test template
You can use this template to run your SQL code and visually compare the results to those of the PyDough code.
Just paste your SQL code inside the ''' ''''. You can also copy this template and paste is wherever you neet to.
Remember to use the column and table names from the original .sql file

In [99]:
query = '''
 SELECT
    *
 FROM
    sbCustomer
'''
sql_output = pd.read_sql_query(query, connection)
sql_output

Unnamed: 0,sbCustId,sbCustName,sbCustEmail,sbCustPhone,sbCustAddress1,sbCustAddress2,sbCustCity,sbCustState,sbCustCountry,sbCustPostalCode,sbCustJoinDate,sbCustStatus
0,C001,john doe,john.doe@email.com,555-123-4567,123 Main St,,Anytown,CA,USA,90001,2020-01-01,active
1,C002,Jane Smith,jane.smith@email.com,555-987-6543,456 Oak Rd,,Someville,NY,USA,10002,2019-03-15,active
2,C003,Bob Johnson,bob.johnson@email.com,555-246-8135,789 Pine Ave,,Mytown,TX,USA,75000,2022-06-01,inactive
3,C004,Samantha Lee,samantha.lee@email.com,555-135-7902,246 Elm St,,Yourtown,CA,USA,92101,2018-09-22,suspended
4,C005,Michael Chen,michael.chen@email.com,555-864-2319,159 Cedar Ln,,Anothertown,FL,USA,33101,2021-02-28,active
5,C006,Emily Davis,emily.davis@email.com,555-753-1904,753 Maple Dr,,Mytown,TX,USA,75000,2020-07-15,active
6,C007,David Kim,david.kim@email.com,555-370-2648,864 Oak St,,Anothertown,FL,USA,33101,2022-11-05,active
7,C008,Sarah Nguyen,sarah.nguyen@email.com,555-623-7419,951 Pine Rd,,Yourtown,CA,USA,92101,2019-04-01,closed
8,C009,William Garcia,william.garcia@email.com,555-148-5326,258 Elm Ave,,Anytown,CA,USA,90001,2021-08-22,active
9,C010,Jessica Hernandez,jessica.hernandez@email.com,555-963-8520,147 Cedar Blvd,,Someville,NY,USA,10002,2020-03-10,inactive


### Pydough template
The important part about this template is to run the PyDough code and store it in a variable called pydough_output for future comparison.

In [111]:
%%pydough

#Setting up the tables that we will need information from in the context
tables = Customers

#The condition we would like the content to fulfill
filter = Customers

#The information we want to receive in the resulting table
output = filter

#Execute the PyDough code
pydough_output = pydough.to_df(output)
pydough_output

Unnamed: 0,_id,name,email,phone,address1,address2,city,state,country,postal_code,join_date,status
0,C001,john doe,john.doe@email.com,555-123-4567,123 Main St,,Anytown,CA,USA,90001,2020-01-01,active
1,C002,Jane Smith,jane.smith@email.com,555-987-6543,456 Oak Rd,,Someville,NY,USA,10002,2019-03-15,active
2,C003,Bob Johnson,bob.johnson@email.com,555-246-8135,789 Pine Ave,,Mytown,TX,USA,75000,2022-06-01,inactive
3,C004,Samantha Lee,samantha.lee@email.com,555-135-7902,246 Elm St,,Yourtown,CA,USA,92101,2018-09-22,suspended
4,C005,Michael Chen,michael.chen@email.com,555-864-2319,159 Cedar Ln,,Anothertown,FL,USA,33101,2021-02-28,active
5,C006,Emily Davis,emily.davis@email.com,555-753-1904,753 Maple Dr,,Mytown,TX,USA,75000,2020-07-15,active
6,C007,David Kim,david.kim@email.com,555-370-2648,864 Oak St,,Anothertown,FL,USA,33101,2022-11-05,active
7,C008,Sarah Nguyen,sarah.nguyen@email.com,555-623-7419,951 Pine Rd,,Yourtown,CA,USA,92101,2019-04-01,closed
8,C009,William Garcia,william.garcia@email.com,555-148-5326,258 Elm Ave,,Anytown,CA,USA,90001,2021-08-22,active
9,C010,Jessica Hernandez,jessica.hernandez@email.com,555-963-8520,147 Cedar Blvd,,Someville,NY,USA,10002,2020-03-10,inactive


### Comparison template 
Run this to compare the two data frames you have obtained as a result of the queries

In [None]:
dfcompare.compare_df(pydough_output, sql_output, query_category="a", question="a")

True

SELECT t.sbTxType, COUNT(DISTINCT t.sbTxCustId) AS num_customers, AVG(t.sbTxShares) AS avg_shares FROM sbTransaction AS t WHERE t.sbTxDateTime BETWEEN '2023-01-01' AND '2023-03-31 23:59:59' GROUP BY t.sbTxType ORDER BY CASE WHEN num_customers IS NULL THEN 1 ELSE 0 END DESC, num_customers DESC LIMIT 3;

SELECT c.sbCustId, c.sbCustName FROM sbCustomer AS c LEFT JOIN sbTransaction AS t ON c.sbCustId = t.sbTxCustId WHERE t.sbTxCustId IS NULL;

SELECT DISTINCT c.sbCustId FROM sbCustomer AS c JOIN sbTransaction AS t ON c.sbCustId = t.sbTxCustId WHERE t.sbTxType = 'buy';

SELECT DISTINCT tk.sbTickerId FROM sbTicker AS tk JOIN sbDailyPrice AS dp ON tk.sbTickerId = dp.sbDpTickerId WHERE dp.sbDpDate >= '2023-04-01';

SELECT tk.sbTickerId, tk.sbTickerSymbol FROM sbTicker AS tk LEFT JOIN sbDailyPrice AS dp ON tk.sbTickerId = dp.sbDpTickerId WHERE dp.sbDpTickerId IS NULL;

SELECT tk.sbTickerSymbol, COUNT(tx.sbTxId) AS num_transactions, SUM(tx.sbTxAmount) AS total_amount FROM sbTicker AS tk JOIN sbTransaction AS tx ON tk.sbTickerId = tx.sbTxTickerId GROUP BY tk.sbTickerSymbol ORDER BY CASE WHEN total_amount IS NULL THEN 1 ELSE 0 END DESC, total_amount DESC LIMIT 10;

SELECT sbTxStatus, COUNT(*) AS num_transactions FROM sbTransaction GROUP BY sbTxStatus ORDER BY CASE WHEN num_transactions IS NULL THEN 1 ELSE 0 END DESC, num_transactions DESC LIMIT 3;

SELECT c.sbCustState, t.sbTickerType, COUNT(*) AS num_transactions FROM sbTransaction AS tx JOIN sbCustomer AS c ON tx.sbTxCustId = c.sbCustId JOIN sbTicker AS t ON tx.sbTxTickerId = t.sbTickerId GROUP BY c.sbCustState, t.sbTickerType ORDER BY CASE WHEN num_transactions IS NULL THEN 1 ELSE 0 END DESC, num_transactions DESC LIMIT 5;

SELECT sbCustCountry, COUNT(*) AS num_customers FROM sbCustomer GROUP BY sbCustCountry ORDER BY CASE WHEN num_customers IS NULL THEN 1 ELSE 0 END DESC, num_customers DESC LIMIT 5;

SELECT c.sbCustCountry, COUNT(t.sbTxId) AS num_transactions, SUM(t.sbTxAmount) AS total_amount FROM sbCustomer AS c JOIN sbTransaction AS t ON c.sbCustId = t.sbTxCustId WHERE t.sbTxDateTime >= DATE('now', '-30 days') GROUP BY c.sbCustCountry ORDER BY total_amount DESC LIMIT 5;

# 1.

In [102]:
query = '''
SELECT
    t.sbTxType,
    COUNT(DISTINCT t.sbTxCustId) AS num_customers,
    AVG(t.sbTxShares) AS avg_shares
FROM
    sbTransaction AS t
WHERE
    t.sbTxDateTime BETWEEN '2023-01-01' AND '2023-03-31 23:59:59'
GROUP BY
    t.sbTxType
ORDER BY
    CASE
        WHEN num_customers IS NULL THEN 1
        ELSE 0
    END DESC,
    num_customers DESC
LIMIT 3;
'''
sql_output = pd.read_sql_query(query, connection)
sql_output


Unnamed: 0,sbTxType,num_customers,avg_shares
0,buy,3,41.75
1,sell,3,43.333333


In [None]:
%%pydough

total_shipped = lines.CALCULATE(shipping_mode = ship_mode).part.CALCULATE(name, total_purchases = COUNT(lines))

shipped_by_mode = PARTITION(total_shipped, name = 'total', by = (shipping_mode, name)
                            ).CALCULATE(mode_purchases = COUNT(total), 
                                total_purchases = MAX(total.total_purchases)
                            ).CALCULATE(percentage = 100* mode_purchases/total_purchases)

part_shipped = PARTITION(shipped_by_mode, name = 'mode', by = name)

result = part_shipped.mode.WHERE(RANKING(by=percentage.DESC(), levels=1) ==1
                                ).TOP_K(10, by= (percentage.DESC(), name.ASC())
                                ).CALCULATE(name, shipping_mode, mode_purchases, percentage)

pydough.to_df(result)



In [4]:
%%pydough

revenue = lines.CALCULATE(supp_name = supplier.name, line_revenue = extended_price*(1-discount)
                        ).order.customer.CALCULATE(cust_name = name)

pair_revenue = PARTITION(revenue, name = 'rev', by = (supp_name, cust_name)
                            ).CALCULATE(cust_supp_revenue = SUM(rev.line_revenue))

revenue_by_sup = PARTITION(pair_revenue, name = 'cust_supp_rev', by = supp_name
                        ).CALCULATE(supp_revenue = SUM(cust_supp_rev.cust_supp_revenue)
                        ).cust_supp_rev.WHERE(RANKING(by = cust_supp_revenue.DESC(), levels = 1) ==1)

result = revenue_by_sup.CALCULATE(supp_name = supp_name, cust_name = cust_name, revenue_percentage = (cust_supp_revenue/supp_revenue), total_revenue = supp_revenue)

pydough.to_df(result)



Unnamed: 0,cust_supp_revenue
0,109000.0860
1,95745.0200
2,109736.3250
3,99399.0000
4,103535.7834
...,...
9995,107918.7768
9996,99763.5100
9997,100336.0050
9998,156035.9266


In [4]:
%%pydough

customers_1996_not_1997 = customers.CALCULATE(
    cust_key=key,
    customer_name=name,
    total_spent=SUM(orders.WHERE(YEAR(order_date) == 1996).total_price)
).WHERE(
    (total_spent > 1000) & 
    (HAS(orders.WHERE(YEAR(order_date) == 1996))==1) & 
    (HASNOT(orders.WHERE(YEAR(order_date) == 1997))==1)
)

pydough.to_df(customers_1996_not_1997)

Unnamed: 0,cust_key,customer_name,total_spent
0,5,Customer#000000005,98790.79
1,17,Customer#000000017,481076.02
2,20,Customer#000000020,538764.56
3,31,Customer#000000031,720459.29
4,38,Customer#000000038,788657.01
...,...,...,...
10773,149957,Customer#000149957,319602.15
10774,149966,Customer#000149966,502448.93
10775,149984,Customer#000149984,28386.08
10776,149989,Customer#000149989,157433.72


In [20]:
%%pydough

single_order_customers = PARTITION(orders, name="o", by=customer_key).CALCULATE( 
    customer_key, 
    first_order_year=YEAR(MIN(o.order_date)),  
    order_count=COUNT(o.key)  
).WHERE(order_count == 1)  

only_orders_per_year = PARTITION(single_order_customers, name="s", by=first_order_year).CALCULATE(
    year=first_order_year,
    n_only_order=COUNT(s.customer_key)  
).ORDER_BY(year.ASC())

pydough.to_df(only_orders_per_year)

Unnamed: 0,year,n_only_order
0,1992,1
1,1993,2
2,1994,2
3,1995,1
4,1996,6
5,1997,3
6,1998,2


Break down the average delay in delivery for for shipments by the number of suppliers who are shipping at least one part in the order. If a product arrives early, its delay is negative.

In [None]:
%%pydough 

delay = lines.CALCULATE(indiv_delay = DATEDIFF('days', commit_date, receipt_date))

order_delay = PARTITION(delay, name ='d', by = order_key
                    ).CALCULATE(supp_amount = NDISTINCT(d.supplier_key), 
                    orders_delay = AVG(d.indiv_delay))
                    
grouped_delay = PARTITION(order_delay, name = 'o', by = supp_amount
                    ).CALCULATE(supp_amount = supp_amount, 
                        AVG_delay = AVG(o.orders_delay)
                    ).ORDER_BY(supp_amount.ASC())

pydough.to_df(grouped_delay)

DATEDIFF unsupported for 'DAYS'.


Unnamed: 0,supp_amount,AVG_delay
0,1,16.618643
1,2,16.476543
2,3,16.393852
3,4,16.513126
4,5,16.459925
5,6,16.489189
6,7,16.501912


Identify how often each month of the year is the month with the highest number of orders that entire year for a part type.

In [None]:
%%pydough

part_month =lines.CALCULATE(part_type = part.part_type, 
                            month = MONTH(order.order_date), 
                            year = YEAR(order.order_date))

monthyear_frequency = PARTITION(part_month, name = 'pm', by = (part_type, month, year)
                                ).CALCULATE(frequency = COUNT(pm))

part_frequency = PARTITION(monthyear_frequency, name = 'my', by = (part_type, year)
                           ).my.WHERE(RANKING(by = frequency.DESC(), levels = 1)==1)

indiv_month = PARTITION(part_frequency, name = 'pf', by = (month)
                        ).CALCULATE(month = month, best_month_freq = COUNT(pf))

pydough.to_df(indiv_month)



Unnamed: 0,month,best_month_freq
0,1,146
1,2,4
2,3,138
3,4,49
4,5,125
5,6,38
6,7,152
7,8,129
8,9,31
9,10,93


For every year, identify the percentage of all revenue generated that year was from repeat customers who have made a previous purchase from the same supplier.

In [69]:
%%pydough

table = lines.CALCULATE(revenue = extended_price*(1-discount), supplier_key = supplier_key).order.CALCULATE(customer_key = customer_key, year = YEAR(order_date))

cust_supp = PARTITION(table, name = 't', by = (customer_key, supplier_key, year)
                ).CALCULATE(number_orders = COUNT(t), cust_sup_yearly_revenue = SUM(t.revenue))
                
yearly_rev = PARTITION(cust_supp, name = 'c', by = year
                       ).CALCULATE(total_year_rev = SUM(c.cust_sup_yearly_revenue), 
                                   repeat_rev = SUM(IFF(c.number_orders > 1, c.cust_sup_yearly_revenue, 0)))

result = yearly_rev.CALCULATE(year = year, repeat_cust_percentage = 100*repeat_rev/total_year_rev)

pydough.to_df(result)

Unnamed: 0,year,repeat_cust_percentage
0,1992,0.141876
1,1993,0.131191
2,1994,0.143702
3,1995,0.144899
4,1996,0.143942
5,1997,0.14754
6,1998,0.095502


Identify the 4 suppliers who have the highest total revenue generated by repeat customers who have already made a purchase from them. Include the suppliers' names, the repeat revenue, and the percentage of their total revenue that is from the repeat revenue.

In [75]:
%%pydough

table = lines.CALCULATE(revenue = extended_price*(1-discount), supplier_name = supplier.name).order.CALCULATE(customer_key = customer_key)

cust_supp = PARTITION(table, name = 't', by = (customer_key, supplier_name)
                ).CALCULATE(number_orders = COUNT(t), cust_sup_revenue = SUM(t.revenue))
                
sup_rev = PARTITION(cust_supp, name = 'c', by = supplier_name
                       ).CALCULATE(total_sup_rev = SUM(c.cust_sup_revenue), 
                                   repeat_rev = SUM(IFF(c.number_orders > 1, c.cust_sup_revenue, 0)))

result = sup_rev.CALCULATE(supplier_name = supplier_name, repeat_rev = repeat_rev, repeat_cust_percentage = 100*repeat_rev/total_sup_rev)

pydough.to_df(result)

Unnamed: 0,supplier_name,repeat_rev,repeat_cust_percentage
0,Supplier#000000001,202459.1100,0.884690
1,Supplier#000000002,106869.5124,0.550822
2,Supplier#000000003,284854.3567,1.415726
3,Supplier#000000004,59471.9000,0.284091
4,Supplier#000000005,197706.5540,0.950184
...,...,...,...
9995,Supplier#000009996,107918.7768,0.397780
9996,Supplier#000009997,46716.6585,0.191722
9997,Supplier#000009998,90250.3708,0.373343
9998,Supplier#000009999,176773.1088,0.747253


Which 5 nations' suppliers generate the most total revenue from customers in other nations. Include the revenue from those international shipments and and the percentage of all revenue in that nation that is from it.

In [None]:
%%pydough

nations_revenue = nations.CALCULATE(nation_revenue = SUM(suppliers.lines.extended_price*(1-suppliers.lines.discount)), current_key = key)
                                    
matching = suppliers.lines.WHERE((order.customer.nation_key == current_key) == 0)
                                    
inter_rev = nations_revenue.CALCULATE(international_revenue = SUM(matching.extended_price*(1-matching.discount)))
                                                          
result = inter_rev.CALCULATE(name = name, nation_revenue = nation_revenue, international_revenue = international_revenue, 
                            international_pct = (100*international_revenue/nation_revenue)).TOP_K(5, international_revenue.DESC())

pydough.to_df(result)

Unnamed: 0,name,nation_revenue,international_revenue,international_pct
0,IRAQ,9473890000.0,9096169000.0,96.01303
1,ALGERIA,9216901000.0,8851360000.0,96.034011
2,PERU,9143177000.0,8781083000.0,96.039732
3,EGYPT,9139913000.0,8780982000.0,96.07293
4,CANADA,9046350000.0,8679203000.0,95.941491


In [1]:
dfcompare.compare_df(pydough_output, sql_output, query_category="a", question="a")

NameError: name 'dfcompare' is not defined