Initial notebook for teaching purposes with SQL + PyDough

This notebook uses TPC-H schema metadata and SQLite database connection for all examples.

We will now do the setup steps to load PyDough and its Jupyter extension:

In [2]:
import pydough

pydough.active_session.load_metadata_graph("../metadata/tpch_demo_graph.json", "TPCH")
pydough.active_session.connect_database("sqlite", database="../../tpch.db")
%load_ext pydough.jupyter_extensions

import pandas as pd
import numpy as np
import sqlite3 as sql
connection = sql.connect("../../tpch.db")

Let's start with a very simple SQL request that looks for all of the part names available in the TPC-H database:

In [3]:
query = '''
 SELECT
     p.p_name
 FROM
     part p
'''
df = pd.read_sql_query(query, connection)
df

Unnamed: 0,P_NAME
0,goldenrod lavender spring chocolate lace
1,blush thistle blue yellow saddle
2,spring green yellow purple cornsilk
3,cornflower chocolate smoke green pink
4,forest brown coral puff cream
...,...
199995,cream navajo saddle dodger navy
199996,peru maroon snow grey chartreuse
199997,pink wheat powder burlywood snow
199998,goldenrod drab brown salmon mint


To convert this into PyDOugh code we first want to establish the context we will be in, in this case, the table parts; then we establish within the parentheses what information we want from this context as such. 

Lastly we call the pydough to df function to execute the SQL code generated by pydough, which we can see by calling pydough to SQL, we can also use pydough.to_sql() with the output as a parameter to take a look at the generated SQL code.

In [4]:
%%pydough

output = parts(name)

#pydough.to_sql(output)
pydough.to_df(output)

Unnamed: 0,name
0,goldenrod lavender spring chocolate lace
1,blush thistle blue yellow saddle
2,spring green yellow purple cornsilk
3,cornflower chocolate smoke green pink
4,forest brown coral puff cream
...,...
199995,cream navajo saddle dodger navy
199996,peru maroon snow grey chartreuse
199997,pink wheat powder burlywood snow
199998,goldenrod drab brown salmon mint


We can also use a filter to specify what resuls we want, in SQL we use a WHERE clause, which we can conveniently use in PyDough as well. The following code shows us the parts we want, but it only shows the ones that have a price higher than 2000.

Since our current context is the parts table, we can reference the retail price column in the WHERE operation and also apply number logic to it, in this case >, but we can also use the <, ==, !=. 

PyDough does not yet support the AND, OR, NOT, IN expressions, as well as trying in-between comparisons like (1 < x < 5)

In [5]:
%%pydough

parts_list = parts.WHERE(retail_price > 2000)
pydough.to_sql(parts_list)

'SELECT key, name, manufacturer, brand, part_type, size, container, retail_price, comment FROM (SELECT p_name AS name, p_container AS container, p_brand AS brand, p_size AS size, p_partkey AS key, p_type AS part_type, p_comment AS comment, p_retailprice AS retail_price, p_mfgr AS manufacturer FROM main.PART) WHERE retail_price > 2000'

In [6]:
%%pydough

priciest_parts = parts.TOP_K(100, by=retail_price.DESC())
pydough.to_df(priciest_parts)



Unnamed: 0,key,name,manufacturer,brand,part_type,size,container,retail_price,comment
0,199999,goldenrod drab brown salmon mint,Manufacturer#5,Brand#55,PROMO PLATED BRASS,24,LG CASE,2098.99,he quickly ironic
1,198999,forest azure almond antique violet,Manufacturer#3,Brand#32,SMALL POLISHED BRASS,49,JUMBO BAG,2097.99,"ven, ir"
2,199998,pink wheat powder burlywood snow,Manufacturer#5,Brand#52,MEDIUM BURNISHED BRASS,49,LG BOX,2097.99,. special deposits hag
3,197999,cornflower almond powder forest slate,Manufacturer#4,Brand#44,MEDIUM PLATED COPPER,34,JUMBO PKG,2096.99,"ges. even,"
4,198998,thistle lavender linen azure sandy,Manufacturer#3,Brand#35,STANDARD POLISHED STEEL,18,WRAP CAN,2096.99,ickly af
...,...,...,...,...,...,...,...,...,...
95,190995,green dodger pale cyan lace,Manufacturer#4,Brand#41,STANDARD BRUSHED TIN,34,JUMBO BOX,2085.99,efully bold instr
96,191994,cornflower frosted cyan violet green,Manufacturer#1,Brand#15,ECONOMY BURNISHED STEEL,1,LG CASE,2085.99,thely pendi
97,192993,rose burlywood spring white orchid,Manufacturer#2,Brand#23,PROMO BRUSHED BRASS,9,LG PKG,2085.99,across th
98,193992,olive smoke goldenrod pink violet,Manufacturer#3,Brand#33,ECONOMY ANODIZED BRASS,19,WRAP PACK,2085.99,riously idle foxes


Lets say a customer comes in looking for a Brass instrument to play and we are wondering if we can have a list of the brass items so we can determine if any of those would suffice, they also with to know the supplier and its corresponding price.

We will now fulfill the request with a simple query that provides us the part name, supplier name and the cost of a part that has its type including Brass, like so:

In [7]:
query = '''
 SELECT
     p.p_name,
     s.s_name,
     ps.ps_supplycost
 FROM
     part p
 JOIN
     partsupp ps ON p.p_partkey = ps.ps_partkey
 JOIN
     supplier s ON ps.ps_suppkey = s.s_suppkey
 WHERE
     p.p_type LIKE '%BRASS%';
'''
df = pd.read_sql_query(query, connection)
df

Unnamed: 0,P_NAME,S_NAME,PS_SUPPLYCOST
0,blush thistle blue yellow saddle,Supplier#000000003,378.49
1,blush thistle blue yellow saddle,Supplier#000002503,915.27
2,blush thistle blue yellow saddle,Supplier#000005003,438.37
3,blush thistle blue yellow saddle,Supplier#000007503,306.39
4,spring green yellow purple cornsilk,Supplier#000000004,920.92
...,...,...,...
160227,pink wheat powder burlywood snow,Supplier#000007556,520.48
160228,goldenrod drab brown salmon mint,Supplier#000010000,606.64
160229,goldenrod drab brown salmon mint,Supplier#000002519,231.95
160230,goldenrod drab brown salmon mint,Supplier#000005038,457.93


This query includes some of the basic blocks for SQL code, basic filtering using a where clause, joining tables to look for information through mo. This also inlcudes the tables partsupp, supplier and part.

A PyDough translation for this is the following:

In [8]:
%%pydough

#Setting up the tables that we will need information from in the context
tables = suppliers.supply_records.part

#The condition we would like the content to fulfill
filter = tables.WHERE(CONTAINS(part_type, "BRASS"))

#The information we want to receive in the resulting table
output = filter(product_name=name, supplier_name = BACK(2).name, supply_cost = BACK(1).supplycost)

#Execute the PyDough code
pydough.to_df(output)

Unnamed: 0,product_name,supplier_name,supply_cost
0,blush thistle blue yellow saddle,Supplier#000000003,378.49
1,blush thistle blue yellow saddle,Supplier#000002503,915.27
2,blush thistle blue yellow saddle,Supplier#000005003,438.37
3,blush thistle blue yellow saddle,Supplier#000007503,306.39
4,spring green yellow purple cornsilk,Supplier#000000004,920.92
...,...,...,...
160227,pink wheat powder burlywood snow,Supplier#000007556,520.48
160228,goldenrod drab brown salmon mint,Supplier#000010000,606.64
160229,goldenrod drab brown salmon mint,Supplier#000002519,231.95
160230,goldenrod drab brown salmon mint,Supplier#000005038,457.93


As we can see, the language is quite different, but in reality what PyDough does is create a query that satisfies the conditions, so let's break down the code we just executed.

First we have the assignment of the tables we are going to need, which we can navigate by using the BACK() function, that allows us to access data in past contexts. In this case we are currently in the part table, but if we use the BACK(1) function we can enter supply records, as it is the second to last context we entered, same can be done with BACK(2) to access th suppliers context.

In this case the customer asked for the product type to be exactly a STANDARD BRUSHED TIN product, this is where we use the WHERE filter, where we can give it a string to compare to the obtained type by providing the column to compare and the string or an data we want to compare it to.

Lastly we construct the output by indicating which columns we want included as well as their names, we use the BACK() function to be able to include the supply cost and the supplier name from past contexts.

In [9]:
query = '''
SELECT
    s.s_name
 FROM
    supplier s
 JOIN
    partsupp ps ON s.s_suppkey = ps.ps_suppkey
JOIN 
    part p on ps.ps_partkey = p.p_partkey
WHERE
    p.p_type = 'STANDARD BRUSHED TIN'
'''
df = pd.read_sql_query(query, connection)
df

Unnamed: 0,S_NAME
0,Supplier#000000032
1,Supplier#000002532
2,Supplier#000005032
3,Supplier#000007532
4,Supplier#000000092
...,...
5543,Supplier#000009864
5544,Supplier#000002449
5545,Supplier#000004968
5546,Supplier#000007487


In [10]:
%%pydough

#Purpose: List of suppliers that have a product with the type STANDARD BRUSHED TIN

#This is similar to a join operation of the tables
parts_supp_partsupp= suppliers.supply_records.part
#The where clause we add onto the previous clause's result
filter = parts_supp_partsupp.WHERE(part_type == "STANDARD BRUSHED TIN")

#Executing the generated PyDough code in the database
pydough.to_sql(filter(BACK(2).name, part_type, name))

"SELECT part_type, name AS name FROM (SELECT part_key FROM (SELECT s_suppkey AS key FROM main.SUPPLIER) INNER JOIN (SELECT ps_partkey AS part_key, ps_suppkey AS supplier_key FROM main.PARTSUPP) ON key = supplier_key) INNER JOIN (SELECT p_name AS name, p_partkey AS key, p_type AS part_type FROM main.PART) ON part_key = key WHERE part_type = 'STANDARD BRUSHED TIN'"

In this next example we will see a query that comes from a customer that wants to aquire Lavender Springs. This customer wants to know out of the available suppliers who has the better supply cost to buy from. Whilst this is a very simple request, there are several parts of this the query need, including to be ordered, to filter by the item name and lastly to display the other relevant information for the client. In the SQL version this will require numeric comparisons, and text comparisons with the operators > and LIKE

In [11]:
query = '''
SELECT
    p.p_name,
    s.s_name,
    ps.ps_supplycost,
    ps.ps_availqty
 FROM
    supplier s
 JOIN
    partsupp ps ON s.s_suppkey = ps.ps_suppkey
JOIN 
    part p on ps.ps_partkey = p.p_partkey
WHERE
    p.p_name LIKE 'LAVENDER SPRING%' AND
    ps.ps_availqty >= 10
ORDER BY
    ps.ps_supplycost ASC
LIMIT 1;
'''
df = pd.read_sql_query(query, connection)
df

Unnamed: 0,P_NAME,S_NAME,PS_SUPPLYCOST,PS_AVAILQTY
0,lavender spring lime puff powder,Supplier#000004053,5.96,8183


In the PyDough version of this we can appreciate how in the WHERE function we can include these comparisons as parameters. In the case of text comparison we can use the CONTAINS, STARTSWITH and ENDSWITH, where we give it the column to be compared and the string we want to compare it to.

In [12]:
%%pydough

tables = suppliers.supply_records

filter = tables.WHERE(availqty >= 10).part.WHERE(CONTAINS(name, "LAVENDER SPRING")).TOP_K(1, by=BACK(1).supplycost.ASC())

output = filter(product_name=name, supplier_name = BACK(2).name, supply_cost = BACK(1).supplycost, avail_quantity = BACK(1).availqty)

pydough.to_sql(output)

"SELECT product_name, supplier_name, supply_cost, avail_quantity FROM (SELECT ordering_0, availqty AS avail_quantity, name_4 AS product_name, name AS supplier_name, supplycost AS supply_cost FROM (SELECT name, availqty, supplycost, name_4, supplycost AS ordering_0 FROM (SELECT _table_alias_0.name AS name, availqty, supplycost, _table_alias_1.name AS name_4 FROM (SELECT name, availqty, part_key, supplycost FROM (SELECT s_name AS name, s_suppkey AS key FROM main.SUPPLIER) INNER JOIN (SELECT ps_partkey AS part_key, ps_availqty AS availqty, ps_suppkey AS supplier_key, ps_supplycost AS supplycost FROM main.PARTSUPP) ON key = supplier_key WHERE availqty >= 10) AS _table_alias_0 INNER JOIN (SELECT p_name AS name, p_partkey AS key FROM main.PART) AS _table_alias_1 ON part_key = key) WHERE name_4 LIKE '%LAVENDER SPRING%') ORDER BY ordering_0 LIMIT 1) ORDER BY ordering_0"

In [13]:
%%pydough

##This is a test cell to try and write the same code without the use of any backs while mantaining the structure I have been working with
tables = suppliers.supply_records

filter = tables.WHERE(availqty >= 10).ORDER_BY(supply_cost.DESC()).part.WHERE(STARTSWITH(name, "LAVENDER SPRING"))

output = filter(product_name=name, supplier_name = BACK(2).name, supply_cost = BACK(1).supplycost, avail_quantity = BACK(1).availqty)

pydough.to_df(output)

PyDoughQDAGException: Unrecognized term of simple table collection 'supply_records' in graph 'TPCH': 'supply_cost'

The next situation consists in a client wanting a very specific part, for example a product by Brand #41, a green dodger with a bold look, this client wants to know the top 10 providers ordered by available quantity, its retail price and the country where it originates since aparently batches from Egypt are lower quality than the rest and they would like to avoid them depending on the price dfference. This will require an ORDER BY, LIMIT, several joins and the name, which belongs to a different table.

In [14]:
query = '''
SELECT
    p.p_name,
    s.s_name,
    ps.ps_supplycost,
    ps.ps_availqty,
    n.n_name
 FROM
    supplier s
 JOIN
    partsupp ps ON s.s_suppkey = ps.ps_suppkey
JOIN 
    part p on ps.ps_partkey = p.p_partkey
JOIN
    nation n on s.s_nationkey = n.n_nationkey
WHERE
    p.p_name LIKE '%GREEN DODGER%' AND
    p.p_brand LIKE '%Brand#41%' AND
    p.p_comment LIKE '%bold%'
ORDER BY
    ps.ps_availqty DESC
LIMIT 10;
'''
df = pd.read_sql_query(query, connection)
df

Unnamed: 0,P_NAME,S_NAME,PS_SUPPLYCOST,PS_AVAILQTY,N_NAME
0,green dodger pale cyan lace,Supplier#000003515,297.36,9171,IRAN
1,firebrick cream green dodger coral,Supplier#000004698,299.32,7323,MOZAMBIQUE
2,firebrick cream green dodger coral,Supplier#000009674,852.27,4422,EGYPT
3,green dodger pale cyan lace,Supplier#000000996,137.78,4086,GERMANY
4,firebrick cream green dodger coral,Supplier#000002186,571.28,3836,ETHIOPIA
5,firebrick cream green dodger coral,Supplier#000007162,636.88,3487,VIETNAM
6,green dodger pale cyan lace,Supplier#000008553,669.27,1913,FRANCE
7,green dodger pale cyan lace,Supplier#000006034,904.58,501,GERMANY


As we can see, the PyDoug version of the LIMIT is by providing both it and ORDER BY in the same function: TOP_K. Which receives two parameters, the number of elements to be taken, and the column + the order, if it's ASC or DESC.

In [33]:
%%pydough

tables = suppliers.supply_records.part

filter = tables.WHERE(CONTAINS(name, "GREEN DODGER") & (brand == "Brand#41") & CONTAINS(comment, "bold")).TOP_K(10, by=BACK(1).availqty.DESC())

output = filter(product_name=name, supplier_name = BACK(2).name, supply_cost = BACK(1).supplycost, avail_quantity = BACK(1).availqty, country_of_origin = BACK(2).nation.name)

pydough.to_df(output)

NotImplementedError: BackReferenceCollection

In the next query I want to know how many parts each supplier handles, with the suppliers's name and the amount in the second column. This will require the use of COUNT and GROUPBY in the SQL version of the code.

In [16]:
query = '''
SELECT
    s.s_name,
    COUNT(*)
 FROM
    partsupp ps
 JOIN
     supplier s ON s.s_suppkey = ps.ps_suppkey
 GROUP BY
     s.s_name
'''
df = pd.read_sql_query(query, connection)
df

Unnamed: 0,S_NAME,COUNT(*)
0,Supplier#000000001,80
1,Supplier#000000002,80
2,Supplier#000000003,80
3,Supplier#000000004,80
4,Supplier#000000005,80
...,...,...
9995,Supplier#000009996,80
9996,Supplier#000009997,80
9997,Supplier#000009998,80
9998,Supplier#000009999,80


In the PyDough equivalent we can appreciate that the COUNT is still being used in the filter section in a very similar way we would use it in SQL, whilst the GROUPBY is very different. The GROUPBY of SQL is the PARTITION of PyDough, and it is used by providing 3 parameters:
1. The table to be partitioned
2. The name we will use to refer to the partitioned table and reference it from now on
3. The column we want to group the table by

In [17]:
%%pydough

tables = supply_records.supplier

filter = PARTITION(tables, name="t", by=(name))

output = filter(name, products_by_supp = COUNT(t.key))

pydough.to_sql(output)

'SELECT name, COALESCE(agg_0, 0) AS products_by_supp FROM (SELECT name, COUNT(key) AS agg_0 FROM (SELECT key, name FROM (SELECT ps_suppkey AS supplier_key FROM main.PARTSUPP) INNER JOIN (SELECT s_name AS name, s_suppkey AS key FROM main.SUPPLIER) ON supplier_key = key) GROUP BY name)'

Query: Find the average order amount for each customer segment within each region.

In [18]:
query = '''
SELECT
    n.n_name AS region_name,
    c.c_mktsegment AS customer_segment,
    AVG(l.l_extendedprice * (1 - l.l_discount)) AS average_order_amount
FROM
    CUSTOMER AS c
JOIN
    ORDERS AS o ON c.c_custkey = o.o_custkey
JOIN
    LINEITEM AS l ON o.o_orderkey = l.l_orderkey
JOIN
    NATION AS n ON c.c_nationkey = n.n_nationkey
JOIN
    REGION AS r ON n.n_regionkey = r.r_regionkey
GROUP BY
    n.n_name,
    c.c_mktsegment
ORDER BY
    n.n_name,
    c.c_mktsegment;
'''
df = pd.read_sql_query(query, connection)
df

Unnamed: 0,region_name,customer_segment,average_order_amount
0,ALGERIA,AUTOMOBILE,36345.368392
1,ALGERIA,BUILDING,36573.715297
2,ALGERIA,FURNITURE,36267.347935
3,ALGERIA,HOUSEHOLD,36374.753854
4,ALGERIA,MACHINERY,36356.958836
...,...,...,...
120,VIETNAM,AUTOMOBILE,36378.471551
121,VIETNAM,BUILDING,36285.335072
122,VIETNAM,FURNITURE,36434.419682
123,VIETNAM,HOUSEHOLD,36307.110974


In [19]:
%%pydough

tables = regions.nations.customers.orders.lines(countries_names = tables.BACK(3).name , customer_mktsgm = tables.BACK(2).mktsegment)

partitioned_tables = PARTITION(tables, name="tabs", by=(countries_names, customer_mktsgm))(average_order_amount = AVG(tabs.extended_price * (1 - tabs.discount)))

ordered_tables= partitioned_tables.ORDER_BY(countries_names.ASC(), customer_mktsgm.ASC())

output = ordered_tables(countries_names, customer_mktsgm, average_order_amount)

pydough.to_sql(output)

'SELECT countries_names, customer_mktsgm, average_order_amount FROM (SELECT customer_mktsgm, countries_names, AVG(extended_price * (1 - discount)) AS average_order_amount, countries_names AS ordering_1, customer_mktsgm AS ordering_2 FROM (SELECT discount, extended_price, name_3 AS countries_names, mktsegment AS customer_mktsgm FROM (SELECT name_3, mktsegment, key AS key_8 FROM (SELECT name_3, key AS key_5, mktsegment FROM (SELECT _table_alias_1.key AS key_2, name AS name_3 FROM (SELECT r_regionkey AS key FROM main.REGION) AS _table_alias_0 INNER JOIN (SELECT n_name AS name, n_regionkey AS region_key, n_nationkey AS key FROM main.NATION) AS _table_alias_1 ON _table_alias_0.key = region_key) INNER JOIN (SELECT c_nationkey AS nation_key, c_mktsegment AS mktsegment, c_custkey AS key FROM main.CUSTOMER) ON key_2 = nation_key) INNER JOIN (SELECT o_custkey AS customer_key, o_orderkey AS key FROM main.ORDERS) ON key_5 = customer_key) INNER JOIN (SELECT l_discount AS discount, l_extendedprice A

Find the suppliers of each region that supply STEEL parts with size bigger than 20, provide the number of parts they supply that comply with the aforementioned conditions

In [20]:
query = '''
SELECT
    s.S_NAME AS SupplierName,
    n.N_NAME AS Nation,
    r.R_NAME AS Region,
    COUNT(DISTINCT p.P_PARTKEY) AS NumberOfPartsSupplied
FROM
    SUPPLIER s
JOIN
    NATION n ON s.S_NATIONKEY = n.N_NATIONKEY
JOIN
    REGION r ON n.N_REGIONKEY = r.R_REGIONKEY
JOIN
    PARTSUPP ps ON s.S_SUPPKEY = ps.PS_SUPPKEY
JOIN
    PART p ON ps.PS_PARTKEY = p.P_PARTKEY
WHERE
    p.P_SIZE > 20  AND p.P_TYPE LIKE '%STEEL%'  -- Example filtering criteria
GROUP BY
    s.S_NAME, n.N_NAME, r.R_NAME
ORDER BY
    r.R_NAME, n.N_NAME, s.S_NAME;
'''
df = pd.read_sql_query(query, connection)
df

Unnamed: 0,SupplierName,Nation,Region,NumberOfPartsSupplied
0,Supplier#000000024,ALGERIA,AFRICA,18
1,Supplier#000000028,ALGERIA,AFRICA,10
2,Supplier#000000037,ALGERIA,AFRICA,10
3,Supplier#000000118,ALGERIA,AFRICA,5
4,Supplier#000000161,ALGERIA,AFRICA,7
...,...,...,...,...
9993,Supplier#000009855,SAUDI ARABIA,MIDDLE EAST,13
9994,Supplier#000009861,SAUDI ARABIA,MIDDLE EAST,8
9995,Supplier#000009891,SAUDI ARABIA,MIDDLE EAST,8
9996,Supplier#000009896,SAUDI ARABIA,MIDDLE EAST,6


In [None]:
%%pydough

tables = regions.nations.suppliers.supply_records.part(sup_name = BACK(2).name, nation_name = BACK(3).name, region_name = BACK(4).name)

filtered_tables = tables.WHERE((size > 20) & (CONTAINS(part_type, 'STEEL')))

partitioned_tables = PARTITION(filtered_tables, name = "tabs", by=(sup_name, nation_name, region_name))(number_of_parts_supplied = NDISTINCT(tabs.key))

ordered_tables = partitioned_tables.ORDER_BY(region_name.ASC(), nation_name.ASC(), sup_name.ASC())

output = ordered_tables(sup_name, nation_name, region_name, number_of_parts_supplied)

print(pydough.to_sql(output))



       number_of_parts_supplied
0                             1
1                             1
2                             1
3                             1
4                             1
...                         ...
23719                         1
23720                         1
23721                         1
23722                         1
23723                         1

[23724 rows x 1 columns]


Analize the revenue by part, category and year of Brand#13

In [22]:
query = '''
SELECT
    SUBSTR(CAST(o.O_ORDERDATE AS VARCHAR(10)), 1, 4) AS OrderYear,  -- Extract the year
    p.P_MFGR AS PartManufacturer,
    SUM(l.L_EXTENDEDPRICE * (1 - l.L_DISCOUNT)) AS TotalRevenue
FROM
    ORDERS o
JOIN
    LINEITEM l ON o.O_ORDERKEY = l.L_ORDERKEY
JOIN
    PART p ON l.L_PARTKEY = p.P_PARTKEY
WHERE
    p.P_BRAND LIKE 'Brand#13%' -- Example brand filter
GROUP BY
    OrderYear, p.P_MFGR
ORDER BY
    OrderYear, TotalRevenue DESC;
'''
df = pd.read_sql_query(query, connection)
df

Unnamed: 0,OrderYear,PartManufacturer,TotalRevenue
0,1992,Manufacturer#1,1328763000.0
1,1993,Manufacturer#1,1317337000.0
2,1994,Manufacturer#1,1321738000.0
3,1995,Manufacturer#1,1316251000.0
4,1996,Manufacturer#1,1327652000.0
5,1997,Manufacturer#1,1331357000.0
6,1998,Manufacturer#1,767687700.0


In [None]:
%%pydough

tables = orders.lines.part

filtered_tables = tables.WHERE(brand == "Brand#13")(order_year = YEAR(BACK(2).order_date))

total_revenue = SUM(filtered_tables.BACK(1).)

partitioned_tables = PARTITION(filtered_tables, name = "tabs", by=(order_year, manufacturer))(number_of_parts_supplied = NDISTINCT(tabs.key))

#ordered_tables = partitioned_tables.ORDER_BY(region_name.ASC(), nation_name.ASC(), sup_name.ASC())

#output = ordered_tables(sup_name, nation_name, region_name, number_of_parts_supplied)

print(pydough.to_df(partitioned_tables))

SyntaxError: invalid syntax (<unknown>, line 6)

Find what nations have the highest average order

In [None]:
query = '''
SELECT
    N_NAME
FROM
    NATION
WHERE
    N_NATIONKEY = (
        SELECT
            C_NATIONKEY
        FROM
            CUSTOMER
        WHERE
            C_CUSTKEY IN (SELECT O_CUSTKEY FROM ORDERS)
        GROUP BY
            C_NATIONKEY
        ORDER BY
            AVG((SELECT SUM(L_EXTENDEDPRICE * (1 - L_DISCOUNT)) FROM LINEITEM WHERE LINEITEM.L_ORDERKEY = ORDERS.O_ORDERKEY)) DESC
        LIMIT 1
    );
'''
df = pd.read_sql_query(query, connection)
df