# PyDough Operations

This notebook aims to provide an overview of the various PyDough operations that can be used as building blocks for solving analytics questions. We do not intend for this to be exhaustive and especially the functions listed are not complete. However, we believe these operations can act as a foundation for getting started.

In [1]:
%load_ext pydough_jupyter_extensions

import pydough
# Setup demo metadata
pydough.active_session.load_metadata_graph("../metadata/tpch_demo_graph.json", "TPCH");
pydough.active_session.connect_database("sqlite", database="../tpch.db");

## Collections

The natural place to start is accessing a collection. A collection in PyDough is an abstraction for any "document", but in most cases represents a table. Starting with the TPC-H schema, if we want to access the regions table, we will use our corresponding PyDough collection.

In [2]:
%%pydough

print(pydough.to_sql(regions))
pydough.to_df(regions)

SELECT r_regionkey AS key, r_name AS name, r_comment AS comment FROM main.REGION


Unnamed: 0,key,name,comment
0,0,AFRICA,lar deposits. blithely final packages cajole. ...
1,1,AMERICA,"hs use ironic, even requests. s"
2,2,ASIA,ges. thinly even pinto beans ca
3,3,EUROPE,ly final courts cajole furiously final excuse
4,4,MIDDLE EAST,uickly special accounts cajole carefully blith...


Collections contain properties, which either correspond to the entries within a document or a sub collection (another document that can be reached from the current document). This is explored in more detail in our notebook on metadata, but what is important to understand is that the path between collections is how we integrate data across multiple tables.

For example, each region is associated with 1 or more nations, so rather than just looking at the region we can look at "each nation for each region". This will result in outputing 1 entry per nation.

In [3]:
%%pydough

print(pydough.to_sql(regions.nations))
pydough.to_df(regions.nations)

SELECT _table_alias_1.key AS key, name AS name, region_key, comment AS comment FROM (SELECT r_regionkey AS key FROM main.REGION) AS _table_alias_0 INNER JOIN (SELECT n_comment AS comment, n_regionkey AS region_key, n_nationkey AS key, n_name AS name FROM main.NATION) AS _table_alias_1 ON _table_alias_0.key = region_key


Unnamed: 0,key,name,region_key,comment
0,0,ALGERIA,0,haggle. carefully final deposits detect slyly...
1,1,ARGENTINA,1,al foxes promise slyly according to the regula...
2,2,BRAZIL,1,y alongside of the pending deposits. carefully...
3,3,CANADA,1,"eas hang ironic, silent packages. slyly regula..."
4,4,EGYPT,4,y above the carefully unusual theodolites. fin...
5,5,ETHIOPIA,0,ven packages wake quickly. regu
6,6,FRANCE,3,"refully final requests. regular, ironi"
7,7,GERMANY,3,"l platelets. regular accounts x-ray: unusual, ..."
8,8,INDIA,2,ss excuses cajole slyly across the packages. d...
9,9,INDONESIA,2,slyly express asymptotes. regular deposits ha...


Notice how in the generated SQL we create a join between `region` and `nation`. The metadata holds this relationship, effectively abstracting joins away from the developer whenever possible.

## Calc

The next important operation is the `CALC` operation, which is used by "calling" a collection as a function.

In [4]:
%%pydough

pydough.to_sql(nations(key))

'SELECT n_nationkey AS key FROM main.NATION'

Calc has a few purposes:
* Select which entries you want in the output.
* Define new fields by calling functions.
* Allow operations to be evaluated for each entry in the outermost collection expression.

In [5]:
%%pydough

pydough.to_sql(nations(key + 1))

'SELECT key + 1 AS _expr0 FROM (SELECT n_nationkey AS key FROM main.NATION)'

Here the context is the "nations" at the root of the graph. This means that for each entry within nations, we compute the result. This has important implications for when we get to more complex expressions. For example, if we want to know how many nations we have stored in each region, we can do via CALC.

In [6]:
%%pydough

pydough.to_df(regions(name, nation_count=COUNT(nations)))

Unnamed: 0,name,nation_count
0,AFRICA,5
1,AMERICA,5
2,ASIA,5
3,EUROPE,5
4,MIDDLE EAST,5


Internally, this process evaluates `COUNT(nations)` in the context of `regions`. That means that for each entry in regions we navigate to `regions.nations` and perform the calc operation to reduce the results to a scalar.

This shows a very important restriction of CALC, each final entry in a calc expression must be scalar with respect to a current context. For example, the expression `regions(region_name=name, nation_name=nations.name)` is not legal because region and nation is a one to many relationship, so there is not a single scalar value for each region.

In [7]:
%%pydough

pydough.to_df(regions(region_name=name, nation_name=nations.name))

PyDoughASTException: Expected all terms in (region_name=name, nation_name=nations.name) to be singular, but encountered a plural expression: nations.name

In contrast, we know that every nation has 1 region (and this is defined in the metadata). As a result the alternative expression, `nations(nation_name=name, region_name=region.name)` is legal.

In [8]:
%%pydough

pydough.to_df(nations(nation_name=name, region_name=region.name))

Unnamed: 0,nation_name,region_name
0,ALGERIA,AFRICA
1,ARGENTINA,AMERICA
2,BRAZIL,AMERICA
3,CANADA,AMERICA
4,EGYPT,MIDDLE EAST
5,ETHIOPIA,AFRICA
6,FRANCE,EUROPE
7,GERMANY,EUROPE
8,INDIA,ASIA
9,INDONESIA,ASIA


This illustrates one of the important properties of the metadata, defining one:one, many:one, one:many, and many:many relationships can allow developers the flexiblity to write simpler queries.

### Functions

PyDough has support for many functions to enable analytics calculations. Whenever possible we try and support standard Python operators. However, this is not completely possible. In addition, to avoid namespace conflicts, for functions that require regular function call semantics we use all capitalization by convention. Here are some examples.

In [9]:
%%pydough

# Numeric operations
print(pydough.to_sql(nations(key + 1, key - 1, key * 1, key / 1)))

# Comparison operators
print(pydough.to_sql(nations(key == 0, key < 0, key != 0, key >= 5)))

# String Operations
print(pydough.to_sql(nations(LENGTH(name), UPPER(name), LOWER(name), STARTSWITH(name, "A"))))

# Boolean operations
print(pydough.to_sql(nations((key != 1) & (LENGTH(name) > 5)))) # Boolean AND
print(pydough.to_sql(nations((key != 1) | (LENGTH(name) > 5)))) # Boolean OR
print(pydough.to_sql(nations(~(LENGTH(name) > 5)))) # Boolean NOT          
print(pydough.to_sql(nations(ISIN(name, ("KENYA", "JAPAN"))))) # In

# Datetime Operations
# Note: Since this is based on SQL lite the underlying date is a bit strange.
print(pydough.to_sql(lines(YEAR(ship_date), MONTH(ship_date), DAY(ship_date))))

# Aggregation operations
print(pydough.to_sql(TPCH(NDISTINCT(nations.comment), SUM(nations.key))))
# Count can be used on a column for non-null entries or a collection
# for total entries.
print(pydough.to_sql(TPCH(COUNT(nations), COUNT(nations.comment))))

SELECT key + 1 AS _expr0, key - 1 AS _expr1, key * 1 AS _expr2, CAST(key AS REAL) / 1 AS _expr3 FROM (SELECT n_nationkey AS key FROM main.NATION)
SELECT key = 0 AS _expr0, key < 0 AS _expr1, key <> 0 AS _expr2, key >= 5 AS _expr3 FROM (SELECT n_nationkey AS key FROM main.NATION)
SELECT LENGTH(name) AS _expr0, UPPER(name) AS _expr1, LOWER(name) AS _expr2, name LIKE 'A%' AS _expr3 FROM (SELECT n_name AS name FROM main.NATION)
SELECT (key <> 1) AND (LENGTH(name) > 5) AS _expr0 FROM (SELECT n_nationkey AS key, n_name AS name FROM main.NATION)
SELECT (key <> 1) OR (LENGTH(name) > 5) AS _expr0 FROM (SELECT n_nationkey AS key, n_name AS name FROM main.NATION)
SELECT NOT LENGTH(name) > 5 AS _expr0 FROM (SELECT n_name AS name FROM main.NATION)
SELECT name IN ('KENYA', 'JAPAN') AS _expr0 FROM (SELECT n_name AS name FROM main.NATION)
SELECT CAST(STRFTIME('%Y', ship_date) AS INTEGER) AS _expr0, CAST(STRFTIME('%m', ship_date) AS INTEGER) AS _expr1, CAST(STRFTIME('%d', ship_date) AS INTEGER) AS _exp

#### Limitations

There are a few limitations with regular Python. Most notably:
* You cannot use Python's builtin `and`, `or`, `not`, or `in` with PyDough expressions.
* We do not support chained comparisons (e.g. `2 < x < 5`).
* We only support Python literals that are `integers`, `floats`, `strings`, `datetime.date`, or a `tuple`/`list` of those supported types.
* Lists and tuples can only be used with `ISIN`.

### BACK

Sometimes you need to load a value from a previous context to use at a later step in a PyDough statement. That can be done using the `BACK` operation. This step moves back `k` steps to find the name you are searching for. This is useful to avoid repeating computation.

In [10]:
%%pydough

pydough.to_df(regions.nations(region_name=BACK(1).name, nation_name=name))

Unnamed: 0,region_name,nation_name
0,AFRICA,ALGERIA
1,AMERICA,ARGENTINA
2,AMERICA,BRAZIL
3,AMERICA,CANADA
4,MIDDLE EAST,EGYPT
5,AFRICA,ETHIOPIA
6,EUROPE,FRANCE
7,EUROPE,GERMANY
8,ASIA,INDIA
9,ASIA,INDONESIA


Here is a more complex example showing intermediate values. Here we will first compute `total_value` and then reuse it via `BACK`.

In [11]:
%%pydough

nations_value = nations(name, total_value=SUM(suppliers.account_balance))
pydough.to_df(nations_value)

Unnamed: 0,name,total_value
0,ALGERIA,1813079.63
1,ARGENTINA,1814336.1
2,BRAZIL,1749634.48
3,CANADA,2041622.22
4,EGYPT,1805131.05
5,ETHIOPIA,1796278.37
6,FRANCE,1853385.17
7,GERMANY,1698097.02
8,INDIA,1858708.57
9,INDONESIA,1880366.21


In [15]:
%%pydough
suppliers_value = nations_value.suppliers(
 key,
 name,
 nation_name=BACK(1).name,
 account_balance=account_balance,
 percentage_of_national_value=100 * account_balance / BACK(1).total_value
)
top_suppliers = suppliers_value.TOP_K(20, by=percentage_of_national_value.DESC())
pydough.to_df(top_suppliers)

Unnamed: 0,key,name,nation_name,account_balance,percentage_of_national_value
0,4194,Supplier#000004194,JORDAN,9973.93,0.640133
1,6716,Supplier#000006716,JORDAN,9895.14,0.635076
2,1943,Supplier#000001943,JORDAN,9889.66,0.634724
3,2753,Supplier#000002753,JORDAN,9882.68,0.634276
4,7901,Supplier#000007901,JORDAN,9869.16,0.633409
5,4196,Supplier#000004196,JORDAN,9825.61,0.630614
6,4778,Supplier#000004778,JORDAN,9818.79,0.630176
7,4160,Supplier#000004160,JORDAN,9812.39,0.629765
8,5141,Supplier#000005141,JORDAN,9639.46,0.618666
9,5305,Supplier#000005305,JORDAN,9611.79,0.616891


## WHERE

The WHERE operation by be used to filter unwanted entries in a context. For example, we can filter `nations` to only consider the `AMERICA` and `EUROPE` regions. A WHERE's context functions similarly to a calc except that it cannot be used to assign new properties. 

In [16]:
%%pydough

pydough.to_df(nations.WHERE((region.name == "AMERICA") | (region.name == "EUROPE")))

Unnamed: 0,key,name,region_key,comment
0,1,ARGENTINA,1,al foxes promise slyly according to the regula...
1,2,BRAZIL,1,y alongside of the pending deposits. carefully...
2,3,CANADA,1,"eas hang ironic, silent packages. slyly regula..."
3,6,FRANCE,3,"refully final requests. regular, ironi"
4,7,GERMANY,3,"l platelets. regular accounts x-ray: unusual, ..."
5,17,PERU,1,platelets. blithely pending dependencies use f...
6,19,ROMANIA,3,ular asymptotes are about the furious multipli...
7,22,RUSSIA,3,requests against the platelets use never acco...
8,23,UNITED KINGDOM,3,eans boost carefully special requests. account...
9,24,UNITED STATES,1,y final packages. slow foxes cajole quickly. q...


## TOP_K

The top k operation is used to reduce a collection to maximum number of values based on some ordering condition. The `by` arugument is used to order the output based on a sorting condition. As an example, consider this query to only calculate the first 5 nations by alphabetical name order.

For the `by` argument:
* Anything that can be an expression used in a `CALC` or a `WHERE` can be used a component of a `by`.
* The value in the `by` must end with either `.ASC()` or `.DESC()`

In [17]:
%%pydough

pydough.to_df(nations.TOP_K(5, by=name.ASC()))

Unnamed: 0,key,name,region_key,comment
0,0,ALGERIA,0,haggle. carefully final deposits detect slyly...
1,1,ARGENTINA,1,al foxes promise slyly according to the regula...
2,2,BRAZIL,1,y alongside of the pending deposits. carefully...
3,3,CANADA,1,"eas hang ironic, silent packages. slyly regula..."
4,18,CHINA,2,c dependencies. furiously express notornis sle...


You can also provide a tuple to by if you need to break ties. Consider this alternatives that instead selects the 20 parts with the largest size, starting with the smallest part id.

In [18]:
%%pydough

pydough.to_df(parts.TOP_K(20, by=(size.DESC(), key.ASC())))

Unnamed: 0,key,name,manufacturer,brand,part_type,size,container,retail_price,comment
0,232,ivory peru lavender orange dark,Manufacturer#5,Brand#53,LARGE BURNISHED NICKEL,50,SM PKG,1132.23,"r, unusual requests"
1,273,pink white sky burnished coral,Manufacturer#2,Brand#25,STANDARD BRUSHED BRASS,50,LG BOX,1173.27,ackages along the
2,414,pink brown purple puff snow,Manufacturer#4,Brand#41,SMALL BURNISHED STEEL,50,WRAP CASE,1314.41,efully. dolph
3,436,turquoise yellow dim purple antique,Manufacturer#1,Brand#14,LARGE POLISHED BRASS,50,WRAP CASE,1336.43,the regul
4,679,purple blanched linen metallic indian,Manufacturer#4,Brand#41,SMALL BURNISHED TIN,50,MED BOX,1579.67,iously ironic in
5,763,wheat seashell azure chartreuse dodger,Manufacturer#4,Brand#44,LARGE BRUSHED TIN,50,SM PKG,1663.76,counts. regu
6,767,blush firebrick misty blanched purple,Manufacturer#2,Brand#24,LARGE POLISHED TIN,50,MED DRUM,1667.76,ts. carefully unu
7,777,blanched indian pink frosted grey,Manufacturer#5,Brand#53,SMALL ANODIZED STEEL,50,JUMBO JAR,1677.77,theodolites
8,796,beige frosted cyan hot puff,Manufacturer#5,Brand#51,ECONOMY BRUSHED STEEL,50,WRAP CAN,1696.79,yly fina
9,803,brown navy tan salmon honeydew,Manufacturer#5,Brand#52,SMALL ANODIZED TIN,50,MED PKG,1703.8,ly at the accou


## ORDER_BY

If you just want to return your output in a sorted order, you can use `ORDER_BY`. The functionality is the same as in `TOP_K` except that there is no `K` argument so the rows are not reduced. Each argument must be an expression that can be used for sorting.

Below can transform our nations collection to sort the output by the alphabetical ordering of the nation names.

In [19]:
%%pydough

pydough.to_df(nations.ORDER_BY(name.ASC()))

Unnamed: 0,key,name,region_key,comment
0,0,ALGERIA,0,haggle. carefully final deposits detect slyly...
1,1,ARGENTINA,1,al foxes promise slyly according to the regula...
2,2,BRAZIL,1,y alongside of the pending deposits. carefully...
3,3,CANADA,1,"eas hang ironic, silent packages. slyly regula..."
4,18,CHINA,2,c dependencies. furiously express notornis sle...
5,4,EGYPT,4,y above the carefully unusual theodolites. fin...
6,5,ETHIOPIA,0,ven packages wake quickly. regu
7,6,FRANCE,3,"refully final requests. regular, ironi"
8,7,GERMANY,3,"l platelets. regular accounts x-ray: unusual, ..."
9,8,INDIA,2,ss excuses cajole slyly across the packages. d...


## PARTITION

The partition operation allows grouping collections under interesting keys similar to a SQL `GROUP BY`. Keys can be specified using the `by` argument and data columns to be aggregated can be referenced using the name argument. For example, we can use this to bucket nations by name length.

In [21]:
%%pydough

updated_nations = nations(key, name_length=LENGTH(name))
grouped_nations = PARTITION(
    updated_nations, name="n", by=(name_length)
)(
    name_length,
    nation_count=COUNT(n.key)
)
pydough.to_df(grouped_nations)

Unnamed: 0,name_length,nation_count
0,4,3
1,5,5
2,6,5
3,7,5
4,8,1
5,9,2
6,10,1
7,12,1
8,13,1
9,14,1


A couple important usage details:
* The `name` argument specifies the name of the subcollection access from the partitions to the original unpartitioned data.
* `keys` can be either be a single expression or a tuple of them, but it can only be references to expressions that already exist in the context of the data (e.g. `name`, not `LOWER(name)` or `region.name`)
* `BACK` should be used to step back into the partition child without retaining the partitioning. An example is shown below where we select brass european parts but only with the minimum supply cost.

In [None]:
%%pydough

selected_parts = parts.WHERE(ENDSWITH(part_type, "BRASS") & (size == 15))
part_types = PARTITION(selected_parts, name="p", by=part_type)(avg_price=AVG(p.retail_price))
output = part_types.p.WHERE(retail_price < BACK(1).avg_price)
pydough.to_df(output)

Unnamed: 0,avg_price
0,1521.141667
1,1573.617778
2,1492.116296
3,1455.88963
4,1462.17375
5,1513.770417
6,1429.555417
7,1530.551818
8,1443.439412
9,1578.760625


## HAS and HASNOT

The `HAS` and `HASNOT` operations are used for filtering based on if any match occurs between an entry and another collection. For example, consider consider only regions that have at least 1 nation whose name is length > 10.

In [23]:
%%pydough

length_10_nations = nations.WHERE(LENGTH(name) > 10)
pydough.to_df(regions.WHERE(HAS(length_10_nations)))

Unnamed: 0,key,name,comment
0,1,AMERICA,"hs use ironic, even requests. s"
1,3,EUROPE,ly final courts cajole furiously final excuse
2,4,MIDDLE EAST,uickly special accounts cajole carefully blith...


Alternatively we can only consider regions where all of its nations names of length 10 or less.

In [24]:
%%pydough

pydough.to_df(regions.WHERE(HASNOT(length_10_nations)))

Unnamed: 0,key,name,comment
0,0,AFRICA,lar deposits. blithely final packages cajole. ...
1,2,ASIA,ges. thinly even pinto beans ca
