# PyDough Operations

This notebook aims to provide an overview of the various builtin PyDough operations. We do not intend for this to be exhaustive and especially the functions listed are not complete, but we believe these operations can act as a foundation for getting started.

In [24]:
%load_ext pydough.jupyter_extensions

import pydough
# Setup demo metadata
pydough.active_session.load_metadata_graph("../metadata/tpch_demo_graph.json", "TPCH");
pydough.active_session.connect_database("sqlite", database="../../tpch.db");

The pydough.jupyter_extensions extension is already loaded. To reload it, use:
  %reload_ext pydough.jupyter_extensions


## Collections

A collection in PyDough is an abstraction for any "document", but in most cases represents a table. Starting with the TPC-H schema, if we want to access the regions table, we will use our corresponding PyDough collection.

In [25]:
%%pydough

print(pydough.to_sql(regions))
pydough.to_df(regions)

SELECT
  r_regionkey AS key,
  r_name AS name,
  r_comment AS comment
FROM main.region


Unnamed: 0,key,name,comment
0,0,AFRICA,lar deposits. blithely final packages cajole. ...
1,1,AMERICA,"hs use ironic, even requests. s"
2,2,ASIA,ges. thinly even pinto beans ca
3,3,EUROPE,ly final courts cajole furiously final excuse
4,4,MIDDLE EAST,uickly special accounts cajole carefully blith...


Collections contain properties, which either correspond to the entries within a document or a sub collection (another document that can be reached from the current document). This is explored in more detail in our notebook on metadata, but what is important to understand is that the path between collections is how we integrate data across multiple tables.

For example, each region is associated with 1 or more nations, so rather than just looking at the region we can look at "each nation for each region". This will result in outputting 1 entry per nation.

In [50]:
%%pydough

print(pydough.to_sql(regions.nations))
# pydough.to_df(regions.nations)

pydough.to_df(nations)

SELECT
  nation.n_nationkey AS key,
  nation.n_regionkey AS region_key,
  nation.n_name AS name,
  nation.n_comment AS comment
FROM main.region AS region
JOIN main.nation AS nation
  ON nation.n_regionkey = region.r_regionkey


Unnamed: 0,key,region_key,name,comment
0,0,0,ALGERIA,haggle. carefully final deposits detect slyly...
1,1,1,ARGENTINA,al foxes promise slyly according to the regula...
2,2,1,BRAZIL,y alongside of the pending deposits. carefully...
3,3,1,CANADA,"eas hang ironic, silent packages. slyly regula..."
4,4,4,EGYPT,y above the carefully unusual theodolites. fin...
5,5,0,ETHIOPIA,ven packages wake quickly. regu
6,6,3,FRANCE,"refully final requests. regular, ironi"
7,7,3,GERMANY,"l platelets. regular accounts x-ray: unusual, ..."
8,8,2,INDIA,ss excuses cajole slyly across the packages. d...
9,9,2,INDONESIA,slyly express asymptotes. regular deposits ha...


Notice how in the generated SQL we create a join between `region` and `nation`. The metadata holds this relationship, effectively abstracting joins away from the developer whenever possible.

## Calculate

The next important operation is the `CALCULATE` operation, which takes in a variable number of positioning and/or keyword arguments.

In [27]:
%%pydough

print(pydough.to_sql(nations.CALCULATE(key, nation_name=name)))

SELECT
  n_nationkey AS key,
  n_name AS nation_name
FROM main.nation


Calculate has a few purposes:
* Select which entries you want in the output.
* Define new fields by calling functions.
* Allow operations to be evaluated for each entry in the outermost collection's "context".
* Define aliases for terms that get down-streamed to descendants ([see here](#down-streaming)).

The terms of the last `CALCULATE` in the PyDough logic are the terms that are included in the result (unless the `columns` argument of `to_sql` or `to_df` is used).

In [51]:
%%pydough
natios_plus_key = nations.CALCULATE(adjusted_key = key + 1)
print(pydough.to_sql(natios_plus_key))
pydough.to_df(natios_plus_key)

SELECT
  n_nationkey + 1 AS adjusted_key
FROM main.nation


Unnamed: 0,adjusted_key
0,1
1,2
2,3
3,4
4,5
5,6
6,7
7,8
8,9
9,10


Here the context is the "nations" at the root of the graph. This means that for each entry within nations, we compute the result. This has important implications for when we get to more complex expressions. For example, if we want to know how many nations we have stored in each region, we can do via `CALCULATE`.

In [29]:
%%pydough

pydough.to_df(regions.CALCULATE(name, nation_count=COUNT(nations)))

Unnamed: 0,name,nation_count
0,AFRICA,5
1,AMERICA,5
2,ASIA,5
3,EUROPE,5
4,MIDDLE EAST,5


Internally, this process evaluates `COUNT(nations)` grouped on each region and then joining the result with the original `regions` table. Importantly, this outputs a "scalar" value for each region.

This shows a very important restriction of `CALCULATE`: each final entry in the operation must be scalar with respect to a current context. For example, the expression `regions(region_name=name, nation_name=nations.name)` is not legal because region and nation is a one to many relationship, so there is not a single nation name for each region. 

**The cell below will result in an error because it violates this restriction.**

In [30]:
%%pydough

pydough.to_df(regions.CALCULATE(region_name=name, nation_name=nations.name))

PyDoughQDAGException: Expected all terms in CALCULATE(region_name=name, nation_name=nations.name) to be singular, but encountered a plural expression: nations.name

In contrast, we know that every nation has 1 region (and this is defined in the metadata). As a result the alternative expression, `nations(nation_name=name, region_name=region.name)` is legal.

In [31]:
%%pydough

pydough.to_df(nations.CALCULATE(nation_name=name, region_name=region.name))

Unnamed: 0,nation_name,region_name
0,ALGERIA,AFRICA
1,ARGENTINA,AMERICA
2,BRAZIL,AMERICA
3,CANADA,AMERICA
4,EGYPT,MIDDLE EAST
5,ETHIOPIA,AFRICA
6,FRANCE,EUROPE
7,GERMANY,EUROPE
8,INDIA,ASIA
9,INDONESIA,ASIA


This illustrates one of the important properties of the metadata, defining one:one, many:one, one:many, and many:many relationships can allow developers the flexiblity to write simpler queries.

### Functions

PyDough has support for many builtin functions. Whenever possible we try and support standard Python operators. However, this is not completely possible. In addition, to avoid namespace conflicts, for functions that require regular function call semantics we use all capitalization by convention. Here are some examples.

In [32]:
%%pydough

# Numeric operations
print("Q1")
print(pydough.to_sql(nations.CALCULATE(key + 1, key - 1, key * 1, key / 1)))

# Comparison operators
print("\nQ2")
print(pydough.to_sql(nations.CALCULATE(key == 0, key < 0, key != 0, key >= 5)))

# String Operations
print("\nQ3")
print(pydough.to_sql(nations.CALCULATE(LENGTH(name), UPPER(name), LOWER(name), STARTSWITH(name, "A"))))

# Boolean operations
print("\nQ4")
print(pydough.to_sql(nations.CALCULATE((key != 1) & (LENGTH(name) > 5)))) # Boolean AND
print("\nQ5")
print(pydough.to_sql(nations.CALCULATE((key != 1) | (LENGTH(name) > 5)))) # Boolean OR
print("\nQ6")
print(pydough.to_sql(nations.CALCULATE(~(LENGTH(name) > 5)))) # Boolean NOT        
print("\nQ7")  
print(pydough.to_sql(nations.CALCULATE(ISIN(name, ("KENYA", "JAPAN"))))) # In

# Datetime Operations
# Note: Since this is based on SQL lite the underlying date is a bit strange.
print("\nQ8")
print(pydough.to_sql(lines.CALCULATE(YEAR(ship_date), MONTH(ship_date), DAY(ship_date),HOUR(ship_date),MINUTE(ship_date),SECOND(ship_date))))

# Aggregation operations
print("\nQ9")
print(pydough.to_sql(TPCH.CALCULATE(NDISTINCT(nations.comment), SUM(nations.key))))
# Count can be used on a column for non-null entries or a collection
# for total entries.
print("\nQ10")
print(pydough.to_sql(TPCH.CALCULATE(COUNT(nations), COUNT(nations.comment))))

Q1
SELECT
  n_nationkey + 1 AS _expr0,
  n_nationkey - 1 AS _expr1,
  n_nationkey * 1 AS _expr2,
  CAST(n_nationkey AS REAL) / 1 AS _expr3
FROM main.nation

Q2
SELECT
  n_nationkey = 0 AS _expr0,
  n_nationkey < 0 AS _expr1,
  n_nationkey <> 0 AS _expr2,
  n_nationkey >= 5 AS _expr3
FROM main.nation

Q3
SELECT
  LENGTH(n_name) AS _expr0,
  UPPER(n_name) AS _expr1,
  LOWER(n_name) AS _expr2,
  n_name LIKE 'A%' AS _expr3
FROM main.nation

Q4
SELECT
  LENGTH(n_name) > 5 AND n_nationkey <> 1 AS _expr0
FROM main.nation

Q5
SELECT
  LENGTH(n_name) > 5 OR n_nationkey <> 1 AS _expr0
FROM main.nation

Q6
SELECT
  LENGTH(n_name) <= 5 AS _expr0
FROM main.nation

Q7
SELECT
  n_name IN ('KENYA', 'JAPAN') AS _expr0
FROM main.nation

Q8
SELECT
  CAST(STRFTIME('%Y', l_shipdate) AS INTEGER) AS _expr0,
  CAST(STRFTIME('%m', l_shipdate) AS INTEGER) AS _expr1,
  CAST(STRFTIME('%d', l_shipdate) AS INTEGER) AS _expr2,
  CAST(STRFTIME('%H', l_shipdate) AS INTEGER) AS _expr3,
  CAST(STRFTIME('%M', l_shipdate)

#### Limitations

There are a few limitations with regular Python. Most notably:
* You cannot use Python's builtin `and`, `or`, `not`, or `in` with PyDough expressions.
* We do not support chained comparisons (e.g. `2 < x < 5`).
* We only support Python literals that are `integers`, `floats`, `strings`, `datetime.date`, or a `tuple`/`list` of those supported types.
* Lists and tuples can only be used with `ISIN`.

### Down-Streaming

Sometimes you need to load a value from a previous context to use at a later step in a PyDough statement. Any expression from an ancestor context that is placed in a `CALCULATE` is automatically made available to all descendants of that context. However, an error will occur if the name of the term defined in the ancestor collides with a name of a term or property of a descendant context, since PyDough will not know which one to use.

Notice how in the example below, `region_name` is defined in a `CALCULATE` within the context of `regions`, so the calculate within the context of `nations` also has access to `region_name` (interpreted as "the name of the region that this nation belongs to").

In [33]:
%%pydough

pydough.to_df(regions.CALCULATE(region_name=name).nations.CALCULATE(region_name, nation_name=name))

Unnamed: 0,region_name,nation_name
0,AFRICA,ALGERIA
1,AMERICA,ARGENTINA
2,AMERICA,BRAZIL
3,AMERICA,CANADA
4,MIDDLE EAST,EGYPT
5,AFRICA,ETHIOPIA
6,EUROPE,FRANCE
7,EUROPE,GERMANY
8,ASIA,INDIA
9,ASIA,INDONESIA


Here is a more complex example showing intermediate values. Here we will first compute `total_value` and then reuse it downstream.

In [34]:
%%pydough

nations_value = nations.CALCULATE(nation_name=name, total_value=SUM(suppliers.account_balance))
pydough.to_df(nations_value)

Unnamed: 0,nation_name,total_value
0,ALGERIA,1813079.63
1,ARGENTINA,1814336.1
2,BRAZIL,1749634.48
3,CANADA,2041622.22
4,EGYPT,1805131.05
5,ETHIOPIA,1796278.37
6,FRANCE,1853385.17
7,GERMANY,1698097.02
8,INDIA,1858708.57
9,INDONESIA,1880366.21


In [35]:
%%pydough
suppliers_value = nations_value.suppliers.CALCULATE(
 key,
 name,
 nation_name,
 account_balance=account_balance,
 percentage_of_national_value=100 * account_balance / total_value
)
top_suppliers = suppliers_value.TOP_K(20, by=percentage_of_national_value.DESC())
pydough.to_df(top_suppliers)

Unnamed: 0,key,name,nation_name,account_balance,percentage_of_national_value
0,4194,Supplier#000004194,JORDAN,9973.93,0.640133
1,6716,Supplier#000006716,JORDAN,9895.14,0.635076
2,1943,Supplier#000001943,JORDAN,9889.66,0.634724
3,2753,Supplier#000002753,JORDAN,9882.68,0.634276
4,7901,Supplier#000007901,JORDAN,9869.16,0.633409
5,4196,Supplier#000004196,JORDAN,9825.61,0.630614
6,4778,Supplier#000004778,JORDAN,9818.79,0.630176
7,4160,Supplier#000004160,JORDAN,9812.39,0.629765
8,5141,Supplier#000005141,JORDAN,9639.46,0.618666
9,5305,Supplier#000005305,JORDAN,9611.79,0.616891


## WHERE

The `WHERE` operation by be used to filter unwanted entries in a context. For example, we can filter `nations` to only consider the `AMERICA` and `EUROPE` regions. A WHERE's context functions similarly to a `CALCULATE` except that it cannot be used to assign new properties; it only contains a single positional argument: the predicate to filter on. 

In [36]:
%%pydough

pydough.to_df(nations.WHERE((region.name == "AMERICA") | (region.name == "EUROPE")))

Unnamed: 0,key,region_key,name,comment
0,1,1,ARGENTINA,al foxes promise slyly according to the regula...
1,2,1,BRAZIL,y alongside of the pending deposits. carefully...
2,3,1,CANADA,"eas hang ironic, silent packages. slyly regula..."
3,6,3,FRANCE,"refully final requests. regular, ironi"
4,7,3,GERMANY,"l platelets. regular accounts x-ray: unusual, ..."
5,17,1,PERU,platelets. blithely pending dependencies use f...
6,19,3,ROMANIA,ular asymptotes are about the furious multipli...
7,22,3,RUSSIA,requests against the platelets use never acco...
8,23,3,UNITED KINGDOM,eans boost carefully special requests. account...
9,24,1,UNITED STATES,y final packages. slow foxes cajole quickly. q...


## TOP_K

The TOP K operation is used to reduce a collection to maximum number of values. The `by` arugument is used to order the output based on a sorting condition. As an example, consider this query to only calculate the first 5 nations by alphabetical name order.

In [37]:
%%pydough

pydough.to_df(nations.TOP_K(5, by=name.ASC()))

Unnamed: 0,key,region_key,name,comment
0,0,0,ALGERIA,haggle. carefully final deposits detect slyly...
1,1,1,ARGENTINA,al foxes promise slyly according to the regula...
2,2,1,BRAZIL,y alongside of the pending deposits. carefully...
3,3,1,CANADA,"eas hang ironic, silent packages. slyly regula..."
4,18,2,CHINA,c dependencies. furiously express notornis sle...


The `by` argument requirements are:
* Anything that can be an expression used in a `CALCULATE` or a `WHERE` can be used a component of a `by`.
* The value in the `by` must end with either `.ASC()` or `.DESC()`

You can also provide a tuple to by if you need to break ties. Consider this alternatives that instead selects the 20 parts with the largest size, starting with the smallest part id.

In [53]:
%%pydough

pydough.to_df(parts.TOP_K(5, by=(size.DESC(), key.ASC())))

Unnamed: 0,key,name,manufacturer,brand,part_type,size,container,retail_price,comment
0,232,ivory peru lavender orange dark,Manufacturer#5,Brand#53,LARGE BURNISHED NICKEL,50,SM PKG,1132.23,"r, unusual requests"
1,273,pink white sky burnished coral,Manufacturer#2,Brand#25,STANDARD BRUSHED BRASS,50,LG BOX,1173.27,ackages along the
2,414,pink brown purple puff snow,Manufacturer#4,Brand#41,SMALL BURNISHED STEEL,50,WRAP CASE,1314.41,efully. dolph
3,436,turquoise yellow dim purple antique,Manufacturer#1,Brand#14,LARGE POLISHED BRASS,50,WRAP CASE,1336.43,the regul
4,679,purple blanched linen metallic indian,Manufacturer#4,Brand#41,SMALL BURNISHED TIN,50,MED BOX,1579.67,iously ironic in


## ORDER_BY

If you just want to return your output in a sorted order, you can use `ORDER_BY`. The functionality is the same as in `TOP_K` except that there is no `K` argument so the rows are not reduced. Each argument must be an expression that can be used for sorting.

Below can transform our nations collection to sort the output by the alphabetical ordering of the nation names.

In [39]:
%%pydough

pydough.to_df(nations.ORDER_BY(name.ASC()))

Unnamed: 0,key,region_key,name,comment
0,0,0,ALGERIA,haggle. carefully final deposits detect slyly...
1,1,1,ARGENTINA,al foxes promise slyly according to the regula...
2,2,1,BRAZIL,y alongside of the pending deposits. carefully...
3,3,1,CANADA,"eas hang ironic, silent packages. slyly regula..."
4,18,2,CHINA,c dependencies. furiously express notornis sle...
5,4,4,EGYPT,y above the carefully unusual theodolites. fin...
6,5,0,ETHIOPIA,ven packages wake quickly. regu
7,6,3,FRANCE,"refully final requests. regular, ironi"
8,7,3,GERMANY,"l platelets. regular accounts x-ray: unusual, ..."
9,8,2,INDIA,ss excuses cajole slyly across the packages. d...


## PARTITION

The partition operation allows grouping collections under interesting keys similar to a SQL `GROUP BY`. Keys can be specified using the `by` argument and data columns to be aggregated can be referenced using the name argument. For example, we can use this to bucket nations by name length.

In [57]:
%%pydough

updated_nations = nations.CALCULATE(key, name_length=LENGTH(name))

grouped_nations = updated_nations.PARTITION(
    name="lengths", by=(name_length)
).CALCULATE(
    name_length,
    nation_count=COUNT(nations)
)

# pydough.to_df(updated_nations)
pydough.to_df(grouped_nations)

Unnamed: 0,name_length,nation_count
0,4,3
1,5,5
2,6,5
3,7,5
4,8,1
5,9,2
6,10,1
7,12,1
8,13,1
9,14,1


A couple important usage details:
* The data inside each partitioned group can be accessed as a sub-collection using its original name (see `nations` in the example above).
* The `name` argument specifies the name of the collection of partitioned data (needed in case the partitioned data gets partitioned again and needs to be accessed by name).
* `keys` can be either be a single expression or a tuple of them, but it can only be references to expressions that already exist in the context of the data (e.g. `name`, not `LOWER(name)` or `region.name`)
* Terms defined from the context of the `PARTITION` can be down-streamed to its descendants. An example is shown below where we select brass parts of size 15, but only the ones whose supply is below the average of all such parts.

In [41]:
%%pydough

selected_parts = parts.WHERE(ENDSWITH(part_type, "BRASS") & (size == 15))
part_types = selected_parts.PARTITION(name="types", by=part_type).CALCULATE(avg_price=AVG(parts.retail_price))
output = part_types.parts.WHERE(retail_price < avg_price)
pydough.to_df(output)

Unnamed: 0,key,name,manufacturer,brand,part_type,size,container,retail_price,comment
0,249,hot sandy lavender saddle rosy,Manufacturer#4,Brand#44,ECONOMY BURNISHED BRASS,15,LG JAR,1149.24,excuses kindle f
1,323,moccasin goldenrod tan maroon bisque,Manufacturer#4,Brand#41,MEDIUM BRUSHED BRASS,15,MED CASE,1223.32,ular pi
2,1015,cornflower brown rosy seashell mint,Manufacturer#4,Brand#41,MEDIUM BRUSHED BRASS,15,MED CAN,916.01,ial deposits. acc
3,2037,black blanched cornsilk cornflower metallic,Manufacturer#1,Brand#12,PROMO PLATED BRASS,15,WRAP CAN,939.03,rding to the blithe
4,2156,yellow beige cream deep grey,Manufacturer#5,Brand#53,LARGE ANODIZED BRASS,15,JUMBO BAG,1058.15,grouche
...,...,...,...,...,...,...,...,...,...
366,195065,lawn lime sienna yellow bisque,Manufacturer#3,Brand#32,SMALL ANODIZED BRASS,15,MED BAG,1160.06,s are slyly. pendi
367,197144,steel mint linen smoke beige,Manufacturer#1,Brand#12,STANDARD ANODIZED BRASS,15,MED CAN,1241.14,ld accounts sle
368,197339,slate lime medium midnight saddle,Manufacturer#2,Brand#22,SMALL PLATED BRASS,15,JUMBO DRUM,1436.33,beans nag sly
369,198189,bisque thistle misty green yellow,Manufacturer#4,Brand#43,MEDIUM ANODIZED BRASS,15,MED BOX,1287.18,packages im


## HAS and HASNOT

The `HAS` and `HASNOT` operations are used for filtering based on if any match occurs between an entry and another collection. For example, consider consider only regions that have at least 1 nation whose name is length > 10.

In [42]:
%%pydough

length_10_nations = nations.WHERE(LENGTH(name) > 10)
pydough.to_df(regions.WHERE(HAS(length_10_nations)))

Unnamed: 0,key,name,comment
0,1,AMERICA,"hs use ironic, even requests. s"
1,3,EUROPE,ly final courts cajole furiously final excuse
2,4,MIDDLE EAST,uickly special accounts cajole carefully blith...


Alternatively we can only consider regions where all of its nations names of length 10 or less.

In [43]:
%%pydough

pydough.to_df(regions.WHERE(HASNOT(length_10_nations)))

Unnamed: 0,key,name,comment
0,0,AFRICA,lar deposits. blithely final packages cajole. ...
1,2,ASIA,ges. thinly even pinto beans ca


# SINGULAR

In PyDough, it is required that if we are accessing a sub-collection in a collection context, the collection must be singular with regards to the sub-collection. For example, consider the following PyDough code, that results in an error:

In [44]:
%%pydough
pydough.to_df(regions.CALCULATE(name, nation_name=nations.name))

PyDoughQDAGException: Expected all terms in CALCULATE(name=name, nation_name=nations.name) to be singular, but encountered a plural expression: nations.name

This results in an error as nations is plural with regards to regions and PyDough does not know which nation name to use for each region. Let's say we want a field nation_4_name that contains the name of the nation with key 4. The PyDough code to do this is as follows:

In [45]:
%%pydough
nation_4 = nations.WHERE(key == 4)
pydough.to_df(regions.CALCULATE(name, nation_4_name=nation_4.name))

PyDoughQDAGException: Expected all terms in CALCULATE(name=name, nation_4_name=nations.WHERE(key == 4).name) to be singular, but encountered a plural expression: nations.WHERE(key == 4).name

We see that the above code results in an error as even though we know that there is at most a single value of `nation_4` for each instance of `regions`, PyDough does not know this and therefore prohibits the operation.
To fix this, we can use the `.SINGULAR()` modifier to tell PyDough that the data should be treated as singular.

In [60]:
%%pydough
nation_plural = nations.WHERE(key == 4).SINGULAR()
nation_4 = nations.WHERE(key == 4).SINGULAR()
pydough.to_df(regions.CALCULATE(name, nation_4_name=nation_4.name))
# pydough.to_df(nation_plural)

Unnamed: 0,name,nation_4_name
0,AFRICA,
1,AMERICA,
2,ASIA,
3,EUROPE,
4,MIDDLE EAST,EGYPT


In summary, certain PyDough operations, such as specific filters, can cause plural data to become singular. In this case, PyDough will still ban the plural data from being treated as singular unless the `.SINGULAR()` modifier is used to tell PyDough that the data should be treated as singular. It is very important that this only be used if the user is certain that the data will be singular, since otherwise it can result in undefined behavior when the PyDough code is executed.