# PyDough Operations

This notebook aims to provide an overview of the various builtin PyDough operations. We do not intend for this to be exhaustive and especially the functions listed are not complete, but we believe these operations can act as a foundation for getting started.

In [None]:
%load_ext pydough.jupyter_extensions

import pydough
# Setup demo metadata
pydough.active_session.load_metadata_graph("../metadata/tpch_demo_graph.json", "TPCH");
pydough.active_session.connect_database("sqlite", database="../../tpch.db");

## Collections

A collection in PyDough is an abstraction for any "document", but in most cases represents a table. Starting with the TPC-H schema, if we want to access the regions table, we will use our corresponding PyDough collection.

In [None]:
%%pydough

print(pydough.to_sql(regions))
pydough.to_df(regions)

Collections contain properties, which either correspond to the entries within a document or a sub collection (another document that can be reached from the current document). This is explored in more detail in our notebook on metadata, but what is important to understand is that the path between collections is how we integrate data across multiple tables.

For example, each region is associated with 1 or more nations, so rather than just looking at the region we can look at "each nation for each region". This will result in outputting 1 entry per nation.

In [None]:
%%pydough

print(pydough.to_sql(regions.nations))
pydough.to_df(regions.nations)

Notice how in the generated SQL we create a join between `region` and `nation`. The metadata holds this relationship, effectively abstracting joins away from the developer whenever possible.

## Calculate

The next important operation is the `CALCULATE` operation, which takes in a variable number of positioning and/or keyword arguments.

In [None]:
%%pydough

print(pydough.to_sql(nations.CALCULATE(key, nation_name=name)))

Calculate has a few purposes:
* Select which entries you want in the output.
* Define new fields by calling functions.
* Allow operations to be evaluated for each entry in the outermost collection's "context".
* Define aliases for terms that get down-streamed to descendants ([see here](#down-streaming)).

The terms of the last `CALCULATE` in the PyDough logic are the terms that are included in the result (unless the `columns` argument of `to_sql` or `to_df` is used).

In [None]:
%%pydough

print(pydough.to_sql(nations.CALCULATE(adjusted_key = key + 1)))

Here the context is the "nations" at the root of the graph. This means that for each entry within nations, we compute the result. This has important implications for when we get to more complex expressions. For example, if we want to know how many nations we have stored in each region, we can do via `CALCULATE`.

In [None]:
%%pydough

pydough.to_df(regions.CALCULATE(name, nation_count=COUNT(nations)))

Internally, this process evaluates `COUNT(nations)` grouped on each region and then joining the result with the original `regions` table. Importantly, this outputs a "scalar" value for each region.

This shows a very important restriction of `CALCULATE`: each final entry in the operation must be scalar with respect to a current context. For example, the expression `regions(region_name=name, nation_name=nations.name)` is not legal because region and nation is a one to many relationship, so there is not a single nation name for each region. 

**The cell below will result in an error because it violates this restriction.**

In [None]:
%%pydough

pydough.to_df(regions.CALCULATE(region_name=name, nation_name=nations.name))

In contrast, we know that every nation has 1 region (and this is defined in the metadata). As a result the alternative expression, `nations(nation_name=name, region_name=region.name)` is legal.

In [None]:
%%pydough

pydough.to_df(nations.CALCULATE(nation_name=name, region_name=region.name))

This illustrates one of the important properties of the metadata, defining one:one, many:one, one:many, and many:many relationships can allow developers the flexiblity to write simpler queries.

### Functions

PyDough has support for many builtin functions. Whenever possible we try and support standard Python operators. However, this is not completely possible. In addition, to avoid namespace conflicts, for functions that require regular function call semantics we use all capitalization by convention. Here are some examples.

In [None]:
%%pydough

# Numeric operations
print("Q1")
print(pydough.to_sql(nations.CALCULATE(key + 1, key - 1, key * 1, key / 1)))

# Comparison operators
print("\nQ2")
print(pydough.to_sql(nations.CALCULATE(key == 0, key < 0, key != 0, key >= 5)))

# String Operations
print("\nQ3")
print(pydough.to_sql(nations.CALCULATE(LENGTH(name), UPPER(name), LOWER(name), STARTSWITH(name, "A"))))

# Boolean operations
print("\nQ4")
print(pydough.to_sql(nations.CALCULATE((key != 1) & (LENGTH(name) > 5)))) # Boolean AND
print("\nQ5")
print(pydough.to_sql(nations.CALCULATE((key != 1) | (LENGTH(name) > 5)))) # Boolean OR
print("\nQ6")
print(pydough.to_sql(nations.CALCULATE(~(LENGTH(name) > 5)))) # Boolean NOT        
print("\nQ7")  
print(pydough.to_sql(nations.CALCULATE(ISIN(name, ("KENYA", "JAPAN"))))) # In

# Datetime Operations
# Note: Since this is based on SQL lite the underlying date is a bit strange.
print("\nQ8")
print(pydough.to_sql(lines.CALCULATE(YEAR(ship_date), MONTH(ship_date), DAY(ship_date),HOUR(ship_date),MINUTE(ship_date),SECOND(ship_date))))

# Aggregation operations
print("\nQ9")
print(pydough.to_sql(TPCH.CALCULATE(NDISTINCT(nations.comment), SUM(nations.key))))
# Count can be used on a column for non-null entries or a collection
# for total entries.
print("\nQ10")
print(pydough.to_sql(TPCH.CALCULATE(COUNT(nations), COUNT(nations.comment))))

#### Limitations

There are a few limitations with regular Python. Most notably:
* You cannot use Python's builtin `and`, `or`, `not`, or `in` with PyDough expressions.
* We do not support chained comparisons (e.g. `2 < x < 5`).
* We only support Python literals that are `integers`, `floats`, `strings`, `datetime.date`, or a `tuple`/`list` of those supported types.
* Lists and tuples can only be used with `ISIN`.

### Down-Streaming

Sometimes you need to load a value from a previous context to use at a later step in a PyDough statement. Any expression from an ancestor context that is placed in a `CALCULATE` is automatically made available to all descendants of that context. However, an error will occur if the name of the term defined in the ancestor collides with a name of a term or property of a descendant context, since PyDough will not know which one to use.

Notice how in the example below, `region_name` is defined in a `CALCULATE` within the context of `regions`, so the calculate within the context of `nations` also has access to `region_name` (interpreted as "the name of the region that this nation belongs to").

In [None]:
%%pydough

pydough.to_df(regions.CALCULATE(region_name=name).nations.CALCULATE(region_name, nation_name=name))

Here is a more complex example showing intermediate values. Here we will first compute `total_value` and then reuse it downstream.

In [None]:
%%pydough

nations_value = nations.CALCULATE(nation_name=name, total_value=SUM(suppliers.account_balance))
pydough.to_df(nations_value)

In [None]:
%%pydough
suppliers_value = nations_value.suppliers.CALCULATE(
 key,
 name,
 nation_name,
 account_balance=account_balance,
 percentage_of_national_value=100 * account_balance / total_value
)
top_suppliers = suppliers_value.TOP_K(20, by=percentage_of_national_value.DESC())
pydough.to_df(top_suppliers)

## WHERE

The `WHERE` operation by be used to filter unwanted entries in a context. For example, we can filter `nations` to only consider the `AMERICA` and `EUROPE` regions. A WHERE's context functions similarly to a `CALCULATE` except that it cannot be used to assign new properties; it only contains a single positional argument: the predicate to filter on. 

In [None]:
%%pydough

pydough.to_df(nations.WHERE((region.name == "AMERICA") | (region.name == "EUROPE")))

## TOP_K

The TOP K operation is used to reduce a collection to maximum number of values. The `by` arugument is used to order the output based on a sorting condition. As an example, consider this query to only calculate the first 5 nations by alphabetical name order.

In [None]:
%%pydough

pydough.to_df(nations.TOP_K(5, by=name.ASC()))

The `by` argument requirements are:
* Anything that can be an expression used in a `CALCULATE` or a `WHERE` can be used a component of a `by`.
* The value in the `by` must end with either `.ASC()` or `.DESC()`

You can also provide a tuple to by if you need to break ties. Consider this alternatives that instead selects the 20 parts with the largest size, starting with the smallest part id.

In [None]:
%%pydough

pydough.to_df(parts.TOP_K(20, by=(size.DESC(), key.ASC())))

## ORDER_BY

If you just want to return your output in a sorted order, you can use `ORDER_BY`. The functionality is the same as in `TOP_K` except that there is no `K` argument so the rows are not reduced. Each argument must be an expression that can be used for sorting.

Below can transform our nations collection to sort the output by the alphabetical ordering of the nation names.

In [None]:
%%pydough

pydough.to_df(nations.ORDER_BY(name.ASC()))

## PARTITION

The partition operation allows grouping collections under interesting keys similar to a SQL `GROUP BY`. Keys can be specified using the `by` argument and data columns to be aggregated can be referenced using the name argument. For example, we can use this to bucket nations by name length.

In [None]:
%%pydough

updated_nations = nations.CALCULATE(key, name_length=LENGTH(name))
grouped_nations = PARTITION(
    updated_nations, name="lengths", by=(name_length)
).CALCULATE(
    name_length,
    nation_count=COUNT(nations)
)
pydough.to_df(grouped_nations)

A couple important usage details:
* The data inside each partitioned group can be accessed as a sub-colleciton using its original name (see `nations` in the example above).
* The `name` argument specifies the name of the collection of partitioned data (needed in case the partitioned data gets partitioned again and needs to be accessed by name).
* `keys` can be either be a single expression or a tuple of them, but it can only be references to expressions that already exist in the context of the data (e.g. `name`, not `LOWER(name)` or `region.name`)
* Terms defined from the context of the `PARTITION` can be down-streamed to its descendants. An example is shown below where we select brass parts of size 15, but only the ones whose supply is below the average of all such parts.

In [None]:
%%pydough

selected_parts = parts.WHERE(ENDSWITH(part_type, "BRASS") & (size == 15))
part_types = PARTITION(selected_parts, name="types", by=part_type).CALCULATE(avg_price=AVG(parts.retail_price))
output = part_types.types.WHERE(retail_price < avg_price)
pydough.to_df(output)

## HAS and HASNOT

The `HAS` and `HASNOT` operations are used for filtering based on if any match occurs between an entry and another collection. For example, consider consider only regions that have at least 1 nation whose name is length > 10.

In [None]:
%%pydough

length_10_nations = nations.WHERE(LENGTH(name) > 10)
pydough.to_df(regions.WHERE(HAS(length_10_nations)))

Alternatively we can only consider regions where all of its nations names of length 10 or less.

In [None]:
%%pydough

pydough.to_df(regions.WHERE(HASNOT(length_10_nations)))

# SINGULAR

In PyDough, it is required that if we are accessing a sub-collection in a collection context, the collection must be singular with regards to the sub-collection. For example, consider the following PyDough code, that results in an error:

In [None]:
%%pydough
pydough.to_df(regions.CALCULATE(name, nation_name=nations.name))

This results in an error as nations is plural with regards to regions and PyDough does not know which nation name to use for each region. Let's say we want a field nation_4_name that contains the name of the nation with key 4. The PyDough code to do this is as follows:

In [None]:
%%pydough
nation_4 = nations.WHERE(key == 4)
pydough.to_df(regions.CALCULATE(name, nation_4_name=nation_4.name))

We see that the above code results in an error as even though we know that there is at most a single value of `nation_4` for each instance of `regions`, PyDough does not know this and therefore prohibits the operation.
To fix this, we can use the `.SINGULAR()` modifier to tell PyDough that the data should be treated as singular.

In [None]:
%%pydough
nation_4 = nations.WHERE(key == 4).SINGULAR()
pydough.to_df(regions.CALCULATE(name, nation_4_name=nation_4.name))

In summary, certain PyDough operations, such as specific filters, can cause plural data to become singular. In this case, PyDough will still ban the plural data from being treated as singular unless the `.SINGULAR()` modifier is used to tell PyDough that the data should be treated as singular. It is very important that this only be used if the user is certain that the data will be singular, since otherwise it can result in undefined behavior when the PyDough code is executed.