# PyDough Operations

This notebook aims to provide an overview of the various PyDough operations that can be used as building blocks for solving analytics questions. We do not intend for this to be exhaustive and especially the functions listed are not complete. However, we believe these operations can act as a foundation for getting started.

In [2]:
%load_ext pydough_jupyter_extensions

import pydough
# Setup demo metadata
pydough.active_session.load_metadata_graph("../metadata/tpch_demo_graph.json", "TPCH");
pydough.active_session.connect_database("sqlite", database="../tpch.db");

The pydough_jupyter_extensions extension is already loaded. To reload it, use:
  %reload_ext pydough_jupyter_extensions


## Collections

The natural place to start is accessing a collection. A collection is PyDough is an abstraction for any "document", but in most cases represents a table. Starting with the TPC-H schema, if we want to access the regions table, we will use our corresponding PyDough collection

In [4]:
%%pydough

print(pydough.to_sql(regions))
pydough.to_df(regions)

SELECT r_regionkey AS key, r_name AS name, r_comment AS comment FROM main.REGION


Unnamed: 0,key,name,comment
0,0,AFRICA,lar deposits. blithely final packages cajole. ...
1,1,AMERICA,"hs use ironic, even requests. s"
2,2,ASIA,ges. thinly even pinto beans ca
3,3,EUROPE,ly final courts cajole furiously final excuse
4,4,MIDDLE EAST,uickly special accounts cajole carefully blith...


Collections contain properties, which either correspond to the entries within a document or a sub collection (another document that can be reached from the current document). All of this is expanded on in more detail in our notebook on metadata, but what is important to understand is that the path between collections is how we integrate data across multiple tables.

For example, each region is associated with 1 or more nations, so rather than just looking at the region we can look at "each nation for each region".

In [5]:
%%pydough

print(pydough.to_sql(regions.nations))
pydough.to_df(regions.nations)

SELECT _table_alias_1.key AS key, name AS name, region_key, comment AS comment FROM (SELECT r_regionkey AS key FROM main.REGION) AS _table_alias_0 INNER JOIN (SELECT n_regionkey AS region_key, n_nationkey AS key, n_comment AS comment, n_name AS name FROM main.NATION) AS _table_alias_1 ON _table_alias_0.key = region_key


Unnamed: 0,key,name,region_key,comment
0,0,ALGERIA,0,haggle. carefully final deposits detect slyly...
1,1,ARGENTINA,1,al foxes promise slyly according to the regula...
2,2,BRAZIL,1,y alongside of the pending deposits. carefully...
3,3,CANADA,1,"eas hang ironic, silent packages. slyly regula..."
4,4,EGYPT,4,y above the carefully unusual theodolites. fin...
5,5,ETHIOPIA,0,ven packages wake quickly. regu
6,6,FRANCE,3,"refully final requests. regular, ironi"
7,7,GERMANY,3,"l platelets. regular accounts x-ray: unusual, ..."
8,8,INDIA,2,ss excuses cajole slyly across the packages. d...
9,9,INDONESIA,2,slyly express asymptotes. regular deposits ha...


Notice how in the generated SQL we create a join between `region` and `nation`. The metadata holds this relationship, effectively abstracting joins away from the developer whenever possible.

## Calc

The next important operation is the `CALC` operation, which is used by "calling" a collection as a function.

In [8]:
%%pydough

pydough.to_sql(nations(key))

'SELECT n_nationkey AS key FROM main.NATION'

Calc has a few purposes:
* Select which entries you want in the output.
* Define new fields by calling functions.
* Allow operations to be evaluated for each entry in the outermost collection expression.

In [9]:
%%pydough

pydough.to_sql(nations(key + 1))

'SELECT key + 1 AS _expr0 FROM (SELECT n_nationkey AS key FROM main.NATION)'

Here the context is the "nations" at the root of the graph. The means that for each entry within nations we compute the result. This has important implications for when we get to more complex expressions. For example, if we want to know how many nations we have stored in each region we can do via CALC.

In [12]:
%%pydough

pydough.to_df(regions(name, nation_count=COUNT(nations)))

Unnamed: 0,name,nation_count
0,AFRICA,5
1,AMERICA,5
2,ASIA,5
3,EUROPE,5
4,MIDDLE EAST,5


Internally, this process evaluates `COUNT(nations)` in the context of `regions`. That means that for each entry in regions we navigate to `regions.nations` and perform the calc operation to reduce the results to a scalar.

This shows a very important restriction of CALC, each final entry in a calc expression must be scalar with respect to a current context. For example, the expression `regions(region_name=name, nation_name=nations.name)` is not legal because region and nation is a one to many relationship, so there is not a single scalar value for each region.

In [14]:
%%pydough

pydough.to_df(regions(region_name=name, nation_name=nations.name))

PyDoughASTException: Expected all terms in (region_name=name, nation_name=nations.name) to be singular, but encountered a plural expression: nations.name

In contrast, we know that every nation has 1 region (and this is defined in the metadata). As a result the alternative expression, `nations(nation_name=name, region_name=region.name)` is legal.

In [16]:
%%pydough

pydough.to_df(nations(nation_name=name, region_name=region.name))

Unnamed: 0,nation_name,region_name
0,ALGERIA,AFRICA
1,ARGENTINA,AMERICA
2,BRAZIL,AMERICA
3,CANADA,AMERICA
4,EGYPT,MIDDLE EAST
5,ETHIOPIA,AFRICA
6,FRANCE,EUROPE
7,GERMANY,EUROPE
8,INDIA,ASIA
9,INDONESIA,ASIA


This illustrates one of the important properties of the metadata, defining one:one, many:one, one:many, and many:many relationships can allow developers the flexiblity to write simpler queries.

### Functions

PyDough has support for many functions to enable analytics calculations. Whenever possible we try and support standard Python operators. However, this is not completely possible. In addition, to avoid namespace conflicts, for functions that require regular function call semantics we use all capitalization by convention. Here are some examples.

In [17]:
%%pydough

# Numeric operations

# Comparison operators

# String Operations

# Datetime Operations

# Aggregation operations

#### Limitations

There are a few limitations with regular Python. Most notably:
* You cannot use Python's builtin `and` or `or` with PyDough expressions.
* We do not support chained comparisons (e.g. `2 < x < 5`).
* We only support Python literals that are `integers`, `floats`, `strings`, `datetime.date`, or a `tuple`/`list` of those supported types.
* Lists and tuples can only be used with `in`.