# Introduction

Welcome to the PyDough demos. This notebook is intended to explain the structure of the PyDough notebook
and the fundamental operations. Depending on where you are executing this notebook, some of the setup steps may already be done for you.

## Importing PyDough

The first step to working in PyDough is that every notebook must `import pydough`. This will be essential for all future operations.

In [1]:
import pydough

## Metadata Setup

When working with PyDough, metadata information is essential to analytics experience. For these demos we will provide all metadata for you, but future demos may walk through the metadata creation process. A followup notebook will provide you a deeper look into the inner workings of the metadata.

To setup the metadata, the PyDough `active_session`. This is a simple class to encapsulate configuration information. To attach our metadata graph, we can use: `pydough.active_session.load_metadata_graph(filename, graphname)`

In [2]:
pydough.active_session.load_metadata_graph("../metadata/tpch_demo_graph.json", "TPCH");

Now our PyDough graph is loaded, so we can construct PyDough expressions to refer the relationships defined by this graph. This graph is a metadata representation of the standard TPC-H schema often used for benchmarking. We won't dive into the details of the relationships for this notebook, but note that it is able to reference the tables involved and move between them based on their standard join key relationships, although the names may be different.

## Jupyter Extension

To actually execute PyDough we have written a Jupyter extension to allow defining cells that contain PyDough expressions which are not valid Python. For example, now that the metadata is loaded, you are able to directly refer to `Nations`, but this is not valid Python syntax.

To enable PyDough, we have therefore defined our own Jupyter cell magic, `%%pydough`. When used, the body of any cell will be converted to a PyDough expression based on name resolution between the regular Python variables and then secondarily the metadata. In the future we intend for this to be a cell drop down option that will automatically append this magic to the relevant cells, but for now we feel this accurate represents the "feel" of working in PyDough.

*Note*: If running outside our demo setup you may need to execute the following steps to enable extension.

```
import pydough_jupyter_extensions
%load_ext pydough_jupyter_extensions
```

In [3]:
%load_ext pydough_jupyter_extensions

Now we can define our access to nations, where we may only want to load the key.

In [4]:
%%pydough

nations(key, name)

?.nations(key=?.key, name=?.name)

As you can see, this step resolved without any Python errors. However, this isn't actually useful because we can't use this result anywhere. Instead, we can assign this result to a Python variable to make it accessible in future cells, whether they are Python cells more PyDough cells.

In [5]:
%%pydough

nation_keys = nations(key, name)

It's important to note that so far we haven't actually **executed** anything. Although we have determined that nations and key may be a valid name in the graph, this is example of what we call a **contextless expression**, meaning that its actual resolution may differ depending on how this variable is inserted into an actual expression that we execute.

We won't showcase this behavior in this demo, but its important to note that the namebinding for Nations does not require it to only refer to a top level collection/table.

We can then further use this name is we decide we want to select only on 2 names with the fewest customers. This step consists of a few part, but in essence what we are doing is saying:
1. Select our nations as defined above.
2. Use the TOP_K operation which gives the first K elements as defined by the `by` section. In our example K is 2.
3. We define our sorting result to be by customer count. What this is saying is that for each nation, count the number of customers connected to it, which works because `Nations.Customers` is a defined path in our metadata. By calling the `COUNT` operation we reduce this result to 1 value per entry, which can be then be used for sorting.
4. We indicate that we want to sort in ascending order

In [6]:
%%pydough

lowest_customer_nations = nation_keys(key, name, cust_count=COUNT(customers)).TOP_K(2, by=cust_count.ASC())
lowest_customer_nations

?.nations(key=?.key, name=?.name)(key=?.key, name=?.name, cust_count=COUNT(?.customers)).TOP_K(2, by=(?.cust_count.ASC(na_pos='last')))

## Evaluating PyDough

Right now there are two primary ways to evaluate PyDough expressions.
1. Convert PyDough expressions into a SQL query using `pydough.to_sql()`
2. Execute PyDough expressions on a SQL database using `pydough.to_df()`

Setting up either situation requires making a change the underlying active session. By default PyDough will allow generating ANSI SQL (but not executing on a Database).

These APIs can be used from Python if only used directly on a Python variable that has already been resolved from PyDough. If there are more complex expressions please use PyDough cells.

In [7]:
%%pydough

pydough.to_sql(nation_keys)

'SELECT n_nationkey AS key, n_name AS name FROM main.NATION'

In [8]:
%%pydough

pydough.to_df(nation_keys)

ValueError: No SQL Database is specified.

As you can see we can generate the underlying SQL but the system does not know how to execute it. To enable execution we setup the `pydough.active_session.connect_database()` to indicate where we want to target our SQL. This API is intended to work with any DB 2.0 compatible database connection, although right now it is only setup to work with SQLite. This step is done by providing the database name and the required `connect(...)` arguments. This API will then automatically load the appropriate dialect, which is maintained through integration with `SQLGlot`.

It also additional possible to define your own database connection and pass it for use by PyDough, but this is an advanced usage that will be shown in a followup demo.

In [9]:
pydough.active_session.connect_database("sqlite", database="../tpch.db");

In [10]:
%%pydough

pydough.to_df(nation_keys)

Unnamed: 0,key,name
0,0,ALGERIA
1,1,ARGENTINA
2,2,BRAZIL
3,3,CANADA
4,4,EGYPT
5,5,ETHIOPIA
6,6,FRANCE
7,7,GERMANY
8,8,INDIA
9,9,INDONESIA


Similarly, we can compute our more complex expression with the two nations containing the fewest customers.

In [11]:
%%pydough

pydough.to_df(lowest_customer_nations)



Unnamed: 0,key,name,cust_count
0,20,SAUDI ARABIA,5904
1,7,GERMANY,5908


Finally, while building a statement from smaller components is best practice in Pydough, you can always evaluate the entire expression all at once within a PyDough cell, such this example that loads the all asian nations in the dataset.

In [12]:
%%pydough

pydough.to_df(nations.WHERE(region.name == "ASIA"))

Unnamed: 0,key,name,region_key,comment
0,8,INDIA,2,ss excuses cajole slyly across the packages. d...
1,9,INDONESIA,2,slyly express asymptotes. regular deposits ha...
2,12,JAPAN,2,"ously. final, express gifts cajole a"
3,18,CHINA,2,c dependencies. furiously express notornis sle...
4,21,VIETNAM,2,"hely enticingly express accounts. even, final"
