# Introduction

Welcome to the PyDough demos. PyDough is a new Python compatible DSL that leverages a rich document model to simplify analytics queries. PyDough's goal is to enable "WHAT-IF" style question by making it easier to reuse previous question components and improve iteration speed by making it both faster to construct a query and easier to debug.  

This notebook is intended to explain the structure of the PyDough notebook and the fundamental operations. At the end we will link to followup notebooks to will explain how to use PyDough in more detail.

In [None]:
import pydough

## Metadata Setup

PyDough simplifies the query writing experience by leveraging a metadata describing the relevant tables and their relationships. For these demos we will provide all metadata for you, but future demos may walk through the metadata creation process. A followup notebook will provide you a deeper look into the inner workings of the metadata.

To setup the metadata, the PyDough `active_session` is used. This is a simple class to encapsulate configuration information. To attach our metadata graph, we can use:

In [None]:
pydough.active_session.load_metadata_graph("../metadata/tpch_demo_graph.json", "TPCH");

The graph we loaded is a metadata representation of the standard TPC-H schema, with some names modified to be more human readable. In all future cells we will be able to reference the entities from this metadata.

We won't dive into the details of the relationships for this notebook, but note that it is able to reference the tables involved and move between them based on pre-defined relationships.

## Jupyter Extension

To actually execute PyDough we have written a Jupyter extension to allow defining cells that contain PyDough expressions which are not valid Python. For example, now that the metadata is loaded, you are able to directly refer to `nations`, but this is not valid Python syntax.

This is done with the Jupyter cell magic: `%%pydough`. This first attempts to resolve any variables with the regular Python environment and then leverages the metadata. In the future we intend for this to be a cell drop down option that will automatically append this magic to the relevant cells, but for now we feel this accurate represents the "feel" of working in PyDough.

When you first setup the notebook you need to load the extension: 

```
%load_ext pydough.jupyter_extensions
```

*Note*: If running outside our demo setup you may need manually install the extension, which is also found in this repository.

In [None]:
%load_ext pydough.jupyter_extensions

Now we can define our access to nations, where we may only want to load the key.

In [None]:
%%pydough

nations.CALCULATE(nkey=key, nname=name)

As you can see, this step resolved without any Python errors. However, this isn't actually useful because we can't use this result anywhere. Instead, we can assign this result to a Python variable to make it accessible in future cells, whether they are Python cells or more PyDough cells.

In [None]:
%%pydough

nation_keys = nations.CALCULATE(nkey=key, nname=name)

It's important to note that so far we haven't actually **executed** anything. Although we have determined that nations and key may be a valid name in the graph, this is example of what we call a **contextless expression**, meaning that its actual resolution may differ depending on how this variable is inserted into an actual expression that we execute.

We won't showcase this behavior in this demo, but its important to note that the namebinding for `nations` does not require it to only refer to a top level collection/table.

We can then further use this name if we decide we want to select only on 2 nations with the fewest customers. This step consists of a few parts, but in essence what we are doing is saying:
1. Select our nations as defined above.
2. Use the TOP_K operation which gives the first 2 elements as defined by the `by` section.
3. We define our sorting result to be by customer count. What this is saying is that for each nation, count the number of customers connected to it, which works because `nations.customers` is a defined path in our metadata. By calling the `COUNT` operation we reduce this result to 1 value per entry, which can be then be used for sorting.
4. We indicate that we want to sort in ascending order.

In [None]:
%%pydough

lowest_customer_nations = nation_keys.CALCULATE(nkey, nname, cust_count=COUNT(customers)).TOP_K(2, by=cust_count.ASC())
lowest_customer_nations

## Evaluating PyDough

Right now there are two primary ways to evaluate PyDough expressions.
1. Convert PyDough expressions into a SQL query using `pydough.to_sql()`
2. Execute PyDough expressions on a SQL database using `pydough.to_df()`

Setting up either situation requires making a change the underlying active session. By default PyDough will allow generating ANSI SQL (but not executing on a Database).

These APIs can be used from Python if only used directly on a Python variable that has already been resolved from PyDough. If there are more complex expressions please use PyDough cells.

In [None]:
%%pydough

pydough.to_sql(nation_keys)

To enable execution on a Database, we need to setup the `pydough.active_session.connect_database()` to indicate where we want to target our SQL. This API is intended to work with any DB 2.0 compatible database connection, although right now it is only setup to work with SQLite. This step is done by providing the database name and the required `connect(...)` arguments. This API will then automatically load the appropriate dialect, which is maintained through integration with `SQLGlot`.

In [None]:
pydough.active_session.connect_database("sqlite", database="../../tpch.db");

In [None]:
%%pydough

pydough.to_df(nation_keys)

Similarly, we can compute our more complex expression with the two nations containing the fewest customers.

In [None]:
%%pydough

pydough.to_df(lowest_customer_nations)

Finally, while building a statement from smaller components is best practice in Pydough, you can always evaluate the entire expression all at once within a PyDough cell, such as this example that loads the all Asian nations in the dataset.

We can use the optional `columns` argument to `to_sql` or `to_df` to specify which columns to include, or even what they should be renamed as.

In [None]:
%%pydough

asian_countries = nations.WHERE(region.name == "ASIA")
print(pydough.to_df(asian_countries, columns=["name", "key"]))
pydough.to_df(asian_countries, columns={"nation_name": "name", "id": "key"})

# Additional Notebooks

This notebook is intended as a simple introduction into pydough and it runs in a Jupyter notebook. To get a better understanding of PyDough's most impactful features, we have the following additional notebooks:
* [2_pydough_operations](./2_pydough_operations.ipynb): Provides a detailed overview of many of the core operations you can currently perform in PyDough, include some best practices and possible limitations.
* [3_exploration](./3_exploration.ipynb): Explores our provided TPC-H metadata to help highlight some of the key metadata features and to teach users how to interact with the metadata.
* [4_tpch](./4_tpch.ipynb): Compares the SQL queries used in the TPC-H benchmarks to equivalent statements written in PyDough.
* [5_what_if](./5_what_if.ipynb): Shows how to do WHAT-IF analysis with PyDough.
