# Overview

This notebook provides a basic example of the process of working in PyDough to answer an analytics question. It highlights how PyDough can be leveraged to guide the development and experimentation process, with an emphasis on solving partial sub-problems.

We believe that is this approach is more composable and scalable. Rather than focusing on building a query to answer a question at hand, building question components allows more proportional scaling relative to the complexity at hand. The PyDough team is optimistic that such an approach can make human more productive effective and LLMs more accurate at solving the problem at hand.

The majority of this notebook will be focused on building up a more complex question piece by piece. At some places we will also deviate from this question to emphasize the powerful PyDough features that enable this type of experimentation.

In [1]:
%load_ext pydough_jupyter_extensions

In [2]:
import pydough

In [3]:
# Setup demo metadata
pydough.active_session.load_metadata_graph("../metadata/tpch_demo_graph.json", "TPCH");
pydough.active_session.connect_database("sqlite", database="../tpch.db");
# Avoid scientific notation
pd.options.display.float_format = '{:.6f}'.format

## Schema Context

For this demo we will be working in an example crafted to match the schema for the TPC-H benchmark. The actual data for this benchmark is generated in SQLite using the standard TPC-H data generation tools. The underlying schema of this data matches this example image from [TPC Benchmark H Standard Specification](https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf).

![TPC-H schema from the Specification Document as of December 12, 2024](../images/tpc_h_schema.png)

## Problem Statement

We are going to build to solve a problem where we seek to **identify the 5 lines that represent the smallest percentage of an order's revenue**. 

We will build an answer to this question in PyDough and use this to explain how PyDough can allow easy iteration in the analytics process. For a point of comparison, here is a possible SQL query that could we be used to answer this question.

```SQL
with total_order_revenue as (
    select
        o_orderkey as order_key,
        SUM(L_EXTENDED_PRICE * (1 - L_DISCOUNT)) as order_total
    from orders
    inner join lines
    group by o_orderkey
)
Select
    (L_EXTENDED_PRICE * (1 - L_DISCOUNT)) / order_total as revenue_ratio,
    orderkey,
    l_linenumber as line_number
from lines
inner join total_order_revenue
on l_orderkey = orderkey
order by revenue_ratio ASC, order_key DESC, line_number DESC
LIMIT 5
```

The sample answer uses a CTE statement produce a result per order that can then be re-combined with the original lines. However, there are also some subtle issue that could arise. Namely:
* Should the joins between lines and orders be inner joins or left joins?
* If the joins were left joins, what should we do if a line wasn't associated with an order (and vice versa).

In this particular example **because we know the exact constraints of the TPC-H setup** we can verify that the provided query is correct and we don't need to worry about left joins. However, in a production setting, such differences can result in subtle bugs that due to SQL's holistic nature can be very difficult to debug, even though this is a relatively simple query. It also means that if the constraints differ or change between the pulling information from two different data sources, the SQL query developed could ultimately lead to incorrect results.

## Metadata

The primary mechanism through which PyDough simplifies life for the user is the metadata. This eliminates the need to worry about our above correctness question like `LEFT` vs `INNER` join and gives us a mechanism to run the same code in different environments, you can just have slightly different metadata but the same PyDough!

The PyDough mechanism defines many things, most of which we won't explore in this demo, but two important details are the interactions between collections and the naming process. To demonstrate this, let's look at the immediate collections that are avaiable in our graph.

In [4]:
%%pydough

print(pydough.explain(pydough.active_session.metadata))

PyDough graph: TPCH
Collections: customers, lines, nations, orders, parts, regions, suppliers, supply_records
Call pydough.explain(graph[collection_name]) to learn more about any of these collections.
Call pydough.explain_structure(graph) to see how all of the collections in the graph are connected.


Here we see the collections available in our TPCH graph. These are similar to the scheme above, but there are two notable renames:
 * LineItems -> lines
 * PartSupp -> supply_records

Let's take a look at lines. We will use `verbose=True` to understand the mapping between this metadata and the underlying database.

In [5]:
%%pydough

print(pydough.explain(pydough.active_session.metadata["lines"], verbose=True))

PyDough collection: lines
Table path: main.LINEITEM
Unique properties of collection: [['order_key', 'line_number'], ['part_key', 'supplier_key', 'order_key']]
Scalar properties:
  comment
  commit_date
  discount
  extended_price
  line_number
  order_key
  part_key
  quantity
  receipt_date
  return_flag
  ship_date
  ship_instruct
  ship_mode
  status
  supplier_key
  tax
Subcollection properties:
  order
  part
  part_and_supplier
  supplier
Call pydough.explain(graph['lines'][property_name]) to learn more about any of these properties.


Here we see that `lines` is mapping to our `LINETIEM` table. It also contains additional information such as what combination of columns are unique and importantly what collections its attached to. Additionally, notice that the names aren't required to match those in the TPC-H schema, for example we remap `l_extendedprice` to `extended_price`. We can also define unique names for traversing collections based on different characteristics, though we will not do this in this demo.

This highlights that PyDough metadata can be used to capture the semantics of your organization without require total control of the data, allowing each part of an organization to use the most accurate naming. For example if your organization consistently says `PNL` but for whatever reason the data owners refuse to use that name, you can effortlessly remap this business logic on top of the underlying data.

Now let's look at what happens when we consider going from lines to orders.

In [6]:
%%pydough

print(pydough.explain(pydough.active_session.metadata["lines"]["order"], verbose=True))

PyDough property: lines.order
This property connects collection lines to orders.
Cardinality of connection: Many -> One
Is reversible: yes
Reverse property: orders.lines
The subcollection relationship is defined by the following join conditions:
    lines.order_key == orders.key


This translation encapsulates the join logic, but also holds that there are many lines for a single order. This both allows PyDough to be smarter about answering the underlying question and also developers to leverage these properties to generate simpler statements.

Right now this metadata always dictates generating a LEFT JOIN and automatically handles processing missing records for aggregation functions, but in the future this will become more extensible via both metadata and configuration.

## PyDough Solution

To build the PyDough solution, we are going to take a different approach than immediately trying to write the query directly. Rather than solely trying to write the shortest statement to answer our original question, we will approach question as one that we need a build in parts from the components to a solution.

The reason for this is two fold:
1. We believe this is a good representation for how PyDough can be leveraged to build answers to complex analytics questions.
2. We believe this reflects an investigative approach to analytics, where someone may rough at a high-level what needs to be done, but not necessary that "path" to get there.

To do this we will first need to define revenue. Here we will say that the revenue is the price information after removing any discounts.

In [7]:
%%pydough

price_def = extended_price*(1-discount)

This might seem shocking. We have defined `price_def` out of nowhere using an `extended_price` and `discount`. What has actually happened here is that we have generated what is called a `Contextless Expression`. This fundamental building block is the key to PyDough's composability.

On its own this expression doesn't mean anything. In fact if we inspect this object in regular PyDough we will see that PyDough itself has a lot of questions.

In [8]:
price_def

(?.extended_price * (1 - ?.discount))

As you see, PyDough now knows that this expression is composed of `extended_price` and `discount`, but it doesn't know **WHICH** `extended_price` and `discount`. To ultimately develop a legal PyDough statement, we will need to bind uses of this expression to a context that can access `extended_price` and `discount`.

This might seem very minor, but this allows us to define definitions up front, allowing reuse in vastly different contexts.

Now our schema is flexible, but this price defintion is actually tied to the lines collection. Let's now combine incorporate lines.

In [9]:
%%pydough

line_price = lines(line_price=price_def).line_price

At this stage our expression is more complete (we know it is associated with lines), but its actually still contextless. Let's extend this by now asking for something more concrete, the SUM.

In [10]:
%%pydough

total_price = SUM(line_price)

Now this expression is more meaningful. If we assign this statement to global context, our actual TPCH graph, then we can compute the total price across all lines.

In [11]:
%%pydough

pydough.to_df(TPCH(total_line_price=total_price))

Unnamed: 0,total_line_price
0,218102223885.0001


In practice though, this may not solve our core question. Instead, we may want to apply a different **context**, say for example total_price for each order. We can instead represent that as follows.

In [12]:
%%pydough

order_total_price = orders(total_line_price=total_price)
pydough.to_df(order_total_price)

Unnamed: 0,total_line_price
0,167183.229600
1,44694.460000
2,190548.724800
3,29770.173000
4,137930.604600
...,...
1499995,111787.875600
1499996,68224.320000
1499997,88356.494400
1499998,62681.145600


Notice that are able to reuse the exact same code, but by swapping the context we can ultimately modify the semantics. This makes testing the underlying behavior much more scalable. To ask is this statement correct, we can instead compose our question to ask:
* Is this underlying expression correct?
* Is this context correct?

Since these can be verify independently, we can develop greater confidence in our question since it arises from composable building blocks. We could also generate selected contexts to build clear testing.

In [13]:
%%pydough

# Compute the sum of the first 5 line numbers, which can be known for testing.
top_five_lines = lines.TOP_K(5, by=(line_number.ASC(), order_key.ASC()))
top_five_line_price = TPCH(total_price=SUM(top_five_lines(line_price=price_def).line_price))
pydough.to_df(top_five_line_price)

Unnamed: 0,total_price
0,168805.6798


Now let's return to extending our question. Building able to compute order sums is great, but we care about results per line. As a result, now we can even extend our orders to an additional context within lines. We will once again define more defintions. Our ratio definition will now ask us to propagate our previous `total_price` price that we computed and compare it to the result of `price_def`.

In [14]:
%%pydough

ratio = price_def / BACK(1).total_price

Now we will build our final question by moving from the context of the orders to the context of orders and lines together. Since Orders -> Lines is a Many to One relationship, this places us in the context of lines with some additional information.

For actually fully solving our prior question, we will compute the ratio and then select 5 smallest ratio value, breaking ties with a combination of the order number and line number.

In [15]:
%%pydough

line_ratios = order_total_price.lines(revenue_ratio=ratio, order_key=order_key, line_number=line_number)
lowest_ratios = line_ratios.TOP_K(5, by=(revenue_ratio.ASC(), order_key.DESC(), line_number.DESC()))

In [16]:
pydough.to_df(lowest_ratios)

Unnamed: 0,revenue_ratio,order_key,line_number
0,0.002341,363876,1
1,0.002359,4920774,1
2,0.002368,3274400,4
3,0.002399,2230976,2
4,0.0024,497094,4


Now we have resolved to solution to our underlying question. We can save this result to a Python variable, as the output is already a Pandas DataFrame. Alternatively, we could expand on our question to collect more information, such as filtering by a minimum number of entries. For completeness we will show that example where we only **consider lines that are part of orders with at least 3 lines**. 

In [17]:
%%pydough

total_lines = COUNT(lines)

In [18]:
%%pydough

order_total_price = orders(total_line_price=total_price, line_count=total_lines)

In [19]:
%%pydough

line_ratios = order_total_price.lines(revenue_ratio=ratio, line_count=BACK(1).line_count, order_key=order_key, line_number=line_number)
filtered_ratios = line_ratios.WHERE(line_count > 3)(revenue_ratio, order_key, line_number)

lowest_ratios = filtered_ratios.TOP_K(5, by=(revenue_ratio.ASC(), order_key.DESC(), line_number.DESC()))
highest_ratios = filtered_ratios.TOP_K(5, by=(revenue_ratio.DESC(), order_key.DESC(), line_number.DESC()))

In [20]:
%%pydough

pydough.to_df(lowest_ratios)

Unnamed: 0,revenue_ratio,order_key,line_number
0,0.002341,363876,1
1,0.002359,4920774,1
2,0.002368,3274400,4
3,0.002399,2230976,2
4,0.0024,497094,4


In [21]:
%%pydough

pydough.to_df(highest_ratios)

Unnamed: 0,revenue_ratio,order_key,line_number
0,0.911963,5165158,2
1,0.906611,5625127,3
2,0.903499,5677511,4
3,0.902529,4483520,1
4,0.898926,1512614,2


The lowest ratio doesn't seem to change, most likely because these already arise from situations with multiple orders. However, now we can meaningfully ask about the highest ratios without being dominated by single line orders.