# Introduction

This notebook is intended to explain various APIs available in PyDough to explain and explore the PyDough metadata and PyDough logical operations.

## Setup

This notebook uses our TPC-H schema metadata and SQLite database connection for all examples.

In [1]:
import pydough

pydough.active_session.load_metadata_graph("../metadata/tpch_demo_graph.json", "TPCH")
pydough.active_session.connect_database("sqlite", database="../tpch.db")
%load_ext pydough_jupyter_extensions

## Using `pydough.explain` on metadata

The API `pydough.explain` can be used to dump information about various PyDough metadata objects or logical operations. The simplest version is to call it on the entire graph, which displays basic information about the graph such as its name and the collections inside it. The `pydough.explain` API takes in an optional `verbose` argument. If this is True, then it displays more information about its arguments, and if it is False it displays a more compact summary.

`pydough.explain` can be called from either inside a normal Python cell or inside a pydough cell.

Below is an example of using this API on the TPCH graph that has been loaded into the active sesssion.

In [2]:
graph = pydough.active_session.metadata

print(pydough.explain(graph, verbose=True))

PyDough graph: TPCH
Collections:
  customers
  lines
  nations
  orders
  parts
  regions
  suppliers
  supply_records
Call pydough.explain(graph[collection_name]) to learn more about any of these collections.
Call pydough.explain_structure(graph) to see how all of the collections in the graph are connected.


Just like the hint suggests, below is an example of calling `pydough.explain` to learn more about the `regions` collection of TPCH:

In [3]:
print(pydough.explain(graph["regions"], verbose=True))

PyDough collection: regions
Table path: main.REGION
Unique properties of collection: ['key']
Scalar properties:
  comment
  key
  name
Subcollection properties:
  nations
Call pydough.explain(graph['regions'][property_name]) to learn more about any of these properties.


The information displayed tells us information about this collection, including:

* The real data table that `regions` maps to in the database is called `main.REGION`.
* Every record in the `regions` collection has a `comment`, a `key`, and a `name`, which are scalar properties of the region.
* Each unique `key` value corresponds to a single unique record in `regions`.
* There is a subcollection of `regions` called `nations`.

Below is another example, this time on the collection `supply_records`:

In [4]:
print(pydough.explain(graph["supply_records"], verbose=True))

PyDough collection: supply_records
Table path: main.PARTSUPP
Unique properties of collection: [['part_key', 'supplier_key']]
Scalar properties:
  availqty
  comment
  part_key
  supplier_key
  supplycost
Subcollection properties:
  lines
  part
  supplier
Call pydough.explain(graph['supply_records'][property_name]) to learn more about any of these properties.


This time, the information displayed tells us the following:

* The real data table that `supply_records` maps to in the database is called `main.PARTSUPP`.
* Every record in the `supply_records` collection has an `availqty`, a `comment`, a `part_key`, a `supplier_key` and a `supplier_cost`.
* Each unique combination of the values of `part_key` and `supplier_key` corresponds to a single unique record in `supply_records`.
* There are subcollections of `supply_records` called `lines`, `part`, and `supplier`.

Just like the hint suggests, below is an example of calling `pydough.explain` to learn more about one of the scalar properties, in this case: the `name` property of `regions`.

In [5]:
print(pydough.explain(graph["regions"]["name"], verbose=True))

PyDough property: regions.name
Column name: main.REGION.r_name
Data type: string


The information displayed tells us that the property `regions.name` corresponds to the `r_name` column of the table `main.REGION`, which has a type of `string`. 

Below is an example of using `pydough.explain` to learn about a subcollection property, in this case: the `nations` property of `regions`.

In [6]:
print(pydough.explain(graph["regions"]["nations"], verbose=True))

PyDough property: regions.nations
This property connects collection regions to nations.
Cardinality of connection: One -> Many
Is reversible: yes
Reverse property: nations.region
The subcollection relationship is defined by the following join conditions:
    regions.key == nations.region_key


The information displayed explains how `regions.nations` connects the `regions` collection to the `nations` connection, including:

* Each record of `regions` can connect to multiple records in `nations`
* Each record of `nations` can connect to at most one record of `regions`
* A record of `regions` is connected to a record of `nations` if the `key` property of `regions` equals the `naiton_key` property of `nations`.
* The `nations` collection has a corresponding twin property to `regions.nations` called `nations.region`, displayed below:

In [7]:
print(pydough.explain(graph["nations"]["region"], verbose=True))

PyDough property: nations.region
This property connects collection nations to regions.
Cardinality of connection: Many -> One
Is reversible: yes
Reverse property: regions.nations
The subcollection relationship is defined by the following join conditions:
    nations.region_key == regions.key


The information displayed is essentially the same as `regions.nations`, but from the perspective of each record of `nations` instead of each record of `regions`.

## Using `pydough.explain_structure`

The API `pydough.explain_structure` can be called on an entire metadata graph, just like `pydough.explain`. Instead of providing a summary of a single graph, collection, or property, `pydough.explain_structure` gives a wholistic summary of how everything in the graph connects. It displays all of the collections in the graph, the names of their properties, and for the ones that are subcollections it provides a brief summary of what other collection they map to and what the cardinality of the collection looks like.

Below is an example of using this API on the TPCH graph to gain a summary of the entire graph.

In [8]:
print(pydough.explain_structure(graph))

Structure of PyDough graph: TPCH

  customers
  ├── acctbal
  ├── address
  ├── comment
  ├── key
  ├── mktsegment
  ├── name
  ├── nation_key
  ├── phone
  ├── nation [one member of nations] (reverse of nations.customers)
  └── orders [multiple orders] (reverse of orders.customer)

  lines
  ├── comment
  ├── commit_date
  ├── discount
  ├── extended_price
  ├── line_number
  ├── order_key
  ├── part_key
  ├── quantity
  ├── receipt_date
  ├── return_flag
  ├── ship_date
  ├── ship_instruct
  ├── ship_mode
  ├── status
  ├── supplier_key
  ├── tax
  ├── order [one member of orders] (reverse of orders.lines)
  ├── part [one member of parts] (reverse of parts.lines)
  ├── part_and_supplier [one member of supply_records] (reverse of supply_records.lines)
  └── supplier [one member of suppliers] (reverse of suppliers.lines)

  nations
  ├── comment
  ├── key
  ├── name
  ├── region_key
  ├── customers [multiple customers] (reverse of customers.nation)
  ├── region [one member of regions] 

Focusing on the `regions` section of this information shows us some of the information observed earlier with `pydough.explain`:

* The scalar properties of `regions` are displayed (`comment`, `key` and `name`)
* The `nations` property is shown, and indicates that it connects each record of `regions` to potentially multiple records of `nations`, and that the reverse is `nations.region`.

## Using `pydough.explain` and `pydough.explain_term` on PyDough code

The API `pydough.explain` can also be called on PyDough code to display information about what it logically does. There is a key constraint: `pydough.explain` can only be called on PyDough code if it that resolves into a collection (e.g. `regions` or `nations.suppliers.WHERE(account_balance > 0)`).

To explain other PyDough code, such as an expression, `pydough.explain_term` must be used (more on that later).

Below is an example of using `pydough.explain` on simple PyDough code to learn about the `nations` collection:

In [9]:
%%pydough

print(pydough.explain(nations, verbose=True))

PyDough collection representing the following logic:
  ──┬─ TPCH
    └─── TableCollection[nations]

This node, specifically, accesses the collection nations.
Call pydough.explain(graph['nations']) to learn more about this collection.

The following terms will be included in the result if this collection is executed:
  comment, key, name, region_key

It is possible to use BACK to go up to 1 level above this collection.

The collection has access to the following expressions:
  comment, key, name, region_key

The collection has access to the following collections:
  customers, region, suppliers

Call pydough.explain_term(collection, term) to learn more about any of these
expressions or collections that the collection has access to.


There are several pieces of information displayed here, including:

* The PyDough code `nations` just accesses the data from the `nations` collection, which we can learn more about as shown earlier by calling `pydough.explain` on the metadata for `nations`.
* If `pydough.to_sql` or `pydough.to_df` is called on `nations`, all four of its scalar properties will be incldued in the result.
* The properties `comment`, `key`, `name`, and `region_key` can be accessed by the collection as scalar expressions.
* The properties `customers`, `region` and `suppliers` can be accessed by the collection as subcollections (unknown if they are singular or plural, without gaining more information).

Below is an example of how to use `pydough.explain_term` to learn more about the `name` expression of `nations`:

In [10]:
%%pydough

print(pydough.explain_term(nations, name, verbose=True))

Collection:
  ──┬─ TPCH
    └─── TableCollection[nations]

The term is the following expression: name

This is column 'name' of collection 'nations'

This term is singular with regards to the collection, meaning it can be placed in a CALC of a collection.
For example, the following is valid:
  TPCH.nations(name)


And below is an example of how to use `pydough.explain_term` to learn more about the `suppliers` collection of `nations`:

In [11]:
%%pydough

print(pydough.explain_term(nations, suppliers, verbose=True))

Collection:
  ──┬─ TPCH
    └─── TableCollection[nations]

The term is the following child of the collection:
  └─┬─ AccessChild
    └─── SubCollection[suppliers]

This child is plural with regards to the collection, meaning its scalar terms can only be accessed by the collection if they are aggregated.
For example, the following are valid:
  TPCH.nations(COUNT(suppliers.account_balance))
  TPCH.nations.WHERE(HAS(suppliers))
  TPCH.nations.ORDER_BY(COUNT(suppliers).DESC())

To learn more about this child, you can try calling pydough.explain on the following:
  TPCH.nations.suppliers


But `pydough.explain` and `pydough.explain_term` do not need to just be used on simple collections and columns. Below is a slightly more complex example that will be dissected in several steps by calling `pydough.explain` and `pydough.explain_term` on various snippets of it. The code in question calculates the top 3 asian countries by the number of orders made by customers in those nations in the year 1995.

In [12]:
%%pydough

asian_countries = nations.WHERE(region.name == "ASIA")

orders_1995 = customers.orders.WHERE(YEAR(order_date) == 1995)

asian_countries_info = asian_countries(country_name=LOWER(name), total_orders=COUNT(orders_1995))

top_asian_countries = asian_countries_info.TOP_K(3, by=total_orders.DESC())

pydough.to_df(top_asian_countries)

Unnamed: 0,country_name,total_orders
0,indonesia,9299
1,china,9185
2,india,9121


First, here is the results of calling `pydough.explain` on the final result:

In [13]:
%%pydough

print(pydough.explain(top_asian_countries, verbose=True))

PyDough collection representing the following logic:
  ──┬─ TPCH
    ├─── TableCollection[nations]
    ├─┬─ Where[$1.name == 'ASIA']
    │ └─┬─ AccessChild
    │   └─── SubCollection[region]
    ├─┬─ Calc[country_name=LOWER(name), total_orders=COUNT($1)]
    │ └─┬─ AccessChild
    │   └─┬─ SubCollection[customers]
    │     ├─── SubCollection[orders]
    │     └─── Where[YEAR(order_date) == 1995]
    └─── TopK[3, total_orders.DESC(na_pos='last')]

The main task of this node is to sort the collection on the following and keep the first 3 records:
  total_orders, in descending order with nulls at the end

The following terms will be included in the result if this collection is executed:
  country_name, total_orders

It is possible to use BACK to go up to 1 level above this collection.

The collection has access to the following expressions:
  comment, country_name, key, name, region_key, total_orders

The collection has access to the following collections:
  customers, region, suppliers


There are several pieces of information displayed, including the following:

* The structure of the entire logic is shown, but the information being displayed is specifically focused on the last operation (the `TopK` at the bottom of the structure).
* The operation is ordering by `total_orders` in descending order, then keeping the top 3 entries.
* There are 6 total expressions that are accessible from `top_asian_countries`, but only 2 of them are included in the answer when executed: `country_name` and `total_orders`.

More can be learned about these expressions included in the answer with `pydough.explain_term`:

In [14]:
%%pydough

print(pydough.explain_term(top_asian_countries, country_name, verbose=True))

Collection:
  ──┬─ TPCH
    ├─── TableCollection[nations]
    ├─┬─ Where[$1.name == 'ASIA']
    │ └─┬─ AccessChild
    │   └─── SubCollection[region]
    ├─┬─ Calc[country_name=LOWER(name), total_orders=COUNT($1)]
    │ └─┬─ AccessChild
    │   └─┬─ SubCollection[customers]
    │     ├─── SubCollection[orders]
    │     └─── Where[YEAR(order_date) == 1995]
    └─── TopK[3, total_orders.DESC(na_pos='last')]

The term is the following expression: country_name

This expression calls the function 'LOWER' on the following arguments:
  name

Call pydough.explain_term with this collection and any of the arguments to learn more about them.

This term is singular with regards to the collection, meaning it can be placed in a CALC of a collection.
For example, the following is valid:
  TPCH.nations.WHERE(region.name == 'ASIA')(country_name=LOWER(name), total_orders=COUNT(customers.orders.WHERE(YEAR(order_date) == 1995))).TOP_K(3, total_orders.DESC(na_pos='last'))(country_name)


Here, we get the structure of everything done up until this point, and information specifically about `country_name`. In this case, we learn it is the result of claling `LOWER` on `name`. Calling `explain_term(top_asian_countries, name)` would more-or-less display the same information as `explain_term(nations, name)`. Instead, here is the result using `explain_term` to learn about `total_orders`:

In [15]:
%%pydough

print(pydough.explain_term(top_asian_countries, total_orders, verbose=True))

Collection:
  ──┬─ TPCH
    ├─── TableCollection[nations]
    ├─┬─ Where[$1.name == 'ASIA']
    │ └─┬─ AccessChild
    │   └─── SubCollection[region]
    ├─┬─ Calc[country_name=LOWER(name), total_orders=COUNT($1)]
    │ └─┬─ AccessChild
    │   └─┬─ SubCollection[customers]
    │     ├─── SubCollection[orders]
    │     └─── Where[YEAR(order_date) == 1995]
    └─── TopK[3, total_orders.DESC(na_pos='last')]

The term is the following expression: total_orders

This expression counts how many records of the following subcollection exist for each record of the collection:
  customers.orders.WHERE(YEAR(order_date) == 1995)

Call pydough.explain_term with this collection and any of the arguments to learn more about them.

This term is singular with regards to the collection, meaning it can be placed in a CALC of a collection.
For example, the following is valid:
  TPCH.nations.WHERE(region.name == 'ASIA')(country_name=LOWER(name), total_orders=COUNT(customers.orders.WHERE(YEAR(order_date) ==

Here, we learn that `total_orders` counts how many records of `customers.orders` exist for each record of `nations`. Now, `explain_terms` is used to learn more about the argument to `COUNT`:

In [16]:
%%pydough

print(pydough.explain_term(asian_countries, orders_1995, verbose=True))

Collection:
  ──┬─ TPCH
    ├─── TableCollection[nations]
    └─┬─ Where[$1.name == 'ASIA']
      └─┬─ AccessChild
        └─── SubCollection[region]

The term is the following child of the collection:
  └─┬─ AccessChild
    └─┬─ SubCollection[customers]
      ├─── SubCollection[orders]
      └─── Where[YEAR(order_date) == 1995]

This child is plural with regards to the collection, meaning its scalar terms can only be accessed by the collection if they are aggregated.
For example, the following are valid:
  TPCH.nations.WHERE(region.name == 'ASIA')(COUNT(customers.orders.WHERE(YEAR(order_date) == 1995).clerk))
  TPCH.nations.WHERE(region.name == 'ASIA').WHERE(HAS(customers.orders.WHERE(YEAR(order_date) == 1995)))
  TPCH.nations.WHERE(region.name == 'ASIA').ORDER_BY(COUNT(customers.orders.WHERE(YEAR(order_date) == 1995)).DESC())

To learn more about this child, you can try calling pydough.explain on the following:
  TPCH.nations.WHERE(region.name == 'ASIA').customers.orders.WHERE(YEAR(o

Here, we learn that `customers.orders` invokes a child of the current context (`nations.WHERE(region.name == 'ASIA')`) by accessing the `customers` subcollection, then accessing its `orders` collection, then filtering it on the conedition `YEAR(order_date) == 1995`. 

We also know that this resulting child is plural with regards to the context, meaning that `asian_countries(asian_countries.order_date)` woudl be illegal, but `asian_countries(MAX(asian_countries.order_date))` is legal.

More combinations of `pydough.explain` and `pydough.explain_terms` can be done to learn more about what each of these components does.