# Aggregate column

In this example, we demonstrate how to define a column using grouping and aggregation. 

Generally, aggregate columns have to describe two aspects in their definition: 

* how to group elements and 
* how to aggregate data in the groups 

There are two general ways to group elements: 

* Partition elements of one (fact) table into non-overlapping groups using some property with respect to elements of another (group) table which has to be equal for all elements in one group. In SQL, it is implemented as GROUP-BY operation.
* Group elements of a table around elements of this same table using some binary relation among them which can be treated as distance. It is typically implemented in rolling aggregation.

In this example, we describe the first approach where sales records are aggregated and the results are stored as a new column of the product table. More specifically, we want to find sales for each product by aggregating data in the sale table.

In [11]:
import pandas as pd  # Prosto relies on pandas
import prosto as pr  # Import Prosto toolkit

### Create a new workflow

In [12]:
# Create a workflow
prosto = pr.Prosto("My Prosto Workflow")
# Element name is stored in the id field
print("Workflow name is: ´{}´".format(prosto.id))

Workflow name is: ´My Prosto Workflow´


### Define a source table

We use in-memory data for populating this table and not data from any other table in the workflow.

This table stores sales data. Each time some product is sold a new record is added to this table.

In [13]:
sales_data = {
    'product_name': ["beer", "chips", "chips", "beer", "chips"],
    'quantity': [1, 2, 3, 2, 1],
    'price': [10.0, 5.0, 6.0, 15.0, 4.0]
}

sales = prosto.populate(
    # A table definition consists of a name and a list of attributes
    table_name="Sales", attributes=["product_name", "quantity", "price"],

    # Table operation is UDF, input tables and model
    func=lambda **m: pd.DataFrame(data=sales_data), tables=[]
)

### Define a target table

This table stores a list of products. Note that the list of products is loaded from memory and not generated from any other table in the workflow. In particular, it is not generated from the sales table (this could be done by means of the project operation). Product names should be unique because otherwise the linking will be ambiguous.

In [14]:
products = prosto.populate(
    # A table definition consists of a name and a list of attributes
    table_name="Products", attributes=["name"],

    # Table operation is UDF, input tables and model
    func=lambda **m: pd.DataFrame(data={'name': ["beer", "chips", "tee"]}), tables=[]
)

### Define a link column

A link column stores row ids of some other table. These row ids are interpreted as references and hence it is possible to use these columns in order to access values from other tables.

In this example, we want to define a link from sales records to products.

In [15]:
link_column = prosto.link(
    # In contrast to other columns, a link column specifies its target table name
    name="product", table=sales.id, type=products.id,

    # It is a criterion of linking: all input columns have to be equal to the output columns
    columns=["product_name"], linked_columns=["name"]
)

### Define an aggregate column

This column belongs to the target table with all products, that is, it is treated as a new characteristic of each products. Yet, this characteristic is computed using groups of records selected another table.

In [16]:
total = prosto.aggregate(
    # Column description
    name="quantity_sold", table=products.id,
    # How to group
    tables=["Sales"], link="product",
    # How to aggregate
    func="lambda x: x.sum()", columns=["quantity"], model={}
)

### Execute the workflow

Above we provided only definitions. In order to really compute the result, we need to execute the workflow. This operation will build a topology (a graph of table and column operations) and then execute these operations according to their dependencies.

In [17]:
prosto.run()

### Explore the result

Once the workflow has been executed, we can read the result data. The expected sales counts are: 3.0 for beer, 6.0 for chips and 0.0 for tee. Note that tee is not present in the fact table and 0.0 is the default value.

In [18]:
table_data = products.get_data()
table_data.head()

Unnamed: 0,name,quantity_sold
0,beer,3.0
1,chips,6.0
2,tee,0.0


### Summary

* An aggregate column uses a link column as a specification of how to group records in another table
* An aggregate column uses an arbitrary user-defined aggregate function for aggregation
* An aggregation column can be further used in other definition as a normal column