# Project table

In this example, we demonstrate how to define a table which will be populated with unique values from another (source) table. Simultaniously, a link column will be evaluated which connects these two tables.

In [10]:
import pandas as pd  # Prosto relies on pandas
import prosto as prst  # Import Prosto toolkit

### Create a new workflow

In [11]:
# Create a workflow
workflow = prst.Schema("My workflow")
# Element name is stored in the id field
print("Workflow name is: ´{}´".format(workflow.id))

Workflow name is: ´My workflow´


### Define source data

In this example, we use an in-memory data source as a data frame. In more realistic cases, it can a csv file or database.

In [12]:
sales_data = {
    'product_name': ["beer", "chips", "chips", "beer", "chips"],
    'quantity': [1, 2, 3, 2, 1],
    'price': [10.0, 5.0, 6.0, 15.0, 4.0]
}
sales_df = pd.DataFrame(data=sales_data)

### Define a source populate table

It is the simplest way to define a table by means of a user-defined function.

To define a table we need to provide a name as well as a list of its attributes. Attribbutes contain data which will be set during the table population.

How a table is populated is specified by a user-defined function. In this example, this function returns a data frame with the data without any computatioens. Therefore, the list of input tables is empty and also we use an empty model (parameters of the population procedure).

The last parameter indicates that the user-defined function returns a whole table.

In [13]:
sales = workflow.create_populate_table(
    # A table definition consists of a name and a list of attributes
    table_name="Sales", attributes=["product_name", "quantity", "price"],

    # Table operation is UDF, input tables and model
    func=lambda **m: sales_df, tables=[], model={},

    # The user-defined function returns a complete data frame
    input_length='table'
)

### Define a project table

Our goal is to creae a table with a list of all (unique) products by using data from the source table. 

A project table contains data projected from another table. The data projected along the specified list of columns.

In [14]:
products = workflow.create_project_table(
    # A table definition consists of a name and a list of attributes
    table_name="Products", attributes=["name"],

    # This parameter specifies a column of the input table which will store row ids of this table
    link="product",

    # Here we specify an input table to be projected
    tables=["Sales"]
)

### Define a link column

A link column stores row ids of some other table. These row ids are interpreted as references and hence it is possible to use these columns in order to access values from other tables.

In [15]:
link_column = workflow.create_link_column(
    # In contrast to other columns, a link column specifies its target table name
    name="product", table=sales.id, type=products.id,

    # It is a criterion of linking: all input columns have to be equal to the output columns
    columns=["product_name"], linked_columns=["name"]
)

### Execute the workflow

Above we provided only definitions. In order to really compute the result we need to execute the workflow. This operation will build a topology (a graph of table and column operations) and then execute these operations according to their dependencies.

In [16]:
workflow.run()

### Explore the result

Once the workflow has been executed we can read the result data.

In [17]:
from IPython.display import display

table_data = products.get_data()
display(table_data)

Unnamed: 0,name
0,beer
1,chips


In [18]:
table_data = sales.get_data()
display(table_data)

Unnamed: 0,product_name,quantity,price,product
0,beer,1,10.0,0
1,chips,2,5.0,1
2,chips,3,6.0,1
3,beer,2,15.0,0
4,chips,1,4.0,1


### Summary

* A project table stores all unique combinations of the specified columns in an input table. In this example, we generated a list of all unique products in a separate table. In future, we can define new columns in this project table.
* A link columns has to be defined along with a project table. This column will store a mapping from the input table to the target table. This column can be used in other definitions where it is necessary to access related records.