# Link column

In this example, we demonstrate how to define a link column which connects two tables using the criterion of equality of the specified columns.

In [1]:
import pandas as pd  # Prosto relies on pandas
import prosto as pr  # Import Prosto toolkit

### Create a new workflow

In [2]:
# Create a workflow
prosto = pr.Prosto("My Prosto Workflow")
# Element name is stored in the id field
print("Workflow name is: ´{}´".format(prosto.id))

Workflow name is: ´My Prosto Workflow´


### Define a source table

We use in-memory data for populating this table and not data from any other table in the workflow.

This table stores sales data. Each time some product is sold a new record is added to this table.

In [3]:
sales_data = {
    'product_name': ["beer", "chips", "chips", "beer", "chips"],
    'quantity': [1, 2, 3, 2, 1],
    'price': [10.0, 5.0, 6.0, 15.0, 4.0]
}
sales_df = pd.DataFrame(data=sales_data)

sales = prosto.populate(
    # A table definition consists of a name and a list of attributes
    table_name="Sales", attributes=["product_name", "quantity", "price"],

    # Table operation is UDF, input tables and model
    func=lambda **m: sales_df, tables=[]
)

### Define a target table

This table stores a list of products. Note that the list of products is loaded from memory and not generated from any other table in the workflow. In particular, it is not generated from the sales table (this could be done by means of the project operation). Product names should be unique because otherwise the linking will be ambiguous.

In [4]:
products = prosto.populate(
    # A table definition consists of a name and a list of attributes
    table_name="Products", attributes=["name"],

    # Table operation is UDF, input tables and model
    func=lambda **m: pd.DataFrame(data={'name': ["beer", "chips", "tee"]}), tables=[]
)

### Define a link column

A link column stores row ids of some other table. These row ids are interpreted as references and hence it is possible to use these columns in order to access values from other tables.

In this example, we want to define a link from sales records to products.

In [5]:
link_column = prosto.link(
    # In contrast to other columns, a link column specifies its target table name
    name="product", table=sales.id, type=products.id,

    # It is a criterion of linking: all input columns have to be equal to the output columns
    columns=["product_name"], linked_columns=["name"]
)

### Execute the workflow

Above we provided only definitions. In order to really compute the result, we need to execute the workflow. This operation will build a topology (a graph of table and column operations) and then execute these operations according to their dependencies.

In [6]:
prosto.run()

### Explore the result

Once the workflow has been executed, we can read the result data.

In [7]:
table_data = products.get_df()
table_data.head()

Unnamed: 0,name
0,beer
1,chips
2,tee


In [8]:
table_data = sales.get_df()
table_data.head()

Unnamed: 0,product_name,quantity,price,product
0,beer,1,10.0,0
1,chips,2,5.0,1
2,chips,3,6.0,1
3,beer,2,15.0,0
4,chips,1,4.0,1


### Summary

* A link column stores a mapping from the input table to the target table. In this example, for each product name found in the sales records, we find and store a row id of the corresponding record in the table of products.
* A link column can be used in other definitions where it is necessary to access related records