# Calculate column

In this example, we demonstrate how to define a simple in-memory table and a calculate column which will compute its output values as a product of two input values.

In [8]:
import pandas as pd  # Prosto relies on pandas
import prosto as pr  # Import Prosto toolkit

### Create a new workflow

In [9]:
# Create a workflow
workflow = pr.Schema("My workflow")
# Element name is stored in the id field
print(f"Workflow name is: ´{workflow.id}´"

SyntaxError: unexpected EOF while parsing (<ipython-input-9-ff5ef43d1c73>, line 4)

### Define source data

In this example, we use an in-memory data source as a data frame. In more realistic cases, it can a csv file or database.

In [None]:
sales_data = {
    'product_name': ["beer", "chips", "chips", "beer", "chips"],
    'quantity': [1, 2, 3, 2, 1],
    'price': [10.0, 5.0, 6.0, 15.0, 4.0]
}
sales_df = pd.DataFrame(data=sales_data)

### Define a source populate table

It is the simplest way to define a table by means of a user-defined function.

To define a table, we need to provide a name as well as a list of its attributes. Attributes contain data which will be set during the table population.

How a table is populated is specified by a user-defined function. In this example, this function returns a data frame with the data without any computations. Therefore, the list of input tables is empty and also we use an empty model (parameters of the population procedure).

The last parameter indicates that the user-defined function returns a whole table.

In [None]:
sales = workflow.create_populate_table(
    # A table definition consists of a name and a list of attributes
    table_name="Sales", attributes=["product_name", "quantity", "price"],

    # Table operation is UDF, input tables and model
    func=lambda **m: sales_df, tables=[], model={},

    # The user-defined function returns a complete data frame
    input_length='table'
)

### Define a calculate column

To define a new column, we need to provide a name as well as a table it belongs to.

A calculate column computes its data from the data in the same row in the input columns by means of a user-defined function. In this case, the output is computed as a product of the quantity and price attributes. The model is empty because the user-defined function does not use external parameters.

The last argument indicates that the user-defined function computes one value for one row and not the whole column.

In [10]:
calc_column = workflow.create_calculate_column(
    # Column definition consists of a name and a table it belongs to
    name="amount", table=sales.id,

    # Column operation is UDF, input columns and model
    func=lambda x: x[0]*x[1], columns=["quantity", "price"], model=None,

    # The user-defined function retuns one value for one row
    input_length='value'
)

### Execute the workflow

Above we provided only definitions. In order to really compute the result, we need to execute the workflow. This operation will build a topology (a graph of table and column operations) and then execute these operations according to their dependencies.

In [11]:
workflow.run()

### Explore the result

Once the workflow has been executed, we can read the result data.

In [12]:
table_data = sales.get_data()
column_data = sales.get_column_data("amount")

table_data.head()

Unnamed: 0,product_name,quantity,price,amount
5,,,,
6,,,,
7,,,,
8,,,,
9,,,,


### Summary

* The logic of computations is specified via an arbitrary user-defined function
* A user-defined function manipulates individual values and is unaware of columns and tables. It is like an Excel formula
* We can change source data, then again execute the workflow and get new result.