# Deep Feature Syntehsis

Deep Feature Synthesis (DFS) is an automated method for performing feature engineering on relational and temporal data.

## Input Data

Deep Feature Synthesis requires structured datasets in order to perform feature engineering. To demonstrate the capabilities of DFS, we will use a mock customer transactions dataset.


In [1]:
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
es





Entityset: transactions
  DataFrames:
    transactions [Rows: 500, Columns: 6]
    products [Rows: 5, Columns: 3]
    sessions [Rows: 35, Columns: 5]
    customers [Rows: 5, Columns: 5]
  Relationships:
    transactions.product_id -> products.product_id
    transactions.session_id -> sessions.session_id
    sessions.customer_id -> customers.customer_id

Once data is prepared as an `.EntitySet`, we are ready to automatically generate features for a target dataframe - e.g. `customers`.

## Running DFS

Typically, without automated feature engineering, a data scientist would write code to aggregate data for a customer, and apply different statistical functions resulting in features quantifying the customer's behavior. In this example, an expert might be interested in features such as: *total number of sessions* or *month the customer signed up*.

These features can be generated by DFS when we specify the target_dataframe as `customers` and `"count"` and `"month"` as primitives.

In [2]:
feature_matrix, feature_defs = ft.dfs(entityset=es,
                                      target_dataframe_name="customers",
                                      agg_primitives=["count"],
                                      trans_primitives=["month"],
                                      max_depth=1)
feature_matrix

Unnamed: 0_level_0,zip_code,COUNT(sessions),MONTH(_ft_last_time),MONTH(date_of_birth),MONTH(join_date)
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5,60091,6,1,7,7
4,60091,8,1,8,4
1,60091,8,1,7,4
3,13244,6,1,11,8
2,13244,7,1,8,4


In the example above, `"count"` is an **aggregation primitive** because it computes a single value based on many sessions related to one customer. `"month"` is called a **transform primitive** because it takes one value for a customer transforms it to another.

## Creating "Deep Features"

The name Deep Feature Synthesis comes from the algorithm's ability to stack primitives to generate more complex features. Each time we stack a primitive we increase the "depth" of a feature. The `max_depth` parameter controls the maximum depth of the features returned by DFS. Let us try running DFS with `max_depth=2`

In [3]:
feature_matrix, feature_defs = ft.dfs(entityset=es,
                                      target_dataframe_name="customers",
                                      agg_primitives=["mean", "sum", "mode"],
                                      trans_primitives=["month", "hour"],
                                      max_depth=2)
feature_matrix

Unnamed: 0_level_0,zip_code,MODE(sessions.device),MEAN(transactions.amount),MODE(transactions.product_id),SUM(transactions.amount),HOUR(_ft_last_time),HOUR(date_of_birth),HOUR(join_date),MONTH(_ft_last_time),MONTH(date_of_birth),MONTH(join_date),MEAN(sessions.MEAN(transactions.amount)),MEAN(sessions.SUM(transactions.amount)),MODE(sessions.HOUR(_ft_last_time)),MODE(sessions.HOUR(session_start)),MODE(sessions.MODE(transactions.product_id)),MODE(sessions.MONTH(_ft_last_time)),MODE(sessions.MONTH(session_start)),SUM(sessions.MEAN(transactions.amount)),MODE(transactions.sessions.device)
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
5,60091,mobile,80.375443,5,6349.66,8,0,5,1,7,7,78.705187,1058.276667,5,0,3,1,1,472.231119,mobile
4,60091,mobile,80.070459,2,8727.68,5,0,20,1,8,4,81.207189,1090.96,3,1,1,1,1,649.657515,mobile
1,60091,mobile,71.631905,4,9025.62,7,0,10,1,7,4,72.77414,1128.2025,1,6,4,1,1,582.193117,mobile
3,13244,desktop,67.06043,1,6236.62,9,0,15,1,11,8,67.539577,1039.436667,1,5,1,1,1,405.237462,desktop
2,13244,desktop,77.422366,4,7200.28,8,0,23,1,8,4,78.415122,1028.611429,3,3,3,1,1,548.905851,desktop


With a depth of 2, a number of features are generated using the supplied primitives. The algorithm to synthesize these definitions is described in this [paper](http://www.jmaxkanter.com/static/papers/DSAA_DSM_2015.pdf). In the returned feature matrix, let us understand one of the depth 2 features

In [4]:
feature_matrix[['MEAN(sessions.SUM(transactions.amount))']]

Unnamed: 0_level_0,MEAN(sessions.SUM(transactions.amount))
customer_id,Unnamed: 1_level_1
5,1058.276667
4,1090.96
1,1128.2025
3,1039.436667
2,1028.611429


For each customer this feature

1. calculates the ``sum`` of all transaction amounts per session to get total amount per session,
2. then applies the ``mean`` to the total amounts across multiple sessions to identify the *average amount spent per session*

We call this feature a "deep feature" with a depth of 2.

Let's look at another depth 2 feature that calculates for every customer *the most common hour of the day when they start a session*

In [5]:
feature_matrix[['MODE(sessions.HOUR(session_start))']]

Unnamed: 0_level_0,MODE(sessions.HOUR(session_start))
customer_id,Unnamed: 1_level_1
5,0
4,1
1,6
3,5
2,3


For each customer this feature calculates

1. The `hour` of the day each of his or her sessions started, then
2. uses the statistical function `mode` to identify the most common hour he or she started a session

Stacking results in features that are more expressive than individual primitives themselves. This enables the automatic creation of complex patterns for machine learning.

## Changing Target DataFrame

DFS is powerful because we can create a feature matrix for any entity in our dataset. If we switch our target dataframe to "sessions", we can synthesize features for each session instead of each customer. Now, we can use these features to predict the outcome of a session.

In [6]:
feature_matrix, feature_defs = ft.dfs(entityset=es,
                                      target_dataframe_name="sessions",
                                      agg_primitives=["mean", "sum", "mode"],
                                      trans_primitives=["month", "hour"],
                                      max_depth=2)
feature_matrix.head(5)

Unnamed: 0_level_0,customer_id,device,MEAN(transactions.amount),MODE(transactions.product_id),SUM(transactions.amount),HOUR(_ft_last_time),HOUR(session_start),MONTH(_ft_last_time),MONTH(session_start),customers.zip_code,...,customers.MODE(sessions.device),customers.MEAN(transactions.amount),customers.MODE(transactions.product_id),customers.SUM(transactions.amount),customers.HOUR(_ft_last_time),customers.HOUR(date_of_birth),customers.HOUR(join_date),customers.MONTH(_ft_last_time),customers.MONTH(date_of_birth),customers.MONTH(join_date)
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,2,desktop,76.813125,3,1229.01,0,0,1,1,13244,...,desktop,77.422366,4,7200.28,8,0,23,1,8,4
2,5,mobile,74.696,5,746.96,0,0,1,1,60091,...,mobile,80.375443,5,6349.66,8,0,5,1,7,7
3,4,mobile,88.6,1,1329.0,0,0,1,1,60091,...,mobile,80.070459,2,8727.68,5,0,20,1,8,4
4,1,mobile,64.5572,5,1613.93,1,0,1,1,60091,...,mobile,71.631905,4,9025.62,7,0,10,1,7,4
5,4,mobile,70.638182,5,777.02,1,1,1,1,60091,...,mobile,80.070459,2,8727.68,5,0,20,1,8,4


As we can see, DFS will also build deep features based on a parent entity, in this case the customer of a particular session. For example, the feature below calculates the mean transaction amount of the customer of the session.

In [7]:
feature_matrix[['customers.MEAN(transactions.amount)']].head(5)

Unnamed: 0_level_0,customers.MEAN(transactions.amount)
session_id,Unnamed: 1_level_1
1,77.422366
2,80.375443
3,80.070459
4,71.631905
5,80.070459
