Tuning Deep Feature Synthesis
=============================

There are several parameters that can be tuned to change the output of DFS.

In [1]:
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
es





Entityset: transactions
  DataFrames:
    transactions [Rows: 500, Columns: 6]
    products [Rows: 5, Columns: 3]
    sessions [Rows: 35, Columns: 5]
    customers [Rows: 5, Columns: 5]
  Relationships:
    transactions.product_id -> products.product_id
    transactions.session_id -> sessions.session_id
    sessions.customer_id -> customers.customer_id

Using "Seed Features"
*********************

Seed features are manually defined, problem specific, features a user provides to DFS. Deep Feature Synthesis will then automatically stack new features on top of these features when it can.

By using seed features, we can include domain specific knowledge in feature engineering automation.

In [4]:
expensive_purchase = ft.Feature(es,"transactions", "amount") > 125

feature_matrix, feature_defs = ft.dfs(entityset=es,
                                      target_dataframe_name="customers",
                                      agg_primitives=["percent_true"],
                                      seed_features=[expensive_purchase])
feature_matrix[['PERCENT_TRUE(transactions.amount > 125)']]

Unnamed: 0_level_0,PERCENT_TRUE(transactions.amount > 125)
customer_id,Unnamed: 1_level_1
5,0.227848
4,0.220183
1,0.119048
3,0.182796
2,0.129032


We can now see that ``PERCENT_TRUE`` was automatically applied to this boolean variable.

Add "interesting" values to variables
*************************************

Sometimes we want to create features that are conditioned on a second value before we calculate. We call this extra filter a "where clause".

By default, where clauses are built using the ``interesting_values`` of a variable.


.. Interesting values can be automatically added to all variables by calling ``EntitySet.add_interesting_values``. We can manually specify interesting values by directly as well.

.. Currently, interesting values are only considered for variables of type :class:`.variable_types.Categorical`, :class:`.variable_types.Ordinal`, and :class:`.variable_types.Boolean`.

In [11]:
values_dict = {'device': ["desktop", "mobile", "tablet"]}
es.add_interesting_values(dataframe_name='sessions', values=values_dict)

es['sessions'].ww.columns['device'].metadata

{'interesting_values': ['desktop', 'mobile', 'tablet']}

In [12]:
es['customers']

Unnamed: 0,customer_id,zip_code,join_date,date_of_birth,_ft_last_time
5,5,60091,2010-07-17 05:27:50,1984-07-28,2014-01-01 08:09:40
4,4,60091,2011-04-08 20:08:14,2006-08-15,2014-01-01 05:31:30
1,1,60091,2011-04-17 10:48:33,1994-07-18,2014-01-01 07:26:20
3,3,13244,2011-08-13 15:42:34,2003-11-21,2014-01-01 09:00:35
2,2,13244,2012-04-15 23:31:04,1986-08-18,2014-01-01 08:23:45


We then specify the aggregation primitive to make where clauses for using ``where_primitives``

In [13]:
feature_matrix, feature_defs = ft.dfs(entityset=es,
                                      target_dataframe_name="customers",
                                      agg_primitives=["count", "avg_time_between"],
                                      where_primitives=["count", "avg_time_between"],
                                      trans_primitives=[])
feature_matrix

Unnamed: 0_level_0,zip_code,AVG_TIME_BETWEEN(sessions.session_start),COUNT(sessions),AVG_TIME_BETWEEN(transactions.transaction_time),COUNT(transactions),AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet),AVG_TIME_BETWEEN(sessions.session_start WHERE device = desktop),AVG_TIME_BETWEEN(sessions.session_start WHERE device = mobile),COUNT(sessions WHERE device = tablet),COUNT(sessions WHERE device = desktop),...,AVG_TIME_BETWEEN(transactions.sessions.session_start),AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = desktop),AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = tablet),AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = mobile),AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = desktop),AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = tablet),AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = mobile),COUNT(transactions WHERE sessions.device = desktop),COUNT(transactions WHERE sessions.device = tablet),COUNT(transactions WHERE sessions.device = mobile)
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5,60091,5577.0,6,363.333333,79,,9685.0,13942.5,1,2,...,357.5,345.892857,0.0,796.714286,376.071429,65.0,809.714286,29,14,36
4,60091,2516.428571,8,168.518519,109,,4127.5,3336.666667,1,3,...,163.101852,223.108108,0.0,192.5,238.918919,65.0,206.25,38,18,53
1,60091,3305.714286,8,192.92,126,8807.5,7150.0,11570.0,3,2,...,185.12,275.0,419.404762,420.727273,302.5,442.619048,438.454545,27,43,56
3,13244,5096.0,6,287.554348,93,,4745.0,,1,4,...,276.956522,233.360656,0.0,0.0,251.47541,65.0,65.0,62,15,16
2,13244,4907.5,7,328.532609,93,5330.0,6890.0,1690.0,2,3,...,320.054348,417.575758,197.407407,56.333333,435.30303,226.296296,82.333333,34,28,31


Now, we have several new potentially useful features. For example, the two features below tell us *how many sessions a customer completed on a tablet*, and *the time between those sessions*.

In [14]:
feature_matrix[["COUNT(sessions WHERE device = tablet)", "AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)"]]

Unnamed: 0_level_0,COUNT(sessions WHERE device = tablet),AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1
5,1,
4,1,
1,3,8807.5
3,1,
2,2,5330.0


We can see that customer who only had 0 or 1 sessions on a tablet, had ``NaN`` values for average time between such sessions.


Encoding categorical features
*****************************

Machine learning algorithms typically expect all numeric data. When Deep Feature Synthesis generates categorical features, we need to encode them.

In [16]:
feature_matrix, feature_defs = ft.dfs(entityset=es,
                                      target_dataframe_name="customers",
                                      agg_primitives=["mode"],
                                      max_depth=1)

feature_matrix

Unnamed: 0_level_0,zip_code,MODE(sessions.device),DAY(_ft_last_time),DAY(date_of_birth),DAY(join_date),MONTH(_ft_last_time),MONTH(date_of_birth),MONTH(join_date),WEEKDAY(_ft_last_time),WEEKDAY(date_of_birth),WEEKDAY(join_date),YEAR(_ft_last_time),YEAR(date_of_birth),YEAR(join_date)
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
5,60091,mobile,1,28,17,1,7,7,2,5,5,2014,1984,2010
4,60091,mobile,1,15,8,1,8,4,2,1,4,2014,2006,2011
1,60091,mobile,1,18,17,1,7,4,2,0,6,2014,1994,2011
3,13244,desktop,1,21,13,1,11,8,2,4,5,2014,2003,2011
2,13244,desktop,1,18,15,1,8,4,2,0,6,2014,1986,2012


This feature matrix contains 2 categorical variables, ``zip_code`` and ``MODE(sessions.device)``. We can use the feature matrix and feature definitions to encode these categorical values. Featuretools offers functionality to apply one hot encoding to the output of DFS.

In [17]:
feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs)
feature_matrix_enc

Unnamed: 0_level_0,zip_code = 60091,zip_code = 13244,zip_code is unknown,MODE(sessions.device) = mobile,MODE(sessions.device) = desktop,MODE(sessions.device) is unknown,DAY(_ft_last_time) = 1,DAY(_ft_last_time) is unknown,DAY(date_of_birth) = 18,DAY(date_of_birth) = 28,...,YEAR(date_of_birth) = 2006,YEAR(date_of_birth) = 2003,YEAR(date_of_birth) = 1994,YEAR(date_of_birth) = 1986,YEAR(date_of_birth) = 1984,YEAR(date_of_birth) is unknown,YEAR(join_date) = 2011,YEAR(join_date) = 2012,YEAR(join_date) = 2010,YEAR(join_date) is unknown
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5,True,False,False,True,False,False,True,False,False,True,...,False,False,False,False,True,False,False,False,True,False
4,True,False,False,True,False,False,True,False,False,False,...,True,False,False,False,False,False,True,False,False,False
1,True,False,False,True,False,False,True,False,True,False,...,False,False,True,False,False,False,True,False,False,False
3,False,True,False,False,True,False,True,False,False,False,...,False,True,False,False,False,False,True,False,False,False
2,False,True,False,False,True,False,True,False,True,False,...,False,False,False,True,False,False,False,True,False,False


The returned feature matrix is now all numeric. Additionally, we get a new set of feature definitions that contain the encoded values.

In [18]:
features_enc

[<Feature: zip_code = 60091>,
 <Feature: zip_code = 13244>,
 <Feature: zip_code is unknown>,
 <Feature: MODE(sessions.device) = mobile>,
 <Feature: MODE(sessions.device) = desktop>,
 <Feature: MODE(sessions.device) is unknown>,
 <Feature: DAY(_ft_last_time) = 1>,
 <Feature: DAY(_ft_last_time) is unknown>,
 <Feature: DAY(date_of_birth) = 18>,
 <Feature: DAY(date_of_birth) = 28>,
 <Feature: DAY(date_of_birth) = 21>,
 <Feature: DAY(date_of_birth) = 15>,
 <Feature: DAY(date_of_birth) is unknown>,
 <Feature: DAY(join_date) = 17>,
 <Feature: DAY(join_date) = 15>,
 <Feature: DAY(join_date) = 13>,
 <Feature: DAY(join_date) = 8>,
 <Feature: DAY(join_date) is unknown>,
 <Feature: MONTH(_ft_last_time) = 1>,
 <Feature: MONTH(_ft_last_time) is unknown>,
 <Feature: MONTH(date_of_birth) = 8>,
 <Feature: MONTH(date_of_birth) = 7>,
 <Feature: MONTH(date_of_birth) = 11>,
 <Feature: MONTH(date_of_birth) is unknown>,
 <Feature: MONTH(join_date) = 4>,
 <Feature: MONTH(join_date) = 8>,
 <Feature: MONTH(join

These features can be used to calculate the same encoded values on new data. For more information on feature engineering in production, read :doc:`/guides/deployment`.