# Introduction: Automated Feature Engineering with Featuretools

Automated feature engineering allows us to create hundreds or thousands of relevant features from a relational dataset in a few lines of code that can be re-used across problems. Currently, the only option for automated feature engineering using many related tables is Featuretools, an open-source Python library. 

In this notebook, we'll work with Featuretools to develop an automated feature engineering workflow for a single partition of the customer churn data. After developing a method that works for one partition, we can take this idea and apply it to many partitions in parallel.

In [4]:
import pandas as pd 
import numpy as np

import featuretools as ft

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

N_PARTITIONS = 1000

In [5]:
PARTITION = '500'
directory = '/data/churn/partitions/p' + PARTITION

import os
os.listdir(directory)

['logs.csv',
 'members.csv',
 'train.csv',
 'cancel_cutoff_times.csv',
 'test.csv',
 'transactions.csv',
 'cutoff_times.csv']

In [6]:
members = pd.read_csv(f'{directory}/members.csv', 
                      parse_dates=['registration_init_time'], 
                      infer_datetime_format = True, 
                      dtype = {'gender': 'category'})

trans = pd.read_csv(f'{directory}/transactions.csv',
                   parse_dates=['transaction_date', 'membership_expire_date'], 
                    infer_datetime_format = True)

logs = pd.read_csv(f'{directory}/logs.csv', parse_dates = ['date'])
cutoff_times = pd.read_csv(f'{directory}/cutoff_times.csv', parse_dates = ['cutoff'])
cutoff_times = cutoff_times.drop_duplicates()

In [7]:
cutoff_times.shape

(21856, 3)

# Define Entities

The entityset structure for this problem is fairly simple as there are only three entities.  `members` is the parent with `logs` and `transactions` both children. In both relationships, the parent and child variable is `msno`, the customer id.

In [8]:
import featuretools.variable_types as vtypes

es = ft.EntitySet(id = 'customers')

#### Members

In [9]:
members.head()

Unnamed: 0,msno,city,bd,gender,registered_via,registration_init_time
0,IPcy704aIqoa4MY5NBAKhVw1qZCWvQcYICBVMufSbcg=,5,0,male,3,2014-11-02
1,N7VphdA9MRD/ojyO/jSWydNrQqfZMe2d1eDl5kwB+vg=,5,17,female,4,2016-12-26
2,wnOtVWT2Hi28usrU9Yb0JCdl/TGO48HUfJlgehG0kDw=,1,0,,4,2017-01-20
3,DEIygRcw0Soz4FguDgJQnSrlHoTYHmlvTcoOLB9dF2Y=,1,0,,4,2017-01-21
4,q4k48ZA18embL69OlVhGpT/8sB5nhETBpH5B6Ud+JXI=,1,0,,4,2016-08-15


In [10]:
members['msno'].is_unique

True

In [11]:
es.entity_from_dataframe(entity_id='members', dataframe=members,
                         index = 'msno', time_index = 'registration_init_time', 
                         variable_types = {'city': vtypes.Categorical, 'bd': vtypes.Categorical,
                                           'registered_via': vtypes.Categorical})

Entityset: customers
  Entities:
    members [Rows: 6684, Columns: 6]
  Relationships:
    No relationships

#### Transactions

In [12]:
trans.head()

Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
0,/4OzeklvQKOIr804cYEbcsy4xbpWHQFF40oeMTMuTak=,41,30,129,129,1,2016-02-29,2016-03-31,0
1,1lhQM//dvJCyWLTaCw7x+aDrCFNhNk/8QzlMwiRgB4Y=,41,30,149,149,1,2015-12-31,2016-01-31,0
2,bSgrbAUbyZDpkoQgVxeH4dQ7v8yEoucUK0lB0x6F2R0=,21,30,149,149,1,2015-12-02,2016-01-08,0
3,c3HjpBgEcGfa+mkJVtC47gE2CaW+KTBUxijgvrnBUuY=,28,30,150,150,0,2016-07-02,2016-08-01,0
4,vWLvk74sFSINQPmCbcIMqAh1MDdzxroTKIjaxKWEQHA=,41,30,99,99,1,2016-09-13,2016-10-13,0


In [13]:
trans['price_difference'] = trans['plan_list_price'] - trans['actual_amount_paid']
trans['planned_daily_price'] = trans['plan_list_price'] / trans['payment_plan_days']
trans['daily_price'] = trans['actual_amount_paid'] / trans['payment_plan_days']

In [14]:
es.entity_from_dataframe(entity_id='transactions', dataframe=trans,
                         index = 'transactions_index', make_index = True,
                         time_index = 'transaction_date', 
                         variable_types = {'payment_method_id': vtypes.Categorical, 
                                           'is_auto_renew': vtypes.Boolean, 'is_cancel': vtypes.Boolean})

es.entity_from_dataframe(entity_id='logs', dataframe=logs,
                         index = 'logs_index', make_index = True,
                         time_index = 'date')

Entityset: customers
  Entities:
    members [Rows: 6684, Columns: 6]
    transactions [Rows: 22859, Columns: 13]
  Relationships:
    No relationships

Entityset: customers
  Entities:
    members [Rows: 6684, Columns: 6]
    transactions [Rows: 22859, Columns: 13]
    logs [Rows: 401596, Columns: 10]
  Relationships:
    No relationships

#### Logs

In [15]:
logs.head()

Unnamed: 0,msno,date,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs
0,1T6cC9wlTNDxYh+ikIsljHO3LJ62pdNxeo0uC6b9iUk=,2017-03-17,17,1,1,0,55,51,12650.427
1,Qbj5QJcK+N/z9h4fR82QYmABCS9g3EIbGijYxqOAw3M=,2017-03-01,4,2,2,1,38,46,10247.052
2,tk3KXVctKu4yERExEwFvMMOrpU88K083pDNRONhpMzY=,2017-03-30,0,0,0,0,18,18,4565.533
3,q9u6CM2lMNSyc0mHPnH9O/yWvMGqeTcMqBHRnS7s0MI=,2017-03-20,0,0,0,0,21,21,5523.67
4,a/vnjfU45TFglx+JFOPBWQHOaQdEY/lYUw8cxLurbwA=,2017-03-10,4,1,1,0,12,11,3670.509


In [16]:
logs['total'] = logs[['num_25', 'num_50', 'num_75', 'num_985', 'num_100']].sum(axis = 1)
logs['percent_100'] = logs['num_100'] / logs['total']
logs['percent_unique'] = logs['num_unq'] / logs['total']

In [17]:
es.entity_from_dataframe(entity_id='logs', dataframe=logs,
                         index = 'logs_index', make_index = True,
                         time_index = 'date')

Entityset: customers
  Entities:
    members [Rows: 6684, Columns: 6]
    transactions [Rows: 22859, Columns: 13]
    logs [Rows: 401596, Columns: 13]
  Relationships:
    No relationships

### Interesting Values

In order to create conditional features, we can set interesting values for existing columns in the data. The following code will be used to build features conditional on the value of `is_cancel` and `is_auto_renew` in the transactions data. The primitives used for the conditional features are specified as `where_primitives` in the call to Deep Feature Synthesis.

In [18]:
es['transactions']['is_cancel'].interesting_values = [0, 1]
es['transactions']['is_auto_renew'].interesting_values = [0, 1]

## Relationships

There are two relationships: one linking `members` to `transactions` and one linking `members` to `logs`. The order for relationships is parent variable, child variable.

In [19]:
r_member_transactions = ft.Relationship(es['members']['msno'], es['transactions']['msno'])
r_member_logs = ft.Relationship(es['members']['msno'], es['logs']['msno'])

es.add_relationships([r_member_transactions, r_member_logs])

Entityset: customers
  Entities:
    members [Rows: 6684, Columns: 6]
    transactions [Rows: 22859, Columns: 13]
    logs [Rows: 401596, Columns: 13]
  Relationships:
    transactions.msno -> members.msno
    logs.msno -> members.msno

# Deep Feature Synthesis

With the entities and relationships fully defined, we are ready to run Deep Feature Synthesis (DFS). To start, we'll use the default aggregation and transformation primitives as well as two `where_primitives` and see how many features this generates.

In [20]:
feature_defs = ft.dfs(entityset=es, target_entity='members', 
                      cutoff_time = cutoff_times,
                      where_primitives = ['sum', 'mean'],
                      max_depth=2, features_only=True)

In [21]:
print(f'This will generate {len(feature_defs)} features.')

This will generate 182 features.


## Specify Primitives 

Now we'll do a call to `ft.dfs` specifying the primitives to use. Often, these will depend on the problem and can involve domain knowledge. We can also build our own custom primitives to use on the dataset.

## Aggregation Primitives

In [22]:
all_p = ft.list_primitives()
trans_p = all_p.loc[all_p['type'] == 'transform'].copy()
agg_p = all_p.loc[all_p['type'] == 'aggregation'].copy()

pd.options.display.max_rows = 50
agg_p

Unnamed: 0,name,type,description
0,time_since_last,aggregation,Time since last related instance.
1,avg_time_between,aggregation,Computes the average time between consecutive ...
2,count,aggregation,Counts the number of non null values.
3,all,aggregation,Test if all values are 'True'.
4,num_true,aggregation,Finds the number of 'True' values in a boolean.
5,n_most_common,aggregation,Finds the N most common elements in a categori...
6,num_unique,aggregation,Returns the number of unique categorical varia...
7,any,aggregation,Test if any value is 'True'.
8,percent_true,aggregation,Finds the percent of 'True' values in a boolea...
9,max,aggregation,Finds the maximum non-null value of a numeric ...


In [23]:
agg_primitives = ['sum', 'time_since_last', 'avg_time_between', 'all', 'mode', 'num_unique', 'min', 'last', 
                  'mean', 'percent_true', 'max', 'std', 'count']

## Transform Primitives

In [24]:
trans_p

Unnamed: 0,name,type,description
19,cum_mean,transform,Calculates the mean of previous values of an i...
20,divide,transform,Creates a transform feature that divides two f...
21,not,transform,"For each value of the base feature, negates th..."
22,week,transform,Transform a Datetime feature into the week.
23,days_since,transform,"For each value of the base feature, compute th..."
24,hours,transform,Transform a Timedelta feature into the number ...
25,minute,transform,Transform a Datetime feature into the minute.
26,isin,transform,"For each value of the base feature, checks whe..."
27,or,transform,"For two boolean values, determine if one value..."
28,subtract,transform,Creates a transform feature that subtracts two...


In [25]:
trans_primitives = ['weekend', 'cum_sum', 'day', 'month', 'diff', 'time_since_previous']

### Where Primitives


In [26]:
where_primitives = ['sum', 'count', 'mean', 'percent_true', 'all', 'any']

## Deep Feature Synthesis with Specified Primitives

In [27]:
feature_defs = ft.dfs(entityset=es, target_entity='members', 
                      cutoff_time = cutoff_times, 
                      agg_primitives = agg_primitives,
                      trans_primitives = trans_primitives,
                      where_primitives = where_primitives,
                      max_depth = 2, features_only = True)

In [28]:
print(f'This will generate {len(feature_defs)} features.')

This will generate 230 features.


# Run Deep Feature Synthesis

Once we're happy with the features that will be generated, we can run deep feature synthesis to make the actual features. We need to change `feature_only` to `False` and then we're good to go.

In [29]:
from timeit import default_timer as timer

start = timer()
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity='members', 
                                      cutoff_time = cutoff_times, 
                                      agg_primitives = agg_primitives,
                                      trans_primitives = trans_primitives,
                                      where_primitives = where_primitives,
                                      max_depth = 2, features_only = False,
                                      verbose = 1, chunk_size = 100)
end = timer()
print(f'{round(end - start)} seconds elapsed.')

Built 230 features
Elapsed: 07:20 | Remaining: 00:00 | Progress: 100%|██████████| Calculated: 219/219 chunks
442 seconds elapsed.


In [30]:
feature_matrix.head()

Unnamed: 0_level_0,city,bd,registered_via,gender,SUM(transactions.payment_plan_days),SUM(transactions.plan_list_price),SUM(transactions.actual_amount_paid),SUM(transactions.price_difference),SUM(transactions.planned_daily_price),SUM(transactions.daily_price),...,WEEKEND(LAST(transactions.transaction_date)),WEEKEND(LAST(transactions.membership_expire_date)),WEEKEND(LAST(logs.date)),DAY(LAST(transactions.transaction_date)),DAY(LAST(transactions.membership_expire_date)),DAY(LAST(logs.date)),MONTH(LAST(transactions.transaction_date)),MONTH(LAST(transactions.membership_expire_date)),MONTH(LAST(logs.date)),label
msno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
+V6OulljdDvq43dsTyzLK6+x7YwZMjfkXwHAWYw0Kds=,5.0,0.0,3.0,,31,149,0,149,4.806452,0.0,...,0.0,1.0,0.0,1.0,31.0,1.0,1.0,1.0,1.0,0.0
10rVobB7ZAi69b7fWLGJMSRUX0b+QZ8kDQRc7J8VEV8=,4.0,23.0,9.0,male,31,149,149,0,4.806452,4.806452,...,0.0,1.0,0.0,1.0,1.0,,1.0,2.0,,0.0
1uW7L6j+0Hsgn2Khzia/hQQcFyv+ncIcRdJlTy+XmSY=,22.0,0.0,3.0,male,31,149,149,0,4.806452,4.806452,...,0.0,1.0,0.0,1.0,1.0,1.0,1.0,2.0,1.0,0.0
5Rg2ghHz158LML0RK8cW+EzKvZAYWhzaPxsikIPlAGY=,5.0,0.0,3.0,,31,149,149,0,4.806452,4.806452,...,0.0,1.0,0.0,1.0,1.0,,1.0,2.0,,
9TRUYe9vqSq9AZFrtJGF/+RJL57bySU6+Jyyt0o+LVU=,22.0,0.0,9.0,,31,149,149,0,4.806452,4.806452,...,0.0,1.0,0.0,1.0,1.0,,1.0,2.0,,0.0


In [31]:
cutoff_times['label'].value_counts()

0.0    18861
1.0      603
Name: label, dtype: int64

In [32]:
from sklearn.ensemble import RandomForestClassifier

feature_matrix = feature_matrix[feature_matrix['label'].notnull()].copy()
labels = np.array(feature_matrix.pop('label'))

In [33]:
from sklearn.model_selection import train_test_split

feature_matrix = pd.get_dummies(feature_matrix).replace({np.inf: np.nan, -np.inf:np.nan}).fillna(0)

X_train, X_test, y_train, y_test = train_test_split(feature_matrix, labels, stratify = labels)

In [34]:
import numpy as np

In [35]:
random_forest = RandomForestClassifier(n_estimators = 1000, max_depth = 10)
random_forest.fit(X_train, y_train)
random_forest.score(X_test, y_test)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=10, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

0.9751335799424579

In [36]:
np.mean(y_test == 0)

0.9689683518290176

In [37]:
ft.save_features(feature_defs, '/data/churn/features.txt')

# Partition to Feature Matrix

Now we'll write a function that takes in the partition number and outputs a feature matrix.

In [38]:
feature_defs = ft.load_features('/data/churn/features.txt')
print(f'There are {len(feature_defs)} features.')

There are 230 features.


In [44]:
def partition_to_feature_matrix(partition, feature_defs):
    """Take in a partition number and return a feature matrix"""
    directory = '/data/churn/partitions/p' + PARTITION
    
    # Read in the data files
    members = pd.read_csv(f'{directory}/members.csv', 
                      parse_dates=['registration_init_time'], 
                      infer_datetime_format = True, 
                      dtype = {'gender': 'category'})

    trans = pd.read_csv(f'{directory}/transactions.csv',
                       parse_dates=['transaction_date', 'membership_expire_date'], 
                        infer_datetime_format = True)

    logs = pd.read_csv(f'{directory}/logs.csv', parse_dates = ['date'])
    cutoff_times = pd.read_csv(f'{directory}/cutoff_times.csv', parse_dates = ['cutoff'])
    cutoff_times = cutoff_times.drop_duplicates()
    
    # Create empty entityset
    es = ft.EntitySet(id = 'customers')

    # Add the members parent table
    es.entity_from_dataframe(entity_id='members', dataframe=members,
                             index = 'msno', time_index = 'registration_init_time', 
                             variable_types = {'city': vtypes.Categorical, 'bd': vtypes.Categorical,
                                               'registered_via': vtypes.Categorical})
    # Create new features in transactions
    trans['price_difference'] = trans['plan_list_price'] - trans['actual_amount_paid']
    trans['planned_daily_price'] = trans['plan_list_price'] / trans['payment_plan_days']
    trans['daily_price'] = trans['actual_amount_paid'] / trans['payment_plan_days']

    # Add the transactions child table
    es.entity_from_dataframe(entity_id='transactions', dataframe=trans,
                             index = 'transactions_index', make_index = True,
                             time_index = 'transaction_date', 
                             variable_types = {'payment_method_id': vtypes.Categorical, 
                                               'is_auto_renew': vtypes.Boolean, 'is_cancel': vtypes.Boolean})

    # Add transactions interesting values
    es['transactions']['is_cancel'].interesting_values = [0, 1]
    es['transactions']['is_auto_renew'].interesting_values = [0, 1]
    
    # Create new features in logs
    logs['total'] = logs[['num_25', 'num_50', 'num_75', 'num_985', 'num_100']].sum(axis = 1)
    logs['percent_100'] = logs['num_100'] / logs['total']
    logs['percent_unique'] = logs['num_unq'] / logs['total']
    
    # Add the logs child table
    es.entity_from_dataframe(entity_id='logs', dataframe=logs,
                         index = 'logs_index', make_index = True,
                         time_index = 'date')

    # Add the relationships
    r_member_transactions = ft.Relationship(es['members']['msno'], es['transactions']['msno'])
    r_member_logs = ft.Relationship(es['members']['msno'], es['logs']['msno'])
    es.add_relationships([r_member_transactions, r_member_logs])

    # return cutoff_times
    
    # Calculate and save the feature matrix
    feature_matrix = ft.calculate_feature_matrix(entityset=es, features=feature_defs, cutoff_time=cutoff_times)
    
    feature_matrix.to_csv(f'{directory}/feature_matrix.csv')
    
    # Report progress every 10th of number of partitions
    if (partition % (N_PARTITIONS / 10) == 0):
        print(f'{100 * round(partition / N_PARTITIONS)}% complete.')
        
    

In [46]:
for i in range(0, 1000, 50):
    partition_to_feature_matrix(i, feature_defs)

0% complete.
0% complete.
0% complete.
0% complete.


KeyboardInterrupt: 

In [45]:
partition_to_feature_matrix(100, feature_defs)

0% complete.
