# Introduction: Automated Feature Engineering with Featuretools

Automated feature engineering allows us to create hundreds or thousands of relevant features from a relational dataset in a few framework that can be re-used across problems. This approach overcomes the limitations of traditional manual feature engineering, letting us develop better predictive models in a fraction of the time. 

Currently, the only option for automated feature engineering using multiple related tables is [Featuretools](https://github.com/Featuretools/featuretools), an open-source Python library. 

In this notebook, we'll work with Featuretools to develop an automated feature engineering workflow for the customer churn dataset. After developing a function that works to build features from a single partition, we'll be able to apply this function to all of the partitions in parallel using Spark with PySpark.

## Featuretools Resources

We won't spend too much time on the basics of Featuretools here, so refer to the following sources for more information:

* [Featuretools Documentation](https://docs.featuretools.com/)
* [Featuretools GitHub](https://github.com/Featuretools/featuretools)
* [Introductory tutorial on Featuretools](https://towardsdatascience.com/automated-feature-engineering-in-python-99baf11cc219)
* [Why Automated Feature Engineering Will Change Machine Learning](https://towardsdatascience.com/why-automated-feature-engineering-will-change-the-way-you-do-machine-learning-5c15bf188b96)

The basics are relatively easy to pick up, and if you're new, you can probably follow along with all the code here! 
With that in mind, let's get started.

In [1]:
# Data science helpers
import pandas as pd 
import numpy as np

import featuretools as ft

# Useful for showing multiple outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

N_PARTITIONS = 1000

All of the data is stored on S3. To access, first configure AWS from the command line using `aws configure`.

In [2]:
PARTITION = '50'
BASE_DIR = 's3://customer-churn-spark/partitions/'
PARTITION_DIR = BASE_DIR + 'p' + PARTITION

In [7]:
# Read in all data
members = pd.read_csv(f'{PARTITION_DIR}/members.csv', 
                      parse_dates=['registration_init_time'], 
                      infer_datetime_format = True, 
                      dtype = {'gender': 'category'})

trans = pd.read_csv(f'{PARTITION_DIR}/transactions.csv',
                   parse_dates=['transaction_date', 'membership_expire_date'], 
                    infer_datetime_format = True)

logs = pd.read_csv(f'{PARTITION_DIR}/logs.csv', parse_dates = ['date'])

cutoff_times = pd.read_csv(f'{PARTITION_DIR}/cutoff_times.csv', parse_dates = ['cutoff'])

# Define Entities and EntitySet

The first step in using Featuretools is to make an `EntitySet` and add all the `entity`s - tables - to it. An EntitySet is a data structure that holds the tables and the relationships between them. This makes it easier to keep track of all the data in a problem with multiple relational tables.

In [4]:
import featuretools.variable_types as vtypes

es = ft.EntitySet(id = 'customers')

## Entities

When creating entities from a dataframe, we need to make sure to include:

* The `index` if there is one or a name for the created index. This is a unique identifier for each observation.
* `make_index = True` if there is no index, we need to supply a name under `index` and set this to `True`.
* A `time_index` if present. This is the time at which the information in the row becomes known.
* `variable_types`. In some cases our data will have discrete variables represented as integers which should be specified.

For this problem these are the only arguments we'll need

#### Members

The `members` table holds basic information about each customer. The important point for this table is to specify that the `city` and `registered_via` columns are discrete, categorical variables and not numerical. The `msno` is the unique index identifying each customer. 

In [5]:
members.head()

Unnamed: 0,msno,city,bd,gender,registered_via,registration_init_time
0,8hW4+CV3D1oNM0CIsA39YljsF8M3m7g1LAX6AQd3C8I=,4,24,male,3,2014-11-04
1,yhcODfebyTYezE6KAPklcV1us9zdOYJ+7eHS7f/xgoU=,8,37,male,9,2007-02-11
2,sBlgSL0AIq49XsmBQ2KceKZNUyIxT1BwSkN/xYQLGMc=,15,21,male,3,2013-02-08
3,Xy3Au8sZKlEeHBQ+C7ro8Ni3X/dxgrtmx0Tt+jqM1zY=,1,0,,9,2015-02-01
4,NiCu2GVWgT5QZbI85oYRBEDqHUZbzz2azS48jvM+khg=,12,21,male,3,2015-02-12


In [6]:
members['msno'].is_unique

True

In [8]:
# Create entity from members
es.entity_from_dataframe(entity_id='members', dataframe=members,
                         index = 'msno', time_index = 'registration_init_time', 
                         variable_types = {'city': vtypes.Categorical, 
                                           'registered_via': vtypes.Categorical})

Entityset: customers
  Entities:
    members [Rows: 6658, Columns: 6]
  Relationships:
    No relationships

#### Transactions

The transactions concern payments made by the customers. Each row records one payment. 

In [9]:
trans.head()

Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
0,5F7G3pHKf5ijGQpoKuko0G7Jm3Bde6ktfPKBZySWoDI=,41,30,99,99,1,2017-02-10,2017-03-10,0
1,DQMPoCSc6EB39ytgnKCRsUIZnR6ZWSrHeDmX7nbxAKs=,41,30,149,149,1,2016-02-01,2016-03-02,0
2,Lrais3nsgqYwpfpSoyK3fHuPutf6cloTI5T5dQfs4lA=,38,30,149,149,0,2016-02-23,2016-04-23,0
3,ZPOjgxQw1/J7v5xgBJTCLXWuwq5Xmk33nO6AoUO1+mY=,41,30,149,119,1,2015-09-06,2016-08-01,0
4,MvR23u4bIiWM+U+VE1Mvw3qqdj/0Ixs1sf7avavjhRs=,38,30,149,149,0,2016-10-28,2016-11-27,0


Before creating the entity, we can create a few new variables based on domain knowledge.

In [10]:
# Difference between listing price and price paid
trans['price_difference'] = trans['plan_list_price'] - trans['actual_amount_paid']

# Planned price per day
trans['planned_daily_price'] = trans['plan_list_price'] / trans['payment_plan_days']

# Actual price per day
trans['daily_price'] = trans['actual_amount_paid'] / trans['payment_plan_days']

There is no `index` in this dataframe so we have to specify to make an index and pass in a name. There is a `time_index`, the time of the transaction, which will be critical when filtering data based on cutoff times to make features. Again, we also need to specify several variable types.

There is one slight anomaly with the transactions where some membership expire dates are after the transactions date, so we will filter those out.

In [11]:
trans = trans[trans['membership_expire_date'] > trans['transaction_date']]

In [12]:
es.entity_from_dataframe(entity_id='transactions', dataframe=trans,
                         index = 'transactions_index', make_index = True,
                         time_index = 'transaction_date', 
                         variable_types = {'payment_method_id': vtypes.Categorical, 
                                           'is_auto_renew': vtypes.Boolean, 'is_cancel': vtypes.Boolean})

Entityset: customers
  Entities:
    members [Rows: 6658, Columns: 6]
    transactions [Rows: 22329, Columns: 13]
  Relationships:
    No relationships

#### Logs

The `logs` contain user listening behavior. As before we'll make a few domain knowledge columns before adding to the `EntitySet`. There is a again a `time_index` although no `index` present.

In [13]:
logs.head()

Unnamed: 0,msno,date,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs
0,6+/V1NwBbqjBOCvRSDueeJZ58F4DY7h7fG6fSZtHaAE=,2017-03-04,29,28,18,11,111,79,34727.142
1,E2aBGFTKR6jzp+1knh7JOOF39gLuu+CoZMWaAL/DA0M=,2017-03-27,1,0,2,0,184,173,33408.719
2,g7exJzakJlHXwzUydnShY5w24WXSwJyS6QqgoFeyr7g=,2017-03-15,0,0,0,0,21,21,4951.0
3,X+i9OmM3P42cETt5gPkOnz8vXGViQL5/M/NMiMQ+Olc=,2017-03-13,3,1,0,0,33,27,8755.599
4,tbl8blAVl6j4A8zW1Gnyg78Hc0LAQzzcYesmzgJ7ofs=,2017-03-27,6,5,0,0,2,6,1035.853


In [15]:
# Make a few features by hand
logs['total'] = logs[['num_25', 'num_50', 'num_75', 'num_985', 'num_100']].sum(axis = 1)
logs['percent_100'] = logs['num_100'] / logs['total']
logs['percent_unique'] = logs['num_unq'] / logs['total']
logs['seconds_per_song'] = logs['total_secs'] / logs['total'] 

In [16]:
es.entity_from_dataframe(entity_id='logs', dataframe=logs,
                         index = 'logs_index', make_index = True,
                         time_index = 'date')

Entityset: customers
  Entities:
    members [Rows: 6658, Columns: 6]
    transactions [Rows: 22329, Columns: 13]
    logs [Rows: 424252, Columns: 14]
  Relationships:
    No relationships

Making features by hand may seem counterintuitive if we are using automated feature engineering, but the benefits of doing this before using Featuretools is that these features can be stacked on top of to build deep features. Automated feature engineering will therefore take our existing hand-built features and extract more value from them by combining them with other features.

### Interesting Values

In order to create conditional features, we can set interesting values for existing columns in the data. The following code will be used to build features conditional on the value of `is_cancel` and `is_auto_renew` in the transactions data. The primitives used for the conditional features are specified as `where_primitives` in the call to Deep Feature Synthesis.

In [17]:
es['transactions']['is_cancel'].interesting_values = [0, 1]
es['transactions']['is_auto_renew'].interesting_values = [0, 1]

# Relationships

The entityset structure for this problem is fairly simple as there are only three entities with two relationships.  `members` is the parent of `logs` and `transactions`. In both relationships, the parent and child variable is `msno`, the customer id.

The two relationships are: one linking `members` to `transactions` and one linking `members` to `logs`. The order for relationships in featuretools is parent variable, child variable.

In [18]:
# Relationships (parent, child)
r_member_transactions = ft.Relationship(es['members']['msno'], es['transactions']['msno'])
r_member_logs = ft.Relationship(es['members']['msno'], es['logs']['msno'])

es.add_relationships([r_member_transactions, r_member_logs])

Entityset: customers
  Entities:
    members [Rows: 6658, Columns: 6]
    transactions [Rows: 22329, Columns: 13]
    logs [Rows: 424252, Columns: 14]
  Relationships:
    transactions.msno -> members.msno
    logs.msno -> members.msno

# Deep Feature Synthesis

With the entities and relationships fully defined, we are ready to run [Deep Feature Synthesis (DFS)](https://www.featurelabs.com/blog/deep-feature-synthesis/). This process applies feature engineering building blocks called [feature primitives](https://docs.featuretools.com/automated_feature_engineering/primitives.html) to the dataset to build hundreds of features. Feature primitives are basic operations in two categories - transforms and aggregations - that stack to build deep features. 

The call to `ft.dfs` needs the entityset which holds all the tables and relationships between them, the `target_entity` to make features for, the specific primitives, the maximum stacking of primitives (`max_depth`), the `cutoff_times`, and a number of optional parameters. The `cutoff_times` is a critical piece of any time based machine learning problem. This is the label dataframe that holds the member id, cutoff time, and label associated with each cutoff time. __For each cutoff time, only data from before the cutoff time can be used to build features for that label.__ This is one of the greatest advantages of Featuretools compared to manual feature engineering: __Featuretools automatically filters our data based on the cutoff times to ensure that all the features are valid for machine learning.__

To start, we'll use the default aggregation and transformation primitives as well as two `where_primitives` and see how many features this generates. To only generate the definitions of the features, we pass in `features_only = True`.

In [19]:
feature_defs = ft.dfs(entityset=es, target_entity='members', 
                      cutoff_time = cutoff_times,
                      where_primitives = ['sum', 'mean'],
                      max_depth=2, features_only=True)

In [20]:
print(f'This will generate {len(feature_defs)} features.')

This will generate 188 features.


In [21]:
import random; random.seed(42)

random.sample(feature_defs, 10)

[<Feature: MEAN(transactions.daily_price WHERE is_auto_renew = 0)>,
 <Feature: MIN(transactions.payment_plan_days)>,
 <Feature: SUM(transactions.actual_amount_paid)>,
 <Feature: MAX(logs.num_985)>,
 <Feature: STD(logs.total_secs)>,
 <Feature: STD(logs.num_50)>,
 <Feature: MEAN(transactions.plan_list_price)>,
 <Feature: SKEW(transactions.planned_daily_price)>,
 <Feature: MODE(transactions.DAY(membership_expire_date))>,
 <Feature: SUM(transactions.daily_price WHERE is_auto_renew = 0)>]

We can see that Featuretools has built almost 200 features automatically for us using the table relationships and feature primitives. If built by hand, each of these features would require minutes of work, totaling many hours to build 188 features.

## Specify Primitives 

Now we'll do a call to `ft.dfs` specifying the primitives to use. Often, these will depend on the problem and can involve domain knowledge. We can also build our own custom primitives to use on the dataset.

## Aggregation Primitives

In [24]:
all_p = ft.list_primitives()
trans_p = all_p.loc[all_p['type'] == 'transform'].copy()
agg_p = all_p.loc[all_p['type'] == 'aggregation'].copy()

pd.options.display.max_colwidth = 100
agg_p.head()

Unnamed: 0,name,type,description
0,avg_time_between,aggregation,Computes the average time between consecutive events.
1,mean,aggregation,Computes the average value of a numeric feature.
2,all,aggregation,Test if all values are 'True'.
3,mode,aggregation,Finds the most common element in a categorical feature.
4,n_most_common,aggregation,Finds the N most common elements in a categorical feature.


In [25]:
# Specify aggregation primitives
agg_primitives = ['sum', 'time_since_last', 'avg_time_between', 'all', 'mode', 'num_unique', 'min', 'last', 
                  'mean', 'percent_true', 'max', 'std', 'count']

## Transform Primitives

In [26]:
trans_p.tail()

Unnamed: 0,name,type,description
57,or,transform,"For two boolean values, determine if one value is 'True'."
58,subtract,transform,Creates a transform feature that subtracts two features.
59,weekend,transform,Transform Datetime feature into the boolean of Weekend.
60,year,transform,Transform a Datetime feature into the year.
61,month,transform,Transform a Datetime feature into the month.


In [27]:
# Specify transformation primitives
trans_primitives = ['weekend', 'cum_sum', 'day', 'month', 'diff', 'time_since_previous']

### Where Primitives


In [28]:
# Specify where primitives
where_primitives = ['sum', 'count', 'mean', 'percent_true', 'all', 'any']

## Custom Primitives

[Custom primitives](https://docs.featuretools.com/automated_feature_engineering/primitives.html#defining-custom-primitives) are one of the most powerful options in Featuretools. We use custom primitives to write our own functions based on domain knowledge and then pass them to `dfs` like any other primitives. Featuretools will then stack our custom primitives with the other primitives, again, in effect, amplifying our domain knowledge.

For this problem, I wrote a custom primitive that calculates the sum of a value in the month prior to the cutoff time. This is actually a primitive I wrote for another problem that I can apply to this problem as well. That's oneof the benefits of feature primitives: they can work for any problem. Writing a custom primitive once will pay off far down the road. 

In [65]:
from featuretools.primitives import make_agg_primitive

def total_previous_month_func(numeric, datetime, time):
    """Return total of `numeric` column in the month prior to `time`."""
    df = pd.DataFrame({'value': numeric, 'date': datetime})
    previous_month = time.month - 1
    year = time.year
   
    # Handle January
    if previous_month == 0:
        previous_month = 12
        year = time.year - 1
        
    # Filter data and sum up total
    df = df[(df['date'].dt.month == previous_month) & (df['date'].dt.year == year)]
    total = df['value'].sum()
    
    return total

In [67]:
numeric = [10, 12, 14, 15, 19, 22, 9, 8, 8, 11]
dates = pd.date_range('2018-01-01', '2018-03-01', periods = len(numeric))
pd.DataFrame({'value': numeric, 'date': dates}).head(6)
total_previous_month_func(numeric, dates, pd.datetime(2018, 2, 1))

Unnamed: 0,value,date
0,10,2018-01-01 00:00:00
1,12,2018-01-07 13:20:00
2,14,2018-01-14 02:40:00
3,15,2018-01-20 16:00:00
4,19,2018-01-27 05:20:00
5,22,2018-02-02 18:40:00


70

In [68]:
numeric = [10, 12, 14, 5, 7, 8]
dates = pd.date_range('2018-01-01', '2018-03-01', periods = len(numeric))
pd.DataFrame({'value': numeric, 'date': dates}).head(6)
total_previous_month_func(numeric, dates, pd.datetime(2018, 3, 1))

Unnamed: 0,value,date
0,10,2018-01-01 00:00:00
1,12,2018-01-12 19:12:00
2,14,2018-01-24 14:24:00
3,5,2018-02-05 09:36:00
4,7,2018-02-17 04:48:00
5,8,2018-03-01 00:00:00


12

The first step is to make a function (`total_previous_month`) that calculates the primitive. The second second is to specify the input and output types. This primitive is an aggregation primitive because it takes in multiple numbers and returns a single number.

In [46]:
# Takes in a number and outputs a number
total_previous = make_agg_primitive(total_previous_month, input_types = [ft.variable_types.Numeric,
                                                                         ft.variable_types.Datetime],
                                    return_type = ft.variable_types.Numeric, 
                                    uses_calc_time = True)

We just have to pass this in as another aggregation primitive for Featuretools to use it in calculations.

The second custom primitive finds the time since a previous true value. This is originally intended for the `is_cancel` variable in the `transactions` dataframe, but it can work for any Boolean variable. It simply finds the time between True (1) examples.


In [60]:
def time_since_true_func(boolean, datetime):
    """Calculate time since previous true value"""
    
    # Handle case with no true values
    if np.all(np.array(boolean) == 0):
        return [np.nan for _ in range(len(boolean))]
    
    # Create dataframe sorted from oldest to newest 
    df = pd.DataFrame({'value': boolean, 'date': datetime}).\
            sort_values('date', ascending = False)
    
    older_date = None
    
    # Iterate through each date in reverse order
    for date in df.loc[df['value'] == 1, 'date']:
        
        # If there was no older true value
        if older_date == None:
            # Subset to times on or after true
            times_after_idx = df.loc[df['date'] >= date].index
            
            
        else:
            # Subset to times on or after true but before previous true
            times_after_idx = df.loc[(df['date'] >= date) & (df['date'] < older_date)].index
        older_date = date
        # Calculate time since previous true
        df.loc[times_after_idx, 'time_since_previous'] = (df.loc[times_after_idx, 'date'] - date).dt.total_seconds()
        
    return list(df['time_since_previous'])

In [61]:
booleans = [1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1]
dates = pd.date_range('2018-01-01', '2018-03-01', periods = len(booleans))

time_since_true(booleans, dates)

[0.0,
 1960615.384615424,
 1568492.3076922882,
 1176369.2307694082,
 784246.153846272,
 392123.076923136,
 0.0,
 0.0,
 0.0,
 1568492.3076922882,
 1176369.230769152,
 784246.153846272,
 392123.076923136,
 0.0]

In [62]:
booleans = [1, 0, 1]
dates = pd.date_range('2018-01-01', '2018-03-01', periods = len(booleans))

time_since_true_func(booleans, dates)

[0.0, 2548800.0, 0.0]

In [64]:
booleans = [0, 0]
dates = pd.date_range('2018-01-01', '2018-03-01', periods = len(booleans))

time_since_true_func(booleans, dates)

[nan, nan]

This is a transformation primitive since it acts on multiple columns in the same table. The return is the same length as the original column.

In [None]:
from featuretools.primitives import make_trans_primitive

time_since_true = make_trans_primitive(time)

## Deep Feature Synthesis with Specified Primitives

In [None]:
feature_defs = ft.dfs(entityset=es, target_entity='members', 
                      cutoff_time = cutoff_times, 
                      agg_primitives = agg_primitives,
                      trans_primitives = trans_primitives,
                      where_primitives = where_primitives,
                      max_depth = 2, features_only = True)

In [None]:
print(f'This will generate {len(feature_defs)} features.')

# Run Deep Feature Synthesis

Once we're happy with the features that will be generated, we can run deep feature synthesis to make the actual features. We need to change `feature_only` to `False` and then we're good to go.

In [None]:
from timeit import default_timer as timer

start = timer()
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity='members', 
                                      cutoff_time = cutoff_times, 
                                      agg_primitives = agg_primitives,
                                      trans_primitives = trans_primitives,
                                      where_primitives = where_primitives,
                                      max_depth = 2, features_only = False,
                                      verbose = 1, chunk_size = 100)
end = timer()
print(f'{round(end - start)} seconds elapsed.')

In [None]:
feature_matrix.head()

In [None]:
cutoff_times['label'].value_counts()

In [None]:
from sklearn.ensemble import RandomForestClassifier

feature_matrix = feature_matrix[feature_matrix['label'].notnull()].copy()
labels = np.array(feature_matrix.pop('label'))

In [None]:
from sklearn.model_selection import train_test_split

feature_matrix = pd.get_dummies(feature_matrix).replace({np.inf: np.nan, -np.inf:np.nan}).fillna(0)

X_train, X_test, y_train, y_test = train_test_split(feature_matrix, labels, stratify = labels)

In [None]:
import numpy as np

In [None]:
random_forest = RandomForestClassifier(n_estimators = 1000, max_depth = 10)
random_forest.fit(X_train, y_train)
random_forest.score(X_test, y_test)

In [None]:
np.mean(y_test == 0)

In [None]:
ft.save_features(feature_defs, '/data/churn/features.txt')

# Partition to Feature Matrix

Now we'll write a function that takes in the partition number and outputs a feature matrix.

In [None]:
feature_defs = ft.load_features('/data/churn/features.txt')
print(f'There are {len(feature_defs)} features.')

In [None]:
def partition_to_feature_matrix(partition, feature_defs):
    """Take in a partition number and return a feature matrix"""
    directory = '/data/churn/partitions/p' + PARTITION
    
    # Read in the data files
    members = pd.read_csv(f'{directory}/members.csv', 
                      parse_dates=['registration_init_time'], 
                      infer_datetime_format = True, 
                      dtype = {'gender': 'category'})

    trans = pd.read_csv(f'{directory}/transactions.csv',
                       parse_dates=['transaction_date', 'membership_expire_date'], 
                        infer_datetime_format = True)

    logs = pd.read_csv(f'{directory}/logs.csv', parse_dates = ['date'])
    cutoff_times = pd.read_csv(f'{directory}/cutoff_times.csv', parse_dates = ['cutoff'])
    cutoff_times = cutoff_times.drop_duplicates()
    
    # Create empty entityset
    es = ft.EntitySet(id = 'customers')

    # Add the members parent table
    es.entity_from_dataframe(entity_id='members', dataframe=members,
                             index = 'msno', time_index = 'registration_init_time', 
                             variable_types = {'city': vtypes.Categorical, 'bd': vtypes.Categorical,
                                               'registered_via': vtypes.Categorical})
    # Create new features in transactions
    trans['price_difference'] = trans['plan_list_price'] - trans['actual_amount_paid']
    trans['planned_daily_price'] = trans['plan_list_price'] / trans['payment_plan_days']
    trans['daily_price'] = trans['actual_amount_paid'] / trans['payment_plan_days']

    # Add the transactions child table
    es.entity_from_dataframe(entity_id='transactions', dataframe=trans,
                             index = 'transactions_index', make_index = True,
                             time_index = 'transaction_date', 
                             variable_types = {'payment_method_id': vtypes.Categorical, 
                                               'is_auto_renew': vtypes.Boolean, 'is_cancel': vtypes.Boolean})

    # Add transactions interesting values
    es['transactions']['is_cancel'].interesting_values = [0, 1]
    es['transactions']['is_auto_renew'].interesting_values = [0, 1]
    
    # Create new features in logs
    logs['total'] = logs[['num_25', 'num_50', 'num_75', 'num_985', 'num_100']].sum(axis = 1)
    logs['percent_100'] = logs['num_100'] / logs['total']
    logs['percent_unique'] = logs['num_unq'] / logs['total']
    
    # Add the logs child table
    es.entity_from_dataframe(entity_id='logs', dataframe=logs,
                         index = 'logs_index', make_index = True,
                         time_index = 'date')

    # Add the relationships
    r_member_transactions = ft.Relationship(es['members']['msno'], es['transactions']['msno'])
    r_member_logs = ft.Relationship(es['members']['msno'], es['logs']['msno'])
    es.add_relationships([r_member_transactions, r_member_logs])

    # return cutoff_times
    
    # Calculate and save the feature matrix
    feature_matrix = ft.calculate_feature_matrix(entityset=es, features=feature_defs, cutoff_time=cutoff_times)
    
    feature_matrix.to_csv(f'{directory}/feature_matrix.csv')
    
    # Report progress every 10th of number of partitions
    if (partition % (N_PARTITIONS / 10) == 0):
        print(f'{100 * round(partition / N_PARTITIONS)}% complete.')
        
    

In [None]:
for i in range(0, 1000, 50):
    partition_to_feature_matrix(i, feature_defs)

In [None]:
partition_to_feature_matrix(100, feature_defs)