![](../../images/featuretools.png)

# Automated Feature Engineering on Home Credit Default Risk Data

In this notebook, we will apply automated feature engineering using featuretools to the Home Credit Default Risk competition dataset. This problem is a machine learnign competition currently running on Kaggle where the objective is to predict if an applicant will default on a loan given historical information on past loans. The data is spread across seven different tables which makes this an ideal problem for automated feature engineering: all of the data must be gathered into a single dataframe for training (and one for testing) with the aim of capturing as much usable information for the prediction problem as possible. 

Our focus will be on using featuretools as efficiently as possible for a first pass on the problem. That means we will not spend much time maximizing the potential of the featuretools library, but will instead focus on getting a usable solution as quickly as possible. 

## Approach 

Automated feature engineering is not meant to replace the data scientist, but rather augment her work. While featuretools allows us to create thousands of feature requiring no domain knowledge whatsover, it also can magnify our domain knowledge by _building features on top of existing domain knowledge features_. Therefore, if build features by hand that are useful for the problem, we can potentially make these even more valuable by stacking additional features on top of these. In this notebook, we will include several of the features that we constructed in the traditional manual feature engineering section of the Manual Feature Engineering notebook as seed features. We will get both the domain knowledge encoded in this features as well as the efficient creation of thousands of features from featuretools. 

## Problem and Dataset

The [Home Credit Default Risk competition](https://www.kaggle.com/c/home-credit-default-risk) currently running on Kaggle is a supervised classification task where the objective is to predict whether or not an applicant for a loan (known as a client) will default on the loan. The data comprises socio-economic indicators for the clients, loan specific financial information, and comprehensive data on previous loans at Home Credit (the institution sponsoring the competition) and other credit agencies. The metric for this competition is Receiver Operating Characteristic Area Under the Curve (ROC AUC) with predictions made in terms of the probability of default. We can evaluate our submissions both through cross-validation on the training data (for which we have the labels) or by submitting our test predictions to Kaggle to see where we place on the public leaderboard (which is calculated with only 10% of the testing data). 

The Home Credit Default Risk dataset ([available for download here](https://www.kaggle.com/c/home-credit-default-risk/data)) consists of seven tables of data:

* application_train/application_test: the main training/testing data for each client at Home Credit. The information includes both socioeconomic indicators for the client and loan-specific characteristics. Each loan has its own row and is uniquely identified by the feature `SK_ID_CURR`. The training application data comes with the `TARGET` indicating 0: the loan was repaid or 1: the loan was not repaid. 
* bureau: data concerning client's previous credits from other financial institutions (not Home Credit). Each previous credit has its own row in bureau, but one client in the application data can have multiple previous credits. The previous credits are uniquely identified by the feature `SK_ID_BUREAU`.
* bureau_balance: monthly balance data about the credits in bureau. Each row has information for one month about a previous credit and a single previous credit can have multiple rows. This is linked backed to the bureau loan data by `SK_ID_BUREAU` (not unique in this dataframe).
* previous_application: previous applications for loans at Home Credit of clients who have loans in the application data. Each client in the application data can have multiple previous loans. Each previous application has one row in this dataframe and is uniquely identified by the feature `SK_ID_PREV`. 
* POS_CASH_BALANCE: monthly data about previous point of sale or cash loans from the previous loan data. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows. This is linked backed to the previous loan data by `SK_ID_PREV` (not unique in this dataframe).
* credit_card_balance: monthly data about previous credit cards loans from the previous loan data. Each row is one month of a credit card balance, and a single credit card can have many rows. This is linked backed to the previous loan data by `SK_ID_PREV` (not unique in this dataframe).
* installments_payment: payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment. This is linked backed to the previous loan data by `SK_ID_PREV` (not unique in this dataframe).

The image below shows the seven tables and the variables linking them:

![](../../images/kaggle_home_credit/home_credit_data.png)

In [1]:
# pandas and numpy for data manipulation
import pandas as pd
import numpy as np

# featuretools for automated feature engineering
import featuretools as ft

# matplotlit and seaborn for visualizations
import matplotlib.pyplot as plt
plt.rcParams['font.size'] = 22
import seaborn as sns

In [2]:
# Read in the datasets 
app_train = pd.read_csv('../../data/kaggle_home_credit/application_train.csv').replace({365243: np.nan})
app_test = pd.read_csv('../../data/kaggle_home_credit/application_test.csv').replace({365243: np.nan})
bureau = pd.read_csv('../../data/kaggle_home_credit/bureau.csv').replace({365243: np.nan})
bureau_balance = pd.read_csv('../../data/kaggle_home_credit/bureau_balance.csv').replace({365243: np.nan})
cash = pd.read_csv('../../data/kaggle_home_credit/POS_CASH_balance.csv').replace({365243: np.nan})
credit = pd.read_csv('../../data/kaggle_home_credit/credit_card_balance.csv').replace({365243: np.nan})
previous = pd.read_csv('../../data/kaggle_home_credit/previous_application.csv').replace({365243: np.nan})
installments = pd.read_csv('../../data/kaggle_home_credit/installments_payments.csv').replace({365243: np.nan})

We'll join the train and test set together but add a separate column identifying the set. This is important because we are going to want to apply the same exact procedures to each dataset. It's safest to just join them together and treat them as a single dataframe. Later we can separate out the training and testing data using `train = app[app["TARGET"].notnull()]` and `test = app[app["TARGET"].isnull()]`

In [3]:
# Add identifying column
app_train['set'] = 'train'
app_test['set'] = 'test'
app_test["TARGET"] = np.nan

# Append the dataframes
app = app_train.append(app_test, ignore_index = True, sort = True)

# Featuretools Basics

[Featuretools](https://docs.featuretools.com/#minute-quick-start) is an open-source Python library for automatically creating features out of a set of related tables using a technique called [deep feature synthesis](http://www.jmaxkanter.com/static/papers/DSAA_DSM_2015.pdf). Automated feature engineering, like many topics in machine learning, is a complex subject built upon a foundation of simpler ideas. By going through these ideas one at a time, we can build up our understanding of how featuretools which will later allow for us to get the most out of it.

There are a few concepts that we will cover along the way:

* [Entities and EntitySets](https://docs.featuretools.com/loading_data/using_entitysets.html)
* [Relationships between tables](https://docs.featuretools.com/loading_data/using_entitysets.html#adding-a-relationship)
* [Feature primitives](https://docs.featuretools.com/automated_feature_engineering/primitives.html): aggregations and transformations
* [Deep feature synthesis](https://docs.featuretools.com/automated_feature_engineering/afe.html)

# Entities and Entitysets

An entity is simply a table or in Pandas, a `dataframe`. The observations are in the rows and the features in the columns. An entity in featuretools must have a unique index where none of the elements are duplicated.  Currently, only `app`, `bureau`, and `previous` have unique indices (`SK_ID_CURR`, `SK_ID_BUREAU`, and `SK_ID_PREV` respectively). For the other dataframes, we must pass in `make_index = True` and then specify the name of the index. Entities can also have time indices where each entry is identified by a unique time. (There are not datetimes in any of the data, but there are relative times, given in months or days, that we could consider treating as time variables).

An [EntitySet](https://docs.featuretools.com/loading_data/using_entitysets.html) is a collection of tables and the relationships between them. This can be thought of a data structute with its own methods and attributes. Using an EntitySet allows us to group together multiple tables and manipulate them much quicker than individual tables. 

First we'll make an empty entityset named clients to keep track of all the data.

In [4]:
# Entity set with id applications
es = ft.EntitySet(id = 'clients')

## Variable Types

Featuretools will automatically infer the variable types. However, there may be some cases where we need to explicitly tell featuretools the variable type such as when a boolean variable is represented as an integer. Variable types in featuretools must be specified as a dictionary. 

We will first work with the `app` data to specify the proper variable types.

In [5]:
import featuretools.variable_types as vtypes

In [6]:
app_types = {}

# Handle the Boolean variables:
for col in app:
    if app[col].nunique() == 2:
        app_types[col] = vtypes.Boolean

print('There are {} Boolean variables.'.format(len(app_types)))

There are 38 Boolean variables.


In [7]:
app_types['REGION_RATING_CLIENT'] = vtypes.Ordinal
app_types['REGION_RATING_CLIENT_W_CITY'] = vtypes.Ordinal
app_types['WEEKDAY_APPR_PROCESS_START'] = vtypes.Ordinal
app_types['HOUR_APPR_PROCESS_START'] = vtypes.Ordinal

We can check the other data tables to see if there are any columns that should be recorded as Boolean. We also need to read through the column descriptions to determine if there are any other variables types (such as ordinal).

In [8]:
print(np.any(bureau.nunique() == 2))
print(np.any(bureau_balance.nunique == 2))
print(np.any(credit.nunique() == 2))
print(np.any(cash.nunique() == 2))
print(np.any(previous.nunique() == 2))
print(np.any(installments.nunique() == 2))

False
False
False
False
True
False


In [9]:
previous_types = {}

# Handle the Boolean variables:
for col in previous:
    if previous[col].nunique() == 2:
        previous_types[col] = vtypes.Boolean

print('There are {} Boolean variables.'.format(len(previous_types)))

There are 3 Boolean variables.


The `credit`, `cash`, and `installments` data all have the `SK_ID_CURR` key. However, we do not actually need this variable in these dataframes because we link them to `app` through the `previous` dataframe with the `SK_ID_PREV` variable. (We can't directly add the relationships using `SK_ID_CURR` and using the `previous` dataframe because this would create a diamond graph.) We don't want to make features from `SK_ID_CURR` since it is an arbitrary id and should have no predictive power. These features are irrelevant and hence would only slow down model training and perhaps lead to poorer performance. 

Our options to handle these is either to tell featuretools to ignore them, or to drop the features before including them in the entityset. We will take these latter approach since it alleviates the need to remember to tell featuretools to ignore these. 

In [10]:
installments = installments.drop(columns = ['SK_ID_CURR'])
credit = credit.drop(columns = ['SK_ID_CURR'])
cash = cash.drop(columns = ['SK_ID_CURR'])

## Add in Domain Features to Base Data

In the traditional manual feature engineering notebook, we created a number of new features by manipulating columns of the `app` dataframe itself. Since these are in the base dataframe, featuretools cannot build features on top of these, but since these features are useful, we should still include them in the `app` data. Later we will see how we can use seed features to allow featuretools to build on top of our features.

In [None]:
app['LOAN_RATE'] = app['AMT_ANNUITY'] / app['AMT_CREDIT'] 
app['CREDIT_INCOME_RATIO'] = app['AMT_CREDIT'] / app['AMT_INCOME_TOTAL']
app['EMPLOYED_BIRTH_RATIO'] = app['DAYS_EMPLOYED'] / app['DAYS_BIRTH']

app['EXT_SOURCE_SUM'] = app[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].sum(axis = 1)
app['EXT_SOURCE_MEAN'] = app[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].mean(axis = 1)
app['AMT_REQ_SUM'] = app[[x for x in app.columns if 'AMT_REQ_' in x]].sum(axis = 1)

## Define Entities

Now we define each entity, or table of data. We need to pass in an index if the data has one or `make_index = True` if not. In the cases where we need to make an index, we must supply a name for the index. We also need to pass in the variable types that we identified.

In [11]:
# Entities with a unique index
es = es.entity_from_dataframe(entity_id = 'app', dataframe = app, index = 'SK_ID_CURR',
                              variable_types = app_types)

es = es.entity_from_dataframe(entity_id = 'bureau', dataframe = bureau, index = 'SK_ID_BUREAU')

es = es.entity_from_dataframe(entity_id = 'previous', dataframe = previous, index = 'SK_ID_PREV',
                              variable_types = previous_types)

# Entities that do not have a unique index
es = es.entity_from_dataframe(entity_id = 'bureau_balance', dataframe = bureau_balance, 
                              make_index = True, index = 'bureaubalance_index')

es = es.entity_from_dataframe(entity_id = 'cash', dataframe = cash, 
                              make_index = True, index = 'cash_index')

es = es.entity_from_dataframe(entity_id = 'installments', dataframe = installments,
                              make_index = True, index = 'installments_index')

es = es.entity_from_dataframe(entity_id = 'credit', dataframe = credit,
                              make_index = True, index = 'credit_index')

# Relationships

Relationships are a fundamental concept not only in featuretools, but in any relational database. The best way to think of a one-to-many relationship is with the analogy of parent-to-child. A parent is a single individual, but can have mutliple children. The children can then have multiple children of their own. In a _parent table_, each individual has a single row. Each individual in the parent table can have multiple rows in the _child table_. 

As an example, the `app` dataframe has one row for each client  (`SK_ID_CURR`) while the `bureau` dataframe has multiple previous loans (`SK_ID_PREV`) for each parent (`SK_ID_CURR`). Therefore, the `bureau` dataframe is the child of the `app` dataframe. The `bureau` dataframe in turn is the parent of `bureau_balance` because each loan has one row in `bureau` but multiple monthly records in `bureau_balance`. 

In [12]:
print('Parent: app, Parent Variable: SK_ID_CURR\n\n', app.iloc[:, 111:115].head())
print('\nChild: bureau, Child Variable: SK_ID_CURR\n\n', bureau.iloc[10:30, :4].head())

Parent: app, Parent Variable: SK_ID_CURR

    SK_ID_CURR  TARGET  TOTALAREA_MODE WALLSMATERIAL_MODE
0      100002     1.0          0.0149       Stone, brick
1      100003     0.0          0.0714              Block
2      100004     0.0             NaN                NaN
3      100006     0.0             NaN                NaN
4      100007     0.0             NaN                NaN

Child: bureau, Child Variable: SK_ID_CURR

     SK_ID_CURR  SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY
10      162297       5714472        Active      currency 1
11      162297       5714473        Closed      currency 1
12      162297       5714474        Active      currency 1
13      402440       5714475        Active      currency 1
14      238881       5714482        Closed      currency 1


The `SK_ID_CURR` "100002" has one row in the parent table and multiple rows in the child. 

Two tables are linked via a shared variable. The `app` and `bureau` dataframe are linked by the `SK_ID_CURR` variable while the `bureau` and `bureau_balance` dataframes are linked with the `SK_ID_BUREAU`. Defining the relationships is relatively straightforward, and the diagram provided by the competition is helpful for seeing the relationships. For each relationship, we need to specify the parent variable and the child variable. Altogether, there are a total of 6 relationships between the tables. Below we specify all six relationships and then add them to the EntitySet.

In [13]:
# Relationship between app and bureau
r_app_bureau = ft.Relationship(es['app']['SK_ID_CURR'], es['bureau']['SK_ID_CURR'])

# Relationship between bureau and bureau balance
r_bureau_balance = ft.Relationship(es['bureau']['SK_ID_BUREAU'], es['bureau_balance']['SK_ID_BUREAU'])

# Relationship between current app and previous apps
r_app_previous = ft.Relationship(es['app']['SK_ID_CURR'], es['previous']['SK_ID_CURR'])

# Relationships between previous apps and cash, installments, and credit
r_previous_cash = ft.Relationship(es['previous']['SK_ID_PREV'], es['cash']['SK_ID_PREV'])
r_previous_installments = ft.Relationship(es['previous']['SK_ID_PREV'], es['installments']['SK_ID_PREV'])
r_previous_credit = ft.Relationship(es['previous']['SK_ID_PREV'], es['credit']['SK_ID_PREV'])

In [14]:
# Add in the defined relationships
es = es.add_relationships([r_app_bureau, r_bureau_balance, r_app_previous,
                           r_previous_cash, r_previous_installments, r_previous_credit])
# Print out the EntitySet
es

Entityset: clients
  Entities:
    app [Rows: 356255, Columns: 123]
    bureau [Rows: 1716428, Columns: 17]
    previous [Rows: 1670214, Columns: 37]
    bureau_balance [Rows: 27299925, Columns: 4]
    cash [Rows: 10001358, Columns: 8]
    installments [Rows: 13605401, Columns: 8]
    credit [Rows: 3840312, Columns: 23]
  Relationships:
    bureau.SK_ID_CURR -> app.SK_ID_CURR
    bureau_balance.SK_ID_BUREAU -> bureau.SK_ID_BUREAU
    previous.SK_ID_CURR -> app.SK_ID_CURR
    cash.SK_ID_PREV -> previous.SK_ID_PREV
    installments.SK_ID_PREV -> previous.SK_ID_PREV
    credit.SK_ID_PREV -> previous.SK_ID_PREV

Slightly advanced note: we need to be careful to not create a [diamond graph](https://en.wikipedia.org/wiki/Diamond_graph) where there are multiple paths from a parent to a child. If we directly link `app` and `cash` via `SK_ID_CURR`; `previous` and `cash` via `SK_ID_PREV`; and `app` and `previous` via `SK_ID_CURR`, then we have created two paths from `app` to `cash`. This results in ambiguity, so the approach we have to take instead is to link `app` to `cash` through `previous`. We establish a relationship between `previous` (the parent) and `cash` (the child) using `SK_ID_PREV`. Then we establish a relationship between `app` (the parent) and `previous` (now the child) using `SK_ID_CURR`. Then featuretools will be able to create features on `app` derived from both `previous` and `cash` by stacking multiple primitives. (We removed the `SK_ID_CURR` from `cash`, `credit`, and `installments` so we didn't have to worry about doing this!)

All entities in the entity can be related to each other. In theory this allows us to calculate features for any of the entities, but in practice, we will only calculate features for the `app` dataframe since that will be used for training/testing. 

## Seed Features

Now is where featuretools is able to augment our domain knowledge. If we specify seed features, featuretools can then stack additional features on top of these. 

#### Seed Features from bureau

In [21]:
credit_active = ft.Feature(es['bureau']['CREDIT_ACTIVE']) != 'Closed'
credit_overdue = ft.Feature(es['bureau']['CREDIT_DAY_OVERDUE']) > 0.0

#### Seed Features from bureau balance

In [22]:
balance_past_due = ft.Feature(es['bureau_balance']['STATUS']).isin(['1', '2', '3', '4', '5'])

#### Seed Features from previous

In [26]:
application_not_approved = ft.Feature(es['previous']['NAME_CONTRACT_STATUS']) != 'Approved'

#### Seed Features from credit

In [54]:
credit_card_past_due = ft.Feature(es['credit']['SK_DPD']) > 0.0
credit_card_active = ft.Feature(es['credit']['NAME_CONTRACT_STATUS']) == 'Active'

#### Seed Features from cash

In [55]:
cash_active = ft.Feature(es['cash']['NAME_CONTRACT_STATUS']) == 'Active'
cash_past_due = ft.Feature(es['cash']['SK_DPD']) > 0.0

#### Seed Features from installments

In [56]:
installments_late = ft.Feature(es['installments']['DAYS_ENTRY_PAYMENT']) > ft.Feature(es['installments']['DAYS_INSTALMENT'])
installments_low_payment = ft.Feature(es['installments']['AMT_PAYMENT']) < ft.Feature(es['installments']['AMT_INSTALMENT']) 

In [57]:
seed_features = [installments_low_payment, installments_late,
                 cash_past_due, cash_active,
                 credit_card_active, credit_card_past_due, 
                 application_not_approved, balance_past_due,
                 credit_overdue, credit_active]

# Feature Primitives

A [feature primitive](https://docs.featuretools.com/automated_feature_engineering/primitives.html) is an operation applied to a table or a set of tables to create a feature. These represent simple calculations, many of which we already use in manual feature engineering, that can be stacked on top of each other to create complex features. Feature primitives fall into two categories:

* __Aggregation__: function that groups together child datapoints for each parent and then calculates a statistic such as mean, min, max, or standard deviation. An example is calculating the maximum previous loan amount for each client. An aggregation works across multiple tables using relationships between tables.
* __Transformation__: an operation applied to one or more columns in a single table. An example would be taking the absolute value of a column, or finding the difference between two columns in one table.

A list of the available features primitives in featuretools can be viewed below.

In [58]:
# List the primitives in a dataframe
primitives = ft.list_primitives()
pd.options.display.max_colwidth = 100
primitives[primitives['type'] == 'aggregation'].head(10)

Unnamed: 0,name,type,description
0,avg_time_between,aggregation,Computes the average time between consecutive events.
1,percent_true,aggregation,Finds the percent of 'True' values in a boolean feature.
2,any,aggregation,Test if any value is 'True'.
3,n_most_common,aggregation,Finds the N most common elements in a categorical feature.
4,sum,aggregation,Counts the number of elements of a numeric or boolean feature.
5,num_true,aggregation,Finds the number of 'True' values in a boolean.
6,last,aggregation,Returns the last value.
7,count,aggregation,Counts the number of non null values.
8,num_unique,aggregation,Returns the number of unique categorical variables.
9,std,aggregation,Finds the standard deviation of a numeric feature ignoring null values.


In [59]:
primitives[primitives['type'] == 'transform'].head(10)

Unnamed: 0,name,type,description
19,add,transform,Creates a transform feature that adds two features.
20,day,transform,Transform a Datetime feature into the day.
21,weekend,transform,Transform Datetime feature into the boolean of Weekend.
22,cum_mean,transform,Calculates the mean of previous values of an instance for each value in a time-dependent entity.
23,month,transform,Transform a Datetime feature into the month.
24,year,transform,Transform a Datetime feature into the year.
25,and,transform,"For two boolean values, determine if both values are 'True'."
26,hour,transform,Transform a Datetime feature into the hour.
27,absolute,transform,Absolute value of base feature.
28,week,transform,Transform a Datetime feature into the week.


# Deep Feature Synthesis

Deep Feature Synthesis (DFS) is the process featuretools uses to make new features. DFS stacks feature primitives to form features with a "depth" equal to the number of primitives. For example, if we take the maximum value of a client's previous loans (say `MAX(previous.loan_amount)`), that is a "deep feature" with a depth of 1. To create a feature with a depth of two, we could stack primitives by taking the maximum value of a client's average montly payments per previous loan (such as `MAX(previous(MEAN(installments.payment)))`). The [original paper on automated feature engineering using deep feature synthesis](https://dai.lids.mit.edu/wp-content/uploads/2017/10/DSAA_DSM_2015.pdf) is worth a read. 

To perform DFS in featuretools, we use the `dfs`  function passing it an `entityset`, the `target_entity` (where we want to make the features), the `agg_primitives` to use, the `trans_primitives` to use and the `max_depth` of the features. Here we will use the default aggregation and transformation primitives,  a max depth of 2, and calculate primitives for the `app` entity. Because this process is computationally expensive, we can run the function using `features_only = True` to return only a list of the features and not calculate the features themselves. This can be useful to look at the resulting features before starting an extended computation.

### DFS with Default Primitives

First, we can run deep feature synthesis with the default primitives and generate all of the names.

In [34]:
# Default primitives from featuretools
default_agg_primitives =  ["sum", "std", "max", "skew", "min", "mean", "count", "percent_true", "num_unique", "mode"]
default_trans_primitives =  ["day", "year", "month", "weekday", "haversine", "numwords", "characters"]

# DFS with specified primitives
feature_names = ft.dfs(entityset = es, target_entity = 'app',
                       trans_primitives = default_trans_primitives,
                       agg_primitives=default_agg_primitives, 
                       max_depth = 2, n_jobs = -1, verbose = 1,
                       features_only=True)

print('%d Total Features' % len(feature_names))

Built 1575 features
1575 Total Features


Some of the primitives are not even used in the deep feature synthesis because they are not applicable. 

In [40]:
str(feature_names[1])[10:-1]

'AMT_CREDIT'

In [42]:
feature_list = []
for feature in feature_names:
    feature_list.append(str(feature)[10:-1])

In [45]:
feature_list[-10]

'MEAN(bureau.SUM(bureau_balance.MONTHS_BALANCE))'

In [46]:
for primitive in default_agg_primitives:
    included = False
    for feature in feature_list:
        if primitive.upper() in feature:
            included = True
    if not included:
        print('{} not in features.'.format(primitive))

In [49]:
for primitive in default_trans_primitives:
    included = False
    for feature in feature_list:
        if ('%s(' % primitive.upper()) in feature:
            included = True
    if not included:
        print('{} not in features.'.format(primitive))

haversine not in features.
numwords not in features.
characters not in features.


## DFS with seed features

Now we can include the seed features and look at the number of features that will be created. We will select a few aggregation and transformation primitives to use rather than the entire default list. 

In [52]:
agg_primitives =  ["sum", "max", "min", "mean", "count", "percent_true", "num_unique", "mode"]
trans_primitives = ['percentile', 'and']

In [61]:
feature_names = ft.dfs(entityset=es, target_entity='app',
                       agg_primitives = agg_primitives,
                       trans_primitives = trans_primitives,
                       seed_features = seed_features,
                       n_jobs = -1, verbose = 1, features_only = True,
                       max_depth = 2)

print("{} total features.".format(len(feature_names)))

Built 2080 features


tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: tcp://127.0.0.1:65207, threads: 1>>
Traceback (most recent call last):
  File "/anaconda3/lib/python3.6/site-packages/psutil/_psosx.py", line 348, in catch_zombie
    yield
  File "/anaconda3/lib/python3.6/site-packages/psutil/_psosx.py", line 387, in _get_pidtaskinfo
    ret = cext.proc_pidtaskinfo_oneshot(self.pid)
ProcessLookupError: [Errno 3] No such process

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/anaconda3/lib/python3.6/site-packages/tornado/ioloop.py", line 1208, in _run
    return self.callback()
  File "/anaconda3/lib/python3.6/site-packages/distributed/nanny.py", line 262, in memory_monitor
    memory = proc.memory_info().rss
  File "/anaconda3/lib/python3.6/site-packages/psutil/_common.py", line 337, in wrapper
    return fun(self)
  File "/anaconda3/lib/python3.6/site-packages/psutil/__init__.py", line 104

CommClosedError: in <closed TCP>: Stream is closed: while trying to call remote method 'scatter'

## Run Full Deep Feature Synthesis

If we are content with the features that will be built, we can run deep feature synthesis and create the feature matrix. The following call runs the full deep feature synthesis. 

In [None]:
feature_matrix, feature_names = ft.dfs(entityset=es, target_entity='app',
                                       agg_primitives = agg_primitives,
                                       trans_primitives = trans_primitives,
                                       seed_features = seed_features,
                                       n_jobs = -1, verbose = 1, features_only = False,
                                       max_depth = 2)

In [None]:
feature_matrix.to_csv('../../data/kaggle_home_credit/feature_matrix.csv')