![](../../images/featuretools.png)

# Automated Feature Engineering on Home Credit Default Risk Data

Feature engineering is the process of creating new features (also called predictors or explanatory variables) out of an existing dataset. Traditionally, this process is done by hand using domain knowledge to build new features one at a time. In a previous notebook, we saw that feature engineering is crucial for a data science problem and how the manual approach is time-consuming, tedious, error-prone, and must be re-done for each problem. Automated feature engineering aims to aid the data scientist in this critical process by automatically creating hundreds or thousands of new features from a set of related tables. 

In this notebook, we will apply automated feature engineering to the Home Credit Default Risk loan dataset using featuretools, the only open-source Python library for automated feature engineering. This problem is a machine learning competition currently running on Kaggle where the objective is to predict if an applicant will default on a loan given comprehensive data on past loans and applicants. The data is spread across seven different tables making this an ideal problem for automated feature engineering: all of the data must be gathered into a single dataframe for training (and one for testing) with the aim of capturing as much usable information for the prediction problem as possible. As we will see, featuretools can efficiently carry out the tedious process of using all of these tables to make new features with only a few lines of code required from the data scientist.


## Approach 

Automated feature engineering is not meant to replace the data scientist, but rather allow for more efficient pipelines. While featuretools allows us to create thousands of features in a few lines of code with no domain knowledge required, it also can amplify any domain knowledge we do have by _building features on top of domain knowledge features_. If we have already built features by hand for a problem, we can use featuretools to increase their value by stacking additional features on top. The original hand-built features are known as _seed features_ becuase they form the basis for additional features. In this notebook, we will demonstrate not only how to efficiently create thousands of features from a dataset, but also how to augment our existing domain knowledge using seed features. Our approach will be as follows (the background for each step will be covered as we go):

1. Read in the set of related data tables
2. Create a featuretools `EntitySet` and add `entities` to it 
    * Identify correct variable types as required
    * Identify indices in data
3. Add relationships between `entities`
4. Run Deep Feature Synthesis using default settings to generate thousands of new features 
    * Will not use any domain knowledge initially
5. Tune deep feature synthesis using domain knowledge features
    * Interesting values of variables
    * Seed features to build on top of
6. Run Deep Feature Synthesis including the domain knowledge
    * Specify primitives, seed features, and interesting value primitives


## Problem and Dataset

The [Home Credit Default Risk competition](https://www.kaggle.com/c/home-credit-default-risk) currently running on Kaggle is a supervised classification task where the objective is to predict whether or not an applicant for a loan (known as a client) will default on the loan. The data comprises socio-economic indicators for the clients, loan specific financial information, and comprehensive data on previous loans at Home Credit (the institution sponsoring the competition) and other credit agencies. The metric for this competition is Receiver Operating Characteristic Area Under the Curve (ROC AUC) with predictions made in terms of the probability of default. We can evaluate our submissions both through cross-validation on the training data (for which we have the labels) or by submitting our test predictions to Kaggle to see where we place on the public leaderboard (which is calculated with only 10% of the testing data). 

The Home Credit Default Risk dataset ([available for download here](https://www.kaggle.com/c/home-credit-default-risk/data)) consists of seven related tables of data:

* application_train/application_test: the main training/testing data for each client at Home Credit. The information includes both socioeconomic indicators for the client and loan-specific characteristics. Each loan has its own row and is uniquely identified by the feature `SK_ID_CURR`. The training application data comes with the `TARGET` indicating 0: the loan was repaid or 1: the loan was not repaid. 
* bureau: data concerning client's previous credits from other financial institutions (not Home Credit). Each previous credit has its own row in bureau, but one client in the application data can have multiple previous credits. The previous credits are uniquely identified by the feature `SK_ID_BUREAU`.
* bureau_balance: monthly balance data about the credits in bureau. Each row has information for one month about a previous credit and a single previous credit can have multiple rows. This is linked backed to the bureau loan data by `SK_ID_BUREAU` (not unique in this dataframe).
* previous_application: previous applications for loans at Home Credit of clients who have loans in the application data. Each client in the application data can have multiple previous loans. Each previous application has one row in this dataframe and is uniquely identified by the feature `SK_ID_PREV`. 
* POS_CASH_BALANCE: monthly data about previous point of sale or cash loans from the previous loan data. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows. This is linked backed to the previous loan data by `SK_ID_PREV` (not unique in this dataframe).
* credit_card_balance: monthly data about previous credit cards loans from the previous loan data. Each row is one month of a credit card balance, and a single credit card can have many rows. This is linked backed to the previous loan data by `SK_ID_PREV` (not unique in this dataframe).
* installments_payment: payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment. This is linked backed to the previous loan data by `SK_ID_PREV` (not unique in this dataframe).

The image below shows the seven tables and the variables linking them:

![](../../images/kaggle_home_credit/home_credit_data.png)

The variables that tie the tables together will be important to understand when it comes to adding `relationships` between entities. 

In [1]:
# pandas and numpy for data manipulation
import pandas as pd
import numpy as np

# featuretools for automated feature engineering
import featuretools as ft

### Read in Data

First we can read in the seven data tables. We also replace the anomalous values previously identified. 

In [2]:
# Read in the datasets and replace the anomalous values
app_train = pd.read_csv('../../data/kaggle_home_credit/application_train.csv').replace({365243: np.nan})
app_test = pd.read_csv('../../data/kaggle_home_credit/application_test.csv').replace({365243: np.nan})
bureau = pd.read_csv('../../data/kaggle_home_credit/bureau.csv').replace({365243: np.nan})
bureau_balance = pd.read_csv('../../data/kaggle_home_credit/bureau_balance.csv').replace({365243: np.nan})
cash = pd.read_csv('../../data/kaggle_home_credit/POS_CASH_balance.csv').replace({365243: np.nan})
credit = pd.read_csv('../../data/kaggle_home_credit/credit_card_balance.csv').replace({365243: np.nan})
previous = pd.read_csv('../../data/kaggle_home_credit/previous_application.csv').replace({365243: np.nan})
installments = pd.read_csv('../../data/kaggle_home_credit/installments_payments.csv').replace({365243: np.nan})

We will join together the training and testing datasets to make sure we build the same features for each set. Later, after the feature matrix is built, we can separate out the two sets. 

In [3]:
app_test['TARGET'] = np.nan

# Join together training and testing
app = app_train.append(app_test, ignore_index = True, sort = True)

Several of the keys are an incorrect data type (floats) so we need to make these all the same (integers) for adding relationships. 

In [4]:
for index in ['SK_ID_CURR', 'SK_ID_PREV', 'SK_ID_BUREAU']:
    for dataset in [app, bureau, bureau_balance, cash, credit, previous, installments]:
        if index in list(dataset.columns):
            dataset[index] = dataset[index].fillna(0).astype(np.int64)

# Featuretools Basics

[Featuretools](https://docs.featuretools.com/#minute-quick-start) is an open-source Python library (the only one currently available) for automatically creating features out of a set of related tables using a technique called [deep feature synthesis](http://www.jmaxkanter.com/static/papers/DSAA_DSM_2015.pdf). Automated feature engineering, like many topics in machine learning, is a complex subject built upon a foundation of simpler ideas. By going through these ideas one at a time, we can build up our understanding of featuretools which will later allow for us to get the most out of it.

There are a few concepts that we will cover along the way:

* [Entities and EntitySets](https://docs.featuretools.com/loading_data/using_entitysets.html): our tables and a data structure for keeping track of them all
* [Relationships between tables](https://docs.featuretools.com/loading_data/using_entitysets.html#adding-a-relationship): how the tables can be related to one another
* [Seed features and interesting values](https://docs.featuretools.com/guides/tuning_dfs.html): Two of the many ways we can amplify our domain knowledge by using featuretools to build on top of our hand-built features
* [Feature primitives](https://docs.featuretools.com/automated_feature_engineering/primitives.html): aggregations and transformations that are stacked to build features
* [Deep feature synthesis](https://docs.featuretools.com/automated_feature_engineering/afe.html): the method that uses feature primitives to generate thousands of new features

# Entities and Entitysets

An entity is simply a table or in Pandas, a `dataframe`. The observations must be in the rows and the features in the columns. An entity in featuretools must have a unique index where none of the elements are duplicated.  Currently, only `app`, `bureau`, and `previous` have unique indices (`SK_ID_CURR`, `SK_ID_BUREAU`, and `SK_ID_PREV` respectively). For the other dataframes, we must pass in `make_index = True` and then specify the name of the index. Entities can also have time indices that represent when the information in the row became known. (There are not datetimes in any of the data, but there are relative times, given in months or days, that could be treated as time variables although we will not use them as time in this notebook).

An [EntitySet](https://docs.featuretools.com/loading_data/using_entitysets.html) is a collection of tables and the relationships between them. This can be thought of a data structure with its own methods and attributes. Using an EntitySet allows us to group together multiple tables and will make creating the features much simpler than keeping track of individual tables and relationships.

First we'll make an empty entityset named clients to keep track of all the data.

In [5]:
# Entity set with id applications
es = ft.EntitySet(id = 'clients')

### Variable Types

Featuretools will automatically infer the variable types. However, there may be some cases where we need to explicitly tell featuretools the variable type such as when a boolean variable is represented as an integer (otherwise it will be considered a numeric). Variable types in featuretools can be specified as a dictionary. 

We will first work with the `app` data to specify the proper variable types. To identify the `Boolean` variables that are recorded as numbers (1.0 or 0.0), we can iterate through the data and find any columns where there are only 2 unique values and the data type is numeric. We can also use the column definitions to find any other data types that should be identified, such as `Ordinal` variables.

In [6]:
import featuretools.variable_types as vtypes

In [7]:
app_types = {}

# Handle the Boolean variables:
for col in app:
    if (app[col].nunique() == 2) and (app[col].dtype == float):
        app_types[col] = vtypes.Boolean

# Remove the `TARGET`
del app_types['TARGET']

print('There are {} Boolean variables in the application data.'.format(len(app_types)))

There are 32 Boolean variables in the application data.


In [8]:
# Ordinal variables
app_types['REGION_RATING_CLIENT'] = vtypes.Ordinal
app_types['REGION_RATING_CLIENT_W_CITY'] = vtypes.Ordinal
app_types['HOUR_APPR_PROCESS_START'] = vtypes.Ordinal

The `previous` table is the only other `entity` that has features which should be recorded as Boolean. Correctly identifying the type of column will prevent featuretools from making irrelevant features such as the mean or max of a `Boolean`. 

In [9]:
previous_types = {}

# Handle the Boolean variables:
for col in previous:
    if (previous[col].nunique() == 2) and (previous[col].dtype == float):
        previous_types[col] = vtypes.Boolean

print('There are {} Boolean variables in the previous data.'.format(len(previous_types)))

There are 2 Boolean variables in the previous data.


In addition to identifying Boolean variables, we want to make sure featuretools does not create nonsense features such as statistical aggregations (mean, max, etc.) of ids. The `credit`, `cash`, and `installments` data all have the `SK_ID_CURR` variable. However, we do not actually need this variable in these dataframes because we link them to `app` through the `previous` dataframe with the `SK_ID_PREV` variable. We don't want to make features from `SK_ID_CURR` since it is an arbitrary id and should have no predictive power. Features like the mean of the id are irrelevant and would only slow down model training and probably lead to poorer model performance. 

Our options to handle these variables is either to tell featuretools to ignore them, or to drop the features before including them in the entityset. We will take the latter approach.

In [10]:
installments = installments.drop(columns = ['SK_ID_CURR'])
credit = credit.drop(columns = ['SK_ID_CURR'])
cash = cash.drop(columns = ['SK_ID_CURR'])

In [11]:
app['LOAN_RATE'] = app['AMT_ANNUITY'] / app['AMT_CREDIT'] 
app['CREDIT_INCOME_RATIO'] = app['AMT_CREDIT'] / app['AMT_INCOME_TOTAL']
app['EMPLOYED_BIRTH_RATIO'] = app['DAYS_EMPLOYED'] / app['DAYS_BIRTH']
app['EXT_SOURCE_SUM'] = app[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].sum(axis = 1)
app['EXT_SOURCE_MEAN'] = app[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].mean(axis = 1)
app['AMT_REQ_SUM'] = app[[x for x in app.columns if 'AMT_REQ_' in x]].sum(axis = 1)

## Adding Entities

Now we define each entity, or table of data, and add it to the `EntitySet`. We need to pass in an index if the table has one or `make_index = True` if not. In the cases where we need to make an index, we must supply a name for the index. We also need to pass in the dictionary of variable types if there are any specific variables we should identify. The following code adds all eight tables to the `EntitySet`.

In [12]:
# Entities with a unique index
es = es.entity_from_dataframe(entity_id = 'app', dataframe = app, index = 'SK_ID_CURR',
                              variable_types = app_types)

es = es.entity_from_dataframe(entity_id = 'bureau', dataframe = bureau, index = 'SK_ID_BUREAU')

es = es.entity_from_dataframe(entity_id = 'previous', dataframe = previous, index = 'SK_ID_PREV',
                              variable_types = previous_types)

# Entities that do not have a unique index
es = es.entity_from_dataframe(entity_id = 'bureau_balance', dataframe = bureau_balance, 
                              make_index = True, index = 'bureaubalance_index')

es = es.entity_from_dataframe(entity_id = 'cash', dataframe = cash, 
                              make_index = True, index = 'cash_index')

es = es.entity_from_dataframe(entity_id = 'installments', dataframe = installments,
                              make_index = True, index = 'installments_index')

es = es.entity_from_dataframe(entity_id = 'credit', dataframe = credit,
                              make_index = True, index = 'credit_index')

In [13]:
# Display entityset so far
es

Entityset: clients
  Entities:
    app [Rows: 356255, Columns: 128]
    bureau [Rows: 1716428, Columns: 17]
    previous [Rows: 1670214, Columns: 37]
    bureau_balance [Rows: 27299925, Columns: 4]
    cash [Rows: 10001358, Columns: 8]
    installments [Rows: 13605401, Columns: 8]
    credit [Rows: 3840312, Columns: 23]
  Relationships:
    No relationships

The `EntitySet` allows us to group together all of our tables as one data structure. This is much easier than manipulating the tables one at a time (as we have to do in manual feature engineering).

# Relationships

Relationships are a fundamental concept not only in featuretools, but in any relational database. The most common type of relationship is one-to-many. The best way to think of a one-to-many relationship is with the analogy of parent-to-child. A parent is a single individual, but can have mutliple children. In the context of tables, a parent table will have one row (observation) for every individual while a child table can have many observations for each parent.  In a _parent table_, each individual has a single row and is uniquely identified by an index (also called a key). Each individual in the parent table can have multiple rows in the _child table_. Things get a little more complicated because children tables can have children of their own, making these grandchildren of the original parent. 

As an example of a parent-to-child relationship, the `app` dataframe has one row for each client (identified by `SK_ID_CURR`) while the `bureau` dataframe has multiple previous loans for each client. Therefore, the `bureau` dataframe is the child of the `app` dataframe. The `bureau` dataframe in turn is the parent of `bureau_balance` because each loan has one row in `bureau` (identified by `SK_ID_BUREAU`) but multiple monthly records in `bureau_balance`. When we do manual feature engineering, keeping track of all these relationships is a massive pain (and a potential source of error), but we can add these relationships to our `EntitySet` and let featuretools worry about keeping the tables straight!

In [14]:
print('Parent: app, Parent Variable of bureau: SK_ID_CURR\n\n', app.iloc[:, 111:115].head())
print('\nChild: bureau, Child Variable of app: SK_ID_CURR\n\n', bureau.iloc[:, :5].head())

Parent: app, Parent Variable of bureau: SK_ID_CURR

    SK_ID_CURR  TARGET  TOTALAREA_MODE WALLSMATERIAL_MODE
0      100002     1.0          0.0149       Stone, brick
1      100003     0.0          0.0714              Block
2      100004     0.0             NaN                NaN
3      100006     0.0             NaN                NaN
4      100007     0.0             NaN                NaN

Child: bureau, Child Variable of app: SK_ID_CURR

    SK_ID_CURR  SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY  DAYS_CREDIT
0      215354       5714462        Closed      currency 1       -497.0
1      215354       5714463        Active      currency 1       -208.0
2      215354       5714464        Active      currency 1       -203.0
3      215354       5714465        Active      currency 1       -203.0
4      215354       5714466        Active      currency 1       -629.0


The `SK_ID_CURR` 215354 has one row in the parent table and multiple rows in the child. 

Two tables are linked via a shared variable. The `app` and `bureau` dataframe are linked by the `SK_ID_CURR` variable while the `bureau` and `bureau_balance` dataframes are linked with the `SK_ID_BUREAU`. The linking variable (also known as a key) is called the `parent` variable in the parent table and the `child` variable in the child table.

In [15]:
print('Parent: bureau, Parent Variable of bureau_balance: SK_ID_BUREAU\n\n', bureau.iloc[:, :5].head())
print('\nChild: bureau_balance, Child Variable of bureau: SK_ID_BUREAU\n\n', bureau_balance.head())

Parent: bureau, Parent Variable of bureau_balance: SK_ID_BUREAU

    SK_ID_CURR  SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY  DAYS_CREDIT
0      215354       5714462        Closed      currency 1       -497.0
1      215354       5714463        Active      currency 1       -208.0
2      215354       5714464        Active      currency 1       -203.0
3      215354       5714465        Active      currency 1       -203.0
4      215354       5714466        Active      currency 1       -629.0

Child: bureau_balance, Child Variable of bureau: SK_ID_BUREAU

    SK_ID_BUREAU  MONTHS_BALANCE STATUS
0       5715448               0      C
1       5715448              -1      C
2       5715448              -2      C
3       5715448              -3      C
4       5715448              -4      C


Traditionally, we use the relationships between parents and children to aggregate data by grouping together all the children for a single parent and calculating statistics. For example, we might group together all the loans for a single client and calculate the average loan amount. This is straightforward, but can grow extremely tedious when we want to make hundreds of these features. Doing so one at a time is extremely inefficient especially because we end up re-writing much of the code over and over again and this code cannot be used for any different problem! Things get even worse when we have to aggregate the grandchildren because we have to use two steps: first aggregate at the parent level, and then at the grandparent level. Soon we will see that featuretools can do this work automatically for us, generating thousands of features from __all__ of the data tables. When we did this manually it took about 15 minutes per feature (as we saw in the manual feature engineering notebook) so featuretools saves us hundreds of data scientist hours that do not have to be wasted! 

Defining the relationships is relatively straightforward using the diagram for the data tables. For each relationship, we need to first specify the parent variable and then the child variable. Altogether, there are a total of 6 relationships between the tables (counting the training and testing relationships as one). Below we specify these relationships and then add them to the EntitySet.

In [16]:
# Relationship between app_train and bureau
r_app_bureau = ft.Relationship(es['app']['SK_ID_CURR'], es['bureau']['SK_ID_CURR'])

# Relationship between bureau and bureau balance
r_bureau_balance = ft.Relationship(es['bureau']['SK_ID_BUREAU'], es['bureau_balance']['SK_ID_BUREAU'])

# Relationship between current app and previous apps
r_app_previous = ft.Relationship(es['app']['SK_ID_CURR'], es['previous']['SK_ID_CURR'])

# Relationships between previous apps and cash, installments, and credit
r_previous_cash = ft.Relationship(es['previous']['SK_ID_PREV'], es['cash']['SK_ID_PREV'])
r_previous_installments = ft.Relationship(es['previous']['SK_ID_PREV'], es['installments']['SK_ID_PREV'])
r_previous_credit = ft.Relationship(es['previous']['SK_ID_PREV'], es['credit']['SK_ID_PREV'])

In [17]:
# Add in the defined relationships
es = es.add_relationships([r_app_bureau, r_bureau_balance, r_app_previous,
                           r_previous_cash, r_previous_installments, r_previous_credit])
# Print out the EntitySet
es

Entityset: clients
  Entities:
    app [Rows: 356255, Columns: 128]
    bureau [Rows: 1716428, Columns: 17]
    previous [Rows: 1670214, Columns: 37]
    bureau_balance [Rows: 27299925, Columns: 4]
    cash [Rows: 10001358, Columns: 8]
    installments [Rows: 13605401, Columns: 8]
    credit [Rows: 3840312, Columns: 23]
  Relationships:
    bureau.SK_ID_CURR -> app.SK_ID_CURR
    bureau_balance.SK_ID_BUREAU -> bureau.SK_ID_BUREAU
    previous.SK_ID_CURR -> app.SK_ID_CURR
    cash.SK_ID_PREV -> previous.SK_ID_PREV
    installments.SK_ID_PREV -> previous.SK_ID_PREV
    credit.SK_ID_PREV -> previous.SK_ID_PREV

Again, we can see the benefits of using an `EntitySet` that is able to track all of the relationships for us. This allows us to work at a higher level of abstraction, thinking about the entire dataset rather than each individual table, greatly increasing our efficient. 

__Slightly advanced note__: we need to be careful to not create a [diamond graph](https://en.wikipedia.org/wiki/Diamond_graph) where there are multiple paths from a parent to a child. If we directly link `app` and `cash` via `SK_ID_CURR`; `previous` and `cash` via `SK_ID_PREV`; and `app` and `previous` via `SK_ID_CURR`, then we have created two paths from `app` to `cash`. This results in ambiguity, so the approach we have to take instead is to link `app` to `cash` through `previous`. We establish a relationship between `previous` (the parent) and `cash` (the child) using `SK_ID_PREV`. Then we establish a relationship between `app` (the parent) and `previous` (now the child) using `SK_ID_CURR`. Then featuretools will be able to create features on `app` derived from both `previous` and `cash` by stacking multiple primitives. 

If this doesn't make too much sense, then just remember to only include one path from a parent to any descendents. For example, link a grandparent to a grandchild through the parent instead of directly through a shared variable.

All entities in the entity can be linked through these relationships. In theory this allows us to calculate features for any of the entities, but in practice, we will only calculate features for the `app` dataframe since that will be used for training/testing. The end outcome will be a dataframe that has one row for each client in `app` with thousands of features for each individual. 

We are almost to the point where we can start creating thousands of features but we still have a few foundational topics to understand. The next building block to cover is feature primitives.

# Feature Primitives

A [feature primitive](https://docs.featuretools.com/automated_feature_engineering/primitives.html) is an operation applied to a table or a set of tables to create a feature. These represent simple calculations, many of which we already use in manual feature engineering, that can be stacked on top of each other to create complex features. Feature primitives fall into two categories:

* __Aggregation__: function that groups together children for each parent and calculates a statistic such as mean, min, max, or standard deviation across the children. An example is the maximum previous loan amount for each client. An aggregation uses multiple tables with the relationships between tables.
* __Transformation__: an operation applied to one or more columns in a single table. An example would be taking the absolute value of a column, or finding the difference between two columns in one table.

A list of the available features primitives in featuretools can be viewed below.

In [18]:
# List the primitives in a dataframe
primitives = ft.list_primitives()
pd.options.display.max_colwidth = 100

primitives[primitives['type'] == 'aggregation'].head(10)

Unnamed: 0,name,type,description
0,count,aggregation,Counts the number of non null values.
1,time_since_last,aggregation,Time since last related instance.
2,min,aggregation,Finds the minimum non-null value of a numeric feature.
3,median,aggregation,Finds the median value of any feature with well-ordered values.
4,all,aggregation,Test if all values are 'True'.
5,sum,aggregation,Counts the number of elements of a numeric or boolean feature.
6,mode,aggregation,Finds the most common element in a categorical feature.
7,mean,aggregation,Computes the average value of a numeric feature.
8,avg_time_between,aggregation,Computes the average time between consecutive events.
9,percent_true,aggregation,Finds the percent of 'True' values in a boolean feature.


In [19]:
primitives[primitives['type'] == 'transform'].head(10)

Unnamed: 0,name,type,description
19,and,transform,"For two boolean values, determine if both values are 'True'."
20,multiply,transform,Creates a transform feature that multplies two features.
21,percentile,transform,"For each value of the base feature, determines the percentile in relation"
22,year,transform,Transform a Datetime feature into the year.
23,days,transform,Transform a Timedelta feature into the number of days.
24,add,transform,Creates a transform feature that adds two features.
25,diff,transform,Compute the difference between the value of a base feature and the previous value.
26,month,transform,Transform a Datetime feature into the month.
27,days_since,transform,"For each value of the base feature, compute the number of days between it"
28,week,transform,Transform a Datetime feature into the week.


# Deep Feature Synthesis

[Deep Feature Synthesis (DFS)](https://docs.featuretools.com/automated_feature_engineering/afe.html) is the process featuretools uses to make new features. DFS stacks feature primitives to form features with a "depth" equal to the number of primitives. For example, if we take the maximum value of a client's previous loans (say `MAX(previous.loan_amount)`), that is a "deep feature" with a depth of 1. To create a feature with a depth of two, we could stack primitives by taking the maximum value of a client's average monthly payments per previous loan (such as `MAX(previous(MEAN(installments.payment)))`). In manual feature engineering, this would require two separate groupings and aggregations and took about 15 minutes to write the code per feature. Deep Feature Synthesis is an extremely powerful method that allows us to overcome our human limitations on time and creativity by building features that we would never be able to think of on our own (or would not have the patience to implement). 

The [original paper on automated feature engineering using deep feature synthesis](https://dai.lids.mit.edu/wp-content/uploads/2017/10/DSAA_DSM_2015.pdf) is worth a read if you want to understand the concepts at a deeper level.

To perform DFS in featuretools, we use the `dfs`  function passing it an `entityset`, the `target_entity` (where we want to make the features), the `agg_primitives` to use, the `trans_primitives` to use, the `max_depth` of the features, and a number of other arguments depending on our use case. For example, when we are using seed features, we specify them with `seed_features` and we build on top of interesting values using the `where_primitives` argument. Since we do not want to make features for the _training_ data with the _testing_ data (which is possible because they are related) we need to pass in `ignore_entities` and then the relavant table. There are also options for multi-processing with `njobs` and the information that is printed out with `verbose`. 

One other important argument is __`features_only`__. If we set this to `True`, `dfs` will only make the feature names and the not calculate the actual values of the features (called the feature matrix). This is useful when we want to inspect the feature that will be created and we can also save the features to use with a different dataset (for example when we have training and testing data).

## Deep Feature Synthesis with Default Primitives

Without using any domain knowledge we can make thousands of features by using the default primitives in featuretools. This first call will not include any of the `seed` features or `interesting_values` and instead we will use the default aggregation and transformation primitives,  a max depth of 2, and calculate primitives for the `app_train` entity. We will only generate the features themselves (the names and not the values) which we can save and inspect.

In [20]:
# Default primitives from featuretools
default_agg_primitives =  ["sum", "std", "max", "skew", "min", "mean", "count", "percent_true", "num_unique", "mode"]
default_trans_primitives =  ["day", "year", "month", "weekday", "haversine", "numwords", "characters"]

# DFS with specified primitives
feature_names = ft.dfs(entityset = es, target_entity = 'app',
                       trans_primitives = default_trans_primitives,
                       agg_primitives=default_agg_primitives, 
                       where_primitives = [], seed_features = [],
                       max_depth = 2, n_jobs = -1, verbose = 1,
                       features_only=True)

Built 1581 features


Even a completely uninformed call to deep feature synthesis gives us over 2000 features to work with. Granted, not all of these will be important, but this still represents hundreds of data scientists hours. Moreover, `dfs` might be able to find important features that we would never have thought of in the first place. 

We can look at the some of the feature names:

In [21]:
feature_names[-15:]

[<Feature: MEAN(previous.MIN(cash.CNT_INSTALMENT))>,
 <Feature: MEAN(previous.MIN(cash.CNT_INSTALMENT_FUTURE))>,
 <Feature: MEAN(previous.MIN(cash.SK_DPD))>,
 <Feature: MEAN(previous.MIN(cash.SK_DPD_DEF))>,
 <Feature: MEAN(previous.MEAN(cash.MONTHS_BALANCE))>,
 <Feature: MEAN(previous.MEAN(cash.CNT_INSTALMENT))>,
 <Feature: MEAN(previous.MEAN(cash.CNT_INSTALMENT_FUTURE))>,
 <Feature: MEAN(previous.MEAN(cash.SK_DPD))>,
 <Feature: MEAN(previous.MEAN(cash.SK_DPD_DEF))>,
 <Feature: MEAN(previous.COUNT(cash))>,
 <Feature: MEAN(previous.NUM_UNIQUE(cash.NAME_CONTRACT_STATUS))>,
 <Feature: NUM_UNIQUE(previous.MODE(credit.NAME_CONTRACT_STATUS))>,
 <Feature: NUM_UNIQUE(previous.MODE(cash.NAME_CONTRACT_STATUS))>,
 <Feature: MODE(previous.MODE(credit.NAME_CONTRACT_STATUS))>,
 <Feature: MODE(previous.MODE(cash.NAME_CONTRACT_STATUS))>]

Notice how featuretools stacks multiple primitives on top of each other. This is the benefit behind deep feature synthesis and automated feature engineering. Rather than having to do these groupings and aggregations by ourselves, featuretools is able to handle it all using the framework (`entities`, `relationships`, and `primitives`) that we provide. Next, we can use featuretools to expand upon our domain knowledge work. 

# Building on Top of Domain Features

Featuretools will automatically build thousands of features for us, but that does not mean we can't use our own knowledge to improve the predictive performance. Featuretools is able to augment our domain knowledge by stacking additional features on top of our domain knowledge based features. We identified and created numerous useful features in the manual feature engineering notebook, based on our own knowledge and that of thousands of data scientists working on this problem on Kaggle. We can use this work to even greater effectiveness using featuretools as we will see in this section. Rather than getting only one domain knowledge feature, we can effectively get dozens or even hundreds. We will only use two methods to include our domain knowledge here: seed features and interesting values. However, there are other ways in which featuretools can amplify our existing work such as through [custom primitives](https://docs.featuretools.com/guides/advanced_custom_primitives.html)(arguably one of the most powerful parts of featuretools).  

### Seed Features 

Seed features are domain features that we make in the data that featuretools is then able to build on top of. For example, we saw that the rate of a loan is an important feature because a higher rate loan is likely more risky. In featuretools, we can encode the loan rate (both for the current loan and for previous loans) as a seed feature and featuretools will build additional explanatory variables on this domain knowledge wherever possible. 

### Interesting Values

Interesting values have a similar idea to seed features except they allow us to make conditional features. For example, we might want to find for each client the mean amount of previous loans that have been closed and the mean amount of previous loans that are still active. By specifying interesting values in `bureau` on the `CREDIT_ACTIVE` variable we can have featuretools do exactly that! Carrying this out by hand would be extremely tedious and present numerous opportunities for errors.

Below we encode the domain knowledge features for each table, using a few lines of code to generate potentially dozens of features for table. If we were doing this by hand, I do not even want to think how long making each of these features would take! 

#### Domain Features in app

For the `app` data, we will just add in `seed` features. For each of these, we use the `ft.Feature` to identify the feature and then rename it so we can later interpret the results. All of these features were implemented by hand in the manual feature engineering notebook.

#### Domain Features from bureau

For the `bureau` (recording previous loans at other financial instituions) we will use both interesting values and seed features. The `CREDIT_ACTIVE` records the current condition of the previous credit. Two interesting values are "Active" and "Closed". When we specify these interesting values, featuretools will build features on top of these by treating the values as conditionals.

In [22]:
es['bureau']['CREDIT_ACTIVE'].interesting_values = ['Active', 'Closed']

We also can use a `seed_feature` to specify whether or not a loan is overdue. Any overdue loan is where the amount is greater than 0.0, and we can use a seed feature that tells featuretools to stack new features on this value.

In [23]:
credit_overdue = ft.Feature(es['bureau']['CREDIT_DAY_OVERDUE']) > 0.0
credit_overdue = credit_overdue.rename('CREDIT_OVERDUE')

credit_loan_rate = ft.Feature(es['bureau']['AMT_ANNUITY']) / ft.Feature(es['bureau']['AMT_CREDIT_SUM'])
credit_loan_rate = credit_loan_rate.rename('PREVIOUS_OTHER_LOAN_RATE')

#### Domain Features from bureau balance

Here we will use a single seed feature that indicates whether the monthly balance was overdue. 

In [24]:
balance_past_due = ft.Feature(es['bureau_balance']['STATUS']).isin(['1', '2', '3', '4', '5'])
balance_past_due = balance_past_due.rename('PREVIOUS_OTHER_MONTHLY_PAST_DUE')

#### Domain Features from previous

For the interesting values here, we will specify whether the loan was approved or rejected. 

In [25]:
es['previous']['NAME_CONTRACT_STATUS'].interesting_values = ['Approved', 'Refused']

The seed features will be the difference between the amount requested `AMT_APPLICATION` and the amount actually awarded `AMT_CREDIT` as well as the loan rate. 

In [26]:
previous_difference = ft.Feature(es['previous']['AMT_APPLICATION']) - ft.Feature(es['previous']['AMT_CREDIT'])
previous_difference = previous_difference.rename('PREVIOUS_APPLICATION_RECEIVED_DIFFERENCE')

previous_loan_rate = ft.Feature(es['previous']['AMT_ANNUITY']) / ft.Feature(es['previous']['AMT_CREDIT'])
previous_loan_rate = previous_loan_rate.rename('PREVIOUS_LOAN_RATE')

#### Domain Features from credit

For the previous credit loans from Home Credit, we will add two interesting values to `NAME_CONTRACT_STATUS` and one seed feature.

In [27]:
es['credit']['NAME_CONTRACT_STATUS'].interesting_values = ['Active', 'Completed']

In [28]:
credit_card_past_due = ft.Feature(es['credit']['SK_DPD']) > 0.0
credit_card_past_due = credit_card_past_due.rename('CREDIT_CARD_PAST_DUE')

#### Domain Features from cash

For the previous cash loans, we will also add intereseting values and one seed feature.

In [29]:
es['cash']['NAME_CONTRACT_STATUS'].interesting_values = ['Active', 'Completed']

In [30]:
cash_past_due = ft.Feature(es['cash']['SK_DPD']) > 0.0
cash_past_due = cash_past_due.rename('CASH_PAST_DUE')

#### Seed Features from installments

For the previous installment payments, we will add two seed features: one indicating whether the payment was late, and another indicating if the payment was below the prescribed amount.

In [31]:
installments_late = ft.Feature(es['installments']['DAYS_ENTRY_PAYMENT']) > ft.Feature(es['installments']['DAYS_INSTALMENT'])
installments_late = installments_late.rename('INSTALLMENT_LATE')

installments_low_payment = ft.Feature(es['installments']['AMT_PAYMENT']) < ft.Feature(es['installments']['AMT_INSTALMENT']) 
installments_low_payment = installments_low_payment.rename('INSTALLMENT_LOW')

We will put all of the seed features into a list that we can then pass to featuretools deep feature synthesis (`ft.dfs`, more on this in shortly). For the seed features, we do not have to worry about specifying anything else because featuretools will automatically build on top of these when possible. For the interesting values, we have to pass in another argument called `where_primitives` to tell featuretools what aggregations to place on these conditionals. For example, using `where_primitives = ['mean', 'sum']` will find the `mean` and `sum` of values conditioned on the interesting values. 

In [32]:
seed_features = [installments_low_payment, installments_late,
                       cash_past_due, credit_card_past_due, 
                       previous_difference, previous_loan_rate,
                       balance_past_due, credit_loan_rate, credit_overdue]

## Using DFS to build on Domain Knowledge

Now we can use the domain knowledge we encoded in the seed features and the interesting values. To use the seed features, we simply need to pass them in as a list and to use the interesting values, we only need to specify where primitives to conditionally aggregate depending on the value of the variable.

We can also select a more limited (or expanded) set of primitives if we know which tend to be more important for a problem. In the default call, a number of the primitives were not even used because they are not applicable. Below we select a more limited set of feature primitives to use in Deep Feature Synthesis, include seed features, and also include the `where_primitives` that will be used to with the `interesting_values`. 

In [33]:
# Specify primitives
agg_primitives =  ["sum", "max", "min", "mean", "count", "percent_true", "num_unique", "mode"]
trans_primitives = ['percentile', 'and']
where_primitives = ['percent_true', 'mean', 'sum']

In [34]:
# Deep feature synthesis with domain knowledge (only features)
feature_names = ft.dfs(entityset=es, target_entity='app',
                       agg_primitives = agg_primitives,
                       trans_primitives = trans_primitives,
                       seed_features = seed_features,
                       where_primitives = where_primitives,
                       n_jobs = -1, verbose = 1, features_only = True,
                       max_depth = 2)

Built 2800 features


## Run Full Deep Feature Synthesis

If we are content with the features that will be built, we can run deep feature synthesis and create the feature matrix. The following call runs the full deep feature synthesis. This might take a long time depending on your machine. featuretools does allow for parallel processing, but each core must be able to handle the entire entityset.  

In [35]:
import sys
print('Total size of entityset: {:.5f} gb.'.format(sys.getsizeof(es) / 1e9))

Total size of entityset: 11.64392 gb.


In [36]:
import psutil

print('Total number of cpus detected: {}.'.format(psutil.cpu_count()))
print('Total size of system memory: {:.5f} gb.'.format(psutil.virtual_memory().total / 1e9))

Total number of cpus detected: 8.
Total size of system memory: 17.17987 gb.


In [37]:
# feature_matrix, feature_names = ft.dfs(entityset=es, target_entity='app',
#                                        agg_primitives = agg_primitives,
#                                        trans_primitives = trans_primitives,
#                                        seed_features = seed_features,
#                                         where_features = where_features,
#                                        n_jobs = 1, verbose = 1, features_only = False,
#                                        max_depth = 2, chunk_size = 100)

In [38]:
# feature_matrix.reset_index(inplace = True)
# feature_matrix.to_csv('../../data/kaggle_home_credit/feature_matrix.csv', index = False)

# Conclusions

In this notebook, we saw how to implement automated feature engineering for a data science problem. Automated feature engineering allows us to create thousands of new features from a set of related data tables, aiding us immensely as data scientists. Moreover, we can still use domain knowledge in our features and even augment our domain knowledge by building on top of our own hand-built features. The benefits of automated feature engineering are significant and will considerably help us in our role as data scientists.