# Introduction: Automated Feature Engineering

In this notebook, we will look at an exciting development in data science: automated feature engineering. A machine learning model can only learn from the data we give it, and making sure that data is relevant to the task is one of the most crucial steps in the machine learning pipeline. The importance of creating useful features from existing data has been highlighted by numerous leaders in the field, including Andrew Ng ("applied machine learning is basically feature engineering") and in the excellent paper "A Few Useful Things to Know about Machine Learning". However, manual feature engineering is a tedious task and is limited by both human imagination - there are only so many features we can think to create - and by time - creating new features is time-intensive. Ideally, there would be an objective method to create an array of diverse new features that we can then use for a machine learning task. 

In this notebook, we will walk through an implementation of using Feature Tools, an open-source Python library for automatically creating features with relational data (where the data is in structured tables). Although there are now many efforts working to enable automated model selection and hyperparameter tuning, there has been a lack of work on the feature engineering aspect of the pipeline. This library seeks to close that gap and the general methodology has been proven effective in both machine learning competitions and business use cases. 


## Dataset

We will use an example dataset consisting of three tables:

* `clients`: information about clients at a credit union
* `loans`: previous loans taken out by the clients
* `payments`: payments made/missed on the previous loans

The general problem of feature engineering is taking disparate data, often distributed across multiple tables, and combining it into a single table that can be used for training a machine learning model. If we were doing this manually, we could use `pandas` functions to group the data by client and then calculate summary statistics. Feature tools can do this exact same process, but will create far more features than we would have considered.

First, let's load in the data and look at how we might create new features by hand to contrast with feature tools.

In [0]:
# Run this if feature tools is not already installed
# !pip install featuretools

In [1]:
# pandas and numpy for data manipulation
import pandas as pd
import numpy as np

# featuretools for automated feature engineering
import featuretools as ft

In [5]:
# Read in the data
clients = pd.read_csv('data/clients.csv', parse_dates = ['joined'])
loans = pd.read_csv('data/loans.csv', parse_dates = ['loan_start', 'loan_end'])
payments = pd.read_csv('data/payments.csv', parse_dates = ['payment_date'])

In [6]:
clients.head()

Unnamed: 0,client_id,joined,income,credit_score
0,46109,2002-04-16,172677,527
1,49545,2007-11-14,104564,770
2,41480,2013-03-11,122607,585
3,46180,2001-11-06,43851,562
4,25707,2006-10-06,211422,621


In [9]:
loans.sample(10)

Unnamed: 0,client_id,loan_type,loan_amount,repaid,loan_id,loan_start,loan_end,rate
318,44387,home,12465,0,10682,2005-03-03,2007-06-01,2.91
294,32885,home,14162,1,11333,2000-08-05,2002-06-01,7.21
258,29841,credit,14721,0,11045,2005-02-09,2007-06-09,6.76
351,26695,home,9808,1,10806,2000-07-27,2003-04-18,1.23
205,26326,home,12760,0,11708,2003-12-11,2006-04-13,5.43
321,44387,cash,2559,0,10690,2014-02-24,2016-06-11,3.27
277,44601,home,12571,1,10429,2003-04-15,2005-07-05,6.57
299,49068,other,10082,1,10131,2014-10-10,2016-05-25,0.63
27,49545,credit,4458,1,10192,2013-02-11,2014-09-11,3.6
366,26695,credit,4776,1,10305,2003-09-16,2005-09-20,0.97


In [8]:
payments.head()

Unnamed: 0,loan_id,payment_amount,payment_date,missed
0,10243,2369,2002-05-31,1
1,10243,2439,2002-06-18,1
2,10243,2662,2002-06-29,0
3,10243,2268,2002-07-20,0
4,10243,2027,2002-07-31,1


### Manual Feature Engineering Examples

The first features we might make by hand are relatively simple: we can take the month of the `joined` column and the natural log of the `income` column. Later, we will come to see these are known in Feature Tools as transformation feature primitives because they act on all values in a column. 

In [0]:
# Create a month column
clients['month'] = clients['time_started'].dt.month

# Create a log of income column
clients['log_income'] = np.log(clients['income'])

clients.head()

To incorporate information about the other tables, we would use the `df.groupby` method, followed by a suitable aggregation function, followed by `df.merge`.  For example, let's calculate the average, minimum, and maximum amount of previous loans for each client. In the terms of feature tools, this would be considered an aggregation feature primitive because we are aggregating a statistic for multiple data points.

In [0]:
# Groupby client id and calculate mean, max, min previous loan size
previous_stats = previous_loans.groupby('client_id')['loan_amount'].agg(['mean', 'max', 'min'])
previous_stats.head()

In [0]:
# Merge with the clients dataframe
clients.merge(previous_stats, left_on = 'client_id', right_index=True, how = 'left')

If we then wanted to include information about the `previous_payments` we would have to group that dataframe by the `loan_id`, merge it with the `previous_loans`, group the resulting dataframe by the `client_id`, and then merge it into the `clients` dataframe. This would allow us to include information about previous payments for each client. Clearly, this process can grow quite tedious with multiple tables and I certainly don't want to have to do this process by hand! Luckily, feature tools can automatically carry out this entire process and will create more features than we would have ever thought of. Although I love `pandas`, there is only so much manual data manipulation I'm willing to stand! 

# Feature Tools

Now that we know what we are trying to avoid, let's figure out how to automate this process. Feature tools takes the human limits of time and imagination out of the manual feature engineering process (although it is meant to be used together with a data scientist and not replace them!) Feature tools operates on an idea known as Deep Feature Synthesis. You can read the original paper here, and although it's quite readable, it's not necessary to understand the details to do automated feature engineering. Basically, feature tools uses basic building blocks known as feature primitives (like those above) that when combined together can yield multiple new variables. These variables can then be used for supervised machine learning. 

I threw out some terms there, but don't worry because we'll cover them as we go. Feature Tools builds on simple ideas to create a powerful method, and we will build up our understanding in much the same way. 

The first part of Feature Tools to understand is an `entity`. This is simply a table, or in `pandas`, a `DataFrame`. We corral multiple entities into a single object called an `EntitySet`. This is a data structure composed of many entities and is quite useful for Deep Feature Synthesis. The `EntitySet` will hold all of our entities (think tables) and the relationships between them. 

## EntitySet

Creating a new `EntitySet` is pretty simple: 

In [0]:
es = ft.EntitySet(id = 'clients')

## Entities 

An entity is simply a table, which is represented in Pandas as a `dataframe`. Each entity must have a uniquely identifying column, known as an index. For the clients dataframe, this is the `client_id` because each id only appears once in the data. In the `previous_loans` dataframe, `client_id` is not an index because each id might appear more than once. The index for this dataframe is instead `loan_id`. When we create an `entity` in feature tools, we have to identify which column of the dataframe is the index. 

If the data does not have a unique index we can tell feature tools to make an index for the entity by passing in `make_index = True` and specifying a name for the index. 

If the data also has a uniquely identifying time index, we can pass that in as the `time_index` parameter. Feature tools will automatically infer the variable types (numeric, categorical, datetime) of the columns in our data, but we can also pass in specific datatypes to override this behavior. As an example, even though the `repaid` column in the `previous_loans` dataframe is represented as an integer, we can tell feature tools that this is a categorical feature since it can only take on two discrete values. 

In the code below we create the three entities and add them to the `EntitySet`.  The syntax is relatively straightforward with a few notes: for the `previous_payments` dataframe we need to make an index, and for the `previous_loans` dataframe, we specify that `repaid` is a categorical variable. 

In [0]:
# Create an entity from the client dataframe
# This dataframe already has an index and a time index
es = es.entity_from_dataframe(entity_id = 'clients', dataframe = clients, 
                              index = 'client_id', time_index = 'time_started')

In [0]:
# Create an entity from the previous_loans dataframe
# This dataframe already has an index and a time index
es = es.entity_from_dataframe(entity_id = 'previous_loans', dataframe = previous_loans, 
                              variable_types = {'repaid': ft.variable_types.Categorical},
                              index = 'loan_id', time_index = 'loan_start')

In [0]:
# Create an entity from the previous_payments dataframe
# This does not yet have a unique index
es = es.entity_from_dataframe(entity_id = 'previous_payments', 
                              dataframe = previous_payments,
                              variable_types = {'missed': ft.variable_types.Categorical},
                              make_index = True,
                              index = 'payment_id')

In [0]:
es

All three entities have been successfully added to the `EntitySet`. We can access any of these using Python dictionary syntax.

In [0]:
es['previous_loans']

Feature tools correctly inferred each of the datatypes when we made this entity. We can also see that we overrode the type for the `repaid` feature, changing if from numeric to categorical. 

## Relationships

After defining the entities (tables) in an `EntitySet`, we now need to tell feature tools how they are related. The most intuitive way to think of relationships is with the parent to child analogy: a parent-to-child relationship is one-to-many because for each parent, there can be multiple children. The `client` dataframe is therefore the parent of the `previous_loans` dataframe because while there is only one row for each client in the `client` dataframe, each client may have several previous loans covering multiple rows in the `previous_loans` dataframe. Likewise, the `previous_loans` dataframe is the parent of the `previous_payments` dataframe because each loan will have multiple payments. 

These relationships are what allow us to group together datapoints (called aggregating) and then create new features. As an example, we can group all of the previous loans associated with one client and find the average loan amount. We will discuss the features themselves more in a little bit, but for now let's define the relationships. To define relationships, we need to specify the parent variable and the child variable. This is the variable that links two entities together. In our example, the `client` and `previous_loans` dataframes are linked together by the `client_id` column. Again, this is a parent to child relationship because for each `client_id` in the parent `client` dataframe, there may be multiple entries of the same `client_id` in the child `previous_loans` dataframe. 

We codify relationships in the language of feature tools by specifying the parent variable and then the child variable. After creating a relationship, we add it to the `EntitySet`. 

In [0]:
# Relationship between clients and previous loans
r_client_previous = ft.Relationship(es['clients']['client_id'],
                                    es['previous_loans']['client_id'])

# Add the relationship to the entity set
es = es.add_relationship(r_client_previous)

es

The relationship has now been stored in the entity set. The second relationship is between the `previous_loans` and `previous_payments`. These two entities are related by the `loan_id` variable.

In [0]:
# Relationship between previous loans and previous payments
r_previous_payments = ft.Relationship(es['previous_loans']['loan_id'],
                                      es['previous_payments']['loan_id'])

# Add the relationship to the entity set
es = es.add_relationship(r_previous_payments)

# Deep Feature Synthesis

In the paper describing the original idea behind feature tools, the authors named the process of creating new features they developed "Deep Feature Synthesis". The deep in the title refers to the idea that features can have different depth depending on the number of feature primitives used. What is a feature primitive? Well, since you asked: 


## Feature Primitives

A feature primitive is an operation acting on a feature. These fall into two categories:

* __Aggregation__: function that groups together child datapoints for each parent and then calculates a statistic such as mean, min, max, or standard deviation. An example is calculating the maximum loan amount for each client. 
* __Transformation__: an operation applied to every observation in a feature. An example would be extracting the day from dates, or the square root of a numeric. In contrast to aggregations, a transformation does not group observations.


Let's take a look at feature primitives in feature tools. We can view the list of primitives:

In [0]:
primitives = ft.list_primitives()
primitives[primitives['type'] == 'aggregation'].head(10)

In [0]:
primitives[primitives['type'] == 'transform'].head(10)

If feature tools does not have enough primitives for us, we can also make our own. 

To get an idea of what a feature primitive actually does, let's try out a few on our data. Using primitives is surprisingly easy using the `ft.dfs` function (which stands for deep feature synthesis). In this function, we specify the entityset to use; the `target_entity`, which is the dataframe we want to make the features for; the `agg_primitives` which are the aggregation feature primitives; and the `trans_primitives` which are the transformation primitives to apply. 

In the following example, we are using the `EntitySet` we already created, the target entity is the `clients` dataframe because we want to make new features about each client, and then we specify a few aggregation and transformation primitives. 

In [0]:
# Create new features using specified primitives
features, feature_names = ft.dfs(entityset = es, target_entity = 'clients', 
                                 agg_primitives = ['mean', 'max', 'percent_true', 'last'],
                                 trans_primitives = ['years', 'subtract'])

In [0]:
features.head()

Already we can see how useful feature tools is: it performed the same operations we did manually but also many more in addition. Examining the dataframe brings us to the final piece of the puzzle: deep features.

## Feature Depth

While feature primitives are useful by themselves, the main benefit of using feature tools arises when we stack primitives to get deep features. The depth of a feature is simply the number of primitives required to make a feature. So, a feature that relies on a single aggregation would be a deep feature with a depth of 1. The idea itself is lot simpler than the name "deep feature synthesis" implies. (I think the authors were trying to ride the way of deep neural network hype when they named the method!)

Already in the dataframe we made by specifying the primitives manually we can see the idea of feature depth. For instance, the MEAN(previous_loans.loan_amount) feature has a depth of 1 because it is made by applying a single aggregation primitive. This feature represents the average size of a client's previous loans.


In [0]:
# Show a feature with a depth of 1
features['MEAN(previous_loans.loan_amount)']

As well scroll through the features, we see a number of features with a depth of 2. For example, the LAST(previous_loans.(MEAN(previous_payments.payment_amount))) has depth = 2 because it is made by stacking two feature primitives, first an aggregation and then a transformation. This feature represents the average payment amount for the last (most recent) loan for each client.

In [0]:
# Show a feature with a depth of 2
features['LAST(previous_loans.MEAN(previous_payments.payment_amount))']

We can create features of arbitrary depth by stacking more primitives. However, when I have used feature tools I've never gone beyond a depth of 2! After this point, the features become very convoluted to understand. I'd encourage anyone interested to experiment with increasing the depth (maybe for a real problem) and see if there is value to "going deeper".

## Automated Deep Feature Synthesis

The main benefit of using feature tools comes not in specifying aggregations and transformations by hand, but by letting feature tools automatically generate many new features. We can do this by making the same `ft.dfs` function call, but without passing in any primitives. We can just set the `max_depth` parameter and feature tools will automatically try out all combinations of feature primitives to the ordered depth. 

When running on large datasets, this process can take quite a while, but for our example data, it will be relatively quick. For this call, we only need to specify the `entityset`, the `target_entity` (which will again be `clients`), and the `max_depth`. 

In [0]:
# Perform deep feature synthesis without specifying primitives
features, feature_names = ft.dfs(entityset=es, target_entity='clients', 
                                 max_depth = 2)

In [0]:
features.head()

Deep feature synthesis has created 94 new features out of the existing data! While we could have created all of these manually, I am glad to not have to write all that code by hand. The primary benefit of feature tools is that it creates features without any subjective human biases. Even a human with considerable domain knowledge will be limited by their imagination when making new features (not to mention time). Automated feature engineering is not limited by any factors and provides a good starting point for feature creation. 

While, automatic feature engineering solves one problem, it provides us with another problem: too many features! Although it's impossible to say which features will be important to a given machine learning task, it's likely that not all of the features made by feature tools add value. In fact, having too many features is a significant issue in machine learning because it makes training a model much harder. The irrelevant features can drown out the important features leaving a model unable to learn how to map the features to the target. This problem is known as the "curse of dimensionality" and is addressed through the process of feature reduction, which means removing low-value features from the data. Feature reduction will have to be another topic for another day!

In [0]:
features.shape

# Conclusions

In this notebook, we saw how to apply automated feature engineering to an example dataset. This is a powerful method which allows us to overcome the human limits of time and imagination to create many new features. Feature tools is built on the idea of deep feature synthesis, which means stacking multiple simple feature primitives - aggregations and transformations - to create new features. Feature engineering allows us to combine information across many tables into a single dataframe that we can then use for machine learning model training. The next step after creating all of these features is figuring out which ones are important. 

Feature tools is currently the only Python option for this process, but with the recent emphasis on automating aspects of the machine learning pipeline, other competitiors will probably enter the sphere. While the exact tools will change, the idea of automatically creating new features out of existing data will grow in importance. Staying up-to-date on methods such as automated feature engineering is crucial in the rapidly changing field of data science. Now go out there and find a problem on which to apply feature tools! 