# Automated Feature Engineering

* Often a predictive model's performance is limited by its features — you can tune the model for the best parameters and still not have the best performing model. 
* Identifying and engineering features that clearly demonstrate the predictive signal is paramount to model performance.
* The single biggest technical hurdle that machine learning algorithms must overcome is their need for processed data in order to work — they can only make predictions from numeric data. 
* The process for extracting these numeric features is called “feature engineering.

## Deep Feature Synthesis (DFS)
* Developed at MIT in 2014
* generates many of the same features that a human data scientist would create.

* There are three key concepts in understanding Deep Feature Synthesis:

* **1) Features are derived from relationships between the data points in a dataset.**
    * DFS performs feature engineering for multi-table and transactional datasets commonly found in databases or log files.

* **2) Across datasets, many features are derived by using similar mathematical operations.**
    * Dataset-agnostic operations are called “primitives.”
    
* **3) New features are often composed from utilizing previously derived features.**
    * Primitives are the building blocks of DFS. 
    * Because primitives define their input and output types, we can stack them to construct complex features that mimic the ones that humans create today.
    * DFS can apply primitives across relationships between entities, so features can be created from datasets with many tables. 
    * We can control the complexity of the features we create by setting a maximum depth for our search.

* A second advantage of primitives: they can be used to quickly enumerate many interesting features in a parameterized fashion
    
*  Since primitives are defined independently of a specific dataset, any new primitive added to Featuretools can be incorporated into any other dataset that contains the same variable data types. In some cases, this might be a dataset in the same domain, but it could also be for a completely different use case.

* It’s easy to accidentally leak information about what you’re trying to predict into a model.
* DFS can be used to develop baseline models with little human intervention.
* the automation of feature engineering should be thought of as a complement to critical human expertise — it enables data scientists to be more precise and productive.

* Deep Feature Synthesis vs. Deep Learning
* Deep Learning automates feature engineering for images, text, and audio where a large training set is typically required, whereas DFS targets the structured transactional and relational datasets that companies work with.
* The features that DFS generates are more explainable to humans because they are based on combinations of primitives that are easily described in natural language. 
* The transformations in deep learning must be possible through matrix multiplication, while the primitives in DFS can be mapped to any function that a domain expert can describe.
* This increases the accessibility of the technology and offers more opportunities for those who are not experienced machine learning professionals to contribute their own expertise.
* Additionally, while deep learning often requires many training examples to train the complex architectures it needs to work, DFS can start creating potential features based only on the schema of a dataset.
* For many enterprise use cases, enough training examples for deep learning are not available.
* DFS offers a way to begin creating interpretable features for smaller datasets that humans can manually validate.
* Automating feature engineering offers the potential to accelerate the process of applying machine learning to the valuable datasets collected by data science teams today. 
* It will help data scientists to quickly address new problems as they arise and, more importantly, make it easier for those new to data science to develop the skills necessary to apply their own domain expertise.

# Automated Feature Engineering pt 2

* Data is like the crude oil of machine learning which means it has to be refined into features — predictor variables — to be useful for training a model. Without relevant features, you can’t train an accurate model, no matter how complex the machine learning algorithm. The process of extracting features from a raw dataset is called feature engineering.
* Feature engineering means building features for each label while filtering the data used for the feature based on the label’s cutoff time to make valid features. These features and labels are then passed to modeling where they will be used for training a machine learning algorithm.
* While feature engineering requires label times, in our general-purpose framework, it is not hard-coded for specific labels corresponding to only one prediction problem.
* Instead, we use APIs like Featuretools that can build features for any set of labels without requiring changes to the code.
* This fits with the principles of our machine learning approach: we segment each step of the pipeline while standardizing inputs and outputs. This independence means we can change the problem in prediction engineering without needing to alter the downstream feature engineering and machine learning code.
* The key to making this step of the machine learning process repeatable across prediction problems is automated feature engineering.
* Traditionally, feature engineering is done by hand, building features one at a time using domain knowledge. However, this manual process is error-prone, tedious, must be started from scratch for each dataset, and ultimately is limited by constraints on human creativity and time. Furthermore, in time-dependent problems where we have to filter every feature based on a cutoff time, it’s hard to avoid errors that can invalidate an entire machine learning solution.
* After solving a few problems with machine learning, it becomes clear that many of the operations used to build features are repeated across datasets.
* We can apply the same basic building blocks — called feature primitives — to different relational datasets to build predictor variables.
* Ultimately, automated feature engineering makes us more efficient as data scientists by removing the need to repeat tedious operations across problems.
* Currently, the only open-source Python library for automated feature engineering using multiple tables is Featuretools, developed and maintained by Feature Labs. 
* Featuretools requires some background code to link together the tables through relationships, but then we can automatically make features for customer churn using the following code (see notebook for complete details):

```
import featuretools as ft

# Primitives for deep feature synthesis
trans_primitives = ['weekend', 'cum_sum', 'day', 'month', 'diff', 'time_since_previous']
agg_primitives = ['sum', 'time_since_last', 'avg_time_between', 'all', 'mode', 
                  'num_unique', 'min', 'last', 'mean', 'percent_true', 
                  'max', 'std', 'count']

# Perform deep feature synthesis 
feature_matrix, feature_names = ft.dfs(entityset=es, 
                                       trans_primitives = trans_primitives,
                                       agg_primitives = agg_primitives,
                                       target_entity='customers',
                                       cutoff_times=cutoff_times)
```
* This one line of code gives us over 200 features for each label in cutoff_times. Each feature is a combination of feature primitives and is built with only data from before the associated cutoff time.

* To solve a different problem, rather than rewrite the entire pipeline, we:
* 1) Tweak the prediction engineering code to create new label times
* 2) Input the label times to feature engineering and output features
* 3) Use the features to train and a supervised machine learning model

# Case Study Notes

* The first step in using Featuretools is to make an EntitySet and add all the entitys - tables - to it. An EntitySet is a data structure that holds the tables and the relationships between them. This makes it easier to keep track of all the data in a problem with multiple relational tables.

## Entities

When creating entities from a dataframe, we need to make sure to include:

* The index if there is one or a name for the created index. This is a unique identifier for each observation.
* make_index = True if there is no index, we need to supply a name under index and set this to True.
* A time_index if present. This is the time at which the information in the row becomes known. Featuretools will use the time_index and the cutoff_time to make valid features for each label.
variable_types. In some cases our data will have variables for which we should specify the type. An example would be a boolean that is represented as a float. This prevents Featuretools from making features such as the min or max of a True/False varaibles.
* For this problem these are the only arguments we'll need. There are additional arguments that can be used as shown in the documentation.

* primitives are data agnostic

# you're one project away from your next job