# Results: Manual vs Automated Feature Engineering,

In this notebook, we will compare the manual, semi-automated, and fully automated (featuretools) feature engineering approaches for the Kaggle Home Credit Default Risk competition. For comparison we will focus on time: how long it took to make the features, and performance: the score in cross validation and when submitted to the Kaggle leaderboard.

| Dataset                     	| Total Features  before feature selection 	| Total Features  after feature selection 	| Time Spent (conservative estimate) 	| CV ROC AUC default model 	| CV ROC AUC  Optimized Model 	| Public Leaderboard ROC AUC optimized model 	|
|-----------------------------	|------------------------------------------	|-----------------------------------------	|-------------------------------------------------------------------------------------------------------------------------------------------------------	|--------------------------	|-----------------------------	|--------------------------------------------	|
| Main after one-hot encoding 	| 241                                      	| 203                                     	| 15 minutes                                                                                                                                            	| 0.754497 (0.00600176)    	| 0.759222  (0.00522168)      	| 0.745                                      	|
| Manual Feature Engineering  	| 271 (30 from  manual engineering)        	| 231 (26 from  manual engineering)       	| 7.5 hours (0.058 features per minute)                                                                                                                 	| 0.772477  (0.00440231)   	| 0.78036  (0.00449839)       	| 0.786                                      	|
| Manual + Semi-Automated     	| 1444 (1173 from  semi-auto methods)      	| 880 (657 from  semi-auto methods)       	| 10 hours (1.095 features per minute)                                                                                                                  	| 0.779672 (0.00566087)    	| 0.788476 (0.00435772)       	| 0.791                                      	|
| Fully Automated             	| 3078 (2820 from  featuretools)           	| 1289 (1104 from featuretools)           	| 2 hours (9.2 features per minute)                                                                                                                     	| 0.777852  (0.00564101)   	| 0.786656  (0.00468256)      	| 0.787                                      	|
  
## Explanation of Result Categories
   
* __Dataset__: refers to the method used to construct the set of features. The baseline set is the main dataframe (`app`) after one-hot encoding categorical variables
* __Total Features before feature selection__: the total number of predictor variables after implementing the method. Numbers in parenthesis indicate the features built by the method alone since each method built on the previous
* __Total Features after feature selection__: same as the previous column except the metric after feature selection
* __Time Spent__: Total time spent creating the set of features. This is a __conservative__ estimate as it does not include the hundreds of hours spent by other data scientists working on the problem or the hours I personally spent reading about the problem. This refers only to the time I spent actively coding the method.
* __CV ROC AUC default model__. The 5-fold cross validation ROC AUC using the default hyperparameter values of the Gradient Boosting Machine (GBM) implemented with the LightGBM library. The number of estimators was found using 100 rounds of early stopping with 5-fold cv. (Number in parenthesis is the standard deviation across five folds).
* __CV ROC AUC optimized model__. The 5-fold cross validation ROC AUC using the best hyperparameters from random search for 100 iterations on the respective sample of data. (Number in parenthesis is the standard deviation)
* __Public Leaderboard ROC AUC__. The ROC AUC score of dataset from the GBM model when submitted to the public leaderboard on Kaggle. The GBM model used the optimal hyperparameters and early stopping for the number of estimators. Predictions were made on the testing data and then uploaded to Kaggle where the Public Leaderboard is calculated using 10% of the total testing observations. The final leaderboard will be made known at the end of the competition. 
 
# Methodology
    
To assess the features, we want to perform several operations:,
    
1. Cross Validation (5 folds) ROC AUC with default GBM model in LightGBM library
2. Cross Validation (5 folds) ROC AUC with best hyperparameters from 100 iterations of random search on data sample
3. Public leaderboard ROC AUC from submitting predictions on testing data to Kaggle
4. Correlations with the label (`TARGET`)
5. Feature importances in the trained model
    
## Random Search
   
The "optimal" hyperparameters of the GBM for each dataset were found by applying 100 iterations of random search to a sample of 10% of each set of training data. Performance was measured by the 5-fold cross validation ROC AUC using early stopping to determine the number of estimators to train. The gradient boosting machine was implemented in LightGBM. In addition to testing with the optimal hyperparameter values, we will assess the cross validation using the default hyperparameters to determine the relative effects of hyperparameter tuning versus feature engineering. 

### Roadmap 

To apply the same operations to the three datasets, we create a function that calculate the 5 metrics above. This function will take in the feature matrix and the hyperparameter tuning results and return the five metrics. 

In [1]:
import pandas as pd
import numpy as np

import lightgbm as lgb

# Evaluating dictionary
import ast

# Utilities developed in previous notebooks
from utils import format_data, plot_feature_importances, evaluate

RSEED = 50

  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


# Compare Number of Features

The first order of business is to compare the number of features created by each method that remain after feature selection. We can do this by loading in the first row of the data and comparing the column names.

In [3]:
train, test = pd.read_csv('../input/application_train.csv'), pd.read_csv('../input/application_test.csv')
test['TARGET'] = np.nan
train, test = pd.get_dummies(train).align(pd.get_dummies(test), axis = 1, join = 'inner')

original_features = list(train.columns)
manual_features = [x for x in pd.read_csv('../input/features_manual_selected.csv', nrows = 1).columns if x not in original_features]
auto_features = [x for x in pd.read_csv('../input/feature_matrix_select.csv', nrows = 1).columns if x not in original_features and x not in manual_features]
semi_features = [x for x in pd.read_csv('../input/features_semi_selected.csv', nrows = 1).columns if x not in original_features and x not in manual_features]

In [4]:
print("There were originally {} features.".format(len(original_features) - 2))
print("{} Manual Features remained after feature selection.".format(len(manual_features)))
print("{} Automated Features remained after feature selection.".format(len(auto_features)))
print("{} Semi-Automated Features remained after feature selection.".format(len(semi_features)))

There were originally 241 features.
26 Manual Features remained after feature selection.
859 Automated Features remained after feature selection.
657 Semi-Automated Features remained after feature selection.


# Default Features

In [None]:
fm = pd.read_csv('../input/features_default_selected.csv')
hyp_results = pd.read_csv('../results/rs_features_default_sample_finished.csv', index_col=0)

results, feature_importances, submission = evaluate(fm, hyp_results)
results

Number of features:  203


In [None]:
results.to_csv('../results/default_results.csv', index = False)
submission.to_csv('../submissions/default_submission.csv', index = False)

norm_feature_importances = plot_feature_importances(feature_importances)
norm_feature_importances.to_csv('../results/default_fi.csv', index = False)
norm_feature_importances.head(15)

This model scores __0.745__ when submitted.

# Manual Engineered Features

In [None]:
fm = pd.read_csv('../input/features_manual_selected.csv')
hyp_results = pd.read_csv('../results/rs_features_manual_sample_finished.csv', index_col=0)

results, feature_importances, submission = evaluate(fm, hyp_results)
results

## Manual Feature Importances

In [None]:
results.to_csv('../results/manual_results.csv', index = False)
submission.to_csv('../submissions/manual_submission.csv', index = False)

norm_feature_importances = plot_feature_importances(feature_importances)
norm_feature_importances.to_csv('../results/manual_fi.csv', index = False)
norm_feature_importances.head(15)

These score __0.786__ when submitted.

# Automated Features Using Featuretools

In [None]:
fm = pd.read_csv('../input/feature_matrix_selected.csv')
hyp_results = pd.read_csv('../results/rs_feature_matrix_sample_finished.csv', index_col=0)

results, feature_importances, submission = evaluate(fm, hyp_results)
results

## Automated Feature Importances

In [None]:
results.to_csv('../results/auto_results.csv', index = False)
submission.to_csv('../submissions/auto_submission.csv', index = False)

norm_feature_importances = plot_feature_importances(feature_importances)
norm_feature_importances.to_csv('../results/auto_fi.csv', index = False)
norm_feature_importances.head(15)

These features score __0.787__ when submitted.

# Semi-Automated Features

In [None]:
fm = pd.read_csv('../input/features_semi_selected.csv')
hyp_results = pd.read_csv('../results/rs_features_semi_sample_finished.csv', index_col=0)

results, feature_importances, submission = evaluate(fm, hyp_results)
results

## Semi-Automated Feature Importances

In [None]:
results.to_csv('../results/semi_results.csv', index = False)
submission.to_csv('../submissions/semi_submission.csv', index = False)

norm_feature_importances = plot_feature_importances(feature_importances)
norm_feature_importances.to_csv('../results/semi_fi.csv', index = False)
norm_feature_importances.head(15)

This dataset scores __0.791__ when submitted to the competition.

# Conclusions

With the results presented below, this project is now complete. What we have learned is that with only a few minutes of programming, Featuretools is able to get us to a level comparable with dozens of hours of manual feature engineering on a real-world data science problem.