# Results: Manual vs Automated Feature Engineering,

In this notebook, we will compare the manual, semi-automated, and fully automated (featuretools) feature engineering approaches for the Kaggle Home Credit Default Risk competition. For comparison we will focus on time: how long it took to make the features, and performance: the score in cross validation and when submitted to the Kaggle leaderboard.
  
## Explanation of Result Categories
   
* __Method__: refers to the method used to construct the set of features. The baseline set is the main dataframe (`app`) after one-hot encoding categorical variables,
* __Total Features__: the total number of predictor variables after implementing the method. Numbers in parenthesis indicate the features built by the method alone since each method built on the previous,
* __Time Spent__: Total time spent creating the set of features. This is a __conservative__ estimate as it does not include the hundreds of hours spent by other data scientists working on the problem or the hours I personally spent reading about the problem. This refers only to the time I spent actively coding the technique.,
* __CV ROC AUC default model__. The 5-fold cross validation ROC AUC using the default hyperparameter values of the Gradient Boosting Machine (GBM) implemented with the LightGBM library. The number of estimators was found using 100 rounds of early stopping with the 5-fold cv.,
* __Public Leaderboard ROC AUC default model__. The ROC AUC score of dataset from the GBM model when submitted to the public leaderboard on Kaggle. The GBM model used the same default hyperparameters and the cv early stopping results for the number of estimators. Predictions were made on the testing data and then uploaded to Kaggle where the Public Leaderboard is calculated using 10% of the total testing observations. The final leaderboard will be made known at the end of the competition. ,
* __CV ROC AUC optimized model__. The 5-fold cross validation ROC AUC using the best hyperparameters from random search for 150 iterations. ,
* __Public Leaderboard ROC AUC optimized model__. The ROC AUC score when submitted to the Kaggle competition using the hyperparameters from random search,
 
# Methodology
    
To assess the features, we want to perform several operations:,
    
1. Cross Validation (5 folds) ROC AUC with default GBM model in LightGBM library
2. Cross Validation (5 folds) ROC AUC with best hyperparameters from 100 iterations of random search on data sample
3. Public leaderboard ROC AUC from submitting predictions on testing data to Kaggle
4. Correlations with the label (`TARGET`)
5. Feature importances in the trained model
    
## Random Search
   
The "optimal" hyperparameters of the GBM for each dataset were found by applying 100 iterations of random search to a sample of 10% of each set of training data. Performance was measured by the 5-fold cross validation ROC AUC using early stopping to determine the number of estimators to train. The gradient boosting machine was implemented in LightGBM. In addition to testing with the optimal hyperparameter values, we will assess the cross validation using the default hyperparameters to determine the relative effects of hyperparameter tuning versus feature engineering. 

### Roadmap 

To apply the same operations to the three datasets, we create a function that calculate the 5 metrics above. This function will take in the feature matrix and the hyperparameter tuning results and return the five metrics. 

In [1]:
import pandas as pd
import numpy as np

import lightgbm as lgb

import ast

from utils import format_data, plot_feature_importances

RSEED = 50

In [2]:
hyp_results = pd.read_csv('../results/rs_feature_matrix_sample.csv_finished.csv', index_col=0)

In [None]:
hyp_results = hyp_results.sort_values('score', ascending = False).reset_index(drop = True)

best_hyp = ast.literal_eval(hyp_results.loc[0, 'params'])
best_random_score = hyp_results.loc[0, 'score']

In [None]:
def format_data(features):
    """Format a set of training and testing features joined together
       into separate sets for machine learning"""
    
    train = features[features['TARGET'].notnull()].copy()
    test = features[features['TARGET'].isnull()].copy()
    
    train_labels = np.array(train['TARGET'].astype(np.int32)).reshape((-1, ))
    test_ids = list(test['SK_ID_CURR'])
    
    train = train.drop(columns = ['TARGET', 'SK_ID_CURR'])
    test = test.drop(columns = ['TARGET', 'SK_ID_CURR'])
    
    feature_names = list(train.columns)
    
    return train, train_labels, test, test_ids, feature_names

In [3]:
fm = pd.read_csv('../input/features_manual_selected.csv')
# train, train_labels, test, test_ids, feature_names = format_data(fm)
# train_set = lgb.Dataset(train, label = train_labels)

In [None]:
results = pd.DataFrame(columns = ['default_auc', 'default_auc_std', 
                                  'opt_auc', 'opt_auc_std', 
                                  'random_search_auc'], index = [0])

In [None]:
model = lgb.LGBMClassifier()
default_hyp = model.get_params()
del default_hyp['n_estimators'], default_hyp['silent']

cv_results = lgb.cv(default_hyp, train_set, nfold = 5, num_boost_round = 10000, early_stopping_rounds = 100, 
                    metrics = 'auc', seed = RSEED)

In [None]:
default_auc = cv_results['auc-mean'][-1]
default_auc_std = cv_results['auc-stdv'][-1]

In [None]:
del best_hyp['n_estimators']

cv_results = lgb.cv(best_hyp, train_set, nfold = 5, num_boost_round = 10000, early_stopping_rounds = 100, 
                    metrics = 'auc', seed = RSEED)

opt_auc = cv_results['auc-mean'][-1]
opt_auc_std = cv_results['auc-stdv'][-1]

In [None]:
opt_n_estimators = len(cv_results['auc-mean'])
model = lgb.LGBMClassifier(n_estimators = opt_n_estimators, **best_hyp)

In [None]:
results.loc[0, 'default_auc'] = default_auc
results.loc[0, 'default_auc_std'] = default_auc_std
results.loc[0, 'random_search_auc'] = best_random_score
results.loc[0, 'opt_auc'] = opt_auc
results.loc[0, 'opt_auc_std'] = opt_auc_std

results

In [None]:
# Fit on whole training set
model.fit(train, train_labels)

# Make predictions on testing data
preds = model.predict_proba(test)[:, 1]

# Make submission dataframe
submission = pd.DataFrame({'SK_ID_CURR': test_ids, 
                           'TARGET': preds})

feature_importances = pd.DataFrame({'feature': feature_names,
                                    'importance': model.feature_importances_})

In [4]:
def evaluate(fm, hyp_results):
    """Evaluate a feature matrix using the hyperparameter tuning results.
    
    Parameters:
        fm (dataframe): feature matrix with observations in the rows and features in the columns. This will
                        be passed to `format_data` and hence must have a train set where the `TARGET` values are 
                        not null and a test set where `TARGET` is null. Must also have the `SK_ID_CURR` column.
        
        hyp_results (dataframe): results from hyperparameter tuning. Must have column `score` (where higher is better)
                                 and `params` holding the model hyperparameters
                                 
    Returns:
        results (dataframe): the cross validation roc auc from the default hyperparameters and the 
                             optimal hyperparameters
        
        feature_importances (dataframe): feature importances from the gradient boosting machine. Columns are 
                                          `feature` and `importance`. This can be used in `plot_feature_importances`.
                                          
        submission (dataframe): Predictions which can be submitted to the Kaggle Home Credit competition. Save
                                these as `submission.to_csv("filename.csv", index = False)` and upload
       """
    
    # Format the feature matrix 
    train, train_labels, test, test_ids, feature_names = format_data(fm)
    
    # Training set 
    train_set = lgb.Dataset(train, label = train_labels)

    # Dataframe to hold results
    results = pd.DataFrame(columns = ['default_auc', 'default_auc_std', 
                                      'opt_auc', 'opt_auc_std', 
                                      'random_search_auc'], index = [0])

    # Create a default model and find the hyperparameters
    model = lgb.LGBMClassifier()
    default_hyp = model.get_params()
    
    # Remove n_estimators because this is found through early stopping
    del default_hyp['n_estimators'], default_hyp['silent']

    # Cross validation with default hyperparameters
    default_cv_results = lgb.cv(default_hyp, train_set, nfold = 5, num_boost_round = 10000, early_stopping_rounds = 100, 
                                metrics = 'auc', seed = RSEED)
    
    default_auc = default_cv_results['auc-mean'][-1]
    default_auc_std = default_cv_results['auc-stdv'][-1]
    
    # Locate the optimal hyperparameters
    hyp_results = hyp_results.sort_values('score', ascending = False).reset_index(drop = True)
    best_hyp = ast.literal_eval(hyp_results.loc[0, 'params'])
    best_random_score = hyp_results.loc[0, 'score']

    del best_hyp['n_estimators']

    # Cross validation with best hyperparameter values
    opt_cv_results = lgb.cv(best_hyp, train_set, nfold = 5, num_boost_round = 10000, early_stopping_rounds = 100, 
                            metrics = 'auc', seed = RSEED)

    opt_auc = opt_cv_results['auc-mean'][-1]
    opt_auc_std = opt_cv_results['auc-stdv'][-1]
    
    # Insert results into dataframe
    results.loc[0, 'default_auc'] = default_auc
    results.loc[0, 'default_auc_std'] = default_auc_std
    results.loc[0, 'random_search_auc'] = best_random_score
    results.loc[0, 'opt_auc'] = opt_auc
    results.loc[0, 'opt_auc_std'] = opt_auc_std
    
    # Extract the optimum number of estimators
    opt_n_estimators = len(opt_cv_results['auc-mean'])
    model = lgb.LGBMClassifier(n_estimators = opt_n_estimators, **best_hyp)
    
    # Fit on whole training set
    model.fit(train, train_labels)

    # Make predictions on testing data
    preds = model.predict_proba(test)[:, 1]

    # Make submission dataframe
    submission = pd.DataFrame({'SK_ID_CURR': test_ids, 
                               'TARGET': preds})

    # Make feature importances dataframe
    feature_importances = pd.DataFrame({'feature': feature_names,
                                        'importance': model.feature_importances_})

    return results, feature_importances, submission

In [None]:
results, feature_importances, submission = evaluate(fm, hyp_results)