## Feature Selection for Tabular Data

The purpose of this notebook is to demonstrate how to select important features and prune unimportant ones prior to training our machine learning model. This is an important step that yields better prediction performance. 

#### Prerequisite
This notebook is a sequel to the [01_preprocessing_tabular_data.ipynb](01_preprocessing_tabular_data.ipynb) notebook. Before running this notebook, run [01_preprocessing_tabular_data.ipynb](01_preprocessing_tabular_data.ipynb) to preprocess the data used in this notebook. 

#### Notes
In this notebook, we use the sklearn framework for data partitionining and `storemagic` to share dataframes in [03_training_model_on_tabular_data.ipynb](03_training_model_on_tabular_data.ipynb). While we load data into memory here we do note that is it possible to skip this and load your partitioned data directly to an S3 bucket.

#### Tabular Data Sets
* [california house data](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html)
* [diabetes data ](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html)


#### Library Dependencies:
* sagemaker >= 2.0.0
* numpy 
* pandas
* plotly
* sklearn 
* matplotlib 
* seaborn
* xgboost

### Setting up the notebook

In [None]:
import os
import sys
import plotly.express as px
import plotly.offline as pyo
import plotly.graph_objs as go
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import ast
from matplotlib import pyplot

## sklearn dependencies
from sklearn.datasets import make_regression

import sklearn.model_selection
from sklearn.neighbors import KNeighborsRegressor
from sklearn.inspection import permutation_importance

!{sys.executable} -m pip install -qU 'xgboost'
import xgboost
from xgboost import XGBRegressor

## SageMaker dependencies
import sagemaker
from sagemaker import get_execution_role
from sagemaker.inputs import TrainingInput
from sagemaker.image_uris import retrieve

## This instantiates a SageMaker session that we will be operating in. 
session = sagemaker.Session()

## This object represents the IAM role that we are assigned.
role = sagemaker.get_execution_role()
print(role)

### Step 1: Load Relevant Variables from preprocessing_tabular_data.ipynb (Required for this notebook)
Here we load in our training, test, and validation data sets. We preprocessed this data in the [01_preprocessing_tabular_data.ipynb](01_preprocessing_tabular_data.ipynb) and persisted it using storemagic.

In [None]:
# Load relevant dataframes and variables from preprocessing_tabular_data.ipynb required for this notebook
%store -r X_train
%store -r X_test
%store -r X_val

### Step 2: Computing Feature Importance Scores to Select Features
We show two approaches for computing feature importance scores for each feature. We can rank each feature by their corresponding feature importance score in an effort to prune unimportant features which will yield a better performing model. 

The first approach, uses XGBoost and the second uses permutation feature importance.

### Step 2a: Ranking features by Feature Importance using XGBoost
Here we use gradient boosting to extract importance scores for each feature. The importance scores calculated for each feature inform us how useful the feature was for constructing the boosted decision tree and can be ranked and compared to one another for feature selection.

In [None]:
X_data, y_label = make_regression(n_samples=X_train.shape[0], n_features=X_train.shape[1], n_informative=10, random_state=1)
xgboost_model = XGBRegressor()
xgboost_model.fit(X_data, y_label)

feature_importances_xgboost = xgboost_model.feature_importances_
for index, importance_score in enumerate(feature_importances_xgboost):
    print('Feature: {}, Score: {}'.format(X_train.columns[index], importance_score))

In [None]:
def create_bar_plot(feature_importances, X_train):
    '''
    Create a bar plot of features against their corresponding feature importance score. 
    '''
    x_indices = [_ for _ in range(len(feature_importances))]
    plt.figure(figsize = (15, 5))
    plt.bar(x_indices, feature_importances, color='blue')
    plt.xticks(x_indices, X_train.columns)
    plt.xlabel('Feature', fontsize=18)
    plt.ylabel('Importance Score', fontsize=18)
    plt.title('Feature Importance Scores', fontsize=18)
    plt.show()

In [None]:
create_bar_plot(feature_importances_xgboost, X_train)

In the following cell, we rank each feature based on corresponding importance score.

In [None]:
def show_ranked_feature_importance_list(scores, data):
    '''
    Prints the features ranked by their corresponding importance score. 
    '''
    lst = list(zip(data.columns, scores))
    ranked_lst = sorted(lst, key= lambda t: t[1], reverse=True)
    print(pd.DataFrame(ranked_lst, columns=['Feature', 'Importance Score']))
    

In [None]:
show_ranked_feature_importance_list(feature_importances_xgboost, X_train)

### Step 2b: Ranking features by Permutation Feature Importance using the Scikit-learn k-NN Algorithm
This approach is commonly used for selecting features in tabular data. We first randomly shuffle a single feature value and train a model. In this example we use the k-nearest-neighbours algorithm to train our model. The permutation feature importance score is the decrease in models score when this single feature value is shuffled. The decrease in the model score is representative of how dependant the model is on the feature. This technique can be computed many times with altering permutations per feature. 

In [None]:
X_data, y_label = make_regression(n_samples=X_train.shape[0], n_features=X_train.shape[1], n_informative=10, random_state=1)
k_nn_model = KNeighborsRegressor()
k_nn_model.fit(X_data, y_label)
feature_importances_permutations = permutation_importance(k_nn_model, X_data, y_label, scoring='neg_mean_squared_error').importances_mean

for index, importance_score in enumerate(feature_importances_permutations):
    print('Feature: {}, Score: {}'.format(X_train.columns[index], importance_score))


In [None]:
create_bar_plot(feature_importances_permutations, X_train)

In [None]:
show_ranked_feature_importance_list(feature_importances_permutations, X_train)

### Step 3: Prune Unimportant Features
Thus far, we have discussed two common approaches for obtaining a ranked list of feature importance scores for each feature. From these lists we can infer unimportant features based on their importance scores and can eliminate them from our training, validation and test sets. For example, if feature A has a higher importance score then feature B's importance score, then this implies that feature A is more important then feature B and vice versa. We mention that both approaches constrain the removal of features to the dataset itself which is independent of the problem domain.

After selecting your desired approach, move onto the next cell to prune features that have the importance score less than or equal to a threshold value. Depending on the approach of your choice and the distribution of scores, the `threshold` value may vary.

In this example, we select the first approach with XGBoost and set the threshold value to 0.01.

In [None]:
threshold = 0.01

In [None]:
def remove_features(lst, data, threshold):
    '''
    Remove features found in lst from data iff its importance score is below threshold.
    '''
    features_to_remove = []
    for index, pair in enumerate(list(zip(data.columns, lst))):
        if pair[1] <= threshold:
            features_to_remove.append(pair[0])

    if features_to_remove:
        data.drop(features_to_remove, axis=1)

Assign `lst` to be `feature_importances_permutations` or `feature_importances_xgboost` if want to use the ranked list from that uses XGBoost or permutation feature importance respectively.

We remove all features that are below `threshold` from our training data, `X_train`, validation data, `X_val` and testing data `X_test` respectively. 

In [None]:
remove_features(lst=feature_importances_xgboost, data=X_train, threshold=threshold)
remove_features(lst=feature_importances_xgboost, data=X_val, threshold=threshold)
remove_features(lst=feature_importances_xgboost, data=X_test, threshold=threshold)

### Step 4: Store Variables using `storemagic`
After pruning the unimportant features, use `storemagic` to persist all relevant variables so that they can be reused in our next sequel notebook, [03_training_model_on_tabular_data.ipynb](03_training_model_on_tabular_data.ipynb), where we focus on model training. 




In [None]:
# Using storemagic we persist the variables below so we can access them in 03_training_model_on_tabular_data.ipynb
%store X_train
%store X_test
%store X_val