# Module 6 - Feature Selection

Feature Selection is the process of choosing which features to use to answer your central question. Why would anyone want to limit the information availbale to them! Think Ockham's razor - when presented with competing hypotheses about the same prediction, one should select the solution with the fewest assumptions. In short - "the simplest explination is usually the best one". This concept of fugality applied to describing nature is what we call parsimony. In practice, we aim to develop models with the least number of features.

The advantages to this are that models train faster, are less prone to overfitting, and are usually more accurate. In this excercise we will apply various feature selection schemes to the Mobile Price Classification dataset distributed with this notebook to examine how it effects model performance.

 ## About the Mobile Price dataset 
 1. The data is already tidy and partitioned into training and testing csv files. 
 2. There are 2000 observations in the training set and 1000 in testing.
 3. Each observation consisits of 20 phone features (columns) and one categorical label (final column) describing the phone's price range.
 4. This is a classification problem. But for our case, it's an exercise in feature selection.

### Data description
| Feature | Description |
| ------- | ----------- |
| battery_power | Total energy a battery can store in one time measured in mAh |
|blue | Has Bluetooth or not |
|clock_speed | the speed at which microprocessor executes instructions |
|dual_sim | Has dual sim support or not |
| fc | Front Camera megapixels |
| four_g | Has 4G or not |
| int_memory | Internal Memory in Gigabytes |
| m_dep | Mobile Depth in cm |
| mobile_wt | Weight of mobile phone |
| n_cores | Number of cores of the processor |
| pc | Primary Camera megapixels |
| px_height | Pixel Resolution Height |
| px_width | Pixel Resolution Width |
| ram | Random Access Memory in MegaBytes |
| sc_h | Screen Height of mobile in cm |
| sc_w | Screen Width of mobile in cm |
| talk_time | the longest time that a single battery charge will last when you are |
| three_g | Has 3G or not |
| touch_screen | Has touch screen or not |
| wifi | Has wifi or not |
| price_range | This is the target variable with a value of 0(low cost), 1(medium cost), 2(high cost) and 3(very high cost). |

## Setup
Let's get all the requirements sorted before we move on to the excercise. Most packages should be familiar at this point. Numpy, pandas, matplotlib, and seaborn where all introduced in Part I of the workshop in modules 1-3 and last week in module 5 we introduced tableone. Notice, today we will be using sklearn for the first time to do some machine learning. Don't worry too much about the models we'll be using or how to train them for now. This will the the topic for modules 7 & 8.  

In [None]:
# Requirements
!pip install --upgrade ipykernel
!pip install pandas
!pip install numpy
!pip install tableone
!pip install matplotlib
!pip install seaborn
!pip install sklearn
!pip install boruta

# Globals
seed = 1017

#imports
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from tableone import TableOne
from boruta import BorutaPy
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier

#magic
%matplotlib inline

## What question am I answering?
Well, we want to demonstrate the utility of feature selection. I think a convincing approach would be to compare predictive power in a model with and without feature selection. So, for every parsimonious model we train let's compare its performance with that of its couterpart prodigious model (i.e. model that uses all the features). Let's get started.

## Loading the data
As always we should have a look at how the features are distributed grouped by the labels. For this we'll generate a table 1.

In [None]:
# download the data as a pandas dataframe
df = pd.read_csv("data/train.csv")
df_test = pd.read_csv("data/test.csv")

# Generate table 1
TableOne(df, groupby=df.columns[-1],
         pval=True,
         dip_test=True,
         normal_test=True,
         tukey_test=True)

## Comparing Models
Let's define a function that will calculate the prodigious and parsimonious model performance.

In [None]:
#define function that compares selected features to full model
def compare_models(dataset, selfeat):
    """compare parsimonious and full linear model"""
    
    # get predictors and labels
    X = dataset.drop('price_range',axis=1)  #independent columns
    y = dataset['price_range']    #target column i.e price range

    #get selected feature indecies
    isel = [X.columns.get_loc(feat) for feat in selfeat if feat in X]
    
    #70-30 split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=seed)
 

    #define the prodigious and parsimonious logistic models
    prodmodel = LinearRegression()
    parsmodel = LinearRegression()

    #Fit the models
    prodmodel.fit(X_train, y_train)
    parsmodel.fit(X_train[selfeat], y_train) 

    #Report errors
    display('Prodigious Model Score: %.2f' %prodmodel.score(X_test, y_test))
    display('Parsimonious Model Score: %.2f' %parsmodel.score(X_test[selfeat], y_test))

    return

## Filter Method
The Table 1 conveniently has calculated the association of each feature with the outcome. Let's select only those features that are significatly (p<.05) associated. 

In [None]:
selfeat = ['battery_power', 'int_memory', 'mobile_wt', 'px_height', 'px_width', 'ram', 'sc_h']
compare_models(df, selfeat)

By keeping only 7 features the parsimonious model has the same score as the full model that uses all 20 features. 

## Usupervised Methods
**Remove highly correlated features** To remove the correlated features, we can make use of the corr() method of the pandas dataframe. The corr() method returns a correlation matrix containing correlation between all the columns of the dataframe. A useful way to visualize the correlations is with a heatmap. We'll use the seaborn library for this.

In [None]:
#Create a correlation matrix for the columns in the dataset
correlation_matrix = df.corr()

#plot heat map
plt.figure(figsize=(20,20))
g=sns.heatmap(correlation_matrix, annot=True, cmap="RdYlGn")

We can loop through all the columns in the correlation_matrix and keep track of the features with a correlation value > 0.5. This 0.5 cut-off is quite strict and chosen for demonstration purposes. A more reasonable value is 80-90%. 

In [None]:
#init an empty set that will contain the names of the correlated features
correlated_features = set()

#loop over lower triangle of pairs of features
#     do not consider the last feature which is the label 
for i in range(len(correlation_matrix .columns) - 1):
    for j in range(i):
        if abs(correlation_matrix.iloc[i, j]) > 0.5:
            #accumulate the names of the second correlated feature
            colname = correlation_matrix.columns[j]
            correlated_features.add(colname)

In [None]:
#display the correlated features
display(correlated_features)

These features are correlated to at least one other feature and can be considered redundant. Let's not include them in our parsimonious set and see how it effects model performance.

In [None]:
#add label to the correlated features which we will drop
correlated_features.add('price_range')
selfeat = df.columns.drop(correlated_features)
compare_models(df, selfeat)

In this case the parsimonious model scores (goodness of fit) lower than the full model.

## Wrapper Methods
**Recursive feature elimination (RFE)** is a stepwise feature selection process implemented in sklearn. Recall, the model used for feature selection does not have to be the same as the predictive model. Here we will use a tree based model for RFE.

In [None]:
# get predictors and labels
X = df.drop('price_range', axis=1)  
y = df['price_range']

# use tree based model for RFE
rfe = RFECV(estimator=DecisionTreeClassifier())

# fit RFE
rfe.fit(X, y)

# summarize all features
for i in range(X.shape[1]):
    display('Column: %d, Selected %s, Rank: %.3f' % (i, rfe.support_[i], rfe.ranking_[i]))

We can see which features were selected by thier column index. They correspond to features 'battery_power', 'px_height', 'px_width', and 'ram' . Let's compare the parsimonious linear model with the full model.

In [None]:
#get the column indecies
selcol = [0, 11, 12, 13]
#get the column names
selfeat = df.columns[selcol]
#compare models
compare_models(df, selfeat)

**Boruta** is another wrapper method I like to use. It can be faster than RFE as the number of features increases and stands on a more solid statistical footing. To improve RFE statistics one could employ a repeated k-fold cross vaildation scheme but that would increase the computation time even more.

In [None]:
# get predictors and labels
X = np.array(df.drop('price_range', axis=1)) 
y = np.array(df['price_range'])

# define random forest classifier for boruta
forest = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
forest.fit(X, y)

# define Boruta feature selection method
feat_selector = BorutaPy(forest, n_estimators='auto', verbose=0, random_state=seed)

# find all relevant features
feat_selector.fit(X, y)

# zip my names, ranks, and decisions in a single iterable
feature_ranks = list(zip(df.columns.drop('price_range'), 
                         feat_selector.ranking_, 
                         feat_selector.support_))

# iterate through and print out the results
for feat in feature_ranks:
    display('Feature: {:<25} Rank: {},  Keep: {}'.format(feat[0], feat[1], feat[2]))


Looks like bortua selected battery_power, px_height, px_width, and ram. These are the same features selected by RFE so we'll move on.

## Embedded methods
**LASSO**

In [None]:
from sklearn.linear_model import LassoCV

# get predictors and labels
X = np.array(df.drop('price_range', axis=1)) 
y = np.array(df['price_range'])

#train lasso model with 5-fold cross validataion
lasso = LassoCV(cv=5, random_state=0).fit(X, y)

#display the model score
lasso.score(X, y)

#plot feature importance based on coeficients
importance = np.abs(lasso.coef_)
feature_names = np.array(df.columns.drop('price_range'))
plt.bar(height=importance, x=feature_names)
plt.xticks(rotation=90)
plt.title("Feature importances via coefficients")
plt.show()

Again we see battery power, px_height, px_width, and ram are the most important features that influence price.

## Conclusions
I hope I have given you a fair overview of different feature selection schemes. Notice, I have not used the testing set to validate any relationships we have found. The next step would be to aggregate the information you have gained from the various feature selection schemes and use them to decide which features to include in your final model.  Also, Notice there were some warnings raised by the table 1 when we first loaded the data. Addressing these errors could improve your final model's performance; remember garbage in garbage out. I'll leave that as an excercise to you.