EMBEDDED METHODS: 
# FEATURE SELECTION BY TREE DERIVED VARIABLE IMPORTANCE

- DECISION TREES
   - Most popular machine learning algorithms
   - Highly accurate
   - Good generalisation (low overfitting)
   - Interpretability
- RANDOM FOREST IMPORTANCE
   - Random Forests consist of several hundreds of individual decision trees
   - The impurity decrease for each feature is averaged across trees
- Limitations of RANDOM FOREST
   - Correlated features show equal or similar importance
   - Correlated features importance is lower than the real importance, determined when tree is built in absence of correlated counterparts
   - Highly cardinal variables show greater importance (trees are biased to this type of variables)
   
- RANDOM FOREST IMPORTANCE:
- Build a random forest
- Determine feature importance
- Select the features with highest importance, There is a scikit-learn implementation for this

Recursive feature elimination:

- Build random forests
- Calculate feature importance
- Remove least important feature
- Repeat till a condition is met
- If the feature removed is correlated to another feature in the dataset, by removing the correlated feature, the true importance of the other feature will be revealed > its importance will increase


GRADIENT BOOSTED TREES FEATURE IMPORTANCE

- Feature importance calculated in the same way
- Biased to highly cardinal features
- Importance is susceptible to correlated features
- Interpretability of feature importance is not so straightforward:
- Later trees fit to the errors of the first trees, therefore feature importance is not necessarily proportional on the influence of the feature on the outcome, rather on the mistakes of the previous trees.
- Averaging across trees may not add much information on true relation between feature and target

## Random Forest importance

Random forests is one the most popular machine learning algorithms. It is so successful because it provide good predictive performance, low overfitting and easy interpretability. This interpretability is given by the fact that it is straightforward to derive the importance of each variable on the tree decision. In other words, it is easy to compute how much each variable is contributing to the decision.

Random forests consist typically of 4-12 hundred decision trees, each of them built over a random extraction of the observations from the dataset and a random extraction of the features. Not every tree sees all the features or all the observations, and this guarantees that the trees are de-correlated and therefore less prone to over-fitting. Each tree is also a sequence of yes-no questions based on a single or a combination of features. At each node (that is, at each question), the three divides the dataset in 2 buckets, each of them hosting observations that are more similar among themselves and different from the ones in the other bucket. Therefore, the importance of each feature is derived by how "pure" each of the buckets is. 

For classification, the measure of impurity is either Gini or the entropy. For regression the  measure of impurity is the variance. When training a tree, it is possible to compute how much each feature decreases the impurity. The more a feature decreases the impurity, the more important the feature is. In random forests, the impurity decrease elicited by each feature is averaged across trees to determine the final importance of the variable.

In general, features that are selected at the top of the trees are more important than features that are selected at the end nodes of the trees, as generally the top splits lead to bigger information gains.

**Note**
- Random Forests and decision trees in general give preference to features with high cardinality
- Correlated features will be given equal or similar importance, but overall reduced importance compared to the same tree built without correlated counterparts.

I will demonstrate how to select features based on tree importance using a regression and classification dataset.

## Recursive Feature Selection using Random Forests importance

Random Forests assign equal or similar importance to features that are highly correlated. In addition, when features are correlated, the importance assigned is lower than the importance attributed to the feature itself, should the tree be built without the correlated counterparts.

Therefore, instead of eliminating features based on importance by brute force like we did in the previous notebook, we could get a better selection by removing one feature at a time, and recalculating the importance on each round. This procedure is called Recursive Feature Elimination (RFE)

RFE is a hybrid between embedded and wrapper methods: it is based on computation derived when fitting the model, but it also requires fitting several models.

The cycle is as follows:

- Build Random Forests using all features
- Remove least important feature
- Build Random Forests and recalculate importance
- Repeat until a criteria is met

In this situation, when a feature that is highly correlated to another one is removed, then, the importance of the remaining feature increases. This may lead to a better feature space selection. On the downside, building several Random Forests is quite time and compute resource consuming, in particular if the dataset contains a high number of features.

I will demonstrate how to select features based Random Forests importance recursively using sklearn on a classification dataset.

## Feature selection with decision trees, review
### Putting it all together

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import roc_auc_score
import warnings
warnings.filterwarnings("ignore")

In [3]:
data = pd.read_csv('../dataset_1.csv')
data.shape

(50000, 301)

In [4]:
X_train, X_test, y_train, y_test = train_test_split( data.drop(labels=['target'], axis=1), data['target'], test_size=0.3, random_state=0)
X_train.shape, X_test.shape

((35000, 300), (15000, 300))

In [5]:
# I keep a copy of the dataset with all the variables to compare the performance of machine learning models at the end of the notebook
X_train_original = X_train.copy()
X_test_original = X_test.copy()

In [6]:
# Remove constant features
constant_features = [  feat for feat in X_train.columns if X_train[feat].std() == 0 ]

X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 266), (15000, 266))

In [7]:
# Remove quasi-constant features
sel = VarianceThreshold(threshold=0.01)  

sel.fit(X_train)  # fit finds the features with low variance

sum(sel.get_support()) # how many not quasi-constant?

215

In [8]:
features_to_keep = X_train.columns[sel.get_support()]

In [9]:
# remove the features
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape

((35000, 215), (15000, 215))

In [10]:
# sklearn transformations lead to numpy arrays here we convert to dataframe:

X_train= pd.DataFrame(X_train)
X_train.columns = features_to_keep

X_test= pd.DataFrame(X_test)
X_test.columns = features_to_keep

In [11]:
# Remove duplicated features
# check for duplicated features in the training set
duplicated_feat = []
for i in range(0, len(X_train.columns)):
    if i % 10 == 0:  # this helps me understand how the loop is going
        print(i)

    col_1 = X_train.columns[i]

    for col_2 in X_train.columns[i + 1:]:
        if X_train[col_1].equals(X_train[col_2]):
            duplicated_feat.append(col_2)
            
len(duplicated_feat)

0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210


10

In [12]:
# remove duplicated features
X_train.drop(labels=duplicated_feat, axis=1, inplace=True)
X_test.drop(labels=duplicated_feat, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 205), (15000, 205))

In [13]:
# I keep a copy of the dataset except constant, quasi-constant and duplicated variables to measure the performance of machine learning models at the end of the notebook

X_train_basic_filter = X_train.copy()
X_test_basic_filter = X_test.copy()

In [14]:
# Remove correlated features
# find and remove correlated features
def correlation(dataset, threshold):
    col_corr = set()  # Set of all the names of correlated columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff value
                colname = corr_matrix.columns[i]  # getting the name of column
                col_corr.add(colname)
    return col_corr

corr_features = correlation(X_train, 0.8)
print('correlated features: ', len(set(corr_features)) )

correlated features:  93


In [15]:
# removed correlated  features
X_train.drop(labels=corr_features, axis=1, inplace=True)
X_test.drop(labels=corr_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 112), (15000, 112))

In [16]:
# keep a copy of the dataset at  this stage
X_train_corr = X_train.copy()
X_test_corr = X_test.copy()

### Select Features by Random Forests Importance

In [17]:
# select features using the impotance derived from random forests

sel_ = SelectFromModel(RandomForestClassifier(n_estimators=50, random_state=10))
sel_.fit(X_train, y_train)

# remove features with zero coefficient from dataset and parse again as dataframe (output of sklearn is numpy array)
X_train_rf = pd.DataFrame(sel_.transform(X_train))
X_test_rf = pd.DataFrame(sel_.transform(X_test))

# add the columns name
X_train_rf.columns = X_train.columns[(sel_.get_support())]
X_test_rf.columns = X_train.columns[(sel_.get_support())]

In [18]:
X_train_rf.shape, X_test_rf.shape

((35000, 16), (15000, 16))

In [19]:
# Compare the performance in machine learning algorithms
# create a function to build random forests and compare performance in train and test set
def run_randomForests(X_train, X_test, y_train, y_test):
    rf = RandomForestClassifier(n_estimators=200, random_state=39, max_depth=4)
    rf.fit(X_train, y_train)   
    print('Train set')
    pred = rf.predict_proba(X_train)
    print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))    
    print('Test set')
    pred = rf.predict_proba(X_test)
    print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

In [20]:
# original
run_randomForests(X_train_original,  X_test_original, y_train, y_test)

Train set
Random Forests roc-auc: 0.807612232524249
Test set
Random Forests roc-auc: 0.7868832427636059


In [21]:
# filter methods - basic
run_randomForests(X_train_basic_filter, X_test_basic_filter, y_train, y_test)

Train set
Random Forests roc-auc: 0.810290026780428
Test set
Random Forests roc-auc: 0.7914020645941601


In [22]:
# filter methods - correlation
run_randomForests(X_train_corr, X_test_corr, y_train, y_test)

Train set
Random Forests roc-auc: 0.8066004772684517
Test set
Random Forests roc-auc: 0.7859521124929707


In [23]:
# filter methods - univariate roc-auc
run_randomForests(X_train_corr, X_test_corr, y_train, y_test)

Train set
Random Forests roc-auc: 0.8066004772684517
Test set
Random Forests roc-auc: 0.7859521124929707


In [24]:
# embedded methods - Random forests
run_randomForests(X_train_rf, X_test_rf, y_train, y_test)

Train set
Random Forests roc-auc: 0.825594244784318
Test set
Random Forests roc-auc: 0.8037861254524954


Random Forests built using 16 features display a slightly higher performance than a model built with all features!