# Feature Selection for ML

The data features that you use to train your ML models have a huge influence on the performance you can achieve. 

"The more the better" in ML? Perhaps in data volumes (and not always even there..), for sure not in feature selection. Irrelevant or partially-relevant features can negatively impact model performance. 

Here you will discover automatic feature selection techniques that you can use to prepare your ML data in Python with scikit-learn. 

Focus will be on:
1. Univariate Selection.
2. Recursive Feature Elimination
3. Principal Component Analysis
4. Feature Importance.

## Introduction

Feature selection is a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested. 

Having irrelevant features in your data can decrease the accuracy of many models, especially linear algorithms like linear and logistic regression. 

Three benefits of performing feature selection before modeling your data are:

1. ***Reduction of Overfitting***: Less redundant data means less opportunity to make decisions based on noise.
1. ***Improvement of Accuracy***: Less misleading data means modeling accuracy improves
1. ***Reduction of Training Time*** Less data means that algorithms train faster

More about feature selection with scikit-learn can be found [here](http://scikit-learn.org/stable/modules/feature_selection.html).

## 1. Univariate Selection

First of all, statistical tests can be used to select those features that have the strongest relationship with the output variable. 

The scikit-learn library provides the `SelectKBest` class (info [here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest. html#sklearn.feature_selection.SelectKBest)) that can be used with a suite of different statistical tests to select a specific number of features. 

The example below uses the chi-squared (chi2) statistical test for non-negative features to select 4 of the best features from the Pima Indians onset of diabetes dataset.

In [1]:
from pandas import read_csv
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [2]:
# load data
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

In [3]:
# Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)

In [4]:
# summarize scores
set_printoptions(precision=3)
print(fit.scores_)

[  111.52   1411.887    17.605    53.108  2175.565   127.669     5.393
   181.304]


In [5]:
names

['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

In [8]:
X[0:3,:]

array([[   6.   ,  148.   ,   72.   ,   35.   ,    0.   ,   33.6  ,
           0.627,   50.   ],
       [   1.   ,   85.   ,   66.   ,   29.   ,    0.   ,   26.6  ,
           0.351,   31.   ],
       [   8.   ,  183.   ,   64.   ,    0.   ,    0.   ,   23.3  ,
           0.672,   32.   ]])

In [9]:
# summarize selected features
features = fit.transform(X)
print(features[0:3,:])

[[ 148.     0.    33.6   50. ]
 [  85.     0.    26.6   31. ]
 [ 183.     0.    23.3   32. ]]


You can see the scores for each attribute and the 4 attributes chosen (those with the highest scores): plas, test, mass and age. I got the names for the chosen attributes by manually mapping the index of the 4 highest scores to the index of the attribute names.

## 2. Recursive Feature Elimination

The Recursive Feature Elimination (or RFE) works by recursively removing attributes and building a model on those attributes that remain. It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute. 

More about the RFE class in the scikit-learn documentation [here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html# sklearn.feature_selection.RFE).

The example below uses RFE with the logistic regression algorithm to select the top 3 features. Note: the choice of algorithm does not matter too much as long as it is skillful and consistent.

In [6]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression 

In [7]:
# Feature Extraction with RFE
model = LogisticRegression()
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print("Num Features: %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)

Num Features: 3
Selected Features: [ True False False False False  True  True False]
Feature Ranking: [1 2 3 5 6 1 1 4]


You can see that RFE chose the top 3 features as preg, mass and pedi. These are marked True in the support array and marked with a choice 1 in the ranking array. Again, you can manually map the feature indexes to the indexes of attribute names.

## 3. Principal Component Analysis

Principal Component Analysis (or PCA) is - briefly and not rigorously! - a dimensionality-reduction tool that can be used to  reduce a large set of variables to a small set that still contains most of the information in the large set. It is one tool in multivariate statictical analysis.

It follows a mathematical procedure (based on linear algera) that transforms a number of (possibly) correlated variables into a (smaller) number of  uncorrelated variables called principal components (PCs). Think it iteratively: the first PC accounts for as much of 
the variability in the data as possible, and each succeeding PC accounts for as much of the 
_remaining_ variability as possible.

It is a linear dimensionality reduction using Singular Value Decomposition (SVD) of the data to project it to a lower dimensional space (i.e. - naively speaking - each PC takes a 'new' axis in the 'new' cartesian space..). From a linear algebra perspective, e.g., a linear transformation of all 'original' variables that projects them onto a new cartesian space in which the 'new' variable with highest variance is projected on the first axis, the second (per variance value) on the second, and so on. 

This is why generally this is called a data reduction technique: not so appropriate, reduction is not really on the data itself but on the resulting complexity, as you just analyze the "principal" (by variance) variables among the 'new' ones.

Concretely: a property of PCA is that you can choose the number of dimensions or PCs in the transformed result. 

More about the PCA class in scikit-learn can be found in its API documentation [here](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).

In the example below, we use PCA and select 3 principal components. 

In [8]:
from sklearn.decomposition import PCA

In [9]:
# Feature Extraction with PCA
pca = PCA(n_components=2)
fit = pca.fit(X)
# summarize components
print("Explained Variance: %s" % fit.explained_variance_ratio_)
print(fit.components_)

Explained Variance: [ 0.889  0.062]
[[ -2.022e-03   9.781e-02   1.609e-02   6.076e-02   9.931e-01   1.401e-02
    5.372e-04  -3.565e-03]
 [ -2.265e-02  -9.722e-01  -1.419e-01   5.786e-02   9.463e-02  -4.697e-02
   -8.168e-04  -1.402e-01]]


You can see that the transformed dataset (3 principal components) bear little resemblance to the source data.

## 4. Feature Importance

Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance of features. 

In the example below we construct a ExtraTreesClassifier classifier for our dataset.

This method implements a meta-estimator that fits a number of randomized decision trees (called "extra-trees") on a various sub-samples of the original dataset, and use averaging to improve the predictive accuracy and to control over-fitting.

More about the ExtraTreesClassifier class can be found in the scikit-learn API [here](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html).

In [32]:
from sklearn.ensemble import ExtraTreesClassifier

In [33]:
# Feature Importance with Extra Trees Classifier
model = ExtraTreesClassifier()
model.fit(X, Y)
print(model.feature_importances_)

[ 0.104  0.228  0.102  0.076  0.079  0.124  0.132  0.156]


You can see that we are given an importance score for each attribute where the larger the score, the more important the attribute. The scores suggest at the importance of plas, age and mass.

## Summary

What we did:

* we explored feature selection for preparing ML data in Python with scikit-learn, and we discovered 4 different automatic feature selection techniques.

## What's next 

Now we will start looking at how to evaluate ML algorithms on your dataset, starting from discovering resampling methods that can be used to estimate the performance of a ML algorithm on unseen data.