### Feature Selection for ML

Chosen features to train your model have a huge impact on performance. Only highly relevant features should be used to train your models. There are automatic feature selection techniques to help in this step:

- Univariate Selection
- Recursive Feature Elimination
- Principle Component Analysis
- Feature Importance

#### Feature selection

- Automatic
- Selection process of features that contribute the most to the prediction or output of interest
- Irrelevant features are harmful, specially to linear algorithms (linear and logistic regression)

Benefits

* Reduce overfitting

Less redundant data, less decisions based on noise

* Improves accuracy

Less misleading data, better accuracy

* Reduce training time

Less data, training is faster

More info: http://scikit-learn.org/stable/modules/feature_selection.html

##### Univariate feature selection

Use (subset) statistical tests to select features strongly related to the output variable. For example, chi-squared for non-negative features to select 4 of the best features (example below)


Using scikit-learn's `SelectKBest` class: 

In [34]:
#feature extraction with Univariate Statistical Tests (Chi-squared for classification)

from pandas import read_csv
from numpy import set_printoptions, column_stack, asarray

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

print names
# Feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)
#Summarize scores
set_printoptions(precision=3)
fscore = fit.scores_
print(fit.scores_)
features = fit.transform(X)

#Summarize selected features
print features[0:5,:]

print '\nFeatures x Score: \n', 

print column_stack((names[:-1], fscore2.tolist()));

print '\nTop 4: plas, test, mass, age', 


['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
[  111.52   1411.887    17.605    53.108  2175.565   127.669     5.393
   181.304]
[[ 148.     0.    33.6   50. ]
 [  85.     0.    26.6   31. ]
 [ 183.     0.    23.3   32. ]
 [  89.    94.    28.1   21. ]
 [ 137.   168.    43.1   33. ]]

Features x Score: 
[['preg' '111.519690636']
 ['plas' '1411.88704064']
 ['pres' '17.6053732153']
 ['skin' '53.1080398363']
 ['test' '2175.56527292']
 ['mass' '127.669343331']
 ['pedi' '5.39268154697']
 ['age' '181.303689044']]

Top 4: plas, test, mass, age


##### Recursive Feature Elimination (RFE)

Works recursively by: 

1. Removing attributes, 
2. Bulding models with remaining attributes, 
3. Evaluating accuracy, 
4. Selecting model with best attribute combination 


[Feature ranking with recursive feature elimination](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html)
>Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), the goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and weights are assigned to each one of them. Then, features whose absolute weights are the smallest are pruned from the current set features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

Below logistic regression algorithm (any works) to select the top 3 features:

In [45]:
from pandas import read_csv
from numpy import set_printoptions

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

model = LogisticRegression()
rfe =  RFE(model, 3)
fit = rfe.fit(X, Y)

print("Num features: %d") % fit.n_features_
print("Selected features %s") % fit.support_
print ("Feature ranking %s") % fit.ranking_
print "Top features by name:",
for idx in range(len(names)-1):
    if (fit.support_[idx]):
        print names[idx],
    

Num features: 3
Selected features [ True False False False False  True  True False]
Feature ranking [1 2 3 5 6 1 1 4]
Top features by name: preg mass pedi


##### Principal Component Analysis (PCA)

Compress the dataset with linear algebra,a.k.a., data reduction. 
[PCA](http://scikit-learn.org/stable/modules/decomposition.html)
> PCA is used to decompose a multivariate dataset in a set of successive orthogonal components that explain a maximum amount of the variance. In scikit-learn, PCA is implemented as a transformer object that learns n components in its fit method, and can be used on new data to project it on these components.


PCA allows you to adjust the number  dimensions/components in the transformed result.

In [49]:
from pandas import read_csv
from numpy import set_printoptions

from sklearn.decomposition import PCA

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

#extract features
pca = PCA(n_components=3)
fit = pca.fit(X)

#summarize components

print("Explained Variance:        %s ") % fit.explained_variance_
print("Explained Variance Ratio:  %s ") % fit.explained_variance_ratio_
# By choosing k (number of components)=3, we retain 97.6% of the variance 

print("Explained Variance CumSum: %s ") % fit.explained_variance_ratio_.cumsum()
print(fit.components_)


Explained Variance:        [ 13439.051    931.546    390.069] 
Explained Variance Ratio:  [ 0.889  0.062  0.026] 
Explained Variance CumSum: [ 0.889  0.95   0.976] 
[[ -2.022e-03   9.781e-02   1.609e-02   6.076e-02   9.931e-01   1.401e-02
    5.372e-04  -3.565e-03]
 [  2.265e-02   9.722e-01   1.419e-01  -5.786e-02  -9.463e-02   4.697e-02
    8.168e-04   1.402e-01]
 [ -2.246e-02   1.434e-01  -9.225e-01  -3.070e-01   2.098e-02  -1.324e-01
   -6.400e-04  -1.255e-01]]


##### Feature Importance

Some algos can be estimators of feature importance (Bagged decision trees: Random Forest Tree, Extra Trees). 

[ExtraTreesClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html)
> This class implements a meta estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.

Below a ExtraTreesClassifier for the same dataset:

In [52]:
from pandas import read_csv
from numpy import set_printoptions

from sklearn.ensemble import ExtraTreesClassifier

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

#feature extraction
model = ExtraTreesClassifier()
model.fit(X, Y)
print column_stack((names[:-1], model.feature_importances_))


[['preg' '0.104827680737']
 ['plas' '0.231328053099']
 ['pres' '0.106357372243']
 ['skin' '0.0776213364668']
 ['test' '0.0672802413506']
 ['mass' '0.145849365915']
 ['pedi' '0.120177755314']
 ['age' '0.146558194875']]
