# Feature Selection for ML

We discuss feature selection techniques that you can use to prepare your ML data in Python with scikit-learn. 

1. Univariate Selection.
2. Recursive Feature Elimination
3. Principal Component Analysis
4. Feature Importance.

## Introduction

Discussed in the hands-on.

3 main goals/benefits:
1. ***Reduction of Overfitting***
1. ***Improvement of Accuracy***
1. ***Reduction of Training Time***

More about feature selection with scikit-learn can be found [here](http://scikit-learn.org/stable/modules/feature_selection.html).

## 1. Univariate Selection

The scikit-learn library provides the `SelectKBest` class (info [here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest. html#sklearn.feature_selection.SelectKBest)) that can be used with a suite of different statistical tests to select a specific number of features. 

In [1]:
from pandas import read_csv
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [2]:
# load data
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

In [3]:
# Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)

In [4]:
# summarize scores
set_printoptions(precision=3)
print(fit.scores_)

[  111.52   1411.887    17.605    53.108  2175.565   127.669     5.393
   181.304]


In [5]:
names

['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

In [6]:
X[0:3,:]

array([[   6.   ,  148.   ,   72.   ,   35.   ,    0.   ,   33.6  ,
           0.627,   50.   ],
       [   1.   ,   85.   ,   66.   ,   29.   ,    0.   ,   26.6  ,
           0.351,   31.   ],
       [   8.   ,  183.   ,   64.   ,    0.   ,    0.   ,   23.3  ,
           0.672,   32.   ]])

In [7]:
# summarize selected features
features = fit.transform(X)
print(features[0:3,:])

[[ 148.     0.    33.6   50. ]
 [  85.     0.    26.6   31. ]
 [ 183.     0.    23.3   32. ]]


## 2. Recursive Feature Elimination

Discussed in the hands-on.

More about the RFE class in the scikit-learn documentation [here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html# sklearn.feature_selection.RFE).

In [8]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

In [9]:
# Feature Extraction with RFE
model = LogisticRegression()
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print("Num Features: %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)

Num Features: 3
Selected Features: [ True False False False False  True  True False]
Feature Ranking: [1 2 3 5 6 1 1 4]


## 3. Principal Component Analysis

Discussed in the hands-on.

More about the PCA class in scikit-learn can be found in its API documentation [here](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).

In [10]:
from sklearn.decomposition import PCA

In [11]:
# Feature Extraction with PCA
pca = PCA(n_components=2)
fit = pca.fit(X)
# summarize components
print("Explained Variance: %s" % fit.explained_variance_ratio_)
print(fit.components_)

Explained Variance: [ 0.889  0.062]
[[ -2.022e-03   9.781e-02   1.609e-02   6.076e-02   9.931e-01   1.401e-02
    5.372e-04  -3.565e-03]
 [ -2.265e-02  -9.722e-01  -1.419e-01   5.786e-02   9.463e-02  -4.697e-02
   -8.168e-04  -1.402e-01]]


## 4. Feature Importance

Discussed in the hands-on.

More about the ExtraTreesClassifier class can be found in the scikit-learn API [here](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html).

In [12]:
from sklearn.ensemble import ExtraTreesClassifier

In [13]:
# Feature Importance with Extra Trees Classifier
model = ExtraTreesClassifier()
model.fit(X, Y)
print(model.feature_importances_)

[ 0.118  0.217  0.096  0.086  0.075  0.148  0.114  0.146]


You can see that we are given an importance score for each attribute where the larger the score, the more important the attribute. The scores suggest at the importance of plas, age and mass.

## Summary

What we did:

* we explored feature selection for preparing ML data in Python with scikit-learn, and we discovered 4 different automatic feature selection techniques.

## What's next 

Now we will start looking at how to evaluate ML algorithms on your dataset, starting from discovering resampling methods that can be used to estimate the performance of a ML algorithm on unseen data.