# Feature Selection for ML

We discuss feature selection techniques that you can use to prepare your ML data in Python with scikit-learn. 

Focus will be on:
1. Univariate Selection
2. Recursive Feature Elimination
3. Feature Importance
4. Principal Component Analysis

## Introduction

Discussed in the hands-on.

3 main goals/benefits:
1. ***Reduction of Overfitting***
1. ***Improvement of Accuracy***
1. ***Reduction of Training Time***



More about feature selection with scikit-learn can be found [here](http://scikit-learn.org/stable/modules/feature_selection.html).

## 0. Import the data

In [2]:
import pandas as pd

url = 'https://raw.githubusercontent.com/dbonacorsi/AML_basic_AA1920/master/datasets/pima-indians-diabetes.data.csv'

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pd.read_csv(url, names=names)
data

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


In [3]:
data.shape

(768, 9)

## 1. Univariate Selection

The scikit-learn library provides the `SelectKBest` class (info [here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest)) that can be used with a suite of different statistical tests to select a specific number of features. 

The example below uses the chi-squared (chi2) statistical test for non-negative features to select 4 of the best features from the Pima Indians onset of diabetes dataset.

In [0]:
#from pandas import read_csv
from numpy import set_printoptions

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [0]:
array = data.values
X = array[:,0:8]
Y = array[:,8]

In [6]:
X.shape

(768, 8)

In [7]:
Y.shape

(768,)

In [0]:
# chi2 to select the best k=..
test = SelectKBest(score_func=chi2, k=3)
fit = test.fit(X, Y)

In [9]:
# summarize scores
set_printoptions(precision=3)
print(fit.scores_)

[ 111.52  1411.887   17.605   53.108 2175.565  127.669    5.393  181.304]


In [10]:
names[:-1]

['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age']

So, the **k(=3) best ones** seems to be: **plas, test, age**.

[NOTE]: If you want to transform it, you see that change of shape.. (careful, this is powerful..) 

In [0]:
X_new = SelectKBest(score_func=chi2, k=3).fit_transform(X, Y)

In [0]:
X_new.shape

In [0]:
X_new

Let's go back on track.

In [0]:
# summarize selected features
X_new1 = fit.transform(X)
print(X_new1[0:10,:])

You can do all the bove in one shot with `fit_transform` (careful, often this is dangerous..)

In [0]:
X_new2 = SelectKBest(score_func=chi2, k=3).fit_transform(X,Y)
print(X_new2[0:10,:])

## 2. Recursive Feature Elimination


More about the RFE class in the scikit-learn documentation [here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE).

In [0]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

In [15]:
# Feature Extraction with RFE
model = LogisticRegression(solver = 'lbfgs', max_iter = 500)
rfe = RFE(model, 3)    # my choice: seek for 3 features
fit = rfe.fit(X, Y)
print("Num Features: %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)

Num Features: 3
Selected Features: [ True False False False False  True  True False]
Feature Ranking: [1 2 4 6 5 1 1 3]


In [13]:
names[:-1]

['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age']

You can see that RFE chose the **top 3 features** as **preg, mass, pedi**.

## 3. Feature Importance


More about the ExtraTreesClassifier class can be found in the scikit-learn API [here](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html).

In [0]:
from sklearn.ensemble import ExtraTreesClassifier

In [17]:
# Feature Importance with Extra Trees Classifier
model = ExtraTreesClassifier()
model.fit(X, Y)
print(model.feature_importances_)

[0.112 0.211 0.103 0.079 0.078 0.144 0.114 0.16 ]




In [18]:
names[:-1]

['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age']

**The larger the score, the more important the attribute**. The scores suggest at the importance of **plas, mass, age**.

## 4. Principal Component Analysis



More about the PCA class in scikit-learn can be found in its API documentation [here](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).

In [0]:
from sklearn.decomposition import PCA

In [20]:
# Feature Extraction with PCA
pca = PCA(n_components=3)
fit = pca.fit(X)
# summarize components
print("Explained Variance: %s" % fit.explained_variance_ratio_)

Explained Variance: [0.889 0.062 0.026]


In [21]:
print(fit.components_)

[[-2.022e-03  9.781e-02  1.609e-02  6.076e-02  9.931e-01  1.401e-02
   5.372e-04 -3.565e-03]
 [-2.265e-02 -9.722e-01 -1.419e-01  5.786e-02  9.463e-02 -4.697e-02
  -8.168e-04 -1.402e-01]
 [-2.246e-02  1.434e-01 -9.225e-01 -3.070e-01  2.098e-02 -1.324e-01
  -6.400e-04 -1.255e-01]]


You can see that the transformed dataset (3 principal components) bear little resemblance to the source data! It is a completely different approach, valuable mainly if you need to reduce the dimensions and the complexity of the problem.

## <font color='red'>Exercise</font>

<div class="alert alert-block alert-info">
Can you "compare" the first 3 methods above and draw any conclusions on the features you would eventually pick?
</div>

1. **plas, test, age**
2. **preg, mass, pedi**
3. **plas, mass, age**

So, voting: plas 2, test 1, age 2, preg 1, pedi 1, mass 2.
 
What about choosing:
 
*  **plas age mass**



## <font color='green'>Solution</font>

In [0]:
# insert your observations here

## Summary

What we did:

* we explored feature selection for preparing ML data in Python with scikit-learn, and we discovered 4 different automatic feature selection techniques.

## What's next 

Now we will start looking at how to evaluate ML algorithms on your dataset, starting from discovering resampling methods that can be used to estimate the performance of a ML algorithm on unseen data.