# Feature Selection for ML

We discuss feature selection techniques that you can use to prepare your ML data in Python with scikit-learn. 

Focus will be on:
1. Univariate Selection
2. Recursive Feature Elimination
3. Feature Importance

Then we look at another approach: Principal Component Analysis (not plenty of details in this part of the course)

## Introduction

Keep an eye, in the following, to 3 possible goals/benefits of a clever feature selection tactics:

1. ***Reduction of Overfitting***
1. ***Improvement of Accuracy***
1. ***Reduction of Training Time***



More about feature selection with scikit-learn can be found [here](http://scikit-learn.org/stable/modules/feature_selection.html).

## 0. Import the data

In [6]:
import pandas as pd

url = 'https://raw.githubusercontent.com/dbonacorsi/AMLBas2122/main/datasets/pima-indians-diabetes.data.csv'

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pd.read_csv(url, names=names)
data

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [7]:
data.shape

(768, 9)

## 1. Univariate Selection

The scikit-learn library provides the `SelectKBest` class (info [here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest)) that can be used with a suite of different statistical tests to select a specific number of features. Apart from `SelectKBest`, you may use `SelectPercentile` or `GenericUnivariateSelect` (check documentation).

The example below uses the chi-squared (chi2) statistical test for non-negative features to select 4 of the best features from the Pima Indians onset of diabetes dataset.

In [8]:
from numpy import set_printoptions

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [9]:
array = data.values
X = array[:,0:8]
Y = array[:,8]

In [10]:
X.shape

(768, 8)

In [11]:
Y.shape

(768,)

In [12]:
# chi2 to select the best k=..
test = SelectKBest(score_func=chi2, k=3)
fit = test.fit(X, Y)

In [13]:
# summarize scores
set_printoptions(precision=3)
print(fit.scores_)

[ 111.52  1411.887   17.605   53.108 2175.565  127.669    5.393  181.304]


In [15]:
names[:-1]

['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age']

So, the **k(=3) best ones** seems to be: **plas, test** (and perhaps **age**, but much worse)

[NOTE]: If you want to transform it, you see that change of shape.. (careful, this is powerful..) 

In [16]:
X_new = SelectKBest(score_func=chi2, k=3).fit_transform(X, Y)

In [17]:
X_new.shape

(768, 3)

In [18]:
X_new

array([[148.,   0.,  50.],
       [ 85.,   0.,  31.],
       [183.,   0.,  32.],
       ...,
       [121., 112.,  30.],
       [126.,   0.,  47.],
       [ 93.,   0.,  23.]])

Let's go back on track.

In [19]:
# summarize selected features
X_new1 = fit.transform(X)
print(X_new1[0:10,:])

[[148.   0.  50.]
 [ 85.   0.  31.]
 [183.   0.  32.]
 [ 89.  94.  21.]
 [137. 168.  33.]
 [116.   0.  30.]
 [ 78.  88.  26.]
 [115.   0.  29.]
 [197. 543.  53.]
 [125.   0.  54.]]


You can do all the above in one shot with `fit_transform` (careful, often this is dangerous..)

In [None]:
X_new2 = SelectKBest(score_func=chi2, k=3).fit_transform(X,Y)
print(X_new2[0:10,:])

## 2. Recursive Feature Elimination


More about the RFE class in the scikit-learn documentation [here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE).

In [20]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

In [22]:
#model = LogisticRegression()
model = LogisticRegression(solver='lbfgs', max_iter=5000)
#rfe = RFE(model, 3)    # this started to give error in recent sklearn versions
rfe = RFE(model, n_features_to_select=3)   # my choice: seek for 3 features
fit = rfe.fit(X, Y)
print("Num Features: %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)

Num Features: 3
Selected Features: [ True False False False False  True  True False]
Feature Ranking: [1 2 4 6 5 1 1 3]


In [23]:
names[:-1]

['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age']

You can see that RFE has chosen the **top 3 features** as **preg, mass, pedi**.

## 3. Feature Importance


More about the ExtraTreesClassifier class can be found in the scikit-learn API [here](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html).

In [24]:
from sklearn.ensemble import ExtraTreesClassifier

In [25]:
# Feature Importance with Extra Trees Classifier
model = ExtraTreesClassifier()
#model = ExtraTreesClassifier(n_estimators=100)
model.fit(X, Y)
print(model.feature_importances_)

[0.111 0.232 0.102 0.078 0.076 0.14  0.116 0.146]


In [26]:
names[:-1]

['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age']

**The larger the score, the more important the attribute**. The scores suggest at the importance of **plas** and perhaps also **mass, age**.

## Another approach: Principal Component Analysis



More about the PCA class in scikit-learn can be found in its API documentation [here](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).

In [27]:
from sklearn.decomposition import PCA

In [28]:
# Feature Extraction with PCA
pca = PCA(n_components=3)
fit = pca.fit(X)
# summarize components
print("Explained Variance: %s" % fit.explained_variance_ratio_)

Explained Variance: [0.889 0.062 0.026]


In [29]:
print(fit.components_)

[[-2.022e-03  9.781e-02  1.609e-02  6.076e-02  9.931e-01  1.401e-02
   5.372e-04 -3.565e-03]
 [-2.265e-02 -9.722e-01 -1.419e-01  5.786e-02  9.463e-02 -4.697e-02
  -8.168e-04 -1.402e-01]
 [-2.246e-02  1.434e-01 -9.225e-01 -3.070e-01  2.098e-02 -1.324e-01
  -6.400e-04 -1.255e-01]]


You can see that the transformed dataset (3 principal components) bear little resemblance to the source data! It is a completely different approach, valuable mainly if you need to reduce the dimensions and the complexity of the problem.

---

### <font color='red'>Exercise</font>

Can you "compare" the first 3 methods above and draw any conclusions on the features you would eventually pick?

---

## Summary

What we did:

* we explored feature selection for preparing ML data in Python with scikit-learn, and we discovered 4 different automatic feature selection techniques.

## What's next 

Now we will start looking at how to evaluate ML algorithms on your dataset, starting from discovering resampling methods that can be used to estimate the performance of a ML algorithm on unseen data.