# Feature Selection for ML

We discuss feature selection techniques that you can use to prepare your ML data in Python with scikit-learn. 

Focus will be on:
1. Univariate Selection
2. Recursive Feature Elimination
3. Feature Importance

Then we look at another approach: Principal Component Analysis (not plenty of details in this part of the course)

## Introduction

Keep an eye, in the following, to 3 possible goals/benefits of a clever feature selection tactics:

1. ***Reduction of Overfitting***
1. ***Improvement of Accuracy***
1. ***Reduction of Training Time***



More about feature selection with scikit-learn can be found [here](http://scikit-learn.org/stable/modules/feature_selection.html).

## 0. Import the data

In [None]:
import pandas as pd

url = 'https://raw.githubusercontent.com/dbonacorsi/AML2021Bas/main/datasets/pima-indians-diabetes.data.csv'

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pd.read_csv(url, names=names)
data

In [None]:
data.shape

## 1. Univariate Selection

The scikit-learn library provides the `SelectKBest` class (info [here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest)) that can be used with a suite of different statistical tests to select a specific number of features. Apart from `SelectKBest`, you may use `SelectPercentile` or `GenericUnivariateSelect` (check documentation).

The example below uses the chi-squared (chi2) statistical test for non-negative features to select 4 of the best features from the Pima Indians onset of diabetes dataset.

In [None]:
from numpy import set_printoptions

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [None]:
array = data.values
X = array[:,0:8]
Y = array[:,8]

In [None]:
X.shape

In [None]:
Y.shape

In [None]:
# chi2 to select the best k=..
test = SelectKBest(score_func=chi2, k=3)
fit = test.fit(X, Y)

In [None]:
# summarize scores
set_printoptions(precision=3)
print(fit.scores_)

In [None]:
names[:-1]

So, the **k(=3) best ones** seems to be: **plas, test** (and perhaps **age**, but much worse)

[NOTE]: If you want to transform it, you see that change of shape.. (careful, this is powerful..) 

In [None]:
X_new = SelectKBest(score_func=chi2, k=3).fit_transform(X, Y)

In [None]:
X_new.shape

In [None]:
X_new

Let's go back on track.

In [None]:
# summarize selected features
X_new1 = fit.transform(X)
print(X_new1[0:10,:])

You can do all the above in one shot with `fit_transform` (careful, often this is dangerous..)

In [None]:
X_new2 = SelectKBest(score_func=chi2, k=3).fit_transform(X,Y)
print(X_new2[0:10,:])

## 2. Recursive Feature Elimination


More about the RFE class in the scikit-learn documentation [here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE).

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

In [None]:
model = LogisticRegression()
rfe = RFE(model, 3)    # my choice: seek for 3 features
fit = rfe.fit(X, Y)
print("Num Features: %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)

In [None]:
names[:-1]

You can see that RFE has chosen the **top 3 features** as **preg, mass, pedi**.

## 3. Feature Importance


More about the ExtraTreesClassifier class can be found in the scikit-learn API [here](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html).

In [None]:
from sklearn.ensemble import ExtraTreesClassifier

In [None]:
# Feature Importance with Extra Trees Classifier
model = ExtraTreesClassifier()
model.fit(X, Y)
print(model.feature_importances_)

In [None]:
names[:-1]

**The larger the score, the more important the attribute**. The scores suggest at the importance of **plas** and perhaps also **mass, age**.

## Another approach: Principal Component Analysis



More about the PCA class in scikit-learn can be found in its API documentation [here](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).

In [None]:
from sklearn.decomposition import PCA

In [None]:
# Feature Extraction with PCA
pca = PCA(n_components=3)
fit = pca.fit(X)
# summarize components
print("Explained Variance: %s" % fit.explained_variance_ratio_)

In [None]:
print(fit.components_)

You can see that the transformed dataset (3 principal components) bear little resemblance to the source data! It is a completely different approach, valuable mainly if you need to reduce the dimensions and the complexity of the problem.

---

### <font color='red'>Exercise</font>

Can you "compare" the first 3 methods above and draw any conclusions on the features you would eventually pick?

---

## Summary

What we did:

* we explored feature selection for preparing ML data in Python with scikit-learn, and we discovered 4 different automatic feature selection techniques.

## What's next 

Now we will start looking at how to evaluate ML algorithms on your dataset, starting from discovering resampling methods that can be used to estimate the performance of a ML algorithm on unseen data.