#Univariate Feature Selection

## Table of Contents
1. Introduction to univariate feature selection 
1. The `SelectKBest` and `SelectPercentile` transformer classes

##1. Introduction to univariate feature selection

Univariate feature selection techniques evaluate a single feature by applying a scoring function to each feature (possibly in relation to the target feature), and choose features based on their rank with regard to the function.

See _Feature selection_ at Wikipedia for additional details.
- https://en.wikipedia.org/wiki/Feature_selection

To perform univariate feature selection, you need to:

- define the __number of features__ that you want to keep.
- select the __scoring function__ that will evaluate the relationship between the variables.

##2. The `SelectKBest` and `SelectPercentile` transformer classes

The `Scikit-learn` library provides classes to use with a suite of different statistical tests in order to select a specific number (percentage) of features.

For the __number of features__, you can define it through:
- The `SelectKBest` transformer class. Selects the k best features.
- The `SelectPercentile` transformer class. Selects the best features into the percentile that you define.

Regarding the __scoring functions__, you'll have different functions for classification and regression problems. 

For classification problem, the two scoring functions are mostly used:
- `f_classif`. Based on analysis of variance (ANOVA).
- `mutual_info_classif`. Based on mutual information.

For regression problem, the two scoring functions are mostly used:
- `f_regression`. Based on correlation between label and feature.
- `mutual_info_regression` Based on mutual information.

###2.1 `SelectKBest`

The class `SelectBest` in the `sklearn.feature_selection` module can be used to remove all but the k highest scoring features. For instance, we can perform an analysis of variance (ANOVA) F-test to the iris dataset to retrieve only the two best features as follows:

Load the libraries.

In [11]:
import pandas as pd
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

Load iris data.

In [13]:
iris = pd.read_csv('/dbfs/mnt/datalab-datasets/file-samples/iris.csv')

Create features `X` and target `y`.

In [15]:
X = iris.values[:,0:-1]
y = iris.values[:,-1]

Create an intance of `SelectKBest` class and store it in an object `selector`. Set the scoring function as `f_classif` and `k=2` to select two features with the highest ANOVA F-Values.

In [17]:
selector = SelectKBest(f_classif, k=2)

The `fit` method computes the ANOVA F-values from `(X, y)`.

In [19]:
selector.fit(X, y)

After calling `fit` method, the F-values of features are stored in the `scores_` attribute and the p-values are stored in the `pvalues_` attribute.

In [21]:
selector.scores_, selector.pvalues_

The `get_support()` method of the object returns a boolean array which indicates whether a corresponding feature is selected for retention or not.

In [23]:
selector.get_support()

Call the `transform` method to reduce `X` to the selected features. Store the selected features in a new variable `X_kbest`.

In [25]:
X_kbest = selector.transform(X)

Use the `shape` method of the two arrays `X` and `X_kbest` to display the number of features before and after feature selection.

In [27]:
print('Original number of features:', X.shape[1])
print('Reduced number of features:', X_kbest.shape[1])

Display names of the selected features according to the boolean array returned by the `get_support()` method.

In [29]:
names = iris.columns.values[0:-1]
print('Selected Features:')
for i in range(X.shape[1]):
  if selector.get_support()[i] == True:
    print(names[i])

This section introduces using a transformer class `SelectKBest` to select features according to the k highest scores, as well as displaying the number and names of selected features.

###2.2 `SelectPercentile`

The class `SelectPercentile` in the `sklearn.feature_selection` module can be used to remove all but a user-specified highest scoring percentage of features. For instance, we can estimate mutual information (MI) from the iris dataset to retrieve only the top 50% best features as follows.

- _About mutual information(MI): mutual information (MI) between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency._

Load the libraries.

In [34]:
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import mutual_info_classif

Create an intance of `SelectPercentile` class and store it in an object `selector_mi`. Set the scoring function as `mutual_info_classif` and `percentile=50` to select features with the top 50% highest mutual information (MI) estimation.

In [36]:
selector_mi = SelectPercentile(mutual_info_classif, percentile=50)

The `fit` method computes the estimated mutual information between each feature and the target.

In [38]:
selector_mi.fit(X, y)

After calling `fit` method, the estimated MI scores of features are stored in the `scores_` attribute.

In [40]:
selector_mi.scores_

The `get_support()` method of the object returns a boolean array which indicates whether a corresponding feature is selected for retention or not.

In [42]:
selector_mi.get_support()

Call the `transform` method to reduce `X` to the selected features. Store the selected features in a new variable `X_pct`.

In [44]:
X_pct = selector_mi.transform(X)

Display the number of features before and after feature selection.

In [46]:
print('Original number of features:', X.shape[1])
print('Reduced number of features:', X_pct.shape[1])

Display names of the selected features.

In [48]:
print('Selected Features:')
for i in range(X.shape[1]):
  if selector_mi.get_support()[i] == True:
    print(names[i])

Note: The selected features may contain multicollinearity since the univariate feature selection methods do not remove multicollinearity. Methods of dealing with multicollinearity in data won't be illustrated in this notebook.

This section introduces using a transformer classes `SelectPercentile` to select features according to a percentile of the highest scores, as well as displaying the number and names of the selected features.