Import necessary libraries

In [1]:
from IPython.display import display, Math, Latex

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="whitegrid")

## Feature Selection

`sklearn.feature_selection` module has useful APIs to select features/reduce dimensionality, either to improve estimators' accuracy score or to boost their performance on very high-dimensional datasets.

Top reasons to use feature selection are:


* It enables the machine learning algorithm to train faster.

* It reduces the complexity of a model and makes it easier to interpret.

* It improves the accuracy of a model if the right subset is chosen.

* It reduces overfitting.

#### **1.FILTER-BASED METHODS**

##### 1.A. Variance Threshold

* This transformer helps to keep only high variance features by providing a certain threshold.

* Features with  variance greater or equal to threshold value are kept rest are removed.

* By default, it removes any feature with same value i.e. 0 variance.

In [2]:
data = [{'age': 4, 'height': 96.0},
        {'age': 1, 'height': 73.9},
        {'age': 3,  'height': 88.9},
        {'age': 2, 'height': 81.6}
        ]

from sklearn.feature_extraction import DictVectorizer     
dv = DictVectorizer(sparse=False)

data_transformed = dv.fit_transform(data)
np.var(data_transformed, axis=0)

array([ 1.25 , 67.735])

In [3]:
from sklearn.feature_selection import VarianceThreshold
vt = VarianceThreshold(threshold=5)

data_new = vt.fit_transform(data_transformed)
data_new

array([[96. ],
       [73.9],
       [88.9],
       [81.6]])

As you may observe from output of above cell, the transformer has removed the age feature because its variance is below the threshold.

##### 1.B. SelectKBest

It selects k-highest scoring features based on a function and removes the rest of the features.

Let's take an example of California Housing Dataset.

In [4]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, mutual_info_regression


X_california, y_california = fetch_california_housing(return_X_y=True)
X, y = X_california[:2000], y_california[:2000]

Let's select 3 most important features, since it is a regression problem, we can use only `mutual_info_regression` of `f_regression` scoring functions only.

In [5]:
# mutual_info_regression is scoring method for linear regression method

skb = SelectKBest(mutual_info_regression, k=3)
X_new = skb.fit_transform(X, y)

print(f'Shape of feature-matrix before feature selection : {X.shape}')
print(f'Shape of feature-matrix after feature selection : {X_new.shape}')


Shape of feature-matrix before feature selection : (2000, 8)
Shape of feature-matrix after feature selection : (2000, 3)


##### 1.C. SelectPercentile

* This is very similar to `SelectKBest` from previous section, the only difference is, it selects top `percentile` of all features and drops the rest of features.

* Similar to `SelecKBest`, it also uses a scoring function to decide the importance of features.

Let's use the california housing price dataset for this API.

In [6]:
from sklearn.feature_selection import SelectPercentile

sp = SelectPercentile(mutual_info_regression, percentile=30)
X_new = sp.fit_transform(X, y)

print(f'Shape of feature-matrix before feature selection : {X.shape}')
print(f'Shape of feature-matrix after feature selection : {X_new.shape}')


Shape of feature-matrix before feature selection : (2000, 8)
Shape of feature-matrix after feature selection : (2000, 3)


As you can see from above output, the transformed data now only has top 30 percentile of features, i.e only 3 out of 8 features.

In [7]:
skb.get_feature_names_out()

array(['x0', 'x6', 'x7'], dtype=object)

##### 1.D. GenericUnivariateSelect

* It applies  univariate feature selection with a certain strategy, which is passed to the API via `mode` parameter. 

* The `mode` can take one of the following values : 

    * `percentile` (top percentage)

    * `k_best` (top k)

    * `fpr` (false positive rate)

    * `fdr` (false discovery rate)

    * `fwe` (family wise error rate) 

* If we want to accomplish the same objective as `SelectKBest`, we can use following code: 

In [8]:
from sklearn.feature_selection import GenericUnivariateSelect 

gus = GenericUnivariateSelect(mutual_info_regression, mode='k_best', param = 3)
X_new = gus.fit_transform(X,y)

print(f'Shape of feature-matrix before feature selection : {X.shape}')
print(f'Shape of feature-matrix after feature selection : {X_new.shape}')

Shape of feature-matrix before feature selection : (2000, 8)
Shape of feature-matrix after feature selection : (2000, 3)


#### **2.WRAPPER-BASED METHODS**

##### 2.A. Recursive Feature Elimination (RFE)

* STEP 1 : Fits the model 

* STEP 2 : Ranks the features, afterwards it removes one or more features (depending upn `step` parameter)

These two steps are repeated until desired number of features are selected.

In [9]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

estimator = LinearRegression()

selector = RFE(estimator, n_features_to_select=3, step=3)
selector = selector.fit(X, y)

In [10]:
# support_ attribute is a boolean array marking which features are selected
print(selector.support_)

# rank of each feature
# if it's value is '1', then it is selected
# features with rank 2 and onwards are ranked least.
print(f'Rank of each feature is : {selector.ranking_}')

[ True False False False False False  True  True]
Rank of each feature is : [1 3 3 2 3 2 1 1]


In [11]:
X_new = selector.transform(X)

print(f'Shape of feature-matrix before feature selection : {X.shape}')
print(f'Shape of feature-matrix after feature selection : {X_new.shape}')

Shape of feature-matrix before feature selection : (2000, 8)
Shape of feature-matrix after feature selection : (2000, 3)


##### 2.B. SelectFromModel

* Selects desired number of important features (as specified with `max_features` parameter) above certain threshold of feature importance as obtained from the trained estimator.

* The feature importance is obtained via `coef_`, `feature_importance_` or an `importance_getter` callable from the trained estimator.

* The feature importance threshold can be specified either numerically or through string argument based on built-in heuristics such as `mean`, `median` and `float` multiples of these like `0.1*mean`.

In [13]:
from sklearn.feature_selection import SelectFromModel

estimator = LinearRegression()
estimator.fit(X, y)

LinearRegression()

In [18]:
print(f'Coefficients of features :\n {estimator.coef_}')
print()
print(f'Intercept of features : {estimator.intercept_}')
print()
print(f'Indices of top {3} features : {np.argsort(estimator.coef_)[-3:]}')


Coefficients of features :
 [ 3.64048292e-01  5.56221906e-03  5.13591243e-02 -1.64474348e-01
  5.90411479e-05 -1.64573915e-01 -2.17724525e-01 -1.85343265e-01]

Intercept of features : -13.720597901356236

Indices of top 3 features : [1 2 0]


In [19]:
t = np.argsort(np.abs(estimator.coef_))[-3:]

model = SelectFromModel(estimator, max_features=3, prefit=True)
X_new = model.transform(X)

print(f'Shape of feature-matrix before feature selection : {X.shape}')
print(f'Shape of feature-matrix after feature selection : {X_new.shape}')

Shape of feature-matrix before feature selection : (2000, 8)
Shape of feature-matrix after feature selection : (2000, 3)


##### 2.C. SequentialFeatureSelection

It performs feature selection by selecting or deselecting features one by one in a greedy manner. 

In [20]:
from sklearn.feature_selection import SequentialFeatureSelector

In [22]:
%%time

estimator = LinearRegression()
sfs = SequentialFeatureSelector(estimator, n_features_to_select=3)

sfs.fit_transform(X, y)
print(sfs.get_support())

[ True False False False False  True  True False]
Wall time: 135 ms


The features corresponding to True in the output of sfs.get_support() are selected. In this case,features 1, 6 and 7 are selected.

In [23]:
%%time
estimator = LinearRegression()
sfs = SequentialFeatureSelector(
    estimator, n_features_to_select=3, direction='backward')

sfs.fit_transform(X, y)
print(sfs.get_support())

[ True False False False False  True  True False]
Wall time: 194 ms


A couple of observations: 
* Both `forward` and `backward` selection methods select the same featurers.

* The `backward` selection method takes longer than `forward` selection method. 

From above examples, we can observe that depending upon number of features, `SFS` can accomplish feature selection in different periods forwards and backwards. 
