In [1]:
import numpy as np

### Feature Selection

+ Sometimes in a real world dataset, all features do not contribute well enough towards fitting a model. 
+ The features that do not contribute significantly, can be removed. It leads to a decrease in the size of the dataset and hence, the computation cost of fitting a model.
+ `sklearn.feature_selction` provides many APIs to accomplish this task.

### Filter Based Methods

##### VarianceThreshold

Removes all features with variance below a certain threshold, as specified by the user,
from the input feature matrix.

In [2]:
data = [{'age': 4, 'height':96.0},
{'age': 1, 'height':73.9},
{'age': 3, 'height':88.9},
{'age': 2, 'height':81.6}]

In [3]:
from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer(sparse=False)
data_transformed = dv.fit_transform(data) #creates a 2-d matrix
print(data_transformed)
np.var(data_transformed, axis=0) # gives the variance of each column

[[ 4.  96. ]
 [ 1.  73.9]
 [ 3.  88.9]
 [ 2.  81.6]]


array([ 1.25 , 67.735])

In [4]:
# selects only the second column, since variance of the first column is below given threshold
from sklearn.feature_selection import VarianceThreshold
vt = VarianceThreshold(threshold=9) # set the threshold limit, by default only a feature with 0 variance will be removed
vt.fit_transform(data_transformed)

array([[96. ],
       [73.9],
       [88.9],
       [81.6]])

Only the second column has been selected because its the only one that passes the specified variance threshold

#### SelectKBest

Select the k best features (highest scoring features) based on a given scoring method (eg: Mutual Information, chi2, f statistics)

An example using the california housing dataset is below:

In [8]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, SelectPercentile, GenericUnivariateSelect, mutual_info_regression

X_california, y_california = fetch_california_housing(return_X_y=True)

# selecting a subset of the data because the california housing dataset is quite large
X, y = X_california[:2000, :], y_california[:2000]

print(f"The shape of the feature matrix before feature selection: {X.shape}")

The shape of the feature matrix before feature selection: (2000, 8)


Let's select the 3 most important features using the `mutual_info_regression` scoring function

Mutual Information(MI) measures the dependency between 2 variables. It returns a non-negative value:
+ MI = 0 for independent variables
+ Higher MI indicates higher dependency

In [9]:
# Select 3 features using mutual_info_regression method
skb = SelectKBest(mutual_info_regression, k=3)
X_new = skb.fit_transform(X, y)

print(f"The shape of the feature matrix after feature selection: {X_new.shape}")

The shape of the feature matrix after feature selection: (2000, 3)


The transformed data now only has top 3 features

In [None]:
skb.get_feature_names_out()

array(['x0', 'x6', 'x7'], dtype=object)

#### SelectPercentile

This is very simiar to `SelectKBest`, the only difference is that it selects the the highest-scoring k% of features.

It also uses a scoring function to decide the importance of features

In [10]:
# Select top 30 percentile of the features using mutual_info_regression method
sp = SelectPercentile(mutual_info_regression, percentile=30)
X_new = sp.fit_transform(X, y)
X_new.shape

(2000, 3)

Only 3 out of the 8 features got selected:


In [11]:
sp.get_feature_names_out()

array(['x0', 'x6', 'x7'], dtype=object)

#### GenericUnivariateSelect

In [13]:
# Select 3 features using mutual_info_regression method
gus = GenericUnivariateSelect(mutual_info_regression, mode='k_best', param=3)
X_new = gus.fit_transform(X, y)
X_new.shape

(2000, 3)

In [None]:
gus.get_feature_names_out()

array(['x0', 'x2', 'x6', 'x7'], dtype=object)

### Wrapper based Methods
Unlike filter based methods, wrapper based methods use estimator class rather than a scoring
function for feature selection.

#### RFE (Recursive Feature Elimination)

RFE starts with all features in the training dataset and removes features recursively until
the desired number of features are reached.

+ Step 1: Fits a model and
+ Step 2: Ranks the features, afterwards it removes one ore more features (depending upon the `step` parameter)

In [16]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

estimator = LinearRegression()

# RFE takes the estimator and the number of features to select as input
selector = RFE(estimator, n_features_to_select=3, step=1) # select 3 features
selector = selector.fit(X, y)

# .support_ attribute returns a boolean array where True indicates selected features
print(selector.support_)  # True indicates selected columns

# .ranking_ attribute gives the rank of each feature
# if its value is '1', then it is selected,
# features with rank 2 and above are not selected
print(f"Rank of each feature are {selector.ranking_}") # rank 1 assigned to only selected features.

[ True False False False False False  True  True]
Rank of each feature are [1 5 4 3 6 2 1 1]


So from the above, we can see that the 1st, 7th and the 8th features are selected

In [15]:
X_new = selector.transform(X) # returns only the previously "selected" columns
X_new.shape

(2000, 3)

#### RFECV

RFECV adds another layer of cross validationto RFE

#### SelectFromModel

Selects the desired number of important features (as specified with `max_features` parameter)

In [17]:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LinearRegression

estimator = LinearRegression()
estimator.fit(X, y)

print(f'Coefficients of features: {estimator.coef_}')
print(f'Indices of top 3 features: {np.argsort(estimator.coef_)[-3:]}')

model = SelectFromModel(estimator, max_features=3, prefit=True)
X_new = model.transform(X)
X_new.shape

Coefficients of features: [ 3.64048292e-01  5.56221906e-03  5.13591243e-02 -1.64474348e-01
  5.90411479e-05 -1.64573915e-01 -2.17724525e-01 -1.85343265e-01]
Indices of top 3 features: [1 2 0]


(2000, 3)

In [18]:
X_new = selector.transform(X) # returns only the previously "selected" columns
X_new.shape

(2000, 3)

#### SequentialFeatureSelector

It performs feature selection by selecting or deselecting features one by one in a greedy manner.
It uses one of two approaches:
+ Forward Selection
+ Backward Selection

The `direction` parameter controls whether Forward or backward selction is used and in general, the two do not yield the same results

In [22]:
%%time

#Forward sequence selector
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LinearRegression

estimator = LinearRegression()

sfs = SequentialFeatureSelector(estimator, n_features_to_select=3)
X_new = sfs.fit_transform(X, y)
X_new.shape

CPU times: user 296 ms, sys: 1.07 ms, total: 297 ms
Wall time: 296 ms


(2000, 3)

The 3 selected features can be observed by using the `get_feature_names_out()` method


In [24]:
sfs.get_feature_names_out()

array(['x0', 'x5', 'x6'], dtype=object)

In [25]:
%%time

#Backward sequence selector - depends on n_features_to_select.
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LinearRegression

estimator = LinearRegression()

sfs = SequentialFeatureSelector(estimator, n_features_to_select=3, direction='backward')
X_new = sfs.fit_transform(X, y)
X_new.shape

CPU times: user 570 ms, sys: 165 ms, total: 735 ms
Wall time: 506 ms


(2000, 3)

In [None]:
sfs.get_feature_names_out()

array(['x0', 'x5', 'x6'], dtype=object)

Note that, in this case forward selection is faster because it will only need to go through 3 iterations whereas backward selection will need to go through 5.