# **Feature Selection With scikit-learn**

We are going to see a couple of methods to do feature selecction in scikit-learn. Take into account that is not always necesary to apply it to the dataset, and it will depend on the specific task and problem that we are working on. Some algorithms will have problems dealing with a high number of features, while others will be able to work with them. The **Curse of Dimensionality** is different for each algorithm.

We are only going to see a few ot the available methods, a more in-depth discussion of all the feature selection methods available in scikit-lear can be found here: https://scikit-learn.org/stable/modules/feature_selection.html



## **Variance Threshold Feature Selection**

A feature with a low variance means that it has a lot of similar values. Features with have mostly the same values are usually not very useful to discriminate the different clases. E.g. if almost everybody in this class has the same age, it would not be a good predictor of academic success.

In [2]:
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
import pandas as pd
# We are using another library that also has toy datasets https://seaborn.pydata.org/
import seaborn as sns

# We are only using the numerical attributes in the example
mpg = sns.load_dataset('mpg').select_dtypes('number')
mpg.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year
0,18.0,8,307.0,130.0,3504,12.0,70
1,15.0,8,350.0,165.0,3693,11.5,70
2,18.0,8,318.0,150.0,3436,11.0,70
3,16.0,8,304.0,150.0,3433,12.0,70
4,17.0,8,302.0,140.0,3449,10.5,70


And now we are going to standarize the values before applying the feature selection:

In [4]:
scaler = StandardScaler()
mpg = pd.DataFrame(scaler.fit_transform(mpg), columns = mpg.columns)
mpg.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year
0,-0.706439,1.498191,1.090604,0.664133,0.63087,-1.295498,-1.627426
1,-1.090751,1.498191,1.503514,1.574594,0.854333,-1.477038,-1.627426
2,-0.706439,1.498191,1.196232,1.184397,0.55047,-1.658577,-1.627426
3,-0.962647,1.498191,1.061796,1.184397,0.546923,-1.295498,-1.627426
4,-0.834543,1.498191,1.042591,0.924265,0.565841,-1.840117,-1.627426


If we want a variance of 1, we would only get the weight column:

In [11]:
selector = VarianceThreshold(1)
selector.fit(mpg)
mpg.columns[selector.get_support()]

Index(['weight'], dtype='object')

This method is more useful for **unsupervised learning**, were we don't have a class label.


---



## **Univariate Feature Selection with SelectKBest**

Univariate feature selection works by selecting the best features based on univariate statistical tests. We can use different tests to do so: chi2, Pearson-correlation, etc...

Let's start loading the dataset:

In [12]:
mpg = sns.load_dataset('mpg')
# Divide the features into Independent and Dependent Variables.
# If you don't remember what this means: https://www.pluralsight.com/guides/importing-and-splitting-data-into-dependent-and-independent-features-for-ml
mpg = mpg.select_dtypes('number').dropna()
X = mpg.drop('mpg' , axis =1)
y = mpg['mpg']

And now we apply the feature selecction. We are going to use mutual_info_regression and select the top 2 variables:

In [None]:
from sklearn.feature_selection import SelectKBest, mutual_info_regression
selector = SelectKBest(mutual_info_regression, k=2)
selector.fit(X, y)
X.columns[selector.get_support()]

Index(['displacement', 'weight'], dtype='object')



---



## **Recursive Feature Elimination (RFE)**

Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), the goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. What this means is that we use a machine learning model to select the features by eliminating the least important feature after recursively training.

First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through any specific attribute (such as coef_, feature_importances_) or callable. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

As always, first we load the dataset:

In [None]:
titanic = sns.load_dataset('titanic')[['survived', 'pclass', 'age', 'parch', 'sibsp', 'fare']].dropna()
X = titanic.drop('survived', axis = 1)
y = titanic['survived']

Now we apply the feature selection. In this example we are using our old friend, the logistic regression:

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

rfe_selector = RFE(estimator=LogisticRegression(),n_features_to_select = 2, step = 1)
rfe_selector.fit(X, y)
X.columns[rfe_selector.get_support()]

Index(['pclass', 'parch'], dtype='object')



---



## **Feature Selection via SelectFromModel**

This method is similar to the previous one, but the feature selection is done using some importance metric. It is often  coef_ or feature_importances_ but it could be any callable. By default, the threshold is the mean.

Using the previous example:

In [None]:
from sklearn.feature_selection import SelectFromModel

sfm_selector = SelectFromModel(estimator=LogisticRegression())
sfm_selector.fit(X, y)
X.columns[sfm_selector.get_support()]

Index(['pclass'], dtype='object')



---



## **Feature Selection Sequential Feature Selection (SFS)**

Sequential Feature Selection (SFS) is available in the SequentialFeatureSelector transformer. SFS can be either forward or backward:

Forward-SFS is a greedy procedure that iteratively finds the best new feature to add to the set of selected features. Concretely, we initially start with zero features and find the one feature that maximizes a cross-validated score when an estimator is trained on this single feature. Once that first feature is selected, we repeat the procedure by adding a new feature to the set of selected features. The procedure stops when the desired number of selected features is reached, as determined by the n_features_to_select parameter.

Backward-SFS follows the same idea but works in the opposite direction: instead of starting with no features and greedily adding features, we start with all the features and greedily remove features from the set. The direction parameter controls whether forward or backward SFS is used.

In [None]:
from sklearn.feature_selection import SequentialFeatureSelector

sfs_selector = SequentialFeatureSelector(estimator=LogisticRegression(), n_features_to_select = 3, cv =10, direction ='backward')
sfs_selector.fit(X, y)
X.columns[sfs_selector.get_support()]

Index(['pclass', 'age', 'parch'], dtype='object')



---



# **Exercises**


1.   Apply the Variance Threshold Feature Selection to any dataset included in scikit-learn.
2.   Apply SelectKBest to any dataset included in scikit-learn. Use at least 3 diffirent statistical tests.
3. Apply Recursive Feature Elimination (RFE) to any dataset included in scikit-learn. Use 2 other algorithms that are not the logistic regression.
4. Apply SelectFromModel to any dataset included in scikit-learn. Use other importance metric that is not the default mean.

While doing the exercises compare the results obtained with each method


