# Feature Selection

In [7]:
from sklearn.datasets import load_iris

In [2]:
iris = load_iris()
X, y = iris.data, iris.target
X.shape

(150, 4)

## Removing features with low variance (VarianceThreshold)
_**VarianceThreshold**_ is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.

In [4]:
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=.5)

X_vt = sel.fit_transform(X)
X_vt[:5] # Sepal Width got removed

array([[5.1, 1.4, 0.2],
       [4.9, 1.4, 0.2],
       [4.7, 1.3, 0.2],
       [4.6, 1.5, 0.2],
       [5. , 1.4, 0.2]])

## Univariate feature selection
Univariate feature selection works by selecting the best features based on univariate statistical tests. It can be seen as a preprocessing step to an estimator. Scikit-learn exposes feature selection routines as objects that implement the transform method:

* _**SelectKBest**_ removes all but the  highest scoring features
* _**SelectPercentile**_ removes all but a user-specified highest scoring percentage of features
* using common univariate statistical tests for each feature: false positive rate _**SelectFpr**_, false discovery rate _**SelectFdr**_, or family wise error _**SelectFwe**_.
* _**GenericUnivariateSelect**_ allows to perform univariate feature selection with a configurable strategy. This allows to select the best univariate selection strategy with hyper-parameter search estimator.

In [8]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
X_sb = SelectKBest(chi2, k=2).fit_transform(X, y)
X_sb.shape
X_sb[:5]

array([[1.4, 0.2],
       [1.4, 0.2],
       [1.3, 0.2],
       [1.5, 0.2],
       [1.4, 0.2]])

In [9]:
from sklearn.feature_selection import SelectPercentile
X_sp = SelectPercentile(chi2, percentile=50).fit_transform(X, y)
print(X_sp.shape)
X_sp[:5]

(150, 2)


array([[1.4, 0.2],
       [1.4, 0.2],
       [1.3, 0.2],
       [1.5, 0.2],
       [1.4, 0.2]])

In [10]:
X_sp = SelectPercentile(chi2, percentile=75).fit_transform(X, y)
print(X_sp.shape)
X_sp[:5]

(150, 3)


array([[5.1, 1.4, 0.2],
       [4.9, 1.4, 0.2],
       [4.7, 1.3, 0.2],
       [4.6, 1.5, 0.2],
       [5. , 1.4, 0.2]])

## Feature selection using SelectFromModel
_**SelectFromModel**_ is a meta-transformer that can be used along with any estimator that has a coef_ or feature_importances_ attribute after fitting. The features are considered unimportant and removed, if the corresponding coef_ or feature_importances_ values are below the provided threshold parameter. Apart from specifying the threshold numerically, there are built-in heuristics for finding a threshold using a string argument. Available heuristics are “mean”, “median” and float multiples of these like “0.1*mean”.

Linear models penalized with the L1 norm have sparse solutions: many of their estimated coefficients are zero. When the goal is to reduce the dimensionality of the data to use with another classifier, they can be used along with feature_selection.SelectFromModel to select the non-zero coefficients. In particular, sparse estimators useful for this purpose are the linear_model.Lasso for regression, and of linear_model.LogisticRegression and svm.LinearSVC for classification:

With SVMs and logistic-regression, the parameter C controls the sparsity: the smaller C the fewer features selected. With Lasso, the higher the alpha parameter, the fewer features selected.

In [11]:
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel

clf = LinearSVC(C=0.01, penalty="l1", dual=False) # We chose C smaller then default so fewer features where selected
clf.fit(X, y)
model = SelectFromModel(clf, prefit=True)
X_sfm = model.transform(X)
print(X_sfm.shape)
X_sfm[:5]

(150, 3)


array([[5.1, 3.5, 1.4],
       [4.9, 3. , 1.4],
       [4.7, 3.2, 1.3],
       [4.6, 3.1, 1.5],
       [5. , 3.6, 1.4]])

resource: https://scikit-learn.org/stable/modules/feature_selection.html