
<br>
============================<br>
Univariate Feature Selection<br>
============================<br>
An example showing univariate feature selection.<br>
Noisy (non informative) features are added to the iris data and<br>
univariate feature selection is applied. For each feature, we plot the<br>
p-values for the univariate feature selection and the corresponding<br>
weights of an SVM. We can see that univariate feature selection<br>
selects the informative features and that these have larger SVM weights.<br>
In the total set of features, only the 4 first ones are significant. We<br>
can see that they have the highest score with univariate feature<br>
selection. The SVM assigns a large weight to one of these features, but also<br>
Selects many of the non-informative features.<br>
Applying univariate feature selection before the SVM<br>
increases the SVM weight attributed to the significant features, and will<br>
thus improve classification.<br>


In [None]:
print(__doc__)

In [None]:
import numpy as np
import matplotlib.pyplot as plt

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectKBest, f_classif

#############################################################################<br>
Import some data to play with

The iris dataset

In [None]:
X, y = load_iris(return_X_y=True)

Some noisy data not correlated

In [None]:
E = np.random.RandomState(42).uniform(0, 0.1, size=(X.shape[0], 20))

Add the noisy data to the informative features

In [None]:
X = np.hstack((X, E))

Split dataset to select feature and evaluate the classifier

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
        X, y, stratify=y, random_state=0
)

In [None]:
plt.figure(1)
plt.clf()

In [None]:
X_indices = np.arange(X.shape[-1])

#############################################################################<br>
Univariate feature selection with F-test for feature scoring<br>
We use the default selection function to select the four<br>
most significant features

In [None]:
selector = SelectKBest(f_classif, k=4)
selector.fit(X_train, y_train)
scores = -np.log10(selector.pvalues_)
scores /= scores.max()
plt.bar(X_indices - .45, scores, width=.2,
        label=r'Univariate score ($-Log(p_{value})$)', color='darkorange',
        edgecolor='black')

#############################################################################<br>
Compare to the weights of an SVM

In [None]:
clf = make_pipeline(MinMaxScaler(), LinearSVC())
clf.fit(X_train, y_train)
print('Classification accuracy without selecting features: {:.3f}'
      .format(clf.score(X_test, y_test)))

In [None]:
svm_weights = np.abs(clf[-1].coef_).sum(axis=0)
svm_weights /= svm_weights.sum()

In [None]:
plt.bar(X_indices - .25, svm_weights, width=.2, label='SVM weight',
        color='navy', edgecolor='black')

In [None]:
clf_selected = make_pipeline(
        SelectKBest(f_classif, k=4), MinMaxScaler(), LinearSVC()
)
clf_selected.fit(X_train, y_train)
print('Classification accuracy after univariate feature selection: {:.3f}'
      .format(clf_selected.score(X_test, y_test)))

In [None]:
svm_weights_selected = np.abs(clf_selected[-1].coef_).sum(axis=0)
svm_weights_selected /= svm_weights_selected.sum()

In [None]:
plt.bar(X_indices[selector.get_support()] - .05, svm_weights_selected,
        width=.2, label='SVM weights after selection', color='c',
        edgecolor='black')

In [None]:
plt.title("Comparing feature selection")
plt.xlabel('Feature number')
plt.yticks(())
plt.axis('tight')
plt.legend(loc='upper right')
plt.show()