# tutorial 3: feature selection

This tutorial shows you how to enforce feature diversity by using scikit-learn feature selection algorithms in a multichannel pipeline.  

Feature selection can improve model performance when the ratio of samples to features is low.  This is even true for boosted trees and neural nets.  In ML tasks with multiple feature vector inputs, it is often the case that the error is uncorrelated in the different inputs.  This lack of correlation is a good thing for making accurate ensemble predictions.  As such, we may wish to preserve the best features from each input by applying a selection algorithm to each channel (as opposed to concatenating all the vectors together and hoping that there's good representation of each input after selection).

## Select inputs from each channel

In this eample, scikit-learn's **SelectPercentile** class is used to select the best 25% of features from each input channel.  These features are passed through to pipecater's **MultichannelPredictor** which was introducted in tutorial #1.  MultichannelPredictor concatenates the best feaures into a single vector and inputs it into, in this case, GradientBoostingClassifier.

In [1]:
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectPercentile, f_classif
from sklearn.ensemble import GradientBoostingClassifier
import skmultichannel as sm

Xs, y, X_types = sm.make_multi_input_classification(
    n_informative_Xs=3, n_random_Xs=0, n_weak_Xs=7)

early_stopping_GBC = GradientBoostingClassifier(n_estimators=1000, 
                                     validation_fraction=0.1, 
                                     n_iter_no_change=3)

clf = sm.MultichannelPipeline(n_channels=10)
clf.add_layer(StandardScaler())
clf.add_layer(SelectPercentile(percentile=25))
clf.add_layer(sm.MultichannelPredictor(early_stopping_GBC))
clf.fit(Xs, y)

Unnamed: 0_level_0,layer_0,out_0,layer_1,out_1,layer_2,out_2
channel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,StandardScaler,→,SelectPercentile,→,GradientBoostingClassifier_MC,→
1,StandardScaler,→,SelectPercentile,→,▽,
2,StandardScaler,→,SelectPercentile,→,▽,
3,StandardScaler,→,SelectPercentile,→,▽,
4,StandardScaler,→,SelectPercentile,→,▽,
5,StandardScaler,→,SelectPercentile,→,▽,
6,StandardScaler,→,SelectPercentile,→,▽,
7,StandardScaler,→,SelectPercentile,→,▽,
8,StandardScaler,→,SelectPercentile,→,▽,
9,StandardScaler,→,SelectPercentile,→,▽,


In [2]:
np.mean(sm.cross_val_score(clf, Xs, y, cv=3))

0.8578431372549019