# tutorial 4: channel selection

This tutorial shows you how to use classes from the **channel_selection** module to screen feature matrices during pipeline fitting.  Feature matrix screening can prevent garbage from flowing into and out of your ML pipeline and reduce computational costs.

In ML tasks with a large number of potentially useful feature matrix inputs originating from different data sources or generated by different feature extraction methods, it can be useful to select informative matrices prior to the ML stage of the pipeline.  Although ML algorithms are good at ignoring input noise, their capacity to do so is finite.  Thus, reducing input noise by selecting high quality input matrices can sometimes improve performance by reducing overfitting, and it reduces the computational cost of model training and inference.  

## Select channels using aggregate feature scores

The following example illustrates the use of skmultichannel's **SelectKBestScores** class to select the top 3 channels ranked by the average values of their individual feature scores.

See related classes: **SelectPercentBestScores**, **SelectHighPassScores**, **SelectVarianceHighPassScores**

In [2]:
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectPercentile, f_classif
from sklearn.ensemble import GradientBoostingClassifier
import skmultichannel as sm

Xs, y, X_types = sm.make_multi_input_classification(
                        n_informative_Xs=3, n_random_Xs=7, n_weak_Xs=0)

early_stopping_GBC = GradientBoostingClassifier(n_estimators=1000, 
                                     validation_fraction=0.1, 
                                     n_iter_no_change=3)

clf = sm.MultichannelPipeline(n_channels=10)
clf.add_layer(StandardScaler())
clf.add_layer(SelectPercentile(percentile=25))
clf.add_layer(sm.SelectKBestScores(feature_scorer=f_classif, aggregator=np.mean, k=3))
clf.add_layer(sm.MultichannelPredictor(early_stopping_GBC))
df = clf.fit(Xs, y).get_dataframe()
df['input_type'] = X_types
df

Unnamed: 0_level_0,layer_0,out_0,layer_1,out_1,layer_2,out_2,layer_3,out_3,input_type
channel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,StandardScaler,→,SelectPercentile,→,SelectKBestScores,,GradientBoostingClassifier_MC,→,random
1,StandardScaler,→,SelectPercentile,→,▽,,▽,,random
2,StandardScaler,→,SelectPercentile,→,▽,,▽,,random
3,StandardScaler,→,SelectPercentile,→,▽,,▽,,random
4,StandardScaler,→,SelectPercentile,→,▽,,▽,,random
5,StandardScaler,→,SelectPercentile,→,▽,→,▽,,informative
6,StandardScaler,→,SelectPercentile,→,▽,,▽,,random
7,StandardScaler,→,SelectPercentile,→,▽,,▽,,random
8,StandardScaler,→,SelectPercentile,→,▽,→,▽,,informative
9,StandardScaler,→,SelectPercentile,→,▽,→,▽,,informative


**notes**

* Notice that the arrows in the out_2 column exactly match the matrices labeled as informative in the input_type column.  This means that SelectKBestScores correctly identified the high quality channels and filtered out the low quality channels.  


* The scikit-learn SelectPercentile layer selects the best 25% of features.  This pre-selection, prior to channel selection, timproves S/N by allowing the channel selector to focus on the best features.


* The scikit-learn algorithm SelectPercentile passes through the top features, whereas skmultichannel's SelectKBestScores passes through the entire matrix of the top scoring matrices.

## Select channels using ML probes

The following example illustrates the use of skmultichannel's **SelectKBestProbes** class to esimate the predictive value of input channels and to select the 3 best channels.  SelectKBestProbes esimates predictive value for each matrix by determining model accuracy in an internal cross validation run with a probe ML algorithm.  The probe models are then discarded and the selected input values passed through to the next pipeline stage.  ML probe scores enable a nonlinear estimate of the predcitive value of each matrix. 

 See related classes: **SelectPercentBestProbes**, **SelectHighPassProbes**, **SelectVarianceHighPassProbes**

In [3]:
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectPercentile, f_classif
from sklearn.ensemble import GradientBoostingClassifier
import skmultichannel as sm

Xs, y, X_types = sm.make_multi_input_classification(
                        n_informative_Xs=3, n_random_Xs=7, n_weak_Xs=0)

early_stopping_GBC = GradientBoostingClassifier(n_estimators=1000, 
                                     validation_fraction=0.1, 
                                     n_iter_no_change=3)

clf = sm.MultichannelPipeline(n_channels=10)
clf.add_layer(StandardScaler())
clf.add_layer(SelectPercentile(percentile=25))
clf.add_layer(sm.SelectKBestProbes(GradientBoostingClassifier(n_estimators=5), k=3))
clf.add_layer(sm.MultichannelPredictor(early_stopping_GBC))
df = clf.fit(Xs, y).get_dataframe()
df['input_type'] = X_types
df

Unnamed: 0_level_0,layer_0,out_0,layer_1,out_1,layer_2,out_2,layer_3,out_3,input_type
channel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,StandardScaler,→,SelectPercentile,→,SelectKBestProbes,→,GradientBoostingClassifier_MC,→,informative
1,StandardScaler,→,SelectPercentile,→,▽,→,▽,,informative
2,StandardScaler,→,SelectPercentile,→,▽,,▽,,random
3,StandardScaler,→,SelectPercentile,→,▽,,▽,,random
4,StandardScaler,→,SelectPercentile,→,▽,,▽,,random
5,StandardScaler,→,SelectPercentile,→,▽,,▽,,random
6,StandardScaler,→,SelectPercentile,→,▽,,▽,,random
7,StandardScaler,→,SelectPercentile,→,▽,,▽,,random
8,StandardScaler,→,SelectPercentile,→,▽,,▽,,random
9,StandardScaler,→,SelectPercentile,→,▽,→,▽,,informative


**notes**

* Notice that the arrows in the out_2 column exactly match the matrices labeled as informative in the input_type column.  This means that SelectKBestProbes correctly identified the high quality channels and filtered out the low quality channels.  