# tutorial 2: ensemble learning

This tutorial shows you how to use classes from the **ensemble_learning** module to:
* make predictions from an ensemble of different algorithms trained on the same input channel using the **Ensemble** class
* make predictions from an ensemble of models trained on different input channels using the **ChanneEnsemble** class

ML models that win competitions often use ensemble learning approaches that combine the inferences of an ensemble of different ML algorithms to make a final prediction.  Predictor voting and stacked generalization are powerful methods for combining inferences that can create accurate predictions from inaccurate models. Both methods require prediction error in the ensemble to be uncorrelated.

Pipecaster makes conventional single channel voting and stacked genearlization approaches easy with a single class: **Ensemble**.  In addition, the **ChannelEnsemble** class provides similar functionality for models trained on different input channels.

Both **Ensemble** and **ChannelEnsemble** classes also allow screening of base predictors using internal cross validation.  Screening can sometimes improve performance by dropping inaccurate models.

## Single channel ensembles

### Voting

This example illustrates the use of pipecaster's **Ensemble** class to make a voting ensemble classifier.  The ensemble in this examplse has six different ML classifier algorithms.  The predicted class probabilites from these classifiers are averaged using pipecaster's **SoftVotingClassifer** and the highest average class probability used to make an ensemble prediction.  Pipecaster also provides a **HardVotingMetaClassifier** class that predicts with the most frequent base prediction, and the **AggregatingMetaRegressor** class to combine the predictions of regressors.

*Note*: since this ensemble predictor takes only 1 input, feature scaling is added using scikit-learn's Pipeline class rather than MultichannelPipeline. 

In [10]:
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline
import pipecaster as pc

X, y = make_classification(n_classes=2, n_samples=500, n_features=100,
                           n_informative=5, class_sep=0.6)

predictors = [MLPClassifier(), LogisticRegression(),
              KNeighborsClassifier(),  GradientBoostingClassifier(),
              RandomForestClassifier(), GaussianNB()]

ensemble_clf = pc.Ensemble(
                 base_predictors=predictors,
                 meta_predictor=pc.SoftVotingMetaClassifier(),
                 base_processes='max')

clf = Pipeline([('scaler', StandardScaler()), ('ensemble_clf', ensemble_clf)])

pc.cross_val_score(clf, X, y)

[0.8324010327022375, 0.8146156052782558, 0.8313253012048193]

### Stacked generalization

This example illustrates the use of pipecaster's **Ensemble** class to stack ML models for ensemble learning.  The ensemble in this example has six different ML classifier algorithms.  A support vector machine (SVC) is stacked on top of these base classifiers, using their predictions as features for meta-prediction.

**Internal Cross Validation (CV) Training**
When training a meta-predictor (SVC in this example), it is standard practice to use internal
CV training of the base classifiers to prevent them from making inferences on training samples (1).  With internal CV training, training sample prediction is avoided by training training each base predictor multiple times on subsets of the data and using the models to make predicitions about held-out samples.  Line 23 in the code below activates internal CV and line 24 specifies that the CV prediction will be used to train the meta-predictor (the default option when CV is activated, shown here to be explicit).

(1) Wolpert, David H. "Stacked generalization."
Neural networks 5.2 (1992): 241-259.

In [18]:
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
import pipecaster as pc

X, y = make_classification(n_classes=2, n_samples=500, n_features=100,
                           n_informative=5, class_sep=0.6)

predictors = [MLPClassifier(), LogisticRegression(),
              KNeighborsClassifier(),  GradientBoostingClassifier(),
              RandomForestClassifier(), GaussianNB()]

ensemble_clf = pc.Ensemble(
                 base_predictors=predictors,
                 meta_predictor=SVC(),
                 internal_cv=5,
                 disable_cv_train=False, 
                 base_processes='max')

clf = Pipeline([('scaler', StandardScaler()),
                ('ensemble_clf', ensemble_clf)])

pc.cross_val_score(clf, X, y)



[0.7541594951233506, 0.7360154905335627, 0.7289156626506024]

### Stacked generalization with model selection

Internal cross validation was introduced in the previous example as a method for reducing overfitting during stacked generalization.  It can also be used for in-pipeline model selection.  When there are inaccurate models in the ensemble with error that is correlated with higher accuracy models, inferences from these inaccurate models can reduce predictive accuracy.  

Lines 24 and 25 in this example are the only lines that differ from the previous example.  These lines are added to provide methods for scoring and selecting models.  Line 24 designates the default scoring method, which in this place will be set to balanced_accuracy_score when Ensemble detects that its predictors are classifiers (or explained_variance_score if it they were regressors).  Line 25 instructs Ensemble to select the 3 models with the highest balanced_accuracy scores.  For more information see the the API documentation for **Ensemble** and for pipecaster's **score_selection** module.   

In [1]:
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
import pipecaster as pc

X, y = make_classification(n_classes=2, n_samples=500, n_features=100,
                           n_informative=5, class_sep=0.6)

predictors = [MLPClassifier(), LogisticRegression(),
              KNeighborsClassifier(),  GradientBoostingClassifier(),
              RandomForestClassifier(), GaussianNB()]

ensemble_clf = pc.Ensemble(
                 base_predictors=predictors,
                 meta_predictor=SVC(),
                 internal_cv=5, 
                 scorer='auto',
                 score_selector=pc.RankScoreSelector(k=3),
                 disable_cv_train=False, 
                 base_processes='max')

clf = Pipeline([('scaler', StandardScaler()),
                ('ensemble_clf', ensemble_clf)])

pc.cross_val_score(clf, X, y)

File descriptor limit 256 is too low for production servers and may result in connection errors. At least 8192 is recommended. --- Fix with 'ulimit -n 8192'
2021-03-01 21:34:08,422	INFO services.py:1173 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


[0.8023522662076878, 0.7784710269650028, 0.8012048192771084]

In [3]:
clf.fit(X, y)
ensemble_clf = clf.named_steps['ensemble_clf']
ensemble_clf.get_screen_results()

Unnamed: 0_level_0,performance,selections
model,Unnamed: 1_level_1,Unnamed: 2_level_1
MLPClassifier(),0.698,-
LogisticRegression(),0.716,-
KNeighborsClassifier(),0.642,-
GradientBoostingClassifier(),0.796,+++
RandomForestClassifier(),0.802,+++
GaussianNB(),0.74,+++


### Model selection 

The **Ensemble** class can also be used without a meta_predictor for model model selection.  When the meta_predictor parameter is left at the default value of None during Ensemble initialization, Ensemble will use internal cv and scoring to select the best predictor in the ensemble durin fitting.  Only the most accurate predictor in the ensemble predictor will be stored and used for inference.

In [4]:
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline
import pipecaster as pc

X, y = make_classification(n_classes=2, n_samples=500, n_features=100,
                           n_informative=5, class_sep=0.6)

predictors = [MLPClassifier(), LogisticRegression(),
              KNeighborsClassifier(),  GradientBoostingClassifier(),
              RandomForestClassifier(), GaussianNB()]

ensemble_clf = pc.Ensemble(
                 base_predictors=predictors,
                 meta_predictor=None,
                 internal_cv=5, 
                 scorer='auto',
                 base_processes='max')

clf = Pipeline([('scaler', StandardScaler()),
                ('ensemble_clf', ensemble_clf)])

pc.cross_val_score(clf, X, y)



[0.7426133103843947, 0.7063253012048193, 0.7706155632984901]

In [5]:
clf.fit(X, y)
ensemble_clf = clf.named_steps['ensemble_clf']
ensemble_clf.get_screen_results()



Unnamed: 0_level_0,performance,selections
model,Unnamed: 1_level_1,Unnamed: 2_level_1
MLPClassifier(),0.573829,-
LogisticRegression(),0.577893,-
KNeighborsClassifier(),0.519681,-
GradientBoostingClassifier(),0.723854,+++
RandomForestClassifier(),0.71963,-
GaussianNB(),0.671723,-


## Multichannel ensembles

The **ChannelEnsemble** class supports the same functionality as **Ensemble** (voting, stacked generalization, model selection) but instead of training an ensemble of models for a single channel, it trains an ensemble of models for multiple channels with one model per channel.  Training an ML model on each input and combining inferences through voting or stacking can improve predictive accuracy over a simple concatenation -> ML pipeline.  

*Note*:  You can build multichannel ensembles manually without the **ChannelEnsemble** class using using the **transform_wrappers** module to convert predictors into transformers and provide internal cv training functionality, **ChannelConcenator** to combine outputs from the base predictors, and **MultichannelPipeline** layering to stack the concatenator and meta-predictor.  For more info on this usage style, see the 3 different style examples in the **MultichannelPipeline** and **SoftVotingMetaClassifier** API documenation (or docstrings).

### Voting

In [25]:
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
import pipecaster as pc

Xs, y, X_types = pc.make_multi_input_classification(n_informative_Xs=3,
                                                    n_random_Xs=7)

clf = pc.MultichannelPipeline(n_channels=10)
clf.add_layer(StandardScaler())
clf.add_layer(pc.ChannelEnsemble(GradientBoostingClassifier(), pc.SoftVotingMetaClassifier()),
              pipe_processes='max')

pc.cross_val_score(clf, Xs, y)

[0.8823529411764706, 0.8492647058823529, 0.8768382352941176]

In [26]:
clf.fit(Xs, y)

Unnamed: 0_level_0,layer_0,out_0,layer_1,out_1
channel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,StandardScaler,→,ChannelEnsemble,→
1,StandardScaler,→,▽,
2,StandardScaler,→,▽,
3,StandardScaler,→,▽,
4,StandardScaler,→,▽,
5,StandardScaler,→,▽,
6,StandardScaler,→,▽,
7,StandardScaler,→,▽,
8,StandardScaler,→,▽,
9,StandardScaler,→,▽,


### Stacked generalization

In [29]:
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
import pipecaster as pc

Xs, y, X_types = pc.make_multi_input_classification(n_informative_Xs=3,
                                                    n_random_Xs=7)

clf = pc.MultichannelPipeline(n_channels=10)
clf.add_layer(StandardScaler())
clf.add_layer(pc.ChannelEnsemble(GradientBoostingClassifier(), SVC(), internal_cv=5),
              pipe_processes='max')

pc.cross_val_score(clf, Xs, y)

[0.9117647058823529, 0.96875, 0.9393382352941176]

In [30]:
clf.fit(Xs, y)

Unnamed: 0_level_0,layer_0,out_0,layer_1,out_1
channel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,StandardScaler,→,ChannelEnsemble,→
1,StandardScaler,→,▽,
2,StandardScaler,→,▽,
3,StandardScaler,→,▽,
4,StandardScaler,→,▽,
5,StandardScaler,→,▽,
6,StandardScaler,→,▽,
7,StandardScaler,→,▽,
8,StandardScaler,→,▽,
9,StandardScaler,→,▽,


### Stacked generalization with model selection

In [4]:
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
import pipecaster as pc

Xs, y, X_types = pc.make_multi_input_classification(n_informative_Xs=3,
                                                    n_random_Xs=7)

clf = pc.MultichannelPipeline(n_channels=10)
clf.add_layer(StandardScaler())
clf.add_layer(pc.ChannelEnsemble(GradientBoostingClassifier(), SVC(), internal_cv=5, 
                 scorer='auto',
                 score_selector=pc.RankScoreSelector(k=3),),
              pipe_processes='max')

pc.cross_val_score(clf, Xs, y)

[0.9705882352941176, 0.9411764705882353, 0.9705882352941176]

In [5]:
clf.fit(Xs, y)

Unnamed: 0_level_0,layer_0,out_0,layer_1,out_1
channel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,StandardScaler,→,ChannelEnsemble,→
1,StandardScaler,→,▽,
2,StandardScaler,→,▽,
3,StandardScaler,→,▽,
4,StandardScaler,→,▽,
5,StandardScaler,→,▽,
6,StandardScaler,→,▽,
7,StandardScaler,→,▽,
8,StandardScaler,→,▽,
9,StandardScaler,→,▽,


In [6]:
ensemble_clf = clf.get_model(1,0)
df = ensemble_clf.get_screen_results()
df['input type'] = X_types
df

Unnamed: 0_level_0,performance,selections,input type
channel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.42,-,random
1,0.56,-,random
2,0.84,+++,informative
3,0.92,+++,informative
4,0.58,-,random
5,0.45,-,random
6,0.56,-,random
7,0.86,+++,informative
8,0.58,-,random
9,0.49,-,random


### Model selection

In [1]:
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
import pipecaster as pc

Xs, y, X_types = pc.make_multi_input_classification(n_informative_Xs=3,
                                                    n_random_Xs=7)

clf = pc.MultichannelPipeline(n_channels=10)
clf.add_layer(StandardScaler())
clf.add_layer(pc.ChannelEnsemble(GradientBoostingClassifier(), internal_cv=5, scorer='auto'),
              pipe_processes='max')

pc.cross_val_score(clf, Xs, y)

[0.9411764705882353, 0.9705882352941176, 0.7224264705882353]

In [2]:
clf.fit(Xs, y)

Unnamed: 0_level_0,layer_0,out_0,layer_1,out_1
channel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,StandardScaler,→,ChannelEnsemble,→
1,StandardScaler,→,▽,
2,StandardScaler,→,▽,
3,StandardScaler,→,▽,
4,StandardScaler,→,▽,
5,StandardScaler,→,▽,
6,StandardScaler,→,▽,
7,StandardScaler,→,▽,
8,StandardScaler,→,▽,
9,StandardScaler,→,▽,


In [3]:
ensemble_clf = clf.get_model(1,0)
df = ensemble_clf.get_screen_results()
df['input type'] = X_types
df

Unnamed: 0_level_0,performance,selections,input type
channel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.569628,-,random
1,0.508403,-,random
2,0.479792,-,random
3,0.519808,-,random
4,0.729692,-,informative
5,0.7489,+++,informative
6,0.69948,-,informative
7,0.510004,-,random
8,0.509604,-,random
9,0.560224,-,random
