# tutorial 3: ensemble learning

This tutorial show you how to use classes from the **ensemble_learning** module to:
* make predictions from an ensemble of different algorithms trained on the same input channel using the **Ensemble** class
* make predictions from an ensemble of models trained on different input channels using the **ChanneEnsemble** class

ML models that win competitions often use ensemble learning approaches that combine the inferences of an ensemble of different ML algorithms to make a final prediction.  Predictor voting and stacked generalization are powerful methods for combining inferences that can create accurate predictions from inaccurate models. Both methods require prediction error in the ensemble to be uncorrelated.

Pipecaster makes conventional single channel voting and stacked genearlization approaches easy with a single class: **Ensemble**.  In addition, the **ChannelEnsemble** class provides similar functionality for models trained on different input channels.

Both **Ensemble** and **ChannelEnsemble** classes also allow screening of base predictors using internal cross validation.  Screening can sometimes improve performance by dropping inaccurate models.

## Single channel ensembles

### Voting

This example illustrates the use of pipecaster's **Ensemble** class to make a voting ensemble classifier.  The ensemble in this examplse has six different ML classifier algorithms.  The predicted class probabilites from these classifiers are averaged using pipecaster's **SoftVotingClassifer** and the highest average class probability used to make an ensemble prediction.  Pipecaster also provides a **HardVotingClassifier** class that predicts with the most frequent base prediction, and the **AggregatingRegressor** class to combine the predictions of regressors.

*Note*: since this ensemble predictor takes only 1 input, feature scaling is added using scikit-learn's Pipeline class rather than MultichannelPipeline. 

In [10]:
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline
import pipecaster as pc

X, y = make_classification(n_classes=2, n_samples=500, n_features=100,
                           n_informative=5, class_sep=0.6)

predictors = [MLPClassifier(), LogisticRegression(),
              KNeighborsClassifier(),  GradientBoostingClassifier(),
              RandomForestClassifier(), GaussianNB()]

ensemble_clf = pc.Ensemble(
                 base_predictors=predictors,
                 meta_predictor=pc.SoftVotingClassifier(),
                 base_processes='max')

clf = Pipeline([('scaler', StandardScaler()), ('ensemble_clf', ensemble_clf)])

pc.cross_val_score(clf, X, y)

[0.8324010327022375, 0.8146156052782558, 0.8313253012048193]

### Stacked generalization

This example illustrates the use of pipecaster's **Ensemble** class to stack ML models for ensemble learning.  The ensemble in this example has six different ML classifier algorithms.  A support vector machine (SVC) is stacked on top of these base classifiers, using their predictions as features for meta-prediction.

**Internal Cross Validation (CV) Training**
When training a meta-predictor (SVC in this example), it is standard practice to use internal
CV training of the base classifiers to prevent them from making inferences on training samples (1).  With internal CV training, training sample prediction is avoided by training training each base predictor multiple times on subsets of the data and using the models to make predicitions about held-out samples.  Line 23 in the code below activates internal CV and line 24 specifies that the CV prediction will be used to train the meta-predictor (the default option when CV is activated, shown here to be explicit).

(1) Wolpert, David H. "Stacked generalization."
Neural networks 5.2 (1992): 241-259.

In [18]:
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
import pipecaster as pc

X, y = make_classification(n_classes=2, n_samples=500, n_features=100,
                           n_informative=5, class_sep=0.6)

predictors = [MLPClassifier(), LogisticRegression(),
              KNeighborsClassifier(),  GradientBoostingClassifier(),
              RandomForestClassifier(), GaussianNB()]

ensemble_clf = pc.Ensemble(
                 base_predictors=predictors,
                 meta_predictor=SVC(),
                 internal_cv=5,
                 disable_cv_train=False, 
                 base_processes='max')

clf = Pipeline([('scaler', StandardScaler()),
                ('ensemble_clf', ensemble_clf)])

pc.cross_val_score(clf, X, y)



[0.7541594951233506, 0.7360154905335627, 0.7289156626506024]

### Stacked generalization with model selection

Internal cross validation was introduced in the previous example as a method for reducing overfitting during stacked generalization.  It can also be used for in-pipeline model selection.  When there are inaccurate models in the ensemble with error that is correlated with higher accuracy models, inferences from these inaccurate models can reduce predictive accuracy.  

Lines 24 and 25 in this example are the only lines that differ from the previous example.  These lines are added to provide methods for scoring and selecting models.  Line 24 designates the default scoring method, which in this place will be set to balanced_accuracy_score when Ensemble detects that its predictors are classifiers (or explained_variance_score if it they were regressors).  Line 25 instructs Ensemble to select the 2 models with the highest balanced_accuracy scores.  For more information see the the API documentation for **Ensemble** and for pipecaster's **score_selection** module.   

In [19]:
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
import pipecaster as pc

X, y = make_classification(n_classes=2, n_samples=500, n_features=100,
                           n_informative=5, class_sep=0.6)

predictors = [MLPClassifier(), LogisticRegression(),
              KNeighborsClassifier(),  GradientBoostingClassifier(),
              RandomForestClassifier(), GaussianNB()]

ensemble_clf = pc.Ensemble(
                 base_predictors=predictors,
                 meta_predictor=SVC(),
                 internal_cv=5, 
                 scorer='auto',
                 score_selector=pc.RankScoreSelector(k=2),
                 disable_cv_train=False, 
                 base_processes='max')

clf = Pipeline([('scaler', StandardScaler()),
                ('ensemble_clf', ensemble_clf)])

pc.cross_val_score(clf, X, y)



[0.7304216867469879, 0.7780407343660356, 0.8012048192771084]

In [20]:
# inspect models selected by Ensemble
clf.fit(X, y)
ensemble_clf = clf.named_steps['ensemble_clf']
selected_models = [p for i, p in enumerate(predictors) if i in ensemble_clf.get_support()]
selected_models



[GradientBoostingClassifier(), RandomForestClassifier()]

### Model selection 

The **Ensemble** class can also be used without a meta_predictor for model model selection.  When the meta_predictor parameter is left at the default value of None during Ensemble initialization, Ensemble will use internal cv and scoring to select the best predictor in the ensemble durin fitting.  Only the most accurate predictor in the ensemble predictor will be stored and used for inference.

In [1]:
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline
import pipecaster as pc

X, y = make_classification(n_classes=2, n_samples=500, n_features=100,
                           n_informative=5, class_sep=0.6)

predictors = [MLPClassifier(), LogisticRegression(),
              KNeighborsClassifier(),  GradientBoostingClassifier(),
              RandomForestClassifier(), GaussianNB()]

ensemble_clf = pc.Ensemble(
                 base_predictors=predictors,
                 meta_predictor=None,
                 internal_cv=5, 
                 scorer='auto',
                 base_processes='max')

clf = Pipeline([('scaler', StandardScaler()),
                ('ensemble_clf', ensemble_clf)])

pc.cross_val_score(clf, X, y)

File descriptor limit 256 is too low for production servers and may result in connection errors. At least 8192 is recommended. --- Fix with 'ulimit -n 8192'
2021-03-01 11:34:27,123	INFO services.py:1173 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


[0.8443775100401607, 0.7972022955523672, 0.8617886178861789]

In [2]:
# inspect models selected by Ensemble
clf.fit(X, y)
ensemble_clf = clf.named_steps['ensemble_clf']
selected_models = [p for i, p in enumerate(predictors) if i in ensemble_clf.get_support()]
selected_models



[GradientBoostingClassifier()]

## Multichannel ensembles

In [None]:
### 

In [4]:
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC 
import pipecaster as pc

early_stopping_GBC = GradientBoostingClassifier(n_estimators=1000, 
                                     validation_fraction=0.1, 
                                     n_iter_no_change=3)

clf = pc.MultichannelPipeline(n_channels=10)
clf.add_layer(early_stopping_GBC)
clf.add_layer(pc.MultichannelPredictor(SVC()))
clf

Unnamed: 0_level_0,layer_0,layer_1
channel,Unnamed: 1_level_1,Unnamed: 2_level_1
0,GradientBoostingClassifier,SVC_MC
1,GradientBoostingClassifier,▽
2,GradientBoostingClassifier,▽
3,GradientBoostingClassifier,▽
4,GradientBoostingClassifier,▽
5,GradientBoostingClassifier,▽
6,GradientBoostingClassifier,▽
7,GradientBoostingClassifier,▽
8,GradientBoostingClassifier,▽
9,GradientBoostingClassifier,▽


**Notes on dataframe visualization**:  
* Inverted triangles indicate that the channels are spanned by the multichannel pipe shown directly above. 
* The _MC suffix indicates that the SVC classifier is being used for multichannel prediction (i.e. it's wrapped in the MultichannelPredictor class).
* The integer channel and layer indices seen on the left and top can be used to reference specific object instances in the pipeline (see MultichannelPipeline interface annotations).  

In [4]:
mclf1.fit(Xs, y)

Unnamed: 0_level_0,layer_0,out_0,layer_1,out_1
channel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,{GradientBoostingClassifier}cvtr,→,{SVC_MC}tr,→
1,{GradientBoostingClassifier}cvtr,→,▽,
2,{GradientBoostingClassifier}cvtr,→,▽,
3,{GradientBoostingClassifier}cvtr,→,▽,
4,{GradientBoostingClassifier}cvtr,→,▽,
5,{GradientBoostingClassifier}cvtr,→,▽,
6,{GradientBoostingClassifier}cvtr,→,▽,
7,{GradientBoostingClassifier}cvtr,→,▽,
8,{GradientBoostingClassifier}cvtr,→,▽,
9,{GradientBoostingClassifier}cvtr,→,▽,


**Notes on dataframe visualization**: 
* GradientBoostingClassifier appears in brackets followed by 'cvtr' to indicate that the classifier has been wrapped with a class that provides internal cross validation training to prevent overfitting of downstream metaclassifiers.
* SVC_MC appears in brackets followed by 'tr' to indicate that it has been wrapped to provide a transform() method that predictors ordinarily lack (transform functionality is there in case the user wants to generate inputs for a function external to the pipeline). 
* Output layers are indicated showing which channels generate outputs.

### Performance analysis

In [None]:
%matplotlib inline
import time
from sklearn.metrics import balanced_accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectPercentile
from sklearn.pipeline import Pipeline

import matplotlib.pyplot as plt
from scipy.stats import sem

n_cpus = pc.count_cpus()

# Test a MultichannelPipeline in a cross validation experiment
t = time.time()
pc_accuracies = pc.cross_val_score(mclf1, Xs, y, scorer=balanced_accuracy_score, cv=8, n_processes=n_cpus)
pc_time = time.time() - t

# Test a single channel scikit-learn pipeline in a cross validation experiment
X = np.concatenate(Xs, axis=1)
clf = Pipeline([('GradientBoostingClassifier', early_stopping_GBC)])
t = time.time()
sk_accuracies = cross_val_score(clf, X, y, scoring='balanced_accuracy', cv=8, n_jobs=n_cpus)
sk_time = time.time() - t

# Plot the cross validation results
fig, axes = plt.subplots(1, 2)
xlabels = ['concatenated', 'multichannel']
axes[0].bar(xlabels, [np.mean(sk_accuracies), np.mean(pc_accuracies)], 
        yerr=[sem(sk_accuracies), sem(pc_accuracies)], capsize=10)
axes[0].set_ylim(.60)
axes[0].set_ylabel('balanced accuracy', fontsize=12)
axes[0].set_xticklabels(xlabels, rotation=45, ha='right', fontsize=12)
axes[1].bar(xlabels, [np.mean(sk_time), np.mean(pc_time)])
axes[1].set_ylabel('execution time [s]', fontsize=12)
axes[1].set_xticklabels(xlabels, rotation=45, ha='right', fontsize=12)
plt.tight_layout()
# plt.savefig('performance_comparison.svg')

**Results**  
The multichannel pipecaster pipeline performs better on this task than the pipeline architecture that uses concatenated features.  This performance enhancement, which is sometimes seen with real data, may be due to the tendency of meta-predictors to correct errors found in the base predictors and perhaps to increased diversity of the features important to the base predictors.

This example is a proxy for a more thorough demonstration of performance improvement, which would involve a full hyperparameter optimization for the two pipeline architectures.  The early stopping GradientBoostingClassifier method shown here is a decent proxy though, because performance of this classifier is not very sensitive to hyperparameter values, and model complexity is automatically set by increasing the number of boosting rounds until performance on a validation set stops increasing.