# tutorial 5: in-pipeline hyperparameter screens

This tutorial show you how to use the **GridSearchEnsemble** class to 

* screen ML model hyperparameters during model fitting
* make ensemble predictions using the results of a hyperparameter screen

## Stacked generalization with parameter selection

In this example, internal cv is used to estimate the quality of a set of hyperameters as well as to generate features for meta-prediction with a support vector machine.  The top two parameter sets are chosen to create the final model.

In [1]:
from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score
import pipecaster as pc

screen_specs = {
     'learning_rate':[0.1, 10],
     'n_estimators':[2, 10],
}

X, y = make_classification()
clf = pc.GridSearchEnsemble(
                 param_dict=screen_specs,
                 base_predictor_cls=GradientBoostingClassifier,
                 meta_predictor=SVC(),
                 internal_cv=5, 
                 base_score_methods='predict_proba',
                 scorer=roc_auc_score,
                 score_selector=pc.RankScoreSelector(k=2),
                 base_processes='max')
clf.fit(X, y)
clf.get_screen_results()

File descriptor limit 256 is too low for production servers and may result in connection errors. At least 8192 is recommended. --- Fix with 'ulimit -n 8192'
2021-03-24 11:13:20,595	INFO services.py:1173 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8272[39m[22m


Unnamed: 0_level_0,selections,performance
parameters,Unnamed: 1_level_1,Unnamed: 2_level_1
"{'learning_rate': 10, 'n_estimators': 10}",+++,0.827931
"{'learning_rate': 10, 'n_estimators': 2}",+++,0.82493
"{'learning_rate': 0.1, 'n_estimators': 10}",-,0.820928
"{'learning_rate': 0.1, 'n_estimators': 2}",-,0.817927


In [2]:
cross_val_score(clf, X, y, scoring='balanced_accuracy', cv=3)

array([0.91176471, 0.96875   , 0.81985294])

## Parameter selection (without ensemble prediction)

In this example, the meta-predictor is dropped and the best parameter set is used to make the final model.

In [1]:
from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import balanced_accuracy_score
import pipecaster as pc

screen_specs = {
     'learning_rate':[0.1, 10],
     'n_estimators':[2, 10],
}

X, y = make_classification()
clf = pc.GridSearchEnsemble(
                 param_dict=screen_specs,
                 base_predictor_cls=GradientBoostingClassifier,
                 internal_cv=5,                  
                 base_score_methods='predict',
                 scorer=balanced_accuracy_score,
                 score_selector=pc.RankScoreSelector(k=1),
                 base_processes='max')
clf.fit(X, y)
clf.get_screen_results()

File descriptor limit 256 is too low for production servers and may result in connection errors. At least 8192 is recommended. --- Fix with 'ulimit -n 8192'
2021-03-24 11:22:13,308	INFO services.py:1173 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8272[39m[22m


Unnamed: 0_level_0,selections,performance
parameters,Unnamed: 1_level_1,Unnamed: 2_level_1
"{'learning_rate': 0.1, 'n_estimators': 10}",+++,0.89956
"{'learning_rate': 0.1, 'n_estimators': 2}",-,0.889756
"{'learning_rate': 10, 'n_estimators': 2}",-,0.879952
"{'learning_rate': 10, 'n_estimators': 10}",-,0.879552


In [2]:
cross_val_score(clf, X, y, scoring='balanced_accuracy', cv=3)

array([0.85294118, 0.9375    , 0.78492647])

### Ensemble of ensembles

In this example, a small hyperparamater screen is conducted on each channel using the **Ensemble** class, which uses the best parameter sets to generate features for a support vector machine meta-classifier.  **ChannelEnsemble** assess the accuracy of each of these **Ensemble** object and selects the best performers for input into another support vecor machine meta-classifier.

In [9]:
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
import pipecaster as pc

Xs, y, X_types = pc.make_multi_input_classification(n_informative_Xs=3,
                                                    n_random_Xs=7, class_sep=1)

screen_specs = {
     'learning_rate':[.1, 1],
     'n_estimators':[5, 10]
}

channel_screen = pc.GridSearchEnsemble(
                     screen_specs, GradientBoostingClassifier, SVC(),
                     internal_cv=3, scorer='auto',
                     score_selector=pc.RankScoreSelector(k=2))

clf = pc.MultichannelPipeline(n_channels=10)
clf.add_layer(pc.ChannelEnsemble(base_predictors=channel_screen,
                                 meta_predictor=SVC(),
                                 internal_cv=3,
                                 score_selector=pc.RankScoreSelector(k=3)),
              pipe_processes='max')

Unnamed: 0_level_0,layer_0
channel,Unnamed: 1_level_1
0,ChannelEnsemble
1,▽
2,▽
3,▽
4,▽
5,▽
6,▽
7,▽
8,▽
9,▽


In [10]:
pc.cross_val_score(clf, Xs, y)

[0.8235294117647058, 0.9411764705882353, 0.7536764705882353]