# **1. Introduction**

### 1.1 Overview of ensemble learning
*Ensemble learning* is a machine learning paradigm where multiple models (often called "weak learners") are trained to solve the same problem and combined to get better results. The main hypothesis is that when weak models are correctly combined, they can approximate much better complex models.<br>
The core idea behind the ensemble learning is that, no single model is perfect, but a combination of diverse models can lead to a more balanced and accurate predictions. Some common emsemble learning techniques are:
- Bagging (Bootstrap Aggregating)
- Boosting
- Stacking

### 1.2 What is Super Learner Ensemble?
The *Super Learner* is a specific type of stacking ensemble learning technique that stacks multiple traditional machine learning algoriths into an emsemble that finds the optimal combination of diverse learning algorithms. This involves selecting many different algorithms that may be appropriate for your regression or classification problem and evaluating their performance on your dataset using a resampling technique, such as k-fold cross-validation.<br>
Steps to build a Super Learner model:<br>
1. Select a k-fold split of the training dataset.
2. Select m base-models or model configurations.
3. For each basemodel:
    - Evaluate using k-fold cross-validation.
    - Store all out-of-fold predictions.
    - Fit the model on the full training dataset and store.
4. Fit a meta-model on the out-of-fold predictions.
5. Evaluate the model on a holdout dataset or use model to make predictions.

![image](super_learner.png)<br>
[Image source](https://arxiv.org/abs/1803.02323)


### What could be the inputs and outputs for the meta-model?
- Inputs: Predictions from the base models
- Output: Prediction for training dataset

For example, if we have 3 base models, then the meta-model will take 3 predictions as input and output the final prediction.<br>
If we had 1000 rows in the training dataset and 3 base models, then the meta-model will have 1000 rows and 3 columns as input and 1000 rows as output.


### Can this work for regression and classification problems?
Yes, the Super Learner ensemble can be used for both regression and classification problems. The only difference is the choice of the meta-model. For regression problems, the meta-model can be a linear regression model, while for classification problems, the meta-model can be a logistic regression model or any other classification model.

### Won't this overfit the training data?
The Super Learner ensemble is designed to reduce overfitting by using cross-validation to evaluate the base models and the meta-model. The base models are trained on different subsets of the training data, and the meta-model is trained on the out-of-fold predictions of the base models. This helps to ensure that the ensemble generalizes well to new data.

### How do we make a prediction?
To make a prediction on a new sample (row of data), first, the row of data is provided as input to each base model to generate a prediction from each model.

The predictions from the base-models are then concatenated into a vector and provided as input to the meta-model. The meta-model then makes a final prediction for the row of data.

# 2. Implementation of Super Learner Ensemble With scikit-learn

Now, we will implement the Super learner for both regression and classification problems using scikit-learn. 

### 2.1 Super Learner for Regression

In [3]:
# importing necessary libraries
from math import sqrt
from numpy import hstack
from numpy import vstack
from numpy import asarray
from sklearn.datasets import make_regression
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor

First, we will use `make_regression()` method to generate 1000 examples with 100 features. <br>
We then split the data so that 50% is used for training and 50% is used for testing. 

In [4]:
# create the inputs and outputs
X, y = make_regression(n_samples=1000, n_features=100, noise=0.5)
# split
X, X_val, y, y_val = train_test_split(X, y, test_size=0.50)
print('Train', X.shape, y.shape, 'Test', X_val.shape, y_val.shape)

Train (500, 100) (500,) Test (500, 100) (500,)


Now, lets define different regression models. We will use the following regression models:
- Linear Regression
- ElasticNet
- Decision Tree Regressor
- Random Forest Regressor
- SVR
- Extra Trees Regressor
- Bagging Regressor
- KNeighbors Regressor
- AdaBoost Regressor

These models will be used as base models.

In [5]:
# create a list of base-models
def get_models():
	models = list()
	models.append(LinearRegression())
	models.append(ElasticNet())
	models.append(SVR(gamma='scale'))
	models.append(DecisionTreeRegressor())
	models.append(KNeighborsRegressor())
	models.append(AdaBoostRegressor())
	models.append(BaggingRegressor(n_estimators=10))
	models.append(RandomForestRegressor(n_estimators=10))
	models.append(ExtraTreesRegressor(n_estimators=10))
	return models

Next, we will use k-fold cross-validation to make out-of-fold predictions that will be used as the dataset to train the meta-model or “super learner.”

For this, first split the data into k folds, 10 in this case. Then, for each fold, we will:
- Fit each base model on the training dataset.
- Make a prediction on the validation dataset.
- Store the predictions for each base model.

Each out-of-fold prediction will be a column for the meta-model input. We will collect columns from each algorithm for one fold of the data, horizontally stacking the rows. Then for all groups of columns we collect, we will vertically stack these rows into one long dataset with 500 rows and nine columns.

The below function `get_out_of_fold_predictions()` will do the stacking of the predictions from the base models. This is the input dataset for the meta-model.


In [6]:
# collect out of fold predictions form k-fold cross validation
def get_out_of_fold_predictions(X, y, models):
	meta_X, meta_y = list(), list()
	# define split of data
	kfold = KFold(n_splits=10, shuffle=True)
	# enumerate splits
	for train_ix, test_ix in kfold.split(X):
		fold_yhats = list()
		# get data
		train_X, test_X = X[train_ix], X[test_ix]
		train_y, test_y = y[train_ix], y[test_ix]
		meta_y.extend(test_y)
		# fit and make predictions with each sub-model
		for model in models:
			model.fit(train_X, train_y)
			yhat = model.predict(test_X)
			# store columns
			fold_yhats.append(yhat.reshape(len(yhat),1))
		# store fold yhats as columns
		meta_X.append(hstack(fold_yhats))
	return vstack(meta_X), asarray(meta_y)

We can then call the function to get the models and the function to prepare the meta-model dataset.

In [7]:
# get models
models = get_models()
# get out of fold predictions
meta_X, meta_y = get_out_of_fold_predictions(X, y, models)
print('Meta ', meta_X.shape, meta_y.shape)

Meta  (500, 9) (500,)


Next, we can fit all of the base-models on the entire training dataset.

In [8]:
# fit all base models on the training dataset
def fit_base_models(X, y, models):
	for model in models:
		model.fit(X, y)

Then, we can fit the meta-model on the prepared dataset.

In [9]:
# fit a meta model
def fit_meta_model(X, y):
	model = LinearRegression()
	model.fit(X, y)
	return model

Next, we can evaluate the base-models on the holdout dataset. [holdout dataset is the test dataset] 

For the evaluation, we will use the Root Mean Squared Error (MSE) as the metric.

In [10]:
# evaluate a list of models on a dataset
def evaluate_models(X, y, models):
	for model in models:
		yhat = model.predict(X)
		mse = mean_squared_error(y, yhat)
		print('%s: RMSE %.3f' % (model.__class__.__name__, sqrt(mse)))

Finally, we can use super learner to make prediction on the holdout dataset.

In [11]:
# make predictions with stacked model
def super_learner_predictions(X, models, meta_model):
	meta_X = list()
	for model in models:
		yhat = model.predict(X)
		meta_X.append(yhat.reshape(len(yhat),1))
	meta_X = hstack(meta_X)
	# predict
	return meta_model.predict(meta_X)

### Keeping it all together

In [13]:
# example of a super learner model for regression
from math import sqrt
from numpy import hstack
from numpy import vstack
from numpy import asarray
from sklearn.datasets import make_regression
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor

# create a list of base-models
def get_models():
	models = list()
	models.append(LinearRegression())
	models.append(ElasticNet())
	models.append(SVR(gamma='scale'))
	models.append(DecisionTreeRegressor())
	models.append(KNeighborsRegressor())
	models.append(AdaBoostRegressor())
	models.append(BaggingRegressor(n_estimators=10))
	models.append(RandomForestRegressor(n_estimators=10))
	models.append(ExtraTreesRegressor(n_estimators=10))
	return models

# collect out of fold predictions form k-fold cross validation
def get_out_of_fold_predictions(X, y, models):
	meta_X, meta_y = list(), list()
	# define split of data
	kfold = KFold(n_splits=10, shuffle=True)
	# enumerate splits
	for train_ix, test_ix in kfold.split(X):
		fold_yhats = list()
		# get data
		train_X, test_X = X[train_ix], X[test_ix]
		train_y, test_y = y[train_ix], y[test_ix]
		meta_y.extend(test_y)
		# fit and make predictions with each sub-model
		for model in models:
			model.fit(train_X, train_y)
			yhat = model.predict(test_X)
			# store columns
			fold_yhats.append(yhat.reshape(len(yhat),1))
		# store fold yhats as columns
		meta_X.append(hstack(fold_yhats))
	return vstack(meta_X), asarray(meta_y)

# fit all base models on the training dataset
def fit_base_models(X, y, models):
	for model in models:
		model.fit(X, y)

# fit a meta model
def fit_meta_model(X, y):
	model = LinearRegression()
	model.fit(X, y)
	return model

# evaluate a list of models on a dataset
def evaluate_models(X, y, models):
	for model in models:
		yhat = model.predict(X)
		mse = mean_squared_error(y, yhat)
		print('%s: RMSE %.3f' % (model.__class__.__name__, sqrt(mse)))

# make predictions with stacked model
def super_learner_predictions(X, models, meta_model):
	meta_X = list()
	for model in models:
		yhat = model.predict(X)
		meta_X.append(yhat.reshape(len(yhat),1))
	meta_X = hstack(meta_X)
	# predict
	return meta_model.predict(meta_X)

# create the inputs and outputs
X, y = make_regression(n_samples=1000, n_features=100, noise=0.5)
# split
X, X_val, y, y_val = train_test_split(X, y, test_size=0.50)
print('Train', X.shape, y.shape, 'Test', X_val.shape, y_val.shape)
# get models
models = get_models()
# get out of fold predictions
meta_X, meta_y = get_out_of_fold_predictions(X, y, models)
print('Meta ', meta_X.shape, meta_y.shape)
# fit base models
fit_base_models(X, y, models)
# fit the meta model
meta_model = fit_meta_model(meta_X, meta_y)
# evaluate base models
evaluate_models(X_val, y_val, models)
# evaluate meta model
yhat = super_learner_predictions(X_val, models, meta_model)
print('Super Learner: RMSE %.3f' % (sqrt(mean_squared_error(y_val, yhat))))

Train (500, 100) (500,) Test (500, 100) (500,)
Meta  (500, 9) (500,)
LinearRegression: RMSE 0.560
ElasticNet: RMSE 62.114
SVR: RMSE 177.241
DecisionTreeRegressor: RMSE 144.971
KNeighborsRegressor: RMSE 150.702
AdaBoostRegressor: RMSE 92.956
BaggingRegressor: RMSE 100.110
RandomForestRegressor: RMSE 98.980
ExtraTreesRegressor: RMSE 96.466
Super Learner: RMSE 0.561


First the shape of the prepared dataset is printed. Then the shape of the dataset for meta-model is displayed. 

Then the base models are trained on the training dataset and the meta-model is trained on the prepared dataset. After that, the performance of each base-model is reported on holdout dataset. Finally, the performance of the super learner is reported on the holdout dataset.

### 2.2 Super Learner for Classification

For the classification problem, the inputs to the meta learner can be class labels or class probabilites. 

In [14]:
# importing necessary libraries
from numpy import hstack
from numpy import vstack
from numpy import asarray
from sklearn.datasets import make_blobs
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier


Here, we will use `make_blobs()` method from scikit-learn to generate 1000 examples with 100 features and 2 classes. We will split the data so that 50% is used for training and 50% is used for testing.

In [15]:
# create the inputs and outputs
X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20)
# split
X, X_val, y, y_val = train_test_split(X, y, test_size=0.50)
print('Train', X.shape, y.shape, 'Test', X_val.shape, y_val.shape)

Train (500, 100) (500,) Test (500, 100) (500,)


Now, same as regression, we will define different classification models. We will use the following classification models:
- Logistic Regression
- Decision Tree Classifier
- Random Forest Classifier
- Extra Trees Classifier
- Bagging Classifier
- KNeighbors Classifier
- AdaBoost Classifier
- SVC

In [16]:
def get_models():
	models = list()
	models.append(LogisticRegression(solver='liblinear'))
	models.append(DecisionTreeClassifier())
	models.append(SVC(gamma='scale', probability=True))
	models.append(GaussianNB())
	models.append(KNeighborsClassifier())
	models.append(AdaBoostClassifier())
	models.append(BaggingClassifier(n_estimators=10))
	models.append(RandomForestClassifier(n_estimators=10))
	models.append(ExtraTreesClassifier(n_estimators=10))
	return models

Next, we will change get_out_of_fold_predictions() function to return predictions for classification problem.   

In [17]:
# collect out of fold predictions form k-fold cross validation
def get_out_of_fold_predictions(X, y, models):
	meta_X, meta_y = list(), list()
	# define split of data
	kfold = KFold(n_splits=10, shuffle=True)
	# enumerate splits
	for train_ix, test_ix in kfold.split(X):
		fold_yhats = list()
		# get data
		train_X, test_X = X[train_ix], X[test_ix]
		train_y, test_y = y[train_ix], y[test_ix]
		meta_y.extend(test_y)
		# fit and make predictions with each sub-model
		for model in models:
			model.fit(train_X, train_y)
			yhat = model.predict_proba(test_X)
			# store columns
			fold_yhats.append(yhat)
		# store fold yhats as columns
		meta_X.append(hstack(fold_yhats))
	return vstack(meta_X), asarray(meta_y)

Instead of Linear Regression, we will use Logistic Regression as the meta-model for classification problem.

In [18]:
# fit a meta model
def fit_meta_model(X, y):
	model = LogisticRegression(solver='liblinear')
	model.fit(X, y)
	return model

#### Keeping it all together

In [None]:
# example of a super learner model for binary classification
from numpy import hstack
from numpy import vstack
from numpy import asarray
from sklearn.datasets import make_blobs
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier

# create a list of base-models
def get_models():
	models = list()
	models.append(LogisticRegression(solver='liblinear'))
	models.append(DecisionTreeClassifier())
	models.append(SVC(gamma='scale', probability=True))
	models.append(GaussianNB())
	models.append(KNeighborsClassifier())
	models.append(AdaBoostClassifier())
	models.append(BaggingClassifier(n_estimators=10))
	models.append(RandomForestClassifier(n_estimators=10))
	models.append(ExtraTreesClassifier(n_estimators=10))
	return models

# collect out of fold predictions form k-fold cross validation
def get_out_of_fold_predictions(X, y, models):
	meta_X, meta_y = list(), list()
	# define split of data
	kfold = KFold(n_splits=10, shuffle=True)
	# enumerate splits
	for train_ix, test_ix in kfold.split(X):
		fold_yhats = list()
		# get data
		train_X, test_X = X[train_ix], X[test_ix]
		train_y, test_y = y[train_ix], y[test_ix]
		meta_y.extend(test_y)
		# fit and make predictions with each sub-model
		for model in models:
			model.fit(train_X, train_y)
			yhat = model.predict_proba(test_X)
			# store columns
			fold_yhats.append(yhat)
		# store fold yhats as columns
		meta_X.append(hstack(fold_yhats))
	return vstack(meta_X), asarray(meta_y)

# fit all base models on the training dataset
def fit_base_models(X, y, models):
	for model in models:
		model.fit(X, y)

# fit a meta model
def fit_meta_model(X, y):
	model = LogisticRegression(solver='liblinear')
	model.fit(X, y)
	return model

# evaluate a list of models on a dataset
def evaluate_models(X, y, models):
	for model in models:
		yhat = model.predict(X)
		acc = accuracy_score(y, yhat)
		print('%s: %.3f' % (model.__class__.__name__, acc*100))
# here for the evalaution of the models we are using accuracy_score as the metric

# make predictions with stacked model
def super_learner_predictions(X, models, meta_model):
	meta_X = list()
	for model in models:
		yhat = model.predict_proba(X)
		meta_X.append(yhat)
	meta_X = hstack(meta_X)
	# predict
	return meta_model.predict(meta_X)

# create the inputs and outputs
X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20)
# split
X, X_val, y, y_val = train_test_split(X, y, test_size=0.50)
print('Train', X.shape, y.shape, 'Test', X_val.shape, y_val.shape)
# get models
models = get_models()
# get out of fold predictions
meta_X, meta_y = get_out_of_fold_predictions(X, y, models)
print('Meta ', meta_X.shape, meta_y.shape)
# fit base models
fit_base_models(X, y, models)
# fit the meta model
meta_model = fit_meta_model(meta_X, meta_y)
# evaluate base models
evaluate_models(X_val, y_val, models)
# evaluate meta model
yhat = super_learner_predictions(X_val, models, meta_model)
print('Super Learner: %.3f' % (accuracy_score(y_val, yhat) * 100))

Train (500, 100) (500,) Test (500, 100) (500,)




Meta  (500, 18) (500,)




LogisticRegression: 96.600
DecisionTreeClassifier: 71.200
SVC: 98.000
GaussianNB: 98.400
KNeighborsClassifier: 92.800
AdaBoostClassifier: 91.000
BaggingClassifier: 84.400
RandomForestClassifier: 85.400
ExtraTreesClassifier: 83.000
Super Learner: 98.000


Like regression, we will get the shape of the prepared dataset and the dataset for meta-model. Then, we will train the base models on the training dataset and the meta-model on the prepared dataset. After that, we will report the performance of each base-model on the holdout dataset. Finally, we will report the performance of the super learner on the holdout dataset.

# 3. Super Learner With ML-Ensemble Library

All above process is manual way of implementing Super Learner. But, we can use `mlens` library to implement Super Learner with few lines of code. 

Install the `mlens` library<br>
`pip install mlens`

We can use `SuperLearner` class from `mlens` library to implement Super Learner. We can use `.add()` method to add base learners and `.add_meta()` method to add meta learner after instantiating the `SuperLearner` class.<br>
```
#configure model
ensemble = SuperLearner(...)
#add list of base learners
ensemble.add(...)
#add meta learner
ensemble.add_meta(...)
```

To configure the Super Learner, we can use the following parameters:
- folds: Number of folds to use in k-fold cross-validation.
- scorer: The scoring function to use to evaluate the performance of the base models.
- shuffle: Whether to shuffle the data before splitting it into folds.
- sample_size: The size of the sample to use when fitting the base models.

### 3.1 Super Learner for Regression using mlens

In [1]:
#importing necessary libraries
import numpy as np
from math import sqrt
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from mlens.ensemble import SuperLearner


[MLENS] backend: threading


In [2]:
def get_models():
	models = list()
	models.append(LinearRegression())
	models.append(ElasticNet())
	models.append(SVR(gamma='scale'))
	models.append(DecisionTreeRegressor())
	models.append(KNeighborsRegressor())
	models.append(AdaBoostRegressor())
	models.append(BaggingRegressor(n_estimators=10))
	models.append(RandomForestRegressor(n_estimators=10))
	models.append(ExtraTreesRegressor(n_estimators=10))
	return models

# cost function for base models
def rmse(yreal, yhat):
	return sqrt(mean_squared_error(yreal, yhat))


# create the super learner
def get_super_learner(X):
	ensemble = SuperLearner(scorer=rmse, folds=10, shuffle=True, sample_size=len(X))
	# add base models
	models = get_models()
	ensemble.add(models)
	# add the meta model
	ensemble.add_meta(LinearRegression())
	return ensemble
 
# create the inputs and outputs
X, y = make_regression(n_samples=1000, n_features=100, noise=0.5)
# split
X, X_val, y, y_val = train_test_split(X, y, test_size=0.50)
print('Train', X.shape, y.shape, 'Test', X_val.shape, y_val.shape)
# create the super learner
ensemble = get_super_learner(X)
# fit the super learner
ensemble.fit(X, y)
# summarize base learners
print(ensemble.data)
# evaluate meta model
yhat = ensemble.predict(X_val)
print('Super Learner: RMSE %.3f' % (rmse(y_val, yhat)))



Train (500, 100) (500,) Test (500, 100) (500,)
                                  score-m  score-s  ft-m  ft-s  pt-m  pt-s
layer-1  adaboostregressor         104.40     8.92  1.33  0.05  0.08  0.03
layer-1  baggingregressor          120.64    10.83  0.48  0.03  0.02  0.03
layer-1  decisiontreeregressor     166.98    15.19  0.09  0.02  0.00  0.00
layer-1  elasticnet                 71.18     5.80  0.01  0.01  0.00  0.00
layer-1  extratreesregressor       116.00    12.53  0.17  0.03  0.01  0.00
layer-1  kneighborsregressor       160.36    11.71  0.00  0.00  0.66  0.04
layer-1  linearregression            0.56     0.03  0.02  0.01  0.00  0.00
layer-1  randomforestregressor     120.27    10.36  0.38  0.04  0.00  0.00
layer-1  svr                       176.40    14.03  0.02  0.00  0.01  0.00

Super Learner: RMSE 0.574


Here, first we will see the shape of prepared datset and dataset for meta-model.

Next, the performance for the each base model is displayed.

Finally, the performance of the super learner is displayed.

In above table, score-m, score-s, ft-m, ft-s, pt-m and pt-s are:
- score-m: Mean score of the base model.
- score-s: Standard deviation of the score of the base model.
- ft-m: Mean fit time of the base model.
- ft-s: Standard deviation of the fit time of the base model.
- pt-m: Mean prediction time of the base model.
- pt-s: Standard deviation of the prediction time of the base model.




### 3.2 Super Learner for Classification using mlens

We can also use mlens library to implement Super Learner for classification problem. The process is same as regression problem.

In [1]:
# example of a super learner using the mlens library
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from mlens.ensemble import SuperLearner



[MLENS] backend: threading


In [2]:
# create a list of base-models
def get_models():
	models = list()
	models.append(LogisticRegression(solver='liblinear'))
	models.append(DecisionTreeClassifier())
	models.append(SVC(gamma='scale', probability=True))
	models.append(GaussianNB())
	models.append(KNeighborsClassifier())
	models.append(AdaBoostClassifier())
	models.append(BaggingClassifier(n_estimators=10))
	models.append(RandomForestClassifier(n_estimators=10))
	models.append(ExtraTreesClassifier(n_estimators=10))
	return models

# create the super learner
def get_super_learner(X):
	ensemble = SuperLearner(scorer=accuracy_score, folds=10, shuffle=True, sample_size=len(X))
	# add base models
	models = get_models()
	ensemble.add(models)
	# add the meta model
	ensemble.add_meta(LogisticRegression(solver='lbfgs'))
	return ensemble

# create the inputs and outputs
X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20)
# split
X, X_val, y, y_val = train_test_split(X, y, test_size=0.50)
print('Train', X.shape, y.shape, 'Test', X_val.shape, y_val.shape)
# create the super learner
ensemble = get_super_learner(X)
# fit the super learner
ensemble.fit(X, y)
# summarize base learners
print(ensemble.data)
# make predictions on hold out set
yhat = ensemble.predict(X_val)
print('Super Learner: %.3f' % (accuracy_score(y_val, yhat) * 100))

Train (500, 100) (500,) Test (500, 100) (500,)




                                   score-m  score-s  ft-m  ft-s  pt-m  pt-s
layer-1  adaboostclassifier           0.90     0.03  1.27  0.06  0.09  0.02
layer-1  baggingclassifier            0.81     0.07  0.42  0.07  0.05  0.02
layer-1  decisiontreeclassifier       0.70     0.05  0.08  0.02  0.00  0.00
layer-1  extratreesclassifier         0.80     0.04  0.11  0.02  0.02  0.01
layer-1  gaussiannb                   0.97     0.03  0.03  0.01  0.01  0.00
layer-1  kneighborsclassifier         0.94     0.03  0.00  0.00  0.89  0.02
layer-1  logisticregression           0.96     0.02  0.01  0.00  0.00  0.00
layer-1  randomforestclassifier       0.82     0.04  0.06  0.01  0.00  0.00
layer-1  svc                          0.98     0.03  0.10  0.02  0.00  0.00

Super Learner: 95.600


In [3]:
pip freeze > requirements.txt

Note: you may need to restart the kernel to use updated packages.
