# Ensemble Machine Learning Algorithms in Python with scikit-learn

Ensembles can give you a boost in accuracy on your dataset.

## Combine Model Predictions Into Ensemble Predictions
The three most popular methods for combining the predictions from different models are:

1. Bagging. Building multiple models (typically of the same type) from different subsamples of the training dataset.

2. Boosting. Building multiple models (typically of the same type) each of which learns to fix the prediction errors of a prior model in the chain.

3. Voting. Building multiple models (typically of differing types) and simple statistics (like calculating the mean) are used to combine predictions.
This post will not explain each of these methods.

It assumes you are generally familiar with machine learning algorithms and ensemble methods and that you are looking for information on how to create ensembles in Python.

In [1]:
import pandas
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression

url = "https://gist.githubusercontent.com/ktisha/c21e73a1bd1700294ef790c56c8aec1f/raw/819b69b5736821ccee93d05b51de0510bea00294/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)

array = dataframe.values
X = array[9:, 0:8]
Y = array[9:, 8]
Y = Y.astype('int')

seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)

print("Predictors:")
print(X[:5])
print("Targets:")
print(Y[:5])

Predictors:
[['6' 148.0 72.0 35.0 0.0 33.6 0.627 50.0]
 ['1' 85.0 66.0 29.0 0.0 26.6 0.35100000000000003 31.0]
 ['8' 183.0 64.0 0.0 0.0 23.3 0.672 32.0]
 ['1' 89.0 66.0 23.0 94.0 28.1 0.16699999999999998 21.0]
 ['0' 137.0 40.0 35.0 168.0 43.1 2.2880000000000003 33.0]]
Targets:
[1 0 1 0 1]


## Bagging Algorithms
### 1. Bagged Decision Tree
Bagging perform best with algorithms that have high variance. A popular example are decision trees, often constructed without pruning.



In [2]:
# Bagged Decision Trees for Classification


from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

cart = DecisionTreeClassifier()
num_trees = 100

model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=seed)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.770745044429255


### 2. Random Forest
Random forest is an extension of bagged decision tress.

Samples of the training dataset are taken with replacement, but the trees are constructed in a way that reduces the correlation between individul classifers. Specifically, rather than greedily choosing the best split point in the construction of the tree, only a random subset of features are considered for each split.

You can construct a Random Forest model for classification using the RandomForestClassifier class.

The example below provides an example of Random Forest for classification with 100 trees and split points chosen from a random selection of 3 features.

In [3]:
# Random Forest classification

from sklearn.ensemble import RandomForestClassifier

num_trees = 100
max_features = 3

model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.7655844155844156


### 3. Extra Trees
Extra Trees are another modification of bagging where random trees are cnstructed from samples of the training dataset.

You can constructed an Entra Trees model for classification using the ExtraTreesClassifier class.

The example below provides ademonstration of extra trees with the number of trees set to 100 and splits chosen from 7 random feartures.

In [4]:
#Extra Trees Classification

from sklearn.ensemble import ExtraTreesClassifier

num_trees = 100
max_features = 7

model = ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.7629186602870812


## Boosting Algorithms
Boosting ensemble algorithms creates a sequence of models that attempt to correct the mistakes of the models before them in the sequence.

Once created, the models make predictions which may be weighted by their demonstrated accuracy and the results are conbined to create a final output prediction.

The two most common boosting ensemble machine learning algorithms are:

1. AdaBoost
2. Stochastic Gradient Boosting

### 1. AdaBoost
Adaboost was perhaps the first successful boosting ensemble algorithm. It generally works by weighting instance in the dataset by how easy or difficult they are to classify, allowing the algorithm to pay or less attention to them in the construction of subsequent models.

In [5]:
# AdaBoost Classification
from sklearn.ensemble import AdaBoostClassifier

num_trees = 30

model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.760457963089542


### 2. Stochastic Gradient Boosting Classification
Stochastic Gradient Boosting are one of the most sophisticated ensemble techniques. It is also a technique that is proving to be perhaps of the best techniques available for improving performance via ensembles.

In [6]:
# Stochastic Gradient Boosting Classification
from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.7733253588516746


## Voting Ensemble
Voting is one of the simplest ways of combining the predictions from multiple machine learning algorithms.

It works by first creating two or more standalone models from training dataset. A Voting Classifier can then be used to wrap your models and average the predictions of the sub-models when asked to make predictions for new data.

The predictions of the sub-models can be weighted, but specifying the weights for classifier manually or even heirostically is difficult. More advanced methods can learn how to best weight the predictions from submodels, but this is called stacking(stacked aggregation) and is currently not provided in scikit-learn.

In [7]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC  # C-Support Vector Classification.
from sklearn.ensemble import VotingClassifier

# create the ssub models
estimators= []
model1 = LogisticRegression()
estimators.append(('logistic', model1))
model2 = DecisionTreeClassifier()
estimators.append(('cart', model2))
model3 = SVC()
estimators.append(('svm', model3))
# create the ensemble model
ensemble = VotingClassifier(estimators)
results = model_selection.cross_val_score(ensemble, X, Y, cv=kfold)
print(results.mean())

  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


0.7304169514695831


  if diff:
  if diff:
  if diff:


# Summary
In this post you discovered ensemble machine learning algorithms for improving the performance of models on your problems.

You learn about:

1. Bagging Ensembles including Bagged Decision Trees, Random Forest and Extra Trees.
2. Boosting Ensembles including AdaBoost and Stochastic Gradient Boosting.
3. Voting Ensembles for averaging te predictions for any arbitrary models.