# Combining different machine learning algorithms into a model ensemble

**Model ensembling** is a class of techniques for aggregating together multiple different predictive algorithms into a sort of mega-algorithm, which can often increase the accuracy and reduce the overfitting of your model. Ensembling approaches often work surprisingly well. Many winners of competitive data science competitions use model ensembling in [one](http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf) [form](https://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/5283/winning-solution-code-and-methodology) [or](http://arxiv.org/pdf/0911.0460.pdf) another. In a previous tutorial, we discussed how optimizing hyperparameters allows you to get the best performance from individual machine learning algorithms. Here, we will take you through the steps of building your own ensemble for a classification problem, consisting of an individually optimized:

1. Random forest
2. Support vector machine and
3. Neural network


![Algorithms we'll use in this tutorial.](https://github.com/nslatysheva/data_science_blogging/blob/master/model_optimization/hyperparam_algos.pdf)

These algorithms have quite different structures, which suggests they might capture different aspects of the dataset and could work well in an ensemble. We’ll be working on the famous `spam` dataset and trying to predict whether a certain email is spam or not, and using the standard Python machine learning stack (`scikit`/`numpy`/`pandas`).

## The motivation behind ensembling
The general idea behind ensembling is this: different classes of algorithms (or differently parameterized versions of the same type of algorithm) might be good at picking up on different signals in the dataset.  Combining them means that you can model the data better, leading to better predictions. Furthermore, different algorithms might be overfitting to the data in various ways, but by combining them, you can effectively average away some of this overfitting. Furthermore, if you're trying to improve your model to chase accuracy points, ensembling is a more computationally effective way to do this than trying to tune a single model by searching for more and more optimal hyperparameters.

There are also fundamental reasons for why ensembling together different algorithms often improves accuracy. 

It is best to ensemble together models which are less correlated, because then you can capture different aspects of the blog post (see an excellent explanation [here](http://mlwave.com/kaggle-ensembling-guide/)). 
See an excellent explanation of ensembling [here](http://mlwave.com/kaggle-ensembling-guide/). 

## Examples of ensemble learning
You have probably already encountered several uses of model ensembling. **Random forests** are a type of ensemble algorithm that aggregates together many individual classification tree **base learners**. They are a good systems for intuitively understanding what ensembling is. [Explanation here]. A random forest is already an ensemble. But, a random forest will be just one model in the ensemble we build here. 'Ensembling' is a broad term, and is a recurrent concept throughout machine learning. Correcting individual parts that may go wrong, so the overall thing is correct. 

If you’re interested in **deep learning**, one common technique for improving classification accuracy is training different networks and getting them to vote on classifications for test instances (look at **dropout** for a related but wacky take on ensembling subnetworks). Combinging different models is a recurring trend in machine learning, different incarnations. If you’re familiar with **bagging** or **boosting** algorithms, these are very explicit examples of ensembling. 

## In this post

We will be working on ensembling different algorithms, using both majority voting and stacking,, in order to get improved classification accuracy on the spam dataset. We won’t do fancy visualizations of the dataset, but check out a previous tutorial or our bootcamp to learn Plotly and matplotlib if you're interested. Here, we focus on combining different algorithms to boost performance.

Let's get started!

## 1. Loading up the data

Load dataset. We often want our input data to be a matrix (X) and the vector of instance labels as a separate vector (y). 


In [1]:
import pandas as pd
import numpy as np
import wget

# Import the dataset
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data'
columns_url = 
dataset = wget.download(data_url)
columns = wget.download(columns_url)
dataset = pd.read_csv(dataset, sep=",", )

# Take a peak at the data
dataset.head()

Unnamed: 0,email_id,is_spam,word_freq_will,word_freq_original,word_freq_415,word_freq_mail,char_freq_#,char_freq_$,word_freq_internet,word_freq_edu,...,word_freq_receive,word_freq_000,capital_run_length_average,word_freq_address,word_freq_george,word_freq_cs,word_freq_random,word_freq_conference,word_freq_technology,char_freq_(
0,3628,no,0.0,0,0,0.0,0,0,0.0,0,...,0.0,0,2.0,0.0,0.0,0,0,0,0,0.0
1,63,no,0.0,0,0,0.49,0,0,0.0,0,...,0.0,0,2.824,0.0,0.99,0,0,0,0,0.062
2,1540,no,1.31,0,0,0.0,0,0,0.0,0,...,0.0,0,2.176,0.0,0.0,0,0,0,0,0.431
3,4460,yes,0.75,0,0,0.5,0,0,0.5,0,...,0.25,0,1.023,0.75,0.0,0,0,0,0,0.18
4,2771,no,0.0,0,0,0.0,0,0,0.0,0,...,0.0,0,1.5,0.0,1.56,0,0,0,0,0.18


## 2. Cleaning up and summarizing the data
Lookin' good! Let's convert the data into a nice format. We rearrange some columns, check out what the columns are. 

In [2]:
# Reorder the data columns and drop email_id
cols = dataset.columns.tolist()
cols = cols[2:] + [cols[1]]
dataset = dataset[cols]

# Examine shape of dataset and some column names
print dataset.shape
print dataset.columns.values[0:10]

# Summarise feature values
dataset.describe()

(1000, 62)
['word_freq_will' 'word_freq_original' 'word_freq_415' 'word_freq_mail'
 'char_freq_#' 'char_freq_$' 'word_freq_internet' 'word_freq_edu'
 'word_freq_hp' 'word_freq_lab']


Unnamed: 0,word_freq_will,word_freq_original,word_freq_415,word_freq_mail,char_freq_#,char_freq_$,word_freq_internet,word_freq_edu,word_freq_hp,word_freq_lab,...,word_freq_receive,word_freq_000,capital_run_length_average,word_freq_address,word_freq_george,word_freq_cs,word_freq_random,word_freq_conference,word_freq_technology,char_freq_(
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,...,1000.0,1000.0,1000.0,1000.0,1000.0,1000,1000,1000.0,1000.0,1000.0
mean,0.53795,0.03837,0.05469,0.18984,0.022792,0.066014,0.07321,0.181,0.61197,0.11861,...,0.05104,0.0813,4.85761,0.14998,0.77574,0,0,0.03669,0.12558,0.144783
std,0.831747,0.173041,0.365678,0.496022,0.109007,0.248239,0.270431,0.86285,1.734907,0.746169,...,0.192314,0.358906,30.226395,0.955315,3.509211,0,0,0.268434,0.449092,0.232423
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0,0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.541,0.0,0.0,0,0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,2.2195,0.0,0.0,0,0,0.0,0.0,0.072
75%,0.82,0.0,0.0,0.0,0.0,0.016,0.0,0.0,0.315,0.0,...,0.0,0.0,3.3965,0.0,0.0,0,0,0.0,0.0,0.195
max,6.25,2.22,4.76,5.26,1.41,4.017,3.57,10.0,20.83,14.28,...,2.0,5.45,667.0,14.28,33.33,0,0,5.0,4.76,2.941


In [3]:
# Convert dataframe to numpy array and split
# data into input matrix X and class label vector y
npArray = np.array(dataset)
X = npArray[:,:-1].astype(float)
y = npArray[:,-1]

## 3) Splitting data into training and testing sets

Our day is now nice and squeaky clean! This definitely always happens in real life. 

Next up, let's scale the data and split it into a training and test set. 

In [4]:
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split

# Scale and split dataset
X_scaled = preprocessing.scale(X)

# Split into training and test sets
XTrain, XTest, yTrain, yTest = train_test_split(X_scaled, y, random_state=1)

# 4. Running algorithms on the data

Blah blah now it's time to train algorithms. We are doing binary classification. Could ahve also used logistic regression, kNN, etc etc.

### 4.1 Random forests

Let’s build a random forest. A great explanation of random forests can be found here. Briefly, random forests build a collection of classification trees, which each try to predict classes by recursively splitting the data on features that split classes best. Each tree is trained on bootstrapped data, and each split is only allowed to use certain variables. So, an element of randomness is introduced, a variety of different trees are built, and the 'random forest' ensembles together these base learners.


In [10]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

# Build a random forest using previously-optimized hyperparameter values
clfRDF = RandomForestClassifier(n_estimators=best_n_estim, max_features=best_max_features, max_depth=best_max_depth)
clfRDF.fit(XTrain, yTrain)
RF_predictions = clfRDF.predict(XTest)

print (metrics.classification_report(yTest, RF_predictions))
print ("Overall Accuracy:", round(metrics.accuracy_score(yTest, RF_predictions),2))

             precision    recall  f1-score   support

         no       0.94      0.96      0.95       170
        yes       0.91      0.86      0.88        80

avg / total       0.93      0.93      0.93       250

('Overall Accuracy:', 0.93)


93-95% accuracy, not too shabby! Have a look and see how random forests with suboptimal hyperparameters fare. We got around 91-92% accuracy on the out of the box (untuned) random forests, which actually isn't terrible. 

### 2) Second algorithm: support vector machines

Let's train our second algorithm, support vector machines (SVMs) to do the same exact prediction task. A great introduction to the theory behind SVMs can be read [here](https://www.quantstart.com/articles/Support-Vector-Machines-A-Guide-for-Beginners). Briefly, SVMs search for hyperplanes in the feature space which best divide the different classes in your dataset. Crucially, SVMs can find non-linear decision boundaries between classes using a process called kernelling, which projects the data into a higher-dimensional space. This sounds a bit abstract, but if you've ever fit a linear regression to power-transformed variables (e.g. maybe you used x^2, x^3 as features), you're already familiar with the concept.

SVMs can use different types of kernels, like Gaussian or radial ones, to throw the data into a different space. The main hyperparameters we must tune for SVMs are gamma (a kernel parameter, controlling how far we 'throw' the data into the new feature space) and C (which controls the [bias-variance tradeoff](http://scott.fortmann-roe.com/docs/BiasVariance.html) of the model). 

In [11]:
from sklearn.svm import SVC

# Search for good hyperparameter values
# Specify values to grid search over
g_range = 2. ** np.arange(-15, 5, step=2)
C_range = 2. ** np.arange(-5, 15, step=2)

hyperparameters = [{'gamma': g_range, 
                    'C': C_range}] 

# Grid search using cross-validation
grid = GridSearchCV(SVC(), param_grid=hyperparameters, cv= 10)  
grid.fit(XTrain, yTrain)

bestG = grid.best_params_['gamma']
bestC = grid.best_params_['C']

# Train SVM and output predictions
rbfSVM = SVC(kernel='rbf', C=bestC, gamma=bestG)
rbfSVM.fit(XTrain, yTrain)
SVM_predictions = rbfSVM.predict(XTest)

print metrics.classification_report(yTest, SVM_predictions)
print "Overall Accuracy:", round(metrics.accuracy_score(yTest, SVM_predictions),2)

             precision    recall  f1-score   support

         no       0.94      0.95      0.94       170
        yes       0.88      0.86      0.87        80

avg / total       0.92      0.92      0.92       250

Overall Accuracy: 0.92


Looks good! This is similar performance to what we saw in the random forests.

### 3) Third algorithm: neural network

Finally, let's jump on the hype wagon and throw neural networks at our problem.

Neural networks (NNs) represent a different way of thinking about machine learning algorithms. A great place to start learning about neural networks and deep learning is [this resource](http://neuralnetworksanddeeplearning.com/about.html). Briefly, NNs are composed of  multiple layers of artificial neurons, which individually are simple processing units that weigh up input data. Together, layers of neurons can work together to compute some very complex functions of the data, which in turn can make excellent predictions. You may be aware of some of the crazy results that NN research has recently achieved.

Here, we train a shallow, fully-connected, feedforward neural network on the spam dataset. Other types of neural network implementations in scikit are available here. The hyperparameters we optimize here are the overall architecture (number of neurons in each layer and the number of layers) and the learning rate (which controls how quickly the parameters in our network change during the training phase; see [gradient descent](http://neuralnetworksanddeeplearning.com/chap1.html#learning_with_gradient_descent) and [backpropagation](http://neuralnetworksanddeeplearning.com/chap2.html)). 

In [12]:
from multilayer_perceptron import multilayer_perceptron

# Search for good hyperparameter values
# Specify values to grid search over
layer_size_range = [(3,2),(10,10),(2,2,2),10,5] # different networks shapes
learning_rate_range = np.linspace(.1,1,3)
hyperparameters = [{'hidden_layer_sizes': layer_size_range, 'learning_rate_init': learning_rate_range}]

# Grid search using cross-validation
grid = GridSearchCV(multilayer_perceptron.MultilayerPerceptronClassifier(), param_grid=hyperparameters, cv=10)
grid.fit(XTrain, yTrain)

# Output best hyperparameter values
best_size    = grid.best_params_['hidden_layer_sizes']
best_best_lr = grid.best_params_['learning_rate_init']

# Train neural network and output predictions
nnet = multilayer_perceptron.MultilayerPerceptronClassifier(hidden_layer_sizes=best_size, learning_rate_init=best_best_lr)
nnet.fit(XTrain, yTrain)
NN_predictions = nnet.predict(XTest)

print metrics.classification_report(yTest, NN_predictions)
print "Overall Accuracy:", round(metrics.accuracy_score(yTest, NN_predictions),2)

             precision    recall  f1-score   support

         no       0.95      0.92      0.93       170
        yes       0.84      0.89      0.86        80

avg / total       0.91      0.91      0.91       250

Overall Accuracy: 0.91


Looks like this neural network (given this dataset, architecture, and hyperparameterisation) is doing slightly worse on the spam dataset. That's okay, it could still be picking up on a signal that the random forest and SVM weren't. 

Machine learning algorithns... ensemble!

### 4) Majority vote on classifications

In [12]:
# here's a rough solution
import collections

# stick all predictions into a dataframe
predictions = pd.DataFrame(np.array([RF_predictions, SVM_predictions, NN_predictions])).T
predictions.columns = ['RF', 'SVM', 'NN']
predictions = pd.DataFrame(np.where(predictions=='yes', 1, 0), 
                           columns=predictions.columns, 
                           index=predictions.index)

# initialise empty array for holding predictions
ensembled_predictions = np.zeros(shape=yTest.shape)

# majority vote and output final predictions
for test_point in range(predictions.shape[0]):
    predictions.iloc[test_point,:]
    counts = collections.Counter(predictions.iloc[test_point,:])
    majority_vote = counts.most_common(1)[0][0]
    
    # output votes
    ensembled_predictions[test_point] = majority_vote.astype(int)
    #print "The majority vote for test point", test_point, "is: ", majority_vote

NameError: name 'np' is not defined

In [178]:
# Get final accuracy of ensembled model
yTest[yTest == "yes"] = 1
yTest[yTest == "no"] = 0

print metrics.classification_report(yTest.astype(int), ensembled_predictions.astype(int))
print "Ensemble Accuracy:", round(metrics.accuracy_score(yTest.astype(int), ensembled_predictions.astype(int)),2)

             precision    recall  f1-score   support

          0       0.95      0.96      0.95       170
          1       0.91      0.89      0.90        80

avg / total       0.94      0.94      0.94       250

Ensemble Accuracy: 0.94


# 5) Conclusion

There are plenty of ways to do model ensembling. Simple majority voting. We can also do weighted majority voting, where models with higher accuracy get more of a vote. If your output is numerical, you could average. These relatively simple techniques do a great job, but there is more! Stacking (also called blending) is when the predictions from different algorithms are used as input into another algorithm (often good old linear and logistic regression) which then outputs your final predictions. For example, you might train a linear model on the predictions. Blending. 



 It is best to ensemble together models which are less correlated (see an excellent explanation here). 
See an excellent explanation of ensembling here. 


What happens when your dataset isn’t as nice as this? What if there are many more instances of one class versus the other, or if you have a lot of missing values, or a mixture of categorical and numerical variables? Stay tuned for the next blog post where we write up guidance on tackling these types of sticky situations.



## Notes

+ Should we use something cooler like gradient boosting?




Another nice tutorial on doing ensembling in python is [here](http://sebastianraschka.com/Articles/2014_ensemble_classifier.html).