# Ensemble methods: Tree Bagging; Random Forests; Adaboost

We talked a bit in passing about a few ensemble methods when we talked about trees etc. Let's take some time to use them! We'll go over both the sklearn implementations, and try implementing both ourselves. In the 'do it yourself' part, I'll give you a single iteration, it is your job to put it together ;)

In [1]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_iris
from sklearn.datasets import load_boston
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier 
from sklearn.ensemble import AdaBoostClassifier 

from sklearn.datasets import make_gaussian_quantiles

# For producing decision tree diagrams.
from IPython.core.display import Image, display
from sklearn.externals.six import StringIO


Today, we'll use some simulated data: concentric spheres of classes, see plots and examples here:

http://scikit-learn.org/stable/auto_examples/ensemble/plot_adaboost_twoclass.html

In [2]:
X, Y = make_gaussian_quantiles(cov=2.,
                                 n_samples=4000, n_features=10,
                                 n_classes=2, random_state=1)

np.random.seed(0)
shuffle = np.random.permutation(np.arange(X.shape[0]))
X, Y = X[shuffle], Y[shuffle]

train_data, train_labels = X[:2000], Y[:2000]
test_data, test_labels = X[2000:], Y[2000:]

## Ensemble Methods!

Let's explore what sklearn has in terms of ensemble methods. There are two interesting ones we can use right now, adaboost and random forests. We'll start by using the sklearn ones, then try implementing random forests ourselves!

Be sure to reference the documentation at:  
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Let's start with just executing some sklearn functions:

In [3]:
dt = DecisionTreeClassifier(criterion="entropy", splitter="best", random_state=0)
dt.fit(train_data, train_labels)

print('Accuracy (a decision tree):', dt.score(test_data, test_labels))

rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(train_data, train_labels)

print('Accuracy (a random forest):', rfc.score(test_data, test_labels))

abc = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1), n_estimators=100, learning_rate=0.1)

abc.fit(train_data, train_labels)
print('Accuracy (adaboost with decision trees):', abc.score(test_data, test_labels))

Accuracy (a decision tree): 0.759
Accuracy (a random forest): 0.87
Accuracy (adaboost with decision trees): 0.824


It looks like ensemble methods do well, both do better than a single tree. Before moving on, try playing arond with some of the parameters, such as:

n_estimators in RandomForestClassifier


n_estimators and learning_rate AdaBoostClassifier

Why do the methods behave as they when you tweak the parameters?

### Tree bagging

Before we consider the more widely usedRandom forests, which are combinations of many decision trees, let's start with a slightly simplified version: **tree bagging**. Here is a simple algorithm for tree bagging:

1. Set B (number of trees to make)
2. Repeat B times:
  1. Draw N random samples from training data, with replacement, where N is the number of training data points
  2. Fit a decision tree to this re-sampled data
  3. Store the predictions from this decision tree on the test data
3. As the final predictions on the test data, use the majority vote classification for the predictions above

Below, I've given you an implementation of a single iteration of the main loop above. Complete the algorthim by (1) adding the repeated B resampling and fitting (2) implementing step 3 above, the final predictions from tree bagging.

Once you've done that, does bagging do better than a single tree?

In [4]:
np.random.seed(1)

# a single iteration of tree bagging
B = 500
n = train_data.shape[0]
sn = int(n*2.0/3.0)   # nr of training data in subset for each tree
nf = train_data.shape[1]
all_preds = np.zeros((B,test_data.shape[0]))

for b in range(B):
    bs_sample_index = np.random.choice(range(n), size=sn, replace=True)

    bs_data = train_data[bs_sample_index, :]
    bs_labels = train_labels[bs_sample_index]
    bs_test_data = test_data
    
    bs_tree = DecisionTreeClassifier(criterion="entropy", splitter="best")
    bs_tree.fit(bs_data, bs_labels)
    
    bs_tree_preds = bs_tree.predict(bs_test_data)
    all_preds[b,:] = bs_tree_preds
    
voting = np.sum(all_preds,axis=0) / B
voting = [int(x >= 0.5) for x in voting]
np.mean(voting==test_labels)

0.8835

### Random Forest

Now, we are ready to do **random forests**. Random forests add the twist of subsampling features at each node. Typically, we take p' = sqrt(p) features. DecisionTreeClassifier implements with through the *max_features*, check out the documentation. A simple change to your above code should give you random forests.

1. Set B (number of trees to make)
2. Repeat B times:
  1. Draw N random samples from training data, with replacement, where N is the number of training data points
  2. Draw p' = sqrt(p) features without replacement
  3. Fit a decision tree to this re-sampled data
  4. Store the predictions from this decision tree on the test data
3. As the final predictions on the test data, use the majority vote classification for the predictions above

Does random forests do better than tree bagging?

Note: you can also use trees, tree bagging, and random forests for regression! Now, the original data is a regression problem so just reload the data, and to do all of these ideas using trees, you need only use DecisionTreeRegressor instead of DecisionTreeClassifier; see:
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html

As a bonus, try implementing trees, tree bagging, and random forests for regression.

In [5]:
np.random.seed(1)

# a single iteration of tree bagging
B = 500
n = train_data.shape[0]
sn = int(n*2.0/3.0)   # nr of training data in subset for each tree
nf = train_data.shape[1]
all_preds = np.zeros((B,test_data.shape[0]))

for b in range(B):
    bs_sample_index = np.random.choice(range(n), size=sn, replace=True)
    
    bs_data = train_data[bs_sample_index, :]
    bs_labels = train_labels[bs_sample_index]
    bs_test_data = test_data

    bs_sample_index_features = np.random.choice(range(nf), size=int(np.sqrt(nf)), replace=False)
    bs_data = bs_data[:, bs_sample_index_features]
    bs_test_data = bs_test_data[:, bs_sample_index_features]
    
    bs_tree = DecisionTreeClassifier(criterion="entropy", splitter="best")
    bs_tree.fit(bs_data, bs_labels)
    
    bs_tree_preds = bs_tree.predict(bs_test_data)
    all_preds[b,:] = bs_tree_preds
    
voting = np.sum(all_preds,axis=0) / B
voting = [int(x >= 0.5) for x in voting]
np.mean(voting==test_labels)

0.872