# Chapter 7: Random Forests and Ensemble Learning
Aggregate groups of predictors and use the "wisdom of the crowd".
You often use Ensemble methods as a later stage refinement, once good initial models have been created.

It is a good idea for each predictor in an ensemble to be as independent as possible.

In [2]:
#sklearn voting classifier, using moons dataset
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(estimators=
                             [('lr', log_clf),
                             ('rf', rnd_clf), 
                             ('svc', svm_clf)from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
                             ], voting='hard')

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [4]:
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)), ('rf', RandomF...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))],
         flatten_transform=None, n_jobs=1, voting='hard', weights=None)

In [7]:
from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.88
SVC 0.888
VotingClassifier 0.888


  if diff:


Soft voting uses models that provide probabilities, and the ensemble prediction is the class with the highest average probability.

## Bagging and Pasting
Instead of using different models in the ensemble, you can use different subsets of the training data.

Bagging: sampling with replacement. ("bootstrap aggregation")
Pasting: sampling without replacement.

In [10]:
# Bagging and Pasting in sklearn
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# bootstrap=False for Pasting mode.
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, max_samples=100, bootstrap=True, n_jobs=-1)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
# Automatically uses soft voting, if clasifier has predict_proba() method.
accuracy_score(y_test,y_pred)

0.912

In [11]:
# Out Of Bag evaluation (part of training set not sampled)
# Built in using oob_score param.
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, bootstrap=True, n_jobs=-1, oob_score=True)
bag_clf.fit(X_train,y_train)
bag_clf.oob_score_

0.896

This means it is likely to do about this well on the test set.

In [12]:
#but lets check and compare.
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.896

In [13]:
#oob decision function
bag_clf.oob_decision_function_

array([[0.41104294, 0.58895706],
       [0.37142857, 0.62857143],
       [1.        , 0.        ],
       [0.        , 1.        ],
       [0.        , 1.        ],
       [0.10526316, 0.89473684],
       [0.33516484, 0.66483516],
       [0.00990099, 0.99009901],
       [1.        , 0.        ],
       [0.96354167, 0.03645833],
       [0.78888889, 0.21111111],
       [0.00549451, 0.99450549],
       [0.75520833, 0.24479167],
       [0.82954545, 0.17045455],
       [0.96335079, 0.03664921],
       [0.05263158, 0.94736842],
       [0.00546448, 0.99453552],
       [0.97282609, 0.02717391],
       [0.96774194, 0.03225806],
       [0.99502488, 0.00497512],
       [0.02906977, 0.97093023],
       [0.36756757, 0.63243243],
       [0.90419162, 0.09580838],
       [1.        , 0.        ],
       [0.96428571, 0.03571429],
       [0.        , 1.        ],
       [1.        , 0.        ],
       [1.        , 0.        ],
       [0.        , 1.        ],
       [0.61666667, 0.38333333],
       [0.

## Random Patches and Random Subspaces
You can also sample certain features of the input data, for even more diversity in your models.

## Random Forests
Ensemble of decision trees, trained via bagging method, max_samples set to size of training set (i.e. sample entire set).

In [14]:
from sklearn.ensemble import RandomForestClassifier

#use all cpu cores.
rnd_clf = RandomForestClassifier(n_estimators = 500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=16,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [15]:
y_pred_rf = rnd_clf.predict(X_test)

In [17]:
accuracy_score(y_test, y_pred_rf)

0.92

### Feature Importance
Random forests let you measure the relative importance of each input feature. Measured by how much tree nodes that use a feature reduce impurity (on average across all trees in forest).

In [18]:
from sklearn.datasets import load_iris
iris = load_iris()

rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris["data"], iris["target"])

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [19]:
for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.09980536920905438
sepal width (cm) 0.02270575768013286
petal length (cm) 0.422175217711915
petal width (cm) 0.45531365539889745


Shows that sepal length and width are not that important of features. Random forests are a good way to get an idea of which features are important in your data.

### Boosting
(Hypothesis Boosting) is any ensemble method that can combine several weak learners into a strong learner.

AdaBoost (Adaptive Boosting
- Sequentially run models and for each data point they mis classify, increase the weight to better fit in the next round.

### Gradient Boosting


In [23]:
from sklearn.tree import DecisionTreeRegressor

tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X,y)

#train second tree on the residual errors made by first predictor
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X,y2)

#Repeat for 3rd tree
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X,y3)

#run the whole ensemble.
y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))

In [None]:
#Use GradientBoostingRegressor for a built in version.

## Stacking
Instead of using a simple voting scheme with an ensemble, you can train another model to take in the outputs of the ensemble predictors, and then make the final prediction. (extra model is called blender, or meta learner).