# Chapter 7: Ensemble Learning and Random Forests
This work is partialy combined text and code from the book [Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/) is only supposed to be used as reference and is recommended to follow along with a copy of the Book puchased.

In [None]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

In [None]:
X, y= make_moons(n_samples=200, noise=0.15)
X_train, X_test, y_train, y_test = train_test_split(X, y)

log_clf = LogisticRegression()
rndf_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
                estimators=[('lr', log_clf),
                            ('rf', rndf_clf),
                            ('svc', svm_clf)],
                voting='hard'
)
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression()),
                             ('rf', RandomForestClassifier()), ('svc', SVC())])

In [None]:
from sklearn.metrics import accuracy_score

for clf in (log_clf, rndf_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.88
RandomForestClassifier 0.94
SVC 0.98
VotingClassifier 0.98


# Bagging and Pasting in Scikit-learn

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
            DecisionTreeClassifier(),
            n_estimators=500,
            max_samples=100,
            bootstrap=True,          # Pasting instead; set False
            n_jobs=-1,
)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

# Out-of-Bag Evaluation

In [None]:
bag_clf = BaggingClassifier(
            DecisionTreeClassifier(),
            n_estimators=500,
            max_samples=100,
            bootstrap=True,
            n_jobs=-1,
            oob_score=True,
)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_                  # Likely to achieve 94.6% accuracy on Test Set

0.9266666666666666

In [None]:
# Let's Verify
from sklearn.metrics import accuracy_score

y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.94

In [None]:
# Shape is (150, 2) because X_test has 150 instances.
bag_clf.oob_decision_function_[:5] 

array([[0.99609375, 0.00390625],
       [0.07509881, 0.92490119],
       [0.99215686, 0.00784314],
       [0.96538462, 0.03461538],
       [0.73809524, 0.26190476]])

# Random Patches and Random Subspaces

1. **Random Patches**: Sampling both training instances and features.
2. **Random Subspaces**: 
    
    keeping all training instances (*`bootstrap = False`* and *`max_samples = 1.0`*)
    
    ***BUT*** 
    
    sampling features (*`bootstrap_features = True`* and *`max_features < 1.0`*)
    
Sampling feature result in even more predictor diversity.
- Trading a bit more bias for a lower variance

# Random Forests

- Genrally trained via bagging method.
- *`max_sample`* set to the size of training set.

The following code uses all availabile CPU cores to train a Random Forest classifier with 500 trees (each limited to maximum of 16 leaf nodes)

In [None]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)
accuracy_score(y_test, y_pred_rf)

0.94

*RandomForestClassfier* has all the hyperparameters of a *DecisionTreeClassifier* and *BaggingeClassifier*.
- Introduces extra randomness; Searches for very best feature among a random subset of feature.
- Trades for higher bias for lower variance


In [None]:
# the following BagClassifier is just as same as above RandomForestClassifier

bag_clf = BaggingClassifier(
            DecisionTreeClassifier(splitter='random', max_leaf_nodes=16),
            n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)

## Extra-trees | Extremely Randomized Trees Ensemble

Using **random thresholds for each feature** rather than searching for the best possible thresholds (like Decsion Trees do)

Use `ExtraTreesClassifier` class, `ExtraTreesRegressor`; both of their API identical to RandomForestClassifier/RandomForestRegressor
- Trades more Bias for a lower variance
- Much faster to train than regular Random Forests.

  B/c finding the best possible threhold for each feature at every node is one the most time-consuming tasks of growing a tree
  

## Feature Importance
Scikit-Learn measures a feature's importance by looking at how much the tree nodes that use that feature reduce impurity on average (across all tree in the forest).

Scikit-learn computes this score automatically for each feature after training, then it sacles the result so that the sum of all imporatances is equal to 1. 
- *`feature_importances_`* will provide the result.

**Random Forests are very handy to get a quick understanding of what features actually matter**.

In [None]:
from sklearn.datasets import load_iris

iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris['data'], iris['target'])

for name, score in zip(iris['feature_names'], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.08475828580501725
sepal width (cm) 0.022786389573764637
petal length (cm) 0.4270325182779528
petal width (cm) 0.46542280634326527


# Boosting

**Refers to any ensemble method that can combine several weak learners into a strong learner.**
- Train predictors sequentially, each trying to correct its predecessor.

## AdaBoost
**`AdaBoostClassifier`**

New predictors focuses more and more on the hard cases, by paying more attention to the training instances the predecessor underfitted.

- Alogrithm first trains a base classifier, uses it to make prediction, on the training set.
- Algorithm then increases the relative weight of miscalssified training instances.
- Train second classifier, using updated weights, and again makes predictions, update instance weights, and so on.

*AdaBoost adds predictors to the ensemble, gradually making it better*; instead of tweaking single predictors's parameters to minimize a cost function, like Gradient Descent.

**Can't be parallelized, since each predictor can only be trained after the previous predictor has been trained and evaluated**

**IF OVERFITTING:**
1. Reduce the number of estimators
2. More strongly regularize the base estimator.

In [None]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), 
    n_estimators=500,
    algorithm="SAMME.R",
    learning_rate=0.5)
ada_clf.fit(X_train, y_train)

AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1),
                   learning_rate=0.5, n_estimators=500)

## Gradient Boosting
**`GradientBoostingRegressor`** much like `RandomForestRegressor`, has hyperparameters to control the growth of Decision Trees, as well as hyperparameter to control the ensemble training (such as *`n_estimaotors`*) 

Just like AdaBoost, but instead of tweaking the instance weights at every instance, this method 

### Tries to fit the new predictor to the *residual errors* made by the previous predictor.


In [None]:
from sklearn.tree import DecisionTreeRegressor

tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X, y)

DecisionTreeRegressor(max_depth=2)

In [None]:
# Now we'll train a second DecisionTreeRegressor on the residual errors made by the first predictor.
y2 = y - tree_reg1.predict(X)

tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X, y2)

DecisionTreeRegressor(max_depth=2)

In [None]:
# Now we'll train a third DecisonTreeRegeressor on the residual errors made by the second predictor.
y3 = y2 - tree_reg2.predict(X)

tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X, y3)

DecisionTreeRegressor(max_depth=2)

- Now we have ensemble of three trees.
- **Predictions will be made by simply adding up the the predictions of all the trees.**

In [None]:
y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))
y_pred 

### Using Sklearn class

*`learning_rate`* hyperparameter scales the contribution of each tree.

- If set to low value (like `0.1`) you'll need more trees in the ensemble to fit the training set.

  **But will generalize better**. This is a regularization technique called *shrienkage*.  

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X, y)

GradientBoostingRegressor(learning_rate=1.0, max_depth=2, n_estimators=3)

### In order to find optimal number of trees, use **Early Stopping**.
The below code trains 120 trees first and then looks back for the number of tree at which the validation error (mean_squared_error) was least.

`staged_predict()` method returns an iterator over the prediction made by the ensemble at each stage of training (like ensemble with one tree, two tree, etc).

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train, X_val, y_train, y_val = train_test_split(X, y)

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120)
gbrt.fit(X_train, y_train)

errors = [mean_squared_error(y_val, y_pred) for y_pred in gbrt.staged_predict(X_val)]
best_number_of_estimators = np.argmin(errors) + 1

gbrt_best = GradientBoostingRegressor(max_depth=2, n_estimators=best_number_of_estimators)

#### Actually stopping early
By setting *`warm_start`*`=True` makes Sckit-learn keep existing trees when `fit()` method is called, allowing **incremental training**. 

In [None]:
# Stops training if the validation error does not improve for five iterations in a row.

gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True)

min_val_error = float('inf')
error_going_up = 0

for n_estimators in range(1, 120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    val_error = mean_squared_error(y_val, gbrt.predict(X_val))
    
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up += 1
        if error_going_up == 5:
            break             # EARLY STOPPING

### Stocashastic Gradient Boosting
*`subsample`* hyperparameter specifies the fraction of training instances to be used for training each tree.

- Trades a higher bias for lower variance.
- Speeds up training considerably.
- *`subsample`*`=0.25`, then each tree is trained on 25% of the training instances, selected RANDOMLY.

In [None]:
import xgboost

xgb_reg = xgboost.XGBRegressor()
xgb_reg.fit(X_train, y_train)
y_pred = xgb_reg.predict(X_val)[:5]

ModuleNotFoundError: No module named 'xgboost'

In [None]:
# XGBoost also offers several nice features, such as automatically taking care of early stopping

xgb_reg.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=2)
xgb_reg.predict(X_val)[:5]

# Stacking | Stacked Generalization
Instead of using a trivial function (like hard voting) to perform the aggregate the predictions of all predictors,

### Train a model to perform this aggregation itself.
- To train a blender, a common approach is to use a hold-out set.
- Split a training set into two, first for training predictors in first layer and second (Hold-out).
- The first layer predictors are made to make predictions on the hold-out set.
- each predictors will output its predictions and this becomes the training set for the blender.
