## Task 2: 
Review Chapter 7 of the Geron book, complete problems 8 and 9, and create a Jupyter notebook (HW5a.ipynb) with your notes and explanations.
<br><br>

# Chapter 7: Ensemble Learning and Random Forests

A group of predictors is called an *ensemble*. Aggregating the predictions of an ensemble is called *ensembe learning*.

You usually use ensemble methods near the end of a project, once you have built a few good predictors.

### Voting Classifiers

Suppose you have a few (different) trained classifiers of ~80% accuracy. One way to make an even better classifier is to aggregate the predictions of each classifier. The class that gets the most votes is the ensemble's prediction. This majority vote classifier is called a *hard voting* classifier. The voting classifier often achieves higher accuracy than the best classifier in the ensemble. This is due to the *law of large numbers*.

In [1]:
# try SKL's VotingClassifier class on the moons dataset
from sklearn.datasets import make_moons
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

voting_clf = VotingClassifier(
    estimators=[
        ('lr', LogisticRegression(random_state=42)),
        ('rf', RandomForestClassifier(random_state=42)),
        ('svc', SVC(random_state=42))
    ]
)
voting_clf.fit(X_train, y_train)

In [2]:
# take a look at the class that gets voted for
voting_clf.predict(X_test[:1])

array([1])

In [3]:
# look at the predictions for the individual estimators
[clf.predict(X_test[:1]) for clf in voting_clf.estimators_]

[array([1]), array([1]), array([0])]

In [4]:
# performance of the voting classifier
voting_clf.score(X_test, y_test)

0.912

If all classifiers are able to estimate class probabilities (i.e., if they all have a ```predict_proba()``` method), then you can tell Scikit-Learn to predict the class with the highest class probability, averaged over all the individual classifiers. This is called *soft voting*. It often achieves higher performance than hard voting because it gives more weight to highly confident votes. All you need to do is set the voting classifier’s ```voting``` hyperparameter to "soft", and ensure that all classifiers can estimate class probabilities.

In [7]:
voting_clf.voting = "soft"
voting_clf.named_estimators["svc"].probability = True
voting_clf.fit(X_train, y_train)
voting_clf.score(X_test, y_test)

0.92

The accuracy increases to 92% which is pretty good.

### Bagging and Pasting

If you use the same training algorithm for every predictor, but train them on different random subsets of the training set, this is called *bagging (bootstrap aggregating)* when sampling with replacement and *pasting* when sampling without replacement.

Once all predictors are trained, the ensemble's prediction is typically the *statistical mode* of all the predictions if it is a classification task and the average if it is a regression task. Each individual predictor has a higher bias than if it were trained on the original training set, but aggregation reduces both bias and variance.⁠ Generally, the net result is that the ensemble has a similar bias but a lower variance than a single predictor trained on the original training set.

In [8]:
# Bagging and Pasting in SciKit-Learn

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,
                            max_samples=100, n_jobs=-1, random_state=42)
# 'boostrap=False' if pasting
# 'n_jobs' tells SKL the number of CPU cores to use for training and predictions
#     - '-1' means to use all available cores

bag_clf.fit(X_train, y_train)

Overall, bagging generally produces better models than pasting, but if you have enough time and processing power, you can try and see.

#### Out-of-Bag Evaluation

By default a ```BaggingClassifier``` samples m training instances with replacement (```bootstrap=True```), where m is the size of the training set. With this process, it can be shown mathematically that only about 63% of the training instances are sampled on average for each predictor.⁠ The remaining 37% of the training instances that are not sampled are called out-of-bag (OOB) instances.

A bagging ensemble can be evaluated using OOB instances, without the need for a separate validation set. If there are enough estimators, then each instance in the training set will likely be an OOB instance of several estimators, so these estimators can be used to make a fair ensemble prediction for that instance. Once you have a prediction for each instance, you can compute the ensemble’s prediction accuracy (or any other metric).

In Scikit-Learn, you can set ```oob_score=True``` when creating a ```BaggingClassifier``` to request an automatic OOB evaluation after training. 

In [9]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,
                            oob_score=True, n_jobs=-1, random_state=42)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.896

According to this OOB evaluation, this BaggingClassifier is likely to achieve about 89.6% accuracy on the test set. 

In [10]:
from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.92

The accuracy on the test set is actually slightly better than the OOB evaluation.

The OOB decision function for each training instance is also available through the oob_decision_function_ attribute. Since the base estimator has a predict_proba() method, the decision function returns the class probabilities for each training instance. 

In [11]:
bag_clf.oob_decision_function_[:3]  # probas for the first 3 instances

array([[0.32352941, 0.67647059],
       [0.3375    , 0.6625    ],
       [1.        , 0.        ]])

#### Random Patches and Random Subspaces

The BaggingClassifier class supports sampling the features as well. Sampling is controlled by two hyperparameters: max_features and bootstrap_features. They work the same way as max_samples and bootstrap, but for feature sampling instead of instance sampling. Each predictor will be training on a random subset of the input features.

This can speed up training when dealing with high dimensional inputs. Sampling both training instances and features is called the *random patches method*. Keeping all training instances (by setting ```bootstrap=False``` and ```max_samples=1.0```) but sampling features (by setting ```bootstrap_features``` to True and/or ```max_features``` to a value smaller than 1.0) is called the random subspaces method.

Sampling features results in more predictor diversity, trading a bit more bias for lower varience.

### Random Forests

A *random forest* is an esemble of decision trees, usually tried via bagging (or pasting), with ```max_samples``` set to the size of the training set. There is ```RandomForestClassifier``` for classification tasks and ```RandomForestRegressor``` for regression tasks.

In [12]:
# train a random forest classifier with:
#    - 500 trees
#    - max 16 leaf nodes
#    - using all availabe CPU cores

from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16,
                                 n_jobs=-1, random_state=42)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)

```RandomForestClassifier``` has (almost) all the hyper params of a ```DecisionTreeClassifier```  and all the hyper params of a ```BaggingClassifier```.

The following ```BaggingClassifier``` is equivalent to the previous '''RandomForestClassifier```:

In [13]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(max_features="sqrt", max_leaf_nodes=16),
    n_estimators=500, n_jobs=-1, random_state=42)

#### Extra-Trees

Trees can be made more random by using random thresholds for each feature rather than searching for the best possible thresholds (like regular decision trees do). For this, set ```splitter="random"``` when creating a DecisionTreeClassifier.

A forest of these extremely random trees is called an *extremely randomized trees (extra-trees) ensemble*. Again, it trades more bias for lower variance.

SciKit-Learn has a ```ExtraTreesClassifier``` class. Its API is identical to ```RandomForestClassifier```, except ```bootstrap=False``` by default. Similarly, ```ExtraTreesRegressor``` has the same API as ```RandomForestRegressor```, except ```bootstrap=False``` by default.

It is difficult to tell in advance how a ```ExtraTreesClassifier``` would perform compared to a ```RandomForestClassifier```. The only way to know is to try it out.

#### Feature Importance

Random forests make it easy to measure the relative importance of each feature. SKL measures a feature's importance by looking at how much the tree nodes that use a feature reduce impurity on average, across all trees in the forest, with each node weighted by the number of training samples it is associated with.

SKL computes this automatically for each feature after training:

In [14]:
from sklearn.datasets import load_iris
iris = load_iris(as_frame=True)
rnd_clf = RandomForestClassifier(n_estimators=500, random_state=42)
rnd_clf.fit(iris.data, iris.target)
for score, name in zip(rnd_clf.feature_importances_, iris.data.columns):
     print(round(score, 2), name)

0.11 sepal length (cm)
0.02 sepal width (cm)
0.44 petal length (cm)
0.42 petal width (cm)


### Boosting

*Boosting* refers to any ensemble method that can combine several weak learners into a strong learner. The general idea is to train predictors sequentially, with each model trying to correct it's predecessor.

#### AdaBoost

New predictors correct old predictors by paying more attention to training instances that the predecessor underfit by increasing their relative weights. Then it trains the next classifier using the updated weights, and so on.

Once all predictors are trainined, the ensemble makes predictions similar to bagging or pasting, except the predictors have differeny weights depending on their overall accuracy on the weighted training set.

In the AdaBoost algorithm, each instance weight $w^{(i)}$ is intiailly set to $1/m$. After the first predictor is trained, it's weighted error rate $r_1$ is computed on the training set:

**Eq 7-1. Weighted error rate of the $j^{th}$ predictor**

$$
r_j = \sum_{i=1\\ \hat y_j^{(i)} \neq y^{(i)}}^m w^{(i)} \text{where} \hat y_j^{(i)} \text{ is the }j^{th} \text{ predictor's prediction for the } i^{th} \text{ instance}
$$

Next the predictor's weight $a_j$ is than computed as follows. 

**Eq 7-2. Predictor weight**

$$
\alpha_j = \eta\log\frac{1-r_j}{r_j}
$$

* $\eta$ is the learning rate hyperparameter (defaults to 1). 

The more accurate the predictor is, the higher its weight will be. Random guessing will be close to zero. Mostly wrong guesses will be negative.

**Eq 7-3. Weight update rule**

$$
\text{see book.}
$$

Finally, a new predictor is trained using the updated weights, and the whole process is repeated. 

To make predictions, AdaBoost computes the predictions of all the predictors and weighs them using the predictor weights $a_j$. The predicted class is the one with the most votes.

SKL uses a multiclass version of AdaBoost called SAMME. It relies on class probabilies rather than predictions and generally performs better.

In [15]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=30,
    learning_rate=0.5, random_state=42)
ada_clf.fit(X_train, y_train)

#### Gradient Boosting

Gradient boosting sequentially adds predictors to an esemble and tries to fit the new predictor to the residual errors made by the previous predictor.

In [16]:
import numpy as np
from sklearn.tree import DecisionTreeRegressor

np.random.seed(42)
X = np.random.rand(100, 1) - 0.5
y = 3 * X[:, 0] ** 2 + 0.05 * np.random.randn(100)  # y = 3x² + Gaussian noise

tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg1.fit(X, y)

In [17]:
# train second tree on the residual errors of the first predictor
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=43)
tree_reg2.fit(X, y2)

In [18]:
# train third tree on the residual errors of the third predictor
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2, random_state=44)
tree_reg3.fit(X, y3)

Now we have an ensemble of three trees. It can make predictions on a new instance simply by adding up the predictions of all the trees:

In [19]:
X_new = np.array([[-0.4], [0.], [0.5]])
sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))

array([0.49484029, 0.04021166, 0.75026781])

In [20]:
# train GRBT ensemble using GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3,
                                 learning_rate=1.0, random_state=42)
gbrt.fit(X, y)

The learning_rate hyperparameter scales the contribution of each tree. If you set it to a low value, such as 0.05, you will need more trees in the ensemble to fit the training set, but the predictions will usually generalize better. This is a regularization technique called *shrinkage*.

To find the optimal number of trees, you could perform cross-validation using GridSearchCV or RandomizedSearchCV, as usual, but there’s a simpler way: if you set the n_iter_no_change hyperparameter to an integer value, say 10, then the GradientBoostingRegressor will automatically stop adding more trees during training if it sees that the last 10 trees didn’t help.

In [21]:
# ensemble with early stopping
gbrt_best = GradientBoostingRegressor(
    max_depth=2, learning_rate=0.05, n_estimators=500,
    n_iter_no_change=10, random_state=42)
gbrt_best.fit(X, y)

If you set n_iter_no_change too low, training may stop too early and the model will underfit. But if you set it too high, it will overfit instead. We also set a fairly small learning rate and a high number of estimators, but the actual number of estimators in the trained ensemble is much lower, thanks to early stopping:

In [22]:
gbrt_best.n_estimators_

92

#### Histogram-Based Gradient Boosting

This is another GBRT implementation, optimized for large datasets provided by SKL. It bins the input features, replacing them with integers. Binning can greatly reduce the number of possible thresholds that the traiing algorithm needs to evaluate. Moreover, working with integers makes it possible to use faster and more memory-efficient data structures. And the way the bins are built removes the need for sorting the features when training each tree.

HGB classes also support categorical features and missing values (see textbook).

### Stacking

Short for *stacked generalization*. This method trains a model to aggregate the predictions of all predictors in an ensemble. The final aggregate predictor is called a *blender* or *meta learner*.

SKL provides two classes for stacking ensembles: ```StackingClassifier``` and ```StackingRegressor```.

In [24]:
from sklearn.ensemble import StackingClassifier

stacking_clf = StackingClassifier(
    estimators=[
        ('lr', LogisticRegression(random_state=42)),
        ('rf', RandomForestClassifier(random_state=42)),
        ('svc', SVC(probability=True, random_state=42))
    ],
    final_estimator=RandomForestClassifier(random_state=43),
    cv=5  # number of cross-validation folds
)
stacking_clf.fit(X_train, y_train)

In conclusion, ensemble methods are strong and fairly simple to use. Random forests, AdaBoost, and GBRT are among the first models you should test for most machine learning tasks, and they particularly shine with heterogeneous tabular data. Moreover, as they require very little preprocessing, they’re great for getting a prototype up and running quickly. Lastly, ensemble methods like voting classifiers and stacking classifiers can help push your system’s performance to its limits.

# Problem 8.

Load the MNIST dataset (introduced in Chapter 3), and split it into a training set, a validation set, and a test set (e.g., use 50,000 instances for training, 10,000 for validation, and 10,000 for testing). Then train various classifiers, such as a random forest classifier, an extra-trees classifier, and an SVM classifier. Next, try to combine them into an ensemble that outperforms each individual classifier on the validation set, using soft or hard voting. Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?

In [26]:
# import dataset
from sklearn.datasets import fetch_openml

X_mnist, y_mnist = fetch_openml('mnist_784', return_X_y=True, as_frame=False)

In [27]:
# train valid test split
X_train, y_train = X_mnist[:50_000], y_mnist[:50_000]
X_valid, y_valid = X_mnist[50_000:60_000], y_mnist[50_000:60_000]
X_test, y_test = X_mnist[60_000:], y_mnist[60_000:]

In [28]:
# import models
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier

In [30]:
# instantiate estimators
random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
extra_trees_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
svm_clf = LinearSVC(max_iter=100, tol=20, random_state=42)
mlp_clf = MLPClassifier(random_state=42)

# train estimators
estimators = [random_forest_clf, extra_trees_clf, svm_clf, mlp_clf]
for estimator in estimators:
    print("Training the", estimator)
    estimator.fit(X_train, y_train)

Training the RandomForestClassifier(random_state=42)
Training the ExtraTreesClassifier(random_state=42)
Training the LinearSVC(max_iter=100, random_state=42, tol=20)
Training the MLPClassifier(random_state=42)


In [31]:
# print the scores for each individual estimator
[estimator.score(X_valid, y_valid) for estimator in estimators]

[0.9736, 0.9743, 0.8662, 0.9594]

In [32]:
# make a voting model
from sklearn.ensemble import VotingClassifier

named_estimators = [
    ("random_forest_clf", random_forest_clf),
    ("extra_trees_clf", extra_trees_clf),
    ("svm_clf", svm_clf),
    ("mlp_clf", mlp_clf),
]

voting_clf = VotingClassifier(named_estimators)
voting_clf.fit(X_train, y_train)

In [35]:
# print voting score
voting_clf.score(X_valid, y_valid)

0.974

In [37]:
# process the categorical data
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
y_valid_encoded = encoder.fit_transform(y_valid)

y_valid_encoded = y_valid.astype(np.int64)

[estimator.score(X_valid, y_valid_encoded) for estimator in voting_clf.estimators_]

[0.9736, 0.9743, 0.8662, 0.9594]

In [38]:
# update params
voting_clf.set_params(svm_clf="drop")

In [39]:
voting_clf.estimators

[('random_forest_clf', RandomForestClassifier(random_state=42)),
 ('extra_trees_clf', ExtraTreesClassifier(random_state=42)),
 ('svm_clf', 'drop'),
 ('mlp_clf', MLPClassifier(random_state=42))]

In [40]:
voting_clf.estimators_

[RandomForestClassifier(random_state=42),
 ExtraTreesClassifier(random_state=42),
 LinearSVC(max_iter=100, random_state=42, tol=20),
 MLPClassifier(random_state=42)]

In [41]:
voting_clf.named_estimators_

{'random_forest_clf': RandomForestClassifier(random_state=42),
 'extra_trees_clf': ExtraTreesClassifier(random_state=42),
 'svm_clf': LinearSVC(max_iter=100, random_state=42, tol=20),
 'mlp_clf': MLPClassifier(random_state=42)}

In [42]:
# compare soft and hard voting
svm_clf_trained = voting_clf.named_estimators_.pop("svm_clf")
voting_clf.estimators_.remove(svm_clf_trained)

voting_clf.score(X_valid, y_valid)

0.9753

In [43]:
voting_clf.voting = "soft"
voting_clf.score(X_valid, y_valid)

0.9673

In [44]:
# print the score of each estimator
[estimator.score(X_test, y_test.astype(np.int64)) for estimator in voting_clf.estimators_]

[0.968, 0.9703, 0.9603]

The voting classifier reduced the error of the best model from about 3% to 2.7%.

# Problem 9.

Run the individual classifiers from the previous exercise to make predictions on the validation set, and create a new training set with the resulting predictions: each training instance is a vector containing the set of predictions from all your classifiers for an image, and the target is the image’s class. Train a classifier on this new training set. Congratulations—you have just trained a blender, and together with the classifiers it forms a stacking ensemble! Now evaluate the ensemble on the test set. For each image in the test set, make predictions with all your classifiers, then feed the predictions to the blender to get the ensemble’s predictions. How does it compare to the voting classifier you trained earlier? Now try again using a ```StackingClassifier``` instead. Do you get better performance? If so, why?

In [45]:
# create blender for stacking ensemble

X_valid_pred = np.empty((len(X_valid), len(estimators)), dtype=object)

for index, estimator in enumerate(estimators):
    X_valid_pred[:, index] = estimator.predict(X_valid)
    
X_valid_pred

array([['3', '3', '3', '3'],
       ['8', '8', '8', '8'],
       ['6', '6', '6', '6'],
       ...,
       ['5', '5', '5', '5'],
       ['6', '6', '6', '6'],
       ['8', '8', '8', '8']], dtype=object)

In [46]:
forest_blender = RandomForestClassifier(n_estimators=200, oob_score=True,
                                            random_state=42)
forest_blender.fit(X_valid_pred, y_valid)

forest_blender.oob_score_

0.973

In [48]:
# evaluate blender on test set

X_test_pred = np.empty((len(X_test), len(estimators)), dtype=object)

for index, estimator in enumerate(estimators):
    X_test_pred[:, index] = estimator.predict(X_test)

y_pred = forest_blender.predict(X_test_predictions)



accuracy_score(y_test, y_pred)

0.9684

The stacking ensemble performs slightly worse than the voting classifier.

In [50]:
# try again with a StackingClassifier

# combine the test and valid sets
# no need for validation set since StackingClassifier uses crossvalidation
X_train_full, y_train_full = X_mnist[:60_000], y_mnist[:60_000]

# train and fit the model
stack_clf = StackingClassifier(named_estimators,
                               final_estimator=forest_blender)
stack_clf.fit(X_train_full, y_train_full)

In [51]:
stack_clf.score(X_test, y_test)

0.9795

The stacking classifier performs much better. This is likely in part since it had a larger training set (combination of training and validation sets). In addition it can perform soft voting and make decisions based on what is most probable.