# **CHAPTER 7**
# **Ensemble Learning and Random Forests**

Ensemble Learning is based on the idea that combining multiple predictors often leads to better performance than relying on a single model. This concept is similar to the “wisdom of the crowd,” where aggregated opinions tend to be more accurate than individual judgments. In Machine Learning, an ensemble consists of multiple predictors (classifiers or regressors), and their predictions are combined to produce a final result.
Ensemble methods are usually applied near the end of an ML project, after several good models have already been built. By aggregating diverse predictors, ensembles can reduce variance, improve generalization, and achieve higher accuracy. Random Forests are one of the most well-known and powerful ensemble methods, but other techniques such as bagging, boosting, and stacking are also widely used.

**Voting Classifiers**

A voting classifier combines the predictions of multiple classifiers and predicts the class that receives the most votes. This method works best when the individual classifiers are diverse and make different types of errors. Even weak learners can form a strong learner when combined, as long as their errors are not highly correlated.
There are two main types of voting:
•	Hard voting, which selects the class with the majority of votes.
•	Soft voting, which averages predicted class probabilities and selects the class with the highest average probability.


In [3]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X = iris["data"]
y = iris["target"]

# split train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


In [4]:
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression(max_iter=1000)
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard'
)

voting_clf.fit(X_train, y_train)


In [5]:
from sklearn.metrics import accuracy_score
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 1.0
RandomForestClassifier 1.0
SVC 1.0
VotingClassifier 1.0


**Bagging and Pasting**

Bagging (Bootstrap Aggregating) and pasting are ensemble techniques that train multiple predictors on different random subsets of the training data. The difference lies in sampling:
•	Bagging samples with replacement.
•	Pasting samples without replacement.
Bagging generally performs better because it introduces more diversity among predictors. These methods reduce variance without significantly increasing bias and can be trained in parallel, making them highly scalable.


In [6]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

**Out-of-Bag (OOB) Evaluation**

In bagging, some training instances are not used for training a given predictor. These unused instances are called out-of-bag (OOB) samples. Since a predictor never sees its OOB samples during training, they can be used for validation without a separate validation set.
OOB evaluation provides a reliable estimate of test performance while saving data and computation.


In [7]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    bootstrap=True, n_jobs=-1, oob_score=True)

bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.95

In [8]:
from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

1.0

In [9]:
bag_clf.oob_decision_function_

array([[1.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [0.        , 1.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [0.        , 0.03626943, 0.96373057],
       [0.        , 1.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [0.        , 0.02150538, 0.97849462],
       [0.        , 0.96987952, 0.03012048],
       [0.        , 1.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [0.        , 1.        , 0.        ],
       [0.        , 0.63473054, 0.36526946],
       [0.        , 0.        , 1.        ],
       [0.        , 1.        , 0.        ],
       [0.        , 0.        , 1.        ],
       [0.        , 1.        , 0.        ],
       [0.        , 0.        , 1.        ],
       [0.

**Random Patches and Random Subspaces**

Beyond sampling instances, BaggingClassifier also supports sampling features.
•	Random Subspaces: sample features only.
•	Random Patches: sample both instances and features.
These techniques further increase diversity among predictors, which helps reduce variance, especially in high-dimensional datasets.

**Random Forests**

Random Forests are ensembles of Decision Trees trained using bagging, with an additional layer of randomness introduced during tree construction. At each split, only a random subset of features is considered. This reduces correlation between trees and improves generalization.

In [10]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)

In [11]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(splitter="random", max_leaf_nodes=16),
    n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)

**Extra-Trees (Extremely Randomized Trees)**

Extra-Trees introduce even more randomness than Random Forests by choosing random split thresholds instead of optimal ones. This makes them faster to train and can further reduce variance, although at the cost of increased bias.
They use the same API as Random Forests, making them easy to experiment with and compare using cross-validation.

**Feature Importance**

Random Forests provide a convenient way to measure feature importance by evaluating how much each feature reduces impurity across all trees. These importance scores help with feature selection and model interpretability.


In [12]:
from sklearn.datasets import load_iris
iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris["data"], iris["target"])
for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.10355874449837586
sepal width (cm) 0.024339552069736804
petal length (cm) 0.44266706254079685
petal width (cm) 0.42943464089109046


**Boosting Overview**

Boosting is an ensemble technique that builds predictors sequentially. Each new predictor focuses on correcting the errors made by previous ones. Unlike bagging, boosting cannot be fully parallelized, but it often produces very strong models.
The most popular boosting algorithms are AdaBoost and Gradient Boosting.

**AdaBoost**

AdaBoost increases the weights of misclassified training instances so that subsequent predictors focus more on difficult cases. Each predictor is assigned a weight based on its accuracy, and predictions are combined using weighted voting.


In [14]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=200,
    algorithm="SAMME", learning_rate=0.5)
ada_clf.fit(X_train, y_train)



**Gradient Boosting**

Gradient Boosting trains predictors sequentially, where each new model fits the residual errors of the previous ensemble. This method works well for both regression and classification tasks.


In [15]:
from sklearn.tree import DecisionTreeRegressor

tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X, y)

In [16]:
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X, y2)

In [17]:
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X, y3)

In [19]:
X_new = X_train[:5]  # prediksi 5 sampel pertama dari X_train


In [20]:
tree_reg1.fit(X_train, y_train)
tree_reg2.fit(X_train, y_train)
tree_reg3.fit(X_train, y_train)


In [21]:
import numpy as np

y_pred = np.mean([tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3)], axis=0)
print(y_pred)


[0.         0.         1.02702703 0.         0.        ]


In [22]:
from sklearn.ensemble import GradientBoostingRegressor
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X, y)

In [23]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train, X_val, y_train, y_val = train_test_split(X, y)

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120)
gbrt.fit(X_train, y_train)

errors = [mean_squared_error(y_val, y_pred)
          for y_pred in gbrt.staged_predict(X_val)]
bst_n_estimators = np.argmin(errors) + 1

gbrt_best = GradientBoostingRegressor(max_depth=2,n_estimators=bst_n_estimators)
gbrt_best.fit(X_train, y_train)

In [24]:
gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True)

min_val_error = float("inf")
error_going_up = 0
for n_estimators in range(1, 120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_val)
    val_error = mean_squared_error(y_val, y_pred)
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up += 1
        if error_going_up == 5:
            break # early stopping

In [29]:
import xgboost

xgb_reg = xgboost.XGBRegressor()
xgb_reg.fit(X_train, y_train)

**Instability of Decision Trees**

Stacking trains a meta-model (blender) to combine the predictions of base models instead of using simple voting. Base models are trained first, their predictions are used as inputs, and the blender learns how to best combine them.
Although Scikit-Learn does not natively support stacking, the concept is widely used and can outperform traditional ensemble methods.
