<a href="https://colab.research.google.com/github/adamd1985/Lectures_On_MLAI/blob/main/4_7_Ensemble_Lecture.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ensemble Methods

Ensemble methods combine the predictions of several other machine learning models, where by aggregating the results of the other models trained on the same dataset. These methods are classified into four general groups:

- Voting Methods: make predictions based on the majority label predicted.
  - Bagging Method: train individual models on random subsets, before voting.
  - Boosting Methods: train individual models sequentially by learning from the errors before voting.
- Stacking Methods: train individual models, and use a meta-model to aggregate their predictions.


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
import warnings
warnings.filterwarnings('ignore')

We'll use the breast cancer dataset again in this notebook.

In [None]:
from sklearn.datasets import load_breast_cancer

breast_cancer = load_breast_cancer(as_frame=True)
X, y = breast_cancer.data, breast_cancer.target
breast_cancer.frame.sample(3)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
435,13.98,19.62,91.12,599.5,0.106,0.1133,0.1126,0.06463,0.1669,0.06544,...,30.8,113.9,869.3,0.1613,0.3568,0.4069,0.1827,0.3179,0.1055,0
372,21.37,15.1,141.3,1386.0,0.1001,0.1515,0.1932,0.1255,0.1973,0.06183,...,21.84,152.1,1535.0,0.1192,0.284,0.4024,0.1966,0.273,0.08666,0
376,10.57,20.22,70.15,338.3,0.09073,0.166,0.228,0.05941,0.2188,0.0845,...,22.82,76.51,351.9,0.1143,0.3619,0.603,0.1465,0.2597,0.12,1


In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split

X, y = breast_cancer.data.values, breast_cancer.target.values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Voting Methods in Ensemble Learning

Ensemble learning leverages multiple models to improve predictive performance. A fundamental technique in ensemble learning is **voting**, where multiple classifiers are trained independently, and their predictions are aggregated to make a final decision.

Hard voting, also known as majority voting, involves taking the most frequently predicted class label among all base models. Given $m$ classifiers, each producing a prediction $y_i^j$ for sample $x_i$, the ensemble prediction is determined as:

$$
\hat{y}_i = \text{mode}(y_i^1, y_i^2, ..., y_i^m)
$$

where $\text{mode}(\cdot)$ returns the most common class among all base classifiers.

For binary classification, this can be formally expressed using the indicator function:

$$
\hat{y}_i = \mathbb{I} \left( \frac{1}{m} \sum_{j=1}^{m} y_i^j > 0.5 \right)
$$

where $\mathbb{I}$ is the indicator function that assigns class 1 if the majority votes for it; otherwise, class 0.


Soft voting takes into account the predicted probabilities from each classifier rather than just the final class label. Each classifier outputs a probability distribution over the possible classes:

$$
P(y_i = c | x_i) = \frac{1}{m} \sum_{j=1}^{m} P_j(y_i = c | x_i)
$$

where $P_j(y_i = c | x_i)$ is the probability assigned to class $c$ by the $j$-th model.

The final prediction is then assigned to the class with the highest averaged probability:

$$
\hat{y}_i = \arg\max_{c} \left( \frac{1}{m} \sum_{j=1}^{m} P_j(y_i = c | x_i) \right)
$$

Soft voting generally works better when classifiers output well-calibrated probabilities.

For regression tasks, the final prediction is computed as the average of all base model predictions:

$$
\hat{y}_i = \frac{1}{m} \sum_{j=1}^{m} \hat{y}_i^j
$$

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)

lr_preds = lr_model.predict(X_test)
lr_acc = accuracy_score(y_test, lr_preds)
print('Logistic Regression accuracy is {0:7.4f} %'.format(lr_acc*100))

Logistic Regression accuracy is 95.6140 %


In [None]:
from sklearn.svm import SVC

# Set probability=True to access probas.
svm_model = SVC(probability=True, random_state=1)
svm_model.fit(X_train, y_train)

svm_preds = svm_model.predict(X_test)
svm_acc = accuracy_score(y_test, svm_preds)
print('SVM accuracy is {0:7.4f} %'.format(svm_acc*100))

SVM accuracy is 94.7368 %


In [None]:
from sklearn.tree import DecisionTreeClassifier

tree_model = DecisionTreeClassifier(random_state=1)
tree_model.fit(X_train, y_train)
tree_preds = tree_model.predict(X_test)

tree_acc = accuracy_score(y_test, tree_preds)
print('Decision Tree accuracy is {0:7.4f} %'.format(tree_acc * 100))

Decision Tree accuracy is 93.8596 %


Now let's build a hard and soft voting mechanism.

In [None]:
# hard voting (majority vote)
predictions = np.array([lr_preds, svm_preds, tree_preds])
ensemble_hard_preds = np.apply_along_axis(lambda x: np.bincount(x).argmax(), axis=0, arr=predictions)

# soft voting (average probability)
log_reg_probs = lr_model.predict_proba(X_test)
svc_probs = svm_model.predict_proba(X_test)
tree_probs = tree_model.predict_proba(X_test)
avg_probs = (log_reg_probs + svc_probs + tree_probs) / 3
ensemble_soft_preds = np.argmax(avg_probs, axis=1)

hard_acc = accuracy_score(y_test, ensemble_hard_preds)
soft_acc = accuracy_score(y_test, ensemble_soft_preds)
print(f'Hard Voting Classifier accuracy: {hard_acc * 100:.4f}%')
print(f'Soft Voting Classifier accuracy: {soft_acc * 100:.4f}%')

Hard Voting Classifier accuracy: 96.4912%
Soft Voting Classifier accuracy: 95.6140%


We can validate using Scikit learn [VotingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html).

In [None]:
from sklearn.ensemble import VotingClassifier

hard_voting_clf = VotingClassifier(estimators=[('lr', lr_model), ('svc', svm_model), ('tree', tree_model)], voting='hard')
hard_voting_clf.fit(X_train, y_train)
ensemble_hard_preds = hard_voting_clf.predict(X_test)

soft_voting_clf = VotingClassifier(estimators=[('lr', lr_model), ('svc', svm_model), ('tree', tree_model)], voting='soft')
soft_voting_clf.fit(X_train, y_train)
ensemble_soft_preds = soft_voting_clf.predict(X_test)

hard_acc = accuracy_score(y_test, ensemble_hard_preds)
soft_acc = accuracy_score(y_test, ensemble_soft_preds)
print(f'Hard Voting Classifier accuracy: {hard_acc * 100:.4f}%')
print(f'Soft Voting Classifier accuracy: {soft_acc * 100:.4f}%')

Hard Voting Classifier accuracy: 96.4912%
Soft Voting Classifier accuracy: 95.6140%


## Bagging Ensemble

The term **bagging** is short for *bootstrap aggregating*, where bootstrapping refers to sampling subsets from the training data with replacement. Bagging is an ensemble learning technique that aims to reduce variance and improve model stability by training multiple base learners on different random subsets of the dataset and aggregating their predictions.

Given a training dataset of size $N$, bagging creates $m$ bootstrap samples $D_1, D_2, ..., D_m$ by randomly sampling with replacement from the original dataset.

Each base model $f_j(x)$ is trained independently on a bootstrap sample $D_j$ and produces a prediction $\hat{y}_i^j$ for an input sample $x_i$.



In [None]:
from sklearn.utils import resample

n_estimators = 10
tree_models = []
for _ in range(n_estimators):
    X_resampled, y_resampled = resample(X_train, y_train, random_state=1)
    tree = DecisionTreeClassifier(random_state=1)
    tree.fit(X_resampled, y_resampled)
    tree_models.append(tree)

In [None]:
tree_preds = np.array([tree.predict(X_test) for tree in tree_models])
ensemble_hard_preds = np.apply_along_axis(lambda x: np.bincount(x).argmax(), axis=0, arr=tree_preds)

tree_probs = np.mean([tree.predict_proba(X_test) for tree in tree_models], axis=0)
ensemble_soft_preds = np.argmax(tree_probs, axis=1)

hard_acc = accuracy_score(y_test, ensemble_hard_preds)
soft_acc = accuracy_score(y_test, ensemble_soft_preds)
print(f'Bagging Hard Voting Classifier accuracy: {hard_acc * 100:.4f}%')
print(f'Bagging Soft Voting Classifier accuracy: {soft_acc * 100:.4f}%')

Bagging Hard Voting Classifier accuracy: 92.9825%
Bagging Soft Voting Classifier accuracy: 92.9825%


Bagging Ensembles often outperform Voting Ensembles, and they can decrease the overfitting of Decision Trees (which tend to overfit easily). As well as, Bagging Ensembles can be trained in parallel using different CPU cores, which can reduce the processing time.  

Random Forest can be considered a Bagging Ensemble trained on the whole dataset if you set max samples to 1, in fact let's check using scikit [RandomForest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=10, max_samples=1.0, bootstrap=True, random_state=1)
rf_model.fit(X_train, y_train)

ensemble_hard_preds = rf_model.predict(X_test)
ensemble_soft_preds = np.argmax(rf_model.predict_proba(X_test), axis=1)
hard_acc = accuracy_score(y_test, ensemble_hard_preds)
soft_acc = accuracy_score(y_test, ensemble_soft_preds)
print(f'Random Forest (Bagging) Hard Voting accuracy: {hard_acc * 100:.4f}%')
print(f'Random Forest (Bagging) Soft Voting accuracy: {soft_acc * 100:.4f}%')

Random Forest (Bagging) Hard Voting accuracy: 95.6140%
Random Forest (Bagging) Soft Voting accuracy: 95.6140%


## Boosting Ensemble

Boosting is an ensemble learning technique that trains individual models **sequentially**, where each subsequent model learns from the errors of the previous models and attempts to improve overall performance. Unlike bagging, which trains models independently in parallel, boosting adjusts model weights dynamically to focus on **misclassified samples**.

Given a dataset with $N$ training samples, boosting maintains a weight distribution $w_i$ for each training sample, adjusting it iteratively to give more importance to **misclassified samples**.

Each model $f_t(x)$ is trained sequentially on a weighted version of the dataset, where misclassified instances receive higher weights.

### Gradient Boosting

Gradient Boosting is an ensemble learning method that sequentially trains weak models, typically decision trees, to correct errors made by the previous models. It leverages gradient descent to minimize a predefined loss function, making it effective for both regression and classification.

The initial model $ F_0(x) $ is chosen to minimize the loss function over the training data. For regression we have:

$$
F_0(x) = \frac{1}{N} \sum_{i=1}^{N} y_i
$$

For classification (e.g., with log loss), the initial prediction is based on the log-odds:

$$
F_0(x) = \log \frac{p}{1 - p}
$$

where $ p $ is the prior probability of the positive class.


For each boosting step $ m $, compute the **pseudo-residuals**, which represent the negative gradient of the loss function with respect to the current model's predictions. For squared error loss $ L(y_i, F) = (y_i - F)^2 $, this simplifies to:

$$
r_i^{(m)} = y_i - F_{m-1}(x_i)
$$

For log loss in binary classification:

$$
r_i^{(m)} = y_i - \sigma(F_{m-1}(x_i))
$$

where $ \sigma(F) = \frac{1}{1 + e^{-F}} $ is the sigmoid function.

Train a weak model (e.g., a decision tree) $ h_m(x) $ to predict the residuals:

$$
h_m(x) = \arg\min_h \sum_{i=1}^{N} \left( r_i^{(m)} - h(x_i) \right)^2
$$

This weak learner captures patterns in the residuals. To determine the optimal step size $ \rho_m $, minimize the loss function:

$$
\rho_m = \arg\min_\rho \sum_{i=1}^{N} L(y_i, F_{m-1}(x_i) + \rho h_m(x_i))
$$

For squared error loss:

$$
\rho_m = \frac{\sum_{i=1}^{N} r_i^{(m)} h_m(x_i)}{\sum_{i=1}^{N} h_m(x_i)^2}
$$


The new model is updated by adding the scaled weak learner:

$$
F_m(x) = F_{m-1}(x) + \eta \rho_m h_m(x)
$$

where $ \eta $ is the learning rate, controlling the contribution of each weak learner.

obtaining the final model:

$$
F(x) = F_0(x) + \sum_{m=1}^{M} \eta \rho_m h_m(x)
$$

For regression, $ F(x) $ is the predicted value. For classification, predictions are obtained via:

$$
P(y = 1 \mid x) = \sigma(F(x))
$$


In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import log_loss, accuracy_score

n_estimators = 50
learning_rate = 0.1
max_depth = 3
p = np.mean(y_train)
F_train = np.full_like(y_train, np.log(p / (1 - p)), dtype=np.float64)

for _ in range(n_estimators):
    # Compute pseudo residuals
    p_train = 1 / (1 + np.exp(-F_train))
    residuals = y_train - p_train

    tree = DecisionTreeRegressor(max_depth=max_depth, min_samples_leaf=5)
    tree.fit(X_train, residuals)

    tree_preds = tree.predict(X_train)

    numerator = np.sum(residuals * tree_preds)
    denominator = np.sum(tree_preds * tree_preds)
    rho = numerator / denominator if denominator != 0 else 0

    F_train += learning_rate * rho * tree_preds

F_test = np.full_like(y_test, np.log(p / (1 - p)), dtype=np.float64)
for _ in range(n_estimators):
    tree_preds_test = tree.predict(X_test)
    F_test += learning_rate * rho * tree_preds_test
ensemble_probs = 1 / (1 + np.exp(-F_test))
ensemble_preds = (ensemble_probs >= 0.5).astype(int)

logloss = log_loss(y_test, ensemble_probs)
accuracy = accuracy_score(y_test, ensemble_preds)
print(f'Gradient Boosting Log Loss: {logloss:.4f}')
print(f'Gradient Boosting Accuracy: {accuracy * 100:.4f}%')

Gradient Boosting Log Loss: 0.3612
Gradient Boosting Accuracy: 96.4912%


Let's test it against Scikit's [GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

boosting_model = GradientBoostingClassifier(n_estimators=n_estimators, learning_rate=learning_rate, max_depth=max_depth, random_state=1)
boosting_model.fit(X_train, y_train)

ensemble_probs = boosting_model.predict_proba(X_test)[:, 1]  # Probability of class 1
ensemble_preds = boosting_model.predict(X_test)
logloss = log_loss(y_test, ensemble_probs)
accuracy = accuracy_score(y_test, ensemble_preds)
print(f'Gradient Boosting Log Loss: {logloss:.4f}')
print(f'Gradient Boosting Accuracy: {accuracy * 100:.4f}%')

Gradient Boosting Log Loss: 0.1121
Gradient Boosting Accuracy: 95.6140%


# Stacking Methods

Stacking is an ensemble learning technique that combines multiple base models by training a higher-level **meta-model** to aggregate their predictions. Unlike hard or soft voting methods, stacking learns how to best combine base model outputs by leveraging another model that specializes in integrating predictions.

Let $f_1(x), f_2(x), ..., f_m(x)$ represent the $m$ base models. For a given input $x$, the predictions from the base models are:

$$
\mathbf{h}(x) = [f_1(x), f_2(x), ..., f_m(x)]
$$

These predictions form a new feature vector $\mathbf{h}(x)$ for the meta-model $g(\mathbf{h}(x))$. The final stacked prediction is then:

$$
\hat{y} = g(\mathbf{h}(x))
$$


In [None]:
log_reg = LogisticRegression(max_iter=10000, random_state=1)
log_reg.fit(X_train, y_train)
log_reg_preds = log_reg.predict_proba(X_test)[:, 1]

svc_model = SVC(probability=True, random_state=1)
svc_model.fit(X_train, y_train)
svc_preds = svc_model.predict_proba(X_test)[:, 1]

tree_model = DecisionTreeClassifier(random_state=1)
tree_model.fit(X_train, y_train)
tree_preds = tree_model.predict_proba(X_test)[:, 1]

stacked_features = np.column_stack((log_reg_preds, svc_preds, tree_preds))

meta_model = LogisticRegression(max_iter=10000, random_state=1)
meta_model.fit(stacked_features, y_test)
meta_preds = meta_model.predict_proba(stacked_features)[:, 1]
final_preds = meta_model.predict(stacked_features)

logloss = log_loss(y_test, meta_preds)
accuracy = accuracy_score(y_test, final_preds)
print(f'Stacking Log Loss: {logloss:.4f}')
print(f'Stacking Accuracy: {accuracy * 100:.4f}%')

Stacking Log Loss: 0.1162
Stacking Accuracy: 96.4912%


In [None]:
from sklearn.ensemble import StackingClassifier

estimators = [
    ('log_reg', LogisticRegression(max_iter=10000, random_state=1)),
    ('svc', SVC(probability=True, random_state=1)),
    ('tree', DecisionTreeClassifier(random_state=1))
]

stacking_model = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression(max_iter=10000, random_state=1))
stacking_model.fit(X_train, y_train)

ensemble_probs = stacking_model.predict_proba(X_test)[:, 1]  # Probability of class 1
ensemble_preds = stacking_model.predict(X_test)
logloss = log_loss(y_test, ensemble_probs)
accuracy = accuracy_score(y_test, ensemble_preds)
print(f'Stacking Log Loss: {logloss:.4f}')
print(f'Stacking Accuracy: {accuracy * 100:.4f}%')

Stacking Log Loss: 0.0985
Stacking Accuracy: 95.6140%


# Conclusion

In this notebook, we have explored several key ensemble learning techniques and their unique strengths:

1. **Bagging:** By training multiple models on different bootstrapped subsets of the data, we learned how to reduce variance and create a robust ensemble that outperforms individual models. The Random Forest is a classic example that demonstrates bagging’s power to handle noise and provide stable predictions.

2. **Boosting:** We saw how boosting sequentially trains weak learners, focusing on correcting the errors of prior models. By optimizing the loss function step-by-step, boosting algorithms—such as AdaBoost and Gradient Boosting—achieve high accuracy and are especially effective at handling complex patterns and imbalanced datasets.

3. **Voting:** Hard and soft voting methods taught us how simple aggregations of multiple models’ predictions can yield improved performance. These approaches illustrate that even without sophisticated training strategies, combining diverse model types can enhance overall accuracy.

4. **Stacking:** Finally, stacking introduced the concept of training a meta-model to learn how best to combine the outputs of multiple base models. By using a higher-level model to integrate predictions, we gain the ability to exploit the complementary strengths of various learning algorithms.

We have seen that ensemble methods are some of the most effective tools in the machine learning toolbox. They consistently deliver top-tier performance on tabular data, often with minimal hyperparameter tuning. By understanding and leveraging these techniques, we can tackle a wide range of predictive tasks with confidence and reliability.
