
# Introduction to Ensemble Methods

## What are Ensemble Methods

**Ensemble methods** are a powerful class of machine learning techniques that combine multiple models to improve predictive performance. The main idea is to leverage the strengths of different models while mitigating their weaknesses, leading to more robust and accurate predictions.

## Why Use Ensemble Methods?

Ensemble methods are used for several reasons:

-   **Improved Accuracy**: They often outperform individual models by reducing variance and bias.
-   **Robustness**: Ensembles are less sensitive to noise and outliers, leading to more stable predictions.
-   **Reduced Overfitting**: By averaging predictions from multiple models, ensembles can reduce overfitting.
-   **Interpretability**: Some ensemble methods, like Random Forests, provide insights into feature importance.
-   **Flexibility**: They can be applied to various base models, including decision trees, linear models, and neural networks.
-   **Scalability**: Many ensemble methods can be parallelized, making them suitable for large datasets.
-   **Combining Different Algorithms**: Ensembles can combine predictions from different types of models, enhancing overall performance.

## Types of Ensemble Methods

Ensemble methods can be broadly categorized into two types:

-   **Bagging (Bootstrap Aggregating)**: Combines predictions from multiple models trained on different subsets of the training data. The most common example is Random Forests, which builds multiple decision trees and averages their predictions.
-   **Pasting**: Similar to bagging, but uses subsets of the training data without replacement.
-   **Boosting**: Sequentially builds models, where each new model focuses on correcting the errors made by the previous models. Examples include AdaBoost, Gradient Boosting, and XGBoost.
-   **Stacking**, **Blending**, **Bayesian Model Averaging**, **Voting**, etc.

## Practical Demonstration

To illustrate the concepts of ensemble methods, we will use the Iris dataset and apply both Random Forests and Gradient Boosting classifiers. We will compare their performance using accuracy and confusion matrices.

-   Loading the `iris` dataset

In [None]:
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
class_names = iris.target_names

-   Train-test split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

-   Train a Random Forest Classifier and evaluate its performance

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)

print("Random Forest Classifier Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf, target_names=class_names))

ConfusionMatrixDisplay.from_estimator(rf_model, X_test, y_test,
                                      display_labels=class_names, cmap='Blues')
plt.title('Random Forest Confusion Matrix')
plt.show()

-   Train a Gradient Boosting Classifier and evaluate its performance

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_model.fit(X_train, y_train)

y_pred_gb = gb_model.predict(X_test)

print("Gradient Boosting Classifier Accuracy:", accuracy_score(y_test, y_pred_gb))
print(classification_report(y_test, y_pred_gb, target_names=class_names))

ConfusionMatrixDisplay.from_estimator(gb_model, X_test, y_test,
                                      display_labels=class_names, cmap='Blues')
plt.title('Gradient Boosting Confusion Matrix')
plt.show()

-   Feature importance for Random Forest

In [None]:
importances = rf_model.feature_importances_
indices = importances.argsort()[::-1]
print("Feature importances for Random Forest:")
for i in range(X.shape[1]):
    print(f"{feature_names[indices[i]]}: {importances[indices[i]]:.4f}")

# Plot feature importances
plt.figure(figsize=(10, 6))
plt.bar(range(X.shape[1]), importances[indices], align='center')
plt.xticks(range(X.shape[1]), [feature_names[i] for i in indices], rotation=45)
plt.title('Feature Importances from Random Forest')
plt.xlabel('Features')
plt.ylabel('Importance')
plt.tight_layout()
plt.show()

-   Feature importance for Gradient Boosting

In [None]:
gb_importances = gb_model.feature_importances_
gb_indices = gb_importances.argsort()[::-1]
print("Feature importances for Gradient Boosting:")
for i in range(X.shape[1]):
    print(f"{feature_names[gb_indices[i]]}: {gb_importances[gb_indices[i]]:.4f}")

# Plot feature importances
plt.figure(figsize=(10, 6))
plt.bar(range(X.shape[1]), gb_importances[gb_indices], align='center')
plt.xticks(range(X.shape[1]), [feature_names[i] for i in gb_indices], rotation=45)
plt.title('Feature Importances from Gradient Boosting')
plt.xlabel('Features')
plt.ylabel('Importance')
plt.tight_layout()
plt.show()

## Hands-on Exercises

**Voting Classifier**: Implement a Voting Classifier that combines the predictions of the Random Forest and Gradient Boosting classifiers. Evaluate its performance on the test set.

-   Import, instantiate, and train a `VotingClassifier` model from `sklearn.ensemble`.

-   Evaluate the performance of the Voting Classifier