# Ensemble Learning

Question 1: What is Ensemble Learning in machine learning? Explain the key idea behind it.

- Ensemble Learning is a machine learning technique where multiple individual models, often called base estimators or weak learners, are trained and then combined to solve a particular computational intelligence problem.

- The key idea behind ensemble learning is to leverage the "wisdom of the crowd" principle. By aggregating the predictions of several diverse models, the ensemble model achieves better performance and generalization than any single constituent model. The collective decision tends to be more robust and accurate because the errors or biases of individual models often cancel each other out.

Question 2: What is the difference between Bagging and Boosting?
- Models are trained independently and in parallelModels are trained sequentially and adaptively.
- Data Sampling - Uses bootstrap samples (random sampling with replacement) to create different training sets for each modelUses the full dataset where subsequent models focus on the samples that were misclassified or poorly predicted by previous models.
- Model Weights - All individual models are typically given equal weight in the final predictionModels are weighted, with better-performing models (later models that correct errors) often having a greater influence.
- Primary GoalDecrease variance and prevent overfitting, especially for complex or high-variance base models like Decision TreesDecrease bias by converting a set of weak learners into a strong learner.
- ExamplesRandom Forest, Bagging with Decision Trees, Bagging with k-Nearest NeighborsAdaBoost, Gradient Boosting Machines (GBM), XGBoost, LightGBM.

Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?
- Bootstrap sampling is a resampling technique where a subset of data is randomly drawn from the original dataset with replacement16. This means that any one data point can be selected multiple times in a single bootstrap sample, or not at all17. The resulting bootstrap sample has the same size as the original dataset but is a unique variation of it.
- Role in Bagging (e.g., Random Forest):
- Bootstrap sampling is fundamental to Bagging (Bootstrap Aggregating)19. Its role is to:Introduce Diversity: By creating $N$ distinct training datasets (where $N$ is the number of base estimators), it ensures that the individual models (e.g., Decision Trees in a Random Forest) are trained on slightly different perspectives of the data20.
- Ensure Independence: Training models on these independent bootstrap samples introduces randomness and helps decorrelate the individual models' errors21.
- Reduce Variance: When these diverse models are aggregated (e.g., by averaging their predictions or taking a majority vote), the high variance of individual models (like deep Decision Trees) is significantly reduced, leading to a more robust and generalized final ensemble.

Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?
- Out-of-Bag (OOB) samples are the data points from the original training set that were not included in the bootstrap sample used to train a particular base estimator (e.g., a single Decision Tree in a Random Forest). Because bootstrap sampling is "with replacement," each base estimator in a Bagging process typically uses about 63.2% of the original data for training, leaving the remaining $\approx$36.8% as OOB samples for that specific estimator.
- Use of OOB Score for Evaluation:
- The OOB score is a method for evaluating the generalization performance of an ensemble model without the need for a separate validation set or explicit cross-validation.
- Prediction: For each data point in the original training set, only the base estimators that did not use that data point in their training (i.e., the OOB models for that point) are used to make a prediction.
- Aggregation: The OOB predictions from all relevant base estimators for a given data point are aggregated (e.g., majority vote for classification, averaging for regression)
- Score Calculation: The OOB score (e.g., accuracy, $R^2$) is calculated by comparing these aggregated OOB predictions with the true target values across the entire training set.
- The OOB score is considered an unbiased estimate of the model's test set performance, effectively acting as an internal cross-validation.

Question 5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest.
- Feature importance analysis differs significantly between a single Decision Tree and a Random Forest. In a single Decision Tree, feature importance is calculated based on how much a feature reduces impurity (like Gini impurity or entropy) when used for a split within that specific tree. This method is highly unstable and susceptible to noise, as a minor change in the training data can drastically alter the tree structure and, consequently, the feature ranking. Furthermore, a deep single tree can have importance scores biased toward features that performed well on its particular training subset, leading to potential overfitting. In contrast, a Random Forest provides a more stable and reliable estimate of feature importance. The importance score for a feature in a Random Forest is the average of its importance scores across all the individual Decision Trees in the ensemble. The use of bootstrap sampling and random feature subsetting for each tree decorrelates the individual models , and the averaging process mitigates overfitting and smooths out the variance , offering a more global and robust view of feature relevance across the entire dataset.

In [None]:
''' Question 6: Write a Python program to:

- Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer()

- Train a Random Forest Classifier

Print the top 5 most important features based on feature importance scores. '''

import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier

# 1. Load the dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# 2. Train a Random Forest Classifier
# Using a fixed random state for reproducibility
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)

# 3. Get feature importance scores
feature_importances = pd.Series(model.feature_importances_, index=X.columns)

# Sort the features and get the top 5
top_5_features = feature_importances.nlargest(5)

# Print the top 5 most important features
print("Top 5 Most Important Features:")
print(top_5_features)

Top 5 Most Important Features:
worst area              0.139357
worst concave points    0.132225
mean concave points     0.107046
worst radius            0.082848
worst perimeter         0.080850
dtype: float64


In [None]:
''' Question 7: Write a Python program to:

- Train a Bagging Classifier using Decision Trees on the Iris dataset

- Evaluate its accuracy and compare with a single Decision Tree '''

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
data = load_iris()
X, y = data.data, data.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 2. Single Decision Tree
dt_model = DecisionTreeClassifier(random_state = 42)
dt_model.fit(X_train, y_train)
dt_pred = dt_model.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_pred)

# Bagging Classifier Using Decision Trees
bagging_model = BaggingClassifier(estimator = DecisionTreeClassifier(random_state= 42),n_estimators = 100, random_state = 42, n_jobs = 1)
bagging_model.fit(X_train, y_train)
bagging_pred = bagging_model.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_pred)

# --- 4. Compare Accuracies ---
print(f"Accuracy of Single Decision Tree: {dt_accuracy:.4f}")
print(f"Accuracy of Bagging Classifier: {bagging_accuracy:.4f}")
print("\nComparison: The Bagging Classifier improves accuracy by combining multiple independent Decision Trees, reducing variance.")

Accuracy of Single Decision Tree: 1.0000
Accuracy of Bagging Classifier: 1.0000

Comparison: The Bagging Classifier improves accuracy by combining multiple independent Decision Trees, reducing variance.


In [None]:
''' Question 8: Write a Python program to:

- Train a Random Forest Classifier

- Tune hyperparameters max_depth and n_estimators using GridSearchCV

- Print the best parameters and final accuracy '''

import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# 1. Load the dataset (using Breast Cancer for robustness)
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 2. Define the model and the parameter grid
rf = RandomForestClassifier(random_state=42)

# Define the hyperparameters to tune
param_grid = {
    'n_estimators': [50, 100, 200], # Number of trees in the forest
    'max_depth': [5, 10, None]       # Maximum depth of the tree (None = full depth)
}

# 3. Tune hyperparameters using GridSearchCV
# Using 5-fold cross-validation
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=0
)
grid_search.fit(X_train, y_train)

# 4. Print the best parameters and final accuracy
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Evaluate on the test set
final_pred = best_model.predict(X_test)
final_accuracy = accuracy_score(y_test, final_pred)

print(f"Best Parameters found by GridSearchCV: {best_params}")
print(f"Test Accuracy with Best Model: {final_accuracy:.4f}")

Best Parameters found by GridSearchCV: {'max_depth': 10, 'n_estimators': 200}
Test Accuracy with Best Model: 0.9708


In [None]:
''' Question 9: Write a Python program to:

- Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset

- Compare their Mean Squared Errors (MSE) '''

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

# 1. Load the California Housing dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# --- 2. Train Bagging Regressor ---
# Base estimator is a Decision Tree Regressor
bagging_reg = BaggingRegressor(
    estimator=DecisionTreeRegressor(random_state=42),
    n_estimators=50,
    random_state=42,
    n_jobs=-1
)
bagging_reg.fit(X_train, y_train)
bagging_pred = bagging_reg.predict(X_test)
bagging_mse = mean_squared_error(y_test, bagging_pred)

# --- 3. Train Random Forest Regressor ---
# Random Forest is essentially a specialized Bagging of Decision Trees
rf_reg = RandomForestRegressor(
    n_estimators=50,
    random_state=42,
    n_jobs=-1
)
rf_reg.fit(X_train, y_train)
rf_pred = rf_reg.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_pred)

# --- 4. Compare MSEs ---
print(f"Mean Squared Error (MSE) for Bagging Regressor: {bagging_mse:.4f}")
print(f"Mean Squared Error (MSE) for Random Forest Regressor: {rf_mse:.4f}")
print("\nComparison: Random Forest often performs better due to its additional feature-randomness component, which further reduces correlation between trees compared to standard Bagging.")

Mean Squared Error (MSE) for Bagging Regressor: 0.2579
Mean Squared Error (MSE) for Random Forest Regressor: 0.2577

Comparison: Random Forest often performs better due to its additional feature-randomness component, which further reduces correlation between trees compared to standard Bagging.


- Question 10: You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
● Choose between Bagging or Boosting
● Handle overfitting
● Select base models
● Evaluate performance using cross-validation
● Justify how ensemble learning improves decision-making in this real-world
context.

- Step 1: Choose Ensemble Technique
- Choice: Boosting (e.g., XGBoost, LightGBM).
- Justification: Loan default prediction is a bias-sensitive problem, as missing a defaulter (False Negative) is costly. Boosting excels at reducing bias by training models sequentially to focus on correcting the misclassified samples of prior models, generally yielding higher predictive accuracy than Bagging.

- Step 2: Select Base Models
- Selection: Shallow Decision Trees.
- Justification: Decision Trees are the standard base learners for Boosting. Using shallow trees (e.g., max_depth $\in [3, 7]$) ensures they remain "weak learners," which are essential for the sequential, additive process of boosting.

- Step 3: Handle Overfitting (Regularization)
- Techniques :
- Shrinkage (Learning Rate): Use a small learning rate (e.g., $\eta < 0.1$) to make the model learn gradually and act as regularization.
- Subsampling: Apply row subsampling (stochastic gradient boosting) and column subsampling (feature randomness) to inject noise and decorrelate the trees, similar to a Random Forest.
- Early Stopping: Monitor the model's performance on a held-out validation set and stop training if the metric (e.g., AUC-ROC) doesn't improve after a set number of rounds.

- Step 4: Evaluate Performance using Cross-Validation
- Metric: Area Under the ROC Curve (AUC-ROC). This metric is robust to the high class imbalance (few defaulters) common in financial data.
- Method: Stratified k-Fold Cross-Validation. This ensures that the percentage of defaulters is maintained across every fold, providing a reliable and stable estimate of the model's true generalization performance.

-   Step 5: Justify Improvement in Decision-Making
Ensemble learning, through Boosting, improves decision-making by:
- Reducing Financial Risk: Provides significantly higher accuracy (AUC-ROC), resulting in fewer missed defaulters (False Negatives), which directly reduces institutional loss.
- Providing Risk Insights: Feature importance scores identify the strongest risk factors (e.g., specific debt ratios), enabling the institution to create targeted risk mitigation and policy adjustments.
- Optimizing Resource Allocation: The reliable high-performance model ensures manual review and intervention resources are focused only on applicants with the highest predicted risk.

In [None]:
''' Question 10. Python Code '''

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score

# --- 1. Create a Synthetic Financial Dataset (Simulating loan default) ---
# Highly imbalanced and complex, which is typical for default prediction
X, y = make_classification(
    n_samples=5000,
    n_features=20,
    n_informative=10,
    n_redundant=5,
    n_classes=2,
    n_clusters_per_class=1,
    weights=[0.9, 0.1],  # Simulating 10% default rate (imbalanced)
    flip_y=0.02,
    random_state=42
)

# --- 2. Train and Tune Boosting Classifier (XGBoost) ---
xgb_model = XGBClassifier(
    objective='binary:logistic',
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42
)

# Define a simplified grid for demonstration of Hyperparameter Tuning
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [3, 5],
    'learning_rate': [0.05, 0.1]
}

# Use Stratified K-Fold for imbalanced data
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# GridSearchCV for tuning and evaluation
grid_search = GridSearchCV(
    estimator=xgb_model,
    param_grid=param_grid,
    scoring='roc_auc',  # Using AUC-ROC, the preferred metric
    cv=cv,
    n_jobs=-1,
    verbose=0
)
grid_search.fit(X, y)

# --- 3. Print Best Parameters and Performance ---
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Get Cross-Validation Score (Mean AUC-ROC)
cv_score = grid_search.cv_results_['mean_test_score'][grid_search.best_index_]

print(f"--- Boosting (XGBoost) Model Performance ---")
print(f"Best Parameters found: {best_params}")
print(f"Mean 5-Fold Stratified CV AUC-ROC Score: {cv_score:.4f}")

# --- 4. Feature Importance Analysis (for decision justification) ---
# Note: Feature numbering starts from 0 for the synthetic data
feature_importances = pd.Series(
    best_model.feature_importances_,
    index=[f'Feature_{i}' for i in range(X.shape[1])]
)
top_5_features = feature_importances.nlargest(5)
print("\nTop 5 Most Important Features (Risk Factors):")
print(top_5_features)

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


--- Boosting (XGBoost) Model Performance ---
Best Parameters found: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 200}
Mean 5-Fold Stratified CV AUC-ROC Score: 0.9504

Top 5 Most Important Features (Risk Factors):
Feature_1     0.209698
Feature_7     0.156602
Feature_17    0.067861
Feature_12    0.067256
Feature_2     0.056079
dtype: float32
