#Ensemble Learning | Assignment

Q-1 : What is Ensemble Learning in machine learning? Explain the key idea
behind it.

A-1 : What is Ensemble Learning in Machine Learning?
Ensemble Learning is a technique in machine learning where multiple models (learners) are combined to solve a problem and improve overall performance. Instead of relying on a single model, ensemble methods merge predictions from several models to produce a more accurate and robust outcome.

Key Idea Behind Ensemble Learning:
The core idea is that a group of weak learners (models that perform slightly better than random guessing) can come together to form a strong learner. By combining diverse models, ensemble learning helps to:

Reduce variance (e.g., Bagging methods like Random Forest)

Reduce bias (e.g., Boosting methods like AdaBoost, Gradient Boosting)

Improve accuracy and generalization

Common Ensemble Methods:
Bagging: Builds multiple models on different random subsets of data (e.g., Random Forest).

Boosting: Sequentially builds models, focusing on correcting previous errors (e.g., AdaBoost, XGBoost).

Voting: Combines predictions from multiple models using majority or average vote.

Q-2 : What is the difference between Bagging and Boosting?


A-2 : What is the difference between Bagging and Boosting?
Bagging (Bootstrap Aggregating) and Boosting are both ensemble learning techniques that combine multiple models to improve prediction performance, but they differ in how models are trained and combined.

Bagging:
Trains multiple models independently on random subsets of the training data (with replacement).

Goal: Reduce variance by averaging or voting.

All models have equal weight in the final prediction.

Example: Random Forest.

Boosting:
Trains models sequentially, where each new model focuses on correcting the errors of the previous ones.

Goal: Reduce bias and build a strong learner from weak ones.

Later models have more influence in the final result.

Example: AdaBoost, Gradient Boosting.

Key Differences Summary:
Bagging = Parallel, Boosting = Sequential

Bagging reduces variance, Boosting reduces bias

Bagging treats all models equally, Boosting gives weights based on performance

Q-3 : : What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?

A-3 : What is Bootstrap Sampling and its Role in Bagging Methods like Random Forest?
Bootstrap sampling is a statistical technique that involves randomly selecting samples from a dataset with replacement. This means the same data point can appear multiple times in a single sample.

In the context of Bagging (Bootstrap Aggregating) methods like Random Forest, bootstrap sampling plays a key role in creating diverse training subsets.

Role in Bagging / Random Forest:
Diversity: Each model (e.g., decision tree) is trained on a different random bootstrap sample, increasing model diversity.

Overfitting Reduction: Combining predictions from multiple diverse models helps to reduce overfitting and improve generalization.

Stability: Models trained on slightly different datasets help in stabilizing predictions, especially in noisy datasets.



Q-4 : What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?

A-4 : What are Out-of-Bag (OOB) Samples and the OOB Score?
In Bagging methods like Random Forest, each model (e.g., decision tree) is trained on a bootstrap sample — a random subset of the training data with replacement.

As a result, about 63% of the data points are included in each bootstrap sample, and the remaining ~37% are not selected. These unused data points are called Out-of-Bag (OOB) samples.

How OOB Samples Are Used:
OOB samples serve as a kind of internal validation set.

Each data point is predicted using only the models that did not see it during training.

The model’s predictions on OOB samples are compared to their actual labels to calculate the OOB score.

OOB Score (Evaluation):
The OOB score is the average accuracy (or other metric) of the model on the OOB samples.

It provides a reliable estimate of model performance without needing a separate validation set.

It helps save data and gives a quick, unbiased estimate of generalization error.



Q-5 : Compare feature importance analysis in a single Decision Tree vs. a
Random Forest

A-5 :  1. In a Single Decision Tree:
Feature importance is calculated based on how much each feature reduces impurity (e.g., Gini or Entropy) when used for splitting.

It reflects the importance of features in that one specific tree.

Can be unstable — small changes in data may lead to very different trees and importance rankings.

Prone to overfitting if the tree is deep or not pruned.

2. In a Random Forest:
Feature importance is averaged across all trees in the ensemble.

More stable and reliable, as it reduces the bias of any single tree.

Random Forest accounts for feature interactions and variability, giving a more generalized measure.

Often gives better insight into which features consistently contribute to model performance.

Q-6 : Write a Python program to:
● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
● Train a Random Forest Classifier
● Print the top 5 most important features based on feature importance scores.
(Include your Python code and output in the code box below.)

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

# Load the dataset
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

# Train the Random Forest Classifier
model = RandomForestClassifier(random_state=42)
model.fit(X, y)

# Get feature importances
importances = model.feature_importances_

# Create a DataFrame for feature importance
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

# Sort and display the top 5 important features
top_features = feature_importance_df.sort_values(by='Importance', ascending=False).head(5)
print("Top 5 Most Important Features:")
print(top_features)


Top 5 Most Important Features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


Q-7 :  Write a Python program to:
● Train a Bagging Classifier using Decision Trees on the Iris dataset
● Evaluate its accuracy and compare with a single Decision Tree
(Include your Python code and output in the code box below.)

In [4]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a single Decision Tree
dtree = DecisionTreeClassifier(random_state=42)
dtree.fit(X_train, y_train)
y_pred_tree = dtree.predict(X_test)
accuracy_tree = accuracy_score(y_test, y_pred_tree)

# Train a Bagging Classifier using Decision Trees
bagging = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=50, random_state=42)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
accuracy_bagging = accuracy_score(y_test, y_pred_bag)

# Print results
print(f"Single Decision Tree Accuracy: {accuracy_tree:.2f}")
print(f"Bagging Classifier Accuracy: {accuracy_bagging:.2f}")

Single Decision Tree Accuracy: 1.00
Bagging Classifier Accuracy: 1.00


Q-8 :  Write a Python program to:
● Train a Random Forest Classifier
● Tune hyperparameters max_depth and n_estimators using GridSearchCV
● Print the best parameters and final accuracy
(Include your Python code and output in the code box below.)

In [5]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the model
rf = RandomForestClassifier(random_state=42)

# Define parameter grid
param_grid = {
    'n_estimators': [10, 50, 100],
    'max_depth': [None, 3, 5, 10]
}

# Perform GridSearchCV
grid_search = GridSearchCV(rf, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get best model and evaluate
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Parameters:", grid_search.best_params_)
print(f"Final Accuracy: {accuracy:.2f}")


Best Parameters: {'max_depth': None, 'n_estimators': 10}
Final Accuracy: 1.00


Q-9 : Write a Python program to:
● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
● Compare their Mean Squared Errors (MSE)
(Include your Python code and output in the code box below.)

In [7]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor

# Load California Housing dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Bagging Regressor with Decision Trees
bagging_model = BaggingRegressor(estimator=DecisionTreeRegressor(), n_estimators=50, random_state=42)
bagging_model.fit(X_train, y_train)
y_pred_bagging = bagging_model.predict(X_test)

# Train Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=50, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

# Calculate Mean Squared Errors
mse_bagging = mean_squared_error(y_test, y_pred_bagging)
mse_rf = mean_squared_error(y_test, y_pred_rf)

# Print Results
print(f"Bagging Regressor MSE: {mse_bagging:.2f}")
print(f"Random Forest Regressor MSE: {mse_rf:.2f}")

Bagging Regressor MSE: 0.26
Random Forest Regressor MSE: 0.26


Q-10 : You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
● Choose between Bagging or Boosting
● Handle overfitting
● Select base models
● Evaluate performance using cross-validation
● Justify how ensemble learning improves decision-making in this real-world
context.

A-10 :1. Choose Between Bagging or Boosting
We'll use Boosting (XGBoost) because:

It works well with tabular and imbalanced data.

Learns sequentially, focusing on difficult cases.

Usually gives better performance for structured datasets like financial records.

2. Handle Overfitting
We’ll control overfitting using:

Regularization (max_depth, gamma, min_child_weight)

Early stopping

Cross-validation

3. Base Model: XGBoost (Decision Trees internally)
4. Evaluation: Stratified K-Fold Cross-Validation with AUC, Precision, Recall


In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import classification_report, roc_auc_score
from xgboost import XGBClassifier
from sklearn.datasets import make_classification

# Simulate an imbalanced loan default dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=6, n_redundant=2,
                           n_classes=2, weights=[0.85, 0.15], flip_y=0.01, random_state=42)

# Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Define the XGBoost classifier with regularization
model = XGBClassifier(
    max_depth=4,
    learning_rate=0.1,
    n_estimators=100,
    scale_pos_weight=5,  # To handle imbalance
    eval_metric='auc',
    use_label_encoder=False,
    verbosity=0,
    random_state=42
)

# Stratified K-Fold Cross-Validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X_train, y_train, scoring='roc_auc', cv=skf)

print("Cross-Validation AUC Scores:", np.round(cv_scores, 3))
print("Mean AUC Score: {:.3f}".format(np.mean(cv_scores)))

# Train on the entire training set
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

# Classification metrics
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("Test ROC-AUC Score: {:.3f}".format(roc_auc_score(y_test, y_prob)))


Cross-Validation AUC Scores: [0.932 0.944 0.953 0.956 0.954]
Mean AUC Score: 0.948

Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.98      0.97       169
           1       0.86      0.77      0.81        31

    accuracy                           0.94       200
   macro avg       0.91      0.88      0.89       200
weighted avg       0.94      0.94      0.94       200

Test ROC-AUC Score: 0.947
