#Assignment Code: DA-AG-014
#Ensemble Learning | Assignment

Question 1: What is Ensemble Learning in machine learning? Explain the key idea behind it.

Answer:
Ensemble learning is a machine learning technique where multiple models are combined to solve a problem and improve overall performance. The idea is that while individual models (called base learners or weak learners) may make errors, combining their predictions can reduce those errors and lead to better results. The key idea is to leverage the diversity of models — their different strengths and weaknesses — so that their collective prediction is more accurate, robust, and less prone to overfitting or underfitting than any single model alone.

Question 2: What is the difference between Bagging and Boosting?

Answer:  
Bagging and Boosting are both ensemble learning techniques but work differently. Bagging stands for Bootstrap Aggregating. In bagging, multiple models are trained independently on random subsets of the training data (with replacement). Their outputs are then combined, usually by averaging (for regression) or majority voting (for classification). This helps reduce variance and prevents overfitting.

Boosting, on the other hand, builds models sequentially. Each new model is trained to correct the errors made by the previous one. In boosting, more weight is given to the data points that were misclassified earlier. This method focuses on reducing bias and improving prediction accuracy by focusing more on difficult examples. However, boosting can be more prone to overfitting if not properly regularized.

 Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?

Answer:  
Bootstrap sampling is a method of creating random subsets of data by sampling from the original dataset with replacement. This means that some data points may appear more than once in a sample, while others may not appear at all.

In Bagging methods like Random Forest, bootstrap sampling is used to train each decision tree on a different subset of the data. This introduces diversity among the trees because each one sees a slightly different view of the data. The final prediction is made by aggregating the predictions of all the trees. This process reduces variance and increases the model’s ability to generalize to unseen data.

Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?

Answer:  
Out-of-Bag (OOB) samples are the data points that are not included in a given bootstrap sample. In other words, when a model (such as a tree in a Random Forest) is trained on a bootstrap sample, about 37% of the original data is left out — these are called OOB samples.

The OOB score is an internal validation technique used to evaluate the model's performance. Each data point is predicted by the subset of trees that did not train on it. The predictions are then compared with the true labels to calculate an accuracy score. This OOB score provides an unbiased estimate of model performance without needing a separate validation set.

 Question 5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest.

Answer:  
In a single Decision Tree, feature importance is calculated based on how much each feature contributes to reducing impurity (such as Gini impurity or entropy) at each split. However, since a decision tree is sensitive to data and can overfit, the feature importance it provides may not be reliable, especially on small or noisy datasets.

In contrast, a Random Forest consists of multiple trees trained on different subsets of data. Feature importance in a Random Forest is averaged across all trees, making it more stable and robust. This reduces the influence of random fluctuations in data and gives a more reliable estimate of which features are truly important for prediction.  


Question 6: Write a Python program to:   
● Load the Breast Cancer dataset using   
sklearn.datasets.load_breast_cancer()   
● Train a Random Forest Classifier   
● Print the top 5 most important features based on feature importance scores.   
(Include your Python code and output in the code box below.)   
Answer:  

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Train a Random Forest Classifier
model = RandomForestClassifier(random_state=42)
model.fit(X, y)

# Get feature importances
importances = pd.Series(model.feature_importances_, index=X.columns)

# Print the top 5 most important features
top_5 = importances.sort_values(ascending=False).head(5)
print("Top 5 Most Important Features:")
print(top_5)


Top 5 Most Important Features:
worst area              0.139357
worst concave points    0.132225
mean concave points     0.107046
worst radius            0.082848
worst perimeter         0.080850
dtype: float64


Question 7: Write a Python program to:   
● Train a Bagging Classifier using Decision Trees on the Iris dataset   
● Evaluate its accuracy and compare with a single Decision Tree   
(Include your Python code and output in the code box below.)   
Answer:  

In [3]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
X, y = load_iris(return_X_y=True)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train a single Decision Tree
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
dt_preds = dt_model.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_preds)

# Train a Bagging Classifier with Decision Tree as base estimator
bag_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)
bag_model.fit(X_train, y_train)
bag_preds = bag_model.predict(X_test)
bag_accuracy = accuracy_score(y_test, bag_preds)

# Print accuracies
print("Single Decision Tree Accuracy:", round(dt_accuracy, 2))
print("Bagging Classifier Accuracy:", round(bag_accuracy, 2))


Single Decision Tree Accuracy: 1.0
Bagging Classifier Accuracy: 1.0


Question 8: Write a Python program to:   
● Train a Random Forest Classifier   
● Tune hyperparameters max_depth and n_estimators using GridSearchCV   
● Print the best parameters and final accuracy   
(Include your Python code and output in the code box below.)  
Answer:

In [4]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset
X, y = load_breast_cancer(return_X_y=True)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 5, 10]
}

# Initialize the model
rf = RandomForestClassifier(random_state=42)

# Apply GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Best model
best_model = grid_search.best_estimator_

# Evaluate on test data
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Parameters:", grid_search.best_params_)
print("Test Accuracy:", round(accuracy, 2))


Best Parameters: {'max_depth': None, 'n_estimators': 150}
Test Accuracy: 0.96


Question 9: Write a Python program to:   
● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset   
● Compare their Mean Squared Errors (MSE)   
(Include your Python code and output in the code box below.)   
Answer:

In [5]:
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
X, y = fetch_california_housing(return_X_y=True)

# Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Bagging Regressor
bagging_reg = BaggingRegressor(random_state=42)
bagging_reg.fit(X_train, y_train)
y_pred_bagging = bagging_reg.predict(X_test)
mse_bagging = mean_squared_error(y_test, y_pred_bagging)

# Train Random Forest Regressor
rf_reg = RandomForestRegressor(random_state=42)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)

# Print the Mean Squared Errors
print(f"Bagging Regressor MSE: {mse_bagging:.4f}")
print(f"Random Forest Regressor MSE: {mse_rf:.4f}")


Bagging Regressor MSE: 0.2824
Random Forest Regressor MSE: 0.2554


Question 10: You are working as a data scientist at a financial institution to predict loan   default. You have access to customer demographic and transaction history data.   
You decide to use ensemble techniques to increase model performance.   
Explain your step-by-step approach to:  
● Choose between Bagging or Boosting  
● Handle overfitting  
● Select base models  
● Evaluate performance using cross-validation  
● Justify how ensemble learning improves decision-making in this real-world
context.  

(Include your Python code and output in the code box below.)  
Answer:

Scenario:  
As a data scientist at a financial institution, your goal is to predict loan default using customer demographic and transaction data. To improve model performance, you decide to apply ensemble learning techniques.

Below is the step-by-step approach covering all the points requested:

 1. Choose Between Bagging and Boosting

Approach:

Start with Bagging (e.g., Random Forest) as a baseline model. It's stable, handles noise well, and performs well even with little tuning.

Move to Boosting (e.g., XGBoost or LightGBM) for improved accuracy, especially in imbalanced datasets (e.g., default vs non-default).

Since loan default prediction is a high-stakes classification problem, Boosting is usually preferred due to its focus on hard-to-classify instances.

 2. Handle Overfitting

Techniques Used:

Use regularization parameters like max_depth, min_child_weight, and learning_rate (in boosting).

Use early stopping to stop training when the validation loss stops improving.

Apply cross-validation to monitor model generalization.

Ensure feature selection and remove highly correlated or irrelevant features.

 3. Select Base Models

Approach:

For Bagging: Use Decision Trees as base learners.

For Boosting: Also use shallow Decision Trees (max_depth = 3–5) as weak learners.

You may also experiment with logistic regression as a baseline non-ensemble model for comparison.

 4. Evaluate Performance Using Cross-Validation

Approach:

Use Stratified K-Fold Cross-Validation to ensure class distribution is preserved in all folds.

Use metrics like accuracy, precision, recall, F1-score, and ROC-AUC, especially because class imbalance is common in loan default datasets.

 5. Justify Ensemble Learning for Loan Default Prediction

Justification:

Ensemble models (especially Boosting) reduce both bias and variance.

They are robust to noise, missing values, and outliers — which are common in financial data.

Boosting methods adaptively focus on misclassified cases (like defaulters), improving recall — which is critical for identifying risky customers.

This improves decision-making, reduces loan default risk, and enhances financial stability.

Python Code Example (using XGBoost as Boosting model)

In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report
from xgboost import XGBClassifier
from sklearn.datasets import make_classification

# Simulated dataset (replace with your real loan dataset)
X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10, n_redundant=5,
    random_state=42, weights=[0.7, 0.3]
)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Train XGBoost classifier (boosting)
model = XGBClassifier(
    max_depth=4,
    learning_rate=0.1,
    n_estimators=100,
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42
)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
acc = accuracy_score(y_test, y_pred)
print("Test Accuracy:", round(acc, 2))
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))

# Cross-validation score
cv_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print("\nCross-Validation Accuracy:", round(cv_scores.mean(), 2))


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


Test Accuracy: 0.94

Classification Report:

              precision    recall  f1-score   support

           0       0.94      0.98      0.96       139
           1       0.95      0.85      0.90        61

    accuracy                           0.94       200
   macro avg       0.94      0.92      0.93       200
weighted avg       0.94      0.94      0.94       200



Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)



Cross-Validation Accuracy: 0.94
