1.What is Ensemble Learning in machine learning? Explain the key idea
behind it.
- Ensemble learning is a machine learning technique that combines multiple models to achieve better predictive performance than any single model alone

- Key Idea Behind Ensemble Learning
  - Diversity of models: Different models capture different aspects of the data. By combining them, weaknesses of one model are compensated by strengths of others.
  - Error reduction: Individual models may be biased or have high variance. Ensemble methods balance these errors, leading to more reliable predictions.
  - Improved generalization: Ensembles are less likely to overfit compared to single complex models, making them more robust on unseen data


2.What is the difference between Bagging and Boosting?

- Bagging(Bootstrap Aggregating)
  - Builds models independently on random subsets of the data.
  - Uses bootstrapped samples (random sampling with replacement)
  - Predictions are combined by majority voting (classification) or averaging (regression).
  - Goal is to reduce variance (helps prevent overfitting).
  - Parallel training – models don’t depend on each other.
  - example:Random Forest
- Boosting:
  - Builds models sequentially, each new model focuses on correcting errors of the previous one.
  - Uses the entire dataset, but assigns weights to emphasize misclassified points.
  - Predictions are combined by weighted voting, giving more influence to stronger models
  - Reduce bias (improves accuracy by focusing on hard cases).
  - Sequential training – each model depends on the previous one.
  - AdaBoost, Gradient Boosting, XGBoost, LightGBM.






3.What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?
-  Bootstrap sampling means drawing random samples with replacement from the original dataset.
- With replacement: Each time you pick a data point, you put it back before drawing the next one. This means some data points may appear multiple times in a sample, while others may not appear at all.
- Sample size: Typically, the bootstrap sample is the same size as the original dataset
- Bagging (Bootstrap Aggregating) relies on bootstrap sampling to ensure model diversity:
  - Random subsets for each model: Each decision tree in a Random Forest is trained on a different bootstrap sample.
  - Variance reduction: Since trees are trained on slightly different data, they don’t all make the same mistakes. Averaging their predictions reduces variance and stabilizes results.
  - Independence of models: Bootstrap sampling makes each tree independent, so their errors are less correlated.
  - Improved generalization: By combining many diverse trees, Random Forests avoid overfitting and perform better on unseen data.


4.What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?

- In bootstrap sampling, each model (e.g., each tree in a Random Forest) is trained on a random sample of the dataset with replacement.
- Because of replacement, about 63% of the original data is included in each bootstrap sample, while the remaining ~37% is left out.
- These left-out data points are called Out-of-Bag (OOB) samples.
- Each tree can be tested on its own OOB samples, since those points were not used in training that tree.
- By aggregating predictions across all trees for their respective OOB samples, we can estimate the model’s accuracy.
- This gives us the OOB score, which is essentially the average performance of the ensemble on unseen data.
-Benefits of OOB Score
  - No need for a separate validation set: Saves data, especially useful when datasets are small.
  - Unbiased performance estimate: Since OOB samples are not used in training, the OOB score reflects how well the model generalizes.
  - Efficiency: Evaluation happens during training, so it’s computationally cheaper than cross-validation.



5. Compare feature importance analysis in a single Decision Tree vs. a
Random Forest
- Feature importance analysis is a key way to understand how models make decisions.
- Decision Tree Feature Importance
  - Calculation: Importance is measured by the reduction in impurity (e.g., Gini or entropy) when a feature is used for splitting.
  - Interpretability: Very clear—you can trace the exact path of splits.
  - Limitations:
  - Can be unstable (small changes in data may change the tree structure).
  - Biased toward features with many categories or continuous values.
  - Reflects only one model’s perspective
- Random Forest
  - Aggregates importance across all trees in the forest.
  - Two common methods:
  - Mean Decrease in Impurity (MDI): Average impurity reduction across trees.
  - Permutation Importance (MDA): Measures drop in accuracy when feature values are shuffled.
  - Interpretability: Less direct (many trees), but provides a global ranking of features.
  - Advantages:
  - More stable due to averaging.
  - Less prone to overfitting compared to a single tree.
  - More reliable in high-dimensional datasets



In [5]:
'''
6.Write a Python program to:
● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
● Train a Random Forest Classifier
● Print the top 5 most important features based on feature importance scores.
'''

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd


data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X, y)


importances = clf.feature_importances_

feature_importances = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

top_features = feature_importances.sort_values(by='Importance', ascending=False).head(5)

print("Top 5 Important Features:")
print(top_features)

Top 5 Important Features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


In [8]:
'''
7.Write a Python program to:
● Train a Bagging Classifier using Decision Trees on the Iris dataset
● Evaluate its accuracy and compare with a single Decision Tree
'''
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score


iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)


dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
dt_accuracy = accuracy_score(y_test, y_pred_dt)

bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42)
bagging.fit(X_train, y_train)
y_pred_bagging = bagging.predict(X_test)
bagging_accuracy = accuracy_score(y_test, y_pred_bagging)

# Print results
print("Accuracy of Single Decision Tree:", dt_accuracy)
print("Accuracy of Bagging Classifier (Decision Trees):", bagging_accuracy)

Accuracy of Single Decision Tree: 1.0
Accuracy of Bagging Classifier (Decision Trees): 1.0


In [9]:
'''
8.Write a Python program to:
● Train a Random Forest Classifier
● Tune hyperparameters max_depth and n_estimators using GridSearchCV
● Print the best parameters and final accuracy'''



from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

rf = RandomForestClassifier(random_state=42)

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10, 20]
}

grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    n_jobs=-1,
    scoring='accuracy'
)

grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_


best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred)

print("Best Parameters:", best_params)
print("Final Accuracy on Test Set:", final_accuracy)

Best Parameters: {'max_depth': None, 'n_estimators': 200}
Final Accuracy on Test Set: 0.9707602339181286


In [10]:
'''
9.Write a Python program to:
● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
● Compare their Mean Squared Errors (MSE)
'''

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

bagging = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42,
    n_jobs=-1
)
bagging.fit(X_train, y_train)
y_pred_bagging = bagging.predict(X_test)
bagging_mse = mean_squared_error(y_test, y_pred_bagging)


rf = RandomForestRegressor(
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
rf_mse = mean_squared_error(y_test, y_pred_rf)


print("Mean Squared Error (Bagging Regressor):", bagging_mse)
print("Mean Squared Error (Random Forest Regressor):", rf_mse)

Mean Squared Error (Bagging Regressor): 0.25787382250585034
Mean Squared Error (Random Forest Regressor): 0.25650512920799395


In [18]:
'''
10.: You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
● Choose between Bagging or Boosting
● Handle overfitting
● Select base models
● Evaluate performance using cross-validation
● Justify how ensemble learning improves decision-making in this real-world
context
'''

# Loan Default Prediction with Ensemble Methods

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split,StratifiedKFold, cross_val_score
from sklearn.metrics import roc_auc_score, accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
import warnings
warnings.filterwarnings("ignore")

# For Boosting
import lightgbm as lgb

X, y = make_classification(
    n_samples=5000,        # number of customers
    n_features=20,         # demographic + transaction features
    n_informative=10,      # features that actually matter
    n_redundant=5,         # correlated features
    n_classes=2,           # default vs non-default
    weights=[0.8, 0.2],    # imbalanced: 20% defaults
    random_state=42
)



X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    min_samples_leaf=5,
    random_state=42,
    n_jobs=-1
)

rf.fit(X_train, y_train)
rf_pred = rf.predict_proba(X_test)[:, 1]
rf_auc = roc_auc_score(y_test, rf_pred)


lgb_model = lgb.LGBMClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)

lgb_model.fit(X_train, y_train,
              eval_set=[(X_test, y_test)],
              eval_metric="auc")

lgb_pred = lgb_model.predict_proba(X_test)[:, 1]
lgb_auc = roc_auc_score(y_test, lgb_pred)


cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

rf_cv_auc = cross_val_score(rf, X, y, cv=cv, scoring="roc_auc").mean()
lgb_cv_auc = cross_val_score(lgb_model, X, y, cv=cv, scoring="roc_auc").mean()


print("Random Forest Test AUC:", rf_auc)
print("LightGBM Test AUC:", lgb_auc)
print("Random Forest CV AUC:", rf_cv_auc)
print("LightGBM CV AUC:", lgb_cv_auc)

[LightGBM] [Info] Number of positive: 711, number of negative: 2789
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001383 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5100
[LightGBM] [Info] Number of data points in the train set: 3500, number of used features: 20
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.203143 -> initscore=-1.366766
[LightGBM] [Info] Start training from score -1.366766
[LightGBM] [Info] Number of positive: 812, number of negative: 3188
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001028 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5100
[LightGBM] [Info] Number of data points in the train set: 4000, number of used features: 20
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.203000 -> initscore=-1.367649
[LightGBM] [Info] Start training from score -1.367649
[LightGBM] [Info] Nu