Q1.What is Ensemble Learning in machine learning? Explain the key idea
behind it.
>- Ensemble Learning is a machine learning technique where multiple models (learners) are combined to make a single, stronger prediction.

Key idea:
Instead of relying on one model, ensemble learning aggregates predictions from several models to reduce errors, improve accuracy, and increase robustness—because different models can compensate for each other’s mistakes.

Common examples: Bagging, Boosting, and Random Forest.

Q2. What is the difference between Bagging and Boosting?
>- Difference between Bagging and Boosting (short):

Bagging (Bootstrap Aggregating):
Trains multiple models independently on different random samples of the data and combines their predictions to reduce variance.
Example: Random Forest

Boosting:
Trains models sequentially, where each new model focuses more on the previous model’s errors, aiming to reduce bias.
Example: AdaBoost, Gradient Boosting

Q3. What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?
>- Bootstrap sampling is a technique where multiple training datasets are created by randomly sampling with replacement from the original dataset.

Role in Bagging / Random Forest:
It allows each model (tree) to be trained on a different subset of data, increasing model diversity and reducing overfitting and variance when their predictions are combined.

Q4.  What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?
>- Out-of-Bag (OOB) samples are the data points not included in a bootstrap sample for a particular model in an ensemble.

OOB score:
The ensemble model predicts these OOB samples, and the accuracy on these unseen points is used as an internal estimate of model performance without needing a separate test set.

Q5.Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.
>- Feature Importance in a Single Decision Tree:

Calculated based on how much a feature reduces impurity (like Gini or Entropy) in that tree.

Can be unstable—small changes in data may change importance.

Feature Importance in Random Forest:

Averaged over all trees in the forest.

More robust and reliable, as it considers many models instead of just one.

Q6. Write a Python program to:
● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
● Train a Random Forest Classifier
● Print the top 5 most important features based on feature importance scores.



In [1]:
# Import libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get feature importances
importances = rf.feature_importances_
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

# Print top 5 features
print("Top 5 Important Features:")
print(feature_importance_df.head(5))


Top 5 Important Features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


Q7. Write a Python program to:
● Train a Bagging Classifier using Decision Trees on the Iris dataset
● Evaluate its accuracy and compare with a single Decision Tree?


In [5]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
acc_dt = accuracy_score(y_test, y_pred_dt)

# Train a Bagging Classifier using Decision Trees
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(), # Changed base_estimator to estimator
    n_estimators=50,
    random_state=42
)
bag.fit(X_train, y_train)
y_pred_bag = bag.predict(X_test)
acc_bag = accuracy_score(y_test, y_pred_bag)

# Print the accuracies
print(f"Decision Tree Accuracy: {acc_dt:.3f}")
print(f"Bagging Classifier Accuracy: {acc_bag:.3f}")

Decision Tree Accuracy: 1.000
Bagging Classifier Accuracy: 1.000


Q8.  Write a Python program to:
● Train a Random Forest Classifier
● Tune hyperparameters max_depth and n_estimators using GridSearchCV
● Print the best parameters and final accuracy


In [6]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize Random Forest
rf = RandomForestClassifier(random_state=42)

# Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 3, 5, 7]
}

# Grid Search with 5-fold cross-validation
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Best parameters
best_params = grid_search.best_params_

# Evaluate final model on test set
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Hyperparameters:", best_params)
print(f"Final Test Accuracy: {final_accuracy:.3f}")


Best Hyperparameters: {'max_depth': None, 'n_estimators': 100}
Final Test Accuracy: 1.000


Q9. Write a Python program to:
● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
● Compare their Mean Squared Errors (MSE)


In [7]:
# Import required libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
california = fetch_california_housing()
X = california.data
y = california.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Bagging Regressor
bag_reg = BaggingRegressor(n_estimators=50, random_state=42)
bag_reg.fit(X_train, y_train)
y_pred_bag = bag_reg.predict(X_test)
mse_bag = mean_squared_error(y_test, y_pred_bag)

# Train Random Forest Regressor
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)

# Print Mean Squared Errors
print(f"Bagging Regressor MSE: {mse_bag:.4f}")
print(f"Random Forest Regressor MSE: {mse_rf:.4f}")


Bagging Regressor MSE: 0.2579
Random Forest Regressor MSE: 0.2565


Q10.You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
● Choose between Bagging or Boosting
● Handle overfitting
● Select base models
● Evaluate performance using cross-validation
● Justify how ensemble learning improves decision-making in this real-world
context.
>- step 1: Choose Between Bagging or Boosting

Bagging: Reduces variance, useful if individual models overfit (e.g., decision trees).

Boosting: Reduces bias, sequentially improves weak learners, useful if underfitting is an issue.

Approach: Start with Random Forest (Bagging) for stable performance, then try XGBoost/Gradient Boosting if boosting can improve accuracy.

>- Step 2: Handle Overfitting

Limit tree depth (max_depth) and minimum samples per leaf (min_samples_leaf).

Use regularization in boosting models (learning_rate).

Apply feature selection to remove noisy features.

>- Step 3: Select Base Models

Decision Trees are common base models for both Bagging and Boosting.

For Bagging: Multiple deep trees can reduce variance.

For Boosting: Multiple shallow trees work best to reduce bias.

>- Step 4: Evaluate Performance Using Cross-Validation

Use k-fold cross-validation to get robust performance metrics.

Evaluate with accuracy, precision, recall, F1-score, and ROC-AUC since class imbalance is likely.

>- Step 5: Justify Ensemble Learning

Ensemble learning combines multiple models to reduce errors from individual models.

In finance, it improves loan default prediction, minimizing risk by catching subtle patterns in customer behavior and transaction history.

In [8]:
# Import libraries
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Simulated loan dataset (for demonstration)
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           n_redundant=5, n_classes=2, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Bagging: Random Forest
rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
cv_rf = cross_val_score(rf, X_train, y_train, cv=5, scoring='accuracy')

# Boosting: Gradient Boosting
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gb.fit(X_train, y_train)
y_pred_gb = gb.predict(X_test)
cv_gb = cross_val_score(gb, X_train, y_train, cv=5, scoring='accuracy')

# Print results
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Random Forest CV Accuracy:", cv_rf.mean())
print("Gradient Boosting Accuracy:", accuracy_score(y_test, y_pred_gb))
print("Gradient Boosting CV Accuracy:", cv_gb.mean())


Random Forest Accuracy: 0.89
Random Forest CV Accuracy: 0.9014285714285715
Gradient Boosting Accuracy: 0.9066666666666666
Gradient Boosting CV Accuracy: 0.9199999999999999
