1. What is Ensemble Learning in machine learning? Explain the key idea behind it.
- Ensemble learning is a technique in machine learning where multiple models (often called weak learners) are combined to form a single, more powerful model (strong learner).                               
- Key idea behind it are :
 Instead of relying on one model, we combine predictions from multiple models to:
  - Reduce variance (Bagging)

  - Reduce bias (Boosting)

  - Improve generalization and stability

2. What is the difference between Bagging and Boosting?
- Bagging (Bootstrap Aggregating)

  - Trains multiple models independently on different bootstrap samples.

  - Predictions are combined using majority vote (classification) or average (regression).

  - Main goal: Reduce variance and improve stability.

  - Example: Random Forest.

- Boosting

  - Trains models sequentially, where each new model focuses on the errors of the previous one.

  - Predictions are combined in a weighted manner.

  - Main goal: Reduce bias and make the model more accurate.

  - Examples: AdaBoost, Gradient Boosting, XGBoost.

3. What is bootstrap sampling and what role does it play in Bagging methods
- Bootstrap Sampling = Random sampling with replacement from the training data to create multiple subsets.

  Each subset has the same size as the original dataset, but due to replacement, some samples appear multiple times and some not at all.

- Role in Bagging:

  - Ensures diversity in training subsets.

  - Each base learner trains on different subsets, reducing variance.

  - Used in Random Forest for training multiple decision trees.

4. What are Out-of-Bag (OOB) samples and how is OOB score used to
- Since bootstrap samples don’t include all original data, the data not included is called Out-of-Bag (OOB) samples.

- OOB samples act like a validation set to estimate performance without cross-validation.

- OOB Score: The average prediction accuracy on OOB samples → useful for model evaluation in Bagging/Random Forest.

5. Compare feature importance analysis in a single Decision Tree vs. a Random forest.
- Single Decision Tree: Importance is based on information gain (Gini/Entropy) from splits.

  - Can be biased toward features with many categories.

- Random Forest: Aggregates feature importance across many trees.

  - More stable, robust, and less biased.

  - Provides a better global ranking of features.


In [1]:
#6. Write a Python program to:● Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer() ● Train a Random Forest Classifier
#● Print the top 5 most important features based on feature importance scores.

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Feature importance
importances = rf.feature_importances_
feature_importance_df = pd.DataFrame({
    "Feature": feature_names,
    "Importance": importances
}).sort_values(by="Importance", ascending=False)

print("Top 5 Important Features:")
print(feature_importance_df.head(5))


Top 5 Important Features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


In [3]:
# 7. Write a Python program to: ● Train a Bagging Classifier using Decision Trees on the Iris dataset ● Evaluate its accuracy and compare with a single Decision Tree
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))

# Bagging Classifier
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
print("Bagging Classifier Accuracy:", accuracy_score(y_test, y_pred_bag))

Decision Tree Accuracy: 1.0
Bagging Classifier Accuracy: 1.0


In [4]:
#8. Write a Python program to: ● Train a Random Forest Classifier ● Tune hyperparameters max_depth and n_estimators using GridSearchCV ● Print the best parameters and final accuracy
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Random Forest
rf = RandomForestClassifier(random_state=42)
param_grid = {
    "n_estimators": [50, 100, 200],
    "max_depth": [None, 5, 10, 20]
}

grid = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)

# Evaluate
best_rf = grid.best_estimator_
y_pred = best_rf.predict(X_test)
print("Final Accuracy:", accuracy_score(y_test, y_pred))


Best Parameters: {'max_depth': None, 'n_estimators': 200}
Final Accuracy: 0.9707602339181286


In [7]:
#9. Write a Python program to: ● Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset ● Compare their Mean Squared Errors (MSE)
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load dataset
data = fetch_california_housing()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Bagging Regressor
bagging = BaggingRegressor(estimator=DecisionTreeRegressor(), n_estimators=50, random_state=42)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
mse_bag = mean_squared_error(y_test, y_pred_bag)

# Random Forest Regressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)

print("Bagging Regressor MSE:", mse_bag)
print("Random Forest Regressor MSE:", mse_rf)

Bagging Regressor MSE: 0.25787382250585034
Random Forest Regressor MSE: 0.25650512920799395


10. You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data.
  
    You decide to use ensemble techniques to increase model performance.
  
    Explain your step-by-step approach to: ● Choose between Bagging or Boosting ● Handle overfitting ● Select base models ● Evaluate performance using cross-validation ● Justify how ensemble learning improves decision-making in this real-world context.

-  Step-by-Step Approach:

1. Choose between Bagging or Boosting

   - If data is noisy → Bagging (Random Forest)

   - If we need higher accuracy and bias reduction → Boosting (XGBoost/LightGBM)

   - Loan defaults often need catching minority cases → Boosting is preferred.

2. Handle Overfitting

   - Use cross-validation

   - Limit tree depth, learning rate (Boosting)

   - Use regularization

3. Select Base Models

   - Decision Trees (commonly used as weak learners)

   - Can also try Logistic Regression or SVM with Bagging

4. Evaluate Performance

   - Use cross-validation with metrics like AUC-ROC, Precision, Recall, F1-score (since data may be imbalanced).

   - OOB error for Random Forest.

5. Justification for Ensemble Learning

   - Captures complex non-linear patterns.

   - Reduces risk of missing potential defaults.

   - Provides more stable predictions than a single model.

   - Improves decision-making → lower loan default risk → better financial outcomes.