#Ensemble Learning Assignment

**Question 1. What is Ensemble Learning in machine learning? Explain the key idea behind it.**
- Ensemble learning is a technique where multiple machine learning models are combined to make a single, stronger prediction.

- The key idea is that:

  - a group of weak or average models, when combined, can perform better than a single model.

- Ensemble learning improves:

  - accuracy
  - robustness
  - generalization

- Common ensemble methods include Bagging, Boosting, and Random Forest.

**Question 2. What is the difference between Bagging and Boosting?**
- Bagging (Bootstrap Aggregating):
  - Models are trained independently.
  - Each model is trained on a random bootstrap sample of the dataset.
  - All models have equal importance.
  - Mainly reduces variance.
  - Example: Random Forest.

- Boosting:
  - Models are trained sequentially.
  - Each new model focuses more on previously misclassified samples.
  - Models have different weights.
  - Mainly reduces bias.
  - Examples: AdaBoost, Gradient Boosting, XGBoost.

**Question 3. What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?**
- Bootstrap sampling is a technique where:
  - multiple datasets are created by random sampling with replacement from the original dataset.
  - each bootstrap sample has the same size as the original dataset.

- Role in Bagging and Random Forest:
  - Each model (tree) is trained on a different bootstrap sample.
  - This introduces diversity among models.
  - Reduces overfitting and variance.
  - Improves overall model stability and accuracy.

**Question 4. What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?**
- Out-of-Bag samples are the data points not selected in a particular bootstrap sample.

- Key points:
  - On average, about 36% of data is left out of each bootstrap sample.
  - These unused samples are called OOB samples.

- OOB score:
  - Each model predicts its OOB samples.
  - Predictions from all models are combined.
  - Accuracy on OOB samples gives the OOB score.

- Benefits:
  - Acts as an internal validation set.
  - Eliminates the need for a separate test dataset.
  - Commonly used in Random Forest models.

**Question 5. Compare feature importance analysis in a single Decision Tree vs. a Random Forest.**
- Feature importance in a Decision Tree is calculated based on how much each feature reduces impurity at the splits of that single tree.
Because it depends on one tree, it can be unstable and sensitive to noise or outliers.

- Feature importance in a Random Forest is calculated by averaging feature importance across many trees.
This makes it more reliable, stable, and robust, since multiple trees reduce bias from individual splits.

- In summary:
  - Decision Tree importance → fast but unstable
  - Random Forest importance → stable and more trustworthy

**Question 6. Write a Python program to:**
- Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer()
- Train a Random Forest Classifier
- Print the top 5 most important features based on feature importance scores.

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=0)
rf.fit(X, y)

# Feature importance
importances = pd.Series(rf.feature_importances_, index=X.columns)

# Top 5 features
top_5 = importances.sort_values(ascending=False).head(5)
print("Top 5 Important Features:")
print(top_5)

Top 5 Important Features:
worst perimeter         0.173987
worst radius            0.121983
worst concave points    0.118589
mean concave points     0.088527
worst area              0.074981
dtype: float64


**Question 7. Write a Python program to:**
- Train a Bagging Classifier using Decision Trees on the Iris dataset
- Evaluate its accuracy and compare with a single Decision Tree

In [3]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Single Decision Tree
dt = DecisionTreeClassifier(random_state=0)
dt.fit(X_train, y_train)
dt_acc = accuracy_score(y_test, dt.predict(X_test))

# Bagging Classifier
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    random_state=0
)
bag.fit(X_train, y_train)
bag_acc = accuracy_score(y_test, bag.predict(X_test))

print("Decision Tree Accuracy:", dt_acc)
print("Bagging Classifier Accuracy:", bag_acc)

Decision Tree Accuracy: 0.9777777777777777
Bagging Classifier Accuracy: 0.9777777777777777


**Question 8. Write a Python program to:**
- Train a Random Forest Classifier
- Tune hyperparameters max_depth and n_estimators using GridSearchCV
- Print the best parameters and final accuracy

In [4]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Load data
data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Model
rf = RandomForestClassifier(random_state=0)

# Grid
param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [None, 5, 10]
}

grid = GridSearchCV(rf, param_grid, cv=5)
grid.fit(X_train, y_train)

# Best model
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

print("Best Parameters:", grid.best_params_)
print("Final Accuracy:", accuracy_score(y_test, y_pred))

Best Parameters: {'max_depth': None, 'n_estimators': 100}
Final Accuracy: 0.9590643274853801


**Question 9. Write a Python program to:**
- Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset
- Compare their Mean Squared Errors (MSE)

In [6]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load dataset
data = fetch_california_housing()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Bagging Regressor
bag = BaggingRegressor(
    base_estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=0
)
bag.fit(X_train, y_train)
bag_mse = mean_squared_error(y_test, bag.predict(X_test))

# Random Forest Regressor
rf = RandomForestRegressor(n_estimators=50, random_state=0)
rf.fit(X_train, y_train)
rf_mse = mean_squared_error(y_test, rf.predict(X_test))

print("Bagging Regressor MSE:", bag_mse)
print("Random Forest Regressor MSE:", rf_mse)



HTTPError: HTTP Error 403: Forbidden

**Question 10. You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data. You decide to use ensemble techniques to increase model performance.**

Explain your step-by-step approach to:
- Choose between Bagging or Boosting - Handle overfitting
- Select base models
- Evaluate performance using cross-validation
- Justify how ensemble learning improves decision-making in this real-world context.

In [8]:
# Ensemble Learning for Loan Default Prediction (Simulated Example)

from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier
from sklearn.model_selection import cross_val_score

# Create synthetic loan default dataset
X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=5,
    n_redundant=2,
    random_state=0
)

# Base model: Decision Tree
dt = DecisionTreeClassifier(random_state=0)

# Bagging model
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(), # Changed base_estimator to estimator
    n_estimators=100,
    random_state=0
)

# Boosting model (AdaBoost)
boosting = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1), # Changed base_estimator to estimator
    n_estimators=100,
    random_state=0
)

# Cross-validation scores
dt_score = cross_val_score(dt, X, y, cv=5, scoring='accuracy').mean()
bag_score = cross_val_score(bagging, X, y, cv=5, scoring='accuracy').mean()
boost_score = cross_val_score(boosting, X, y, cv=5, scoring='accuracy').mean()

# Print results
print("Decision Tree Accuracy:", dt_score)
print("Bagging Accuracy:", bag_score)
print("Boosting Accuracy:", boost_score)

Decision Tree Accuracy: 0.9339999999999999
Bagging Accuracy: 0.9570000000000001
Boosting Accuracy: 0.9279999999999999


**Step-by-step approach:**
- Choosing Bagging vs Boosting
  - Bagging improves performance by reducing variance.
  - Boosting improves performance by reducing bias.
  - In loan default prediction, Boosting performs better because it focuses on difficult cases.

- Handling Overfitting
  - Cross-validation ensures stable evaluation.
  - Bagging reduces overfitting by averaging multiple trees.
  - Boosting controls overfitting using weak learners.

- Base Models
  - Decision Trees are used as base learners.
  - Shallow trees are used in Boosting to avoid overfitting.

- Evaluation Using Cross-Validation
  - 5-fold cross-validation is used.
  - Accuracy scores are averaged for reliable comparison.

- Why Ensemble Learning Improves Decisions
  - Combines multiple models for better predictions.
  - Reduces false loan approvals.
  - Improves risk management and financial stability.