ENSEMBLE LEARNING ASSIGNMENT

1. What is Ensemble Learning in machine learning? Explain the key idea behind it.
- Ensemble learning combines multiple machine learning models (base learners) to produce a single, more accurate, and robust prediction than any individual model could achieve alone.
- Key idea is rooted in the principle of  "Wisdom of Crowds" by strategically combining diverse models, their errors will cancel each other out.

2. What is the difference between Bagging and Boosting?
- Bagging > Bootstrap Aggregating in parallel training.
  - Bagging trains multiple base models (e.g., decision trees) in parallel.
  - Base models are independent of each other.
- Boosting > Sequential Training
  - Boosting trains multiple base models sequentially, with each subsequent model learning from the errors of its predecessors.
  - New models is influenced by the performance of previously built models.

3. What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?
- Bootstrap sampling is a resampling technique where a subset of data is randomly drawn from an original dataset with replacement.
- In Bagging (Bootstrap Aggregating) methods, such as Random Forest, bootstrap sampling plays a crucial role in creating diverse training sets for multiple base learners:
  - Creating diverse training sets
  - Reducing variance and overfitting
  - Enabling parallel training

4. What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?
- OOB (out-of-bag) score is a performance metric for a machine learning model, specifically for ensemble models such as random forests.
- It is calculated using the samples that are not used in the training of the model, which is called out-of-bag samples.

5. Compare feature importance analysis in a single Decision Tree vs. a Random Forest.
- Single Decision Tree
  - Impurity-based Importance (Gini Importance or Mean Decrease in Impurity)
  - Instability and Bias
- Random Forest
  - Ensemble Averaging
  - Permutataion Importance
  - Robustness and generalisation

6. Write a Python program to:
- Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
- Train a Random Forest Classifier
- Print the top 5 most important features based on feature importance scores.


In [3]:
#load dataset
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
x = data.data
y = data.target
feature_names = data.feature_names

In [4]:
#train random forest classifier
from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(x, y)

In [6]:
#get feature importance
feature = rf_classifier.feature_importances_

In [7]:
import pandas as pd
feature_importances = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature
})
# Sort by importance
feature_importances = feature_importances.sort_values(by='Importance', ascending=False)

# Print top 5 features
print("Top 5 most important features:")
print(feature_importances.head(5))

Top 5 most important features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


7. Write a Python program to:
- Train a Bagging Classifier using Decision Trees on the Iris dataset
- Evaluate its accuracy and compare with a single Decision Tree

In [18]:
#load iris dataset
from sklearn.datasets import load_iris
iris = load_iris()
x, y = iris.data, iris.target

In [19]:
#train test split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [20]:
#train a single decision tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(x_train, y_train)
y_pred_dt = dt.predict(x_test)

from sklearn.metrics import accuracy_score
accuracy_dt = accuracy_score(y_test, y_pred_dt)

In [27]:
#train bagging classifier with decision tree
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Assuming 'x' and 'y' from the loaded Iris dataset are available
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,        # number of trees
    random_state=42
)
bagging.fit(x_train, y_train)
y_pred_bagging = bagging.predict(x_test)
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)

In [22]:
#To get accuracy
print("Accuracy of single Decision Tree: {:.2f}".format(accuracy_dt))
print("Accuracy of Bagging Classifier: {:.2f}".format(accuracy_bagging))

Accuracy of single Decision Tree: 1.00
Accuracy of Bagging Classifier: 1.00


8. Write a Python program to:
- Train a Random Forest Classifier
- Tune hyperparameters max_depth and n_estimators using GridSearchCV
- Print the best parameters and final accuracy

In [28]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Define Random Forest model
rf = RandomForestClassifier(random_state=42)

# Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10, 20]
}

# GridSearchCV
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    n_jobs=-1,
    scoring='accuracy'
)

# Fit the model
grid_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid_search.best_params_)

# Predict on test data
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Final Test Accuracy: {:.2f}".format(accuracy))

Best Parameters: {'max_depth': None, 'n_estimators': 200}
Final Test Accuracy: 0.97


9. Write a Python program to:
- Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
- Compare their Mean Squared Errors (MSE)

In [30]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Bagging Regressor with Decision Trees
bagging_reg = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42,
    n_jobs=-1
)
bagging_reg.fit(X_train, y_train)
y_pred_bagging = bagging_reg.predict(X_test)

# Random Forest Regressor
rf_reg = RandomForestRegressor(
    n_estimators=50,
    random_state=42,
    n_jobs=-1
)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)

# Compute MSE
mse_bagging = mean_squared_error(y_test, y_pred_bagging)
mse_rf = mean_squared_error(y_test, y_pred_rf)

# Print results
print("Mean Squared Error (Bagging Regressor): {:.4f}".format(mse_bagging))
print("Mean Squared Error (Random Forest Regressor): {:.4f}".format(mse_rf))

Mean Squared Error (Bagging Regressor): 0.2573
Mean Squared Error (Random Forest Regressor): 0.2573


10. You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
- Choose between Bagging or Boosting
- Handle overfitting
- Select base models
- Evaluate performance using cross-validation
- Justify how ensemble learning improves decision-making in this real-world context.


Step 1: Choose between Bagging or Boosting
- Bagging works best when base models have high variance and may overfit. Use Random forest
- Boosting focuses on difficult to classify cases by giving more weight to misclassified samples.

Step 2: Handle Overfitting

Regularization techniques:
- Limit max_depth of trees.
- Set min_samples_split or min_child_weight.
- Use subsampling (subsample, colsample_bytree).

Step 3: Select Base Models

Decision Trees > natural choice, as both Bagging (Random Forest) and Boosting use them.

Can also try logistic regression or SVM as base learners in Bagging if linear separability is suspected.

Step 4: Evaluate Performance Using Cross-Validation
- AUC-ROC
- F1-score
- accuracy score
- precision recall

Step 5: Justify Ensemble Learning in Real-World Context
- Improved prediction accuracy
- Better Risk Assessment
- Stability and interpretability