# **Ensemble Techniques Assignment**


1. **What is Ensemble Learning in machine learning? Explain the key idea
behind it.**
- Ensemble learning is a machine learning technique that combines predictions from multiple individual models, often called "base learners," to create a single, more robust, and accurate predictive model. The goal is to improve overall performance and reduce common issues like overfitting and bias, which are often found in single-model approaches.

  **The key idea: The wisdom of crowds**

  The central concept behind ensemble learning is based on the "wisdom of crowds" principle, which states that the combined judgment of a diverse group is often more accurate than that of any single expert. In machine learning, this translates to:
  - By combining predictions from multiple models, you can mitigate the weaknesses of individual models and amplify their strengths.
  - The diverse perspectives of different models, trained on different subsets of data or using different algorithms, mean their errors are less likely to be correlated.
  - When a prediction is aggregated through voting or averaging, the uncorrelated errors tend to cancel each other out, leading to a lower overall error and a more stable, reliable prediction.
  
  For example, a single decision tree model might be prone to high variance and overfitting. An ensemble method like a Random Forest, which combines hundreds of decision trees trained on different data subsets, averages out the individual variances and produces a more accurate and stable result.


2. **What is the difference between Bagging and Boosting?**
- Bagging and Boosting are two types of Ensemble Learning. These two decrease the variance of a single estimate as they combine several estimates from different models. So the result may be a model with higher stability.

  **Bagging (Bootstrap aggregating)**

- Objective: Primarily aims to reduce variance in the model, helping to prevent overfitting.
- Model Training: Trains multiple base models (often the same type, like decision trees) in parallel on different subsets of the data.
- Data Sampling: Creates these subsets using bootstrap sampling – random sampling with replacement from the original dataset.
- Model Weighting: Each base model in the ensemble is given equal weight in the final prediction.
- Error Handling: Doesn't directly address errors made by previous models. Each model learns independently.
- Sensitivity to Noise: Less sensitive to outliers and noisy data due to the averaging or voting mechanism across multiple models.
- Computational Efficiency: Can be more computationally efficient due to parallel processing of model training.
- Examples: Random Forest is a prominent example of a bagging algorithm.

  **Boosting**

- Objective: Primarily aims to reduce bias in the model, often leading to higher accuracy.
- Model Training: Trains multiple weak models sequentially. Each new model focuses on correcting the errors made by the previous ones.
- Data Sampling: Uses weighted sampling, where data points that were misclassified by previous models are given higher importance in training subsequent models.
- Model Weighting: Models are weighted based on their performance (accuracy), with better-performing models having a stronger influence on the final decision.
- Error Handling: Actively corrects errors made by previous models in a sequential manner.
- Sensitivity to Noise: Can be more sensitive to outliers and noisy data, as misclassified points are given more weight, potentially leading to overfitting if not properly managed.
- Computational Efficiency: Generally more computationally intensive due to the sequential training process, making parallelization more challenging.
- Examples: Popular boosting algorithms include AdaBoost, Gradient Boosting, XGBoost, and LightGBM.

3. **What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?**
- Bootstrap sampling is a resampling technique used in statistics and machine learning to create multiple datasets by randomly sampling with replacement from the original dataset.
- In Bagging methods like Random Forest, this creates diverse subsets of data, which are then used to train multiple independent models. By aggregating the predictions from these varied models, Bagging reduces variance and the risk of overfitting, leading to more stable and accurate results

  **What is bootstrap sampling?**

  Bootstrap sampling, or bootstrapping, is a method of generating new, synthetic datasets from a single, original dataset.
- The Process: To create a new bootstrap sample, you randomly select data points from the original dataset one by one. The key is that after each selection, the data point is "replaced," meaning it is put back into the pool of available data and can be selected again.
- The Result: A typical bootstrap sample is the same size as the original dataset. Because of the "sampling with replacement" technique, a single bootstrap sample will likely have some duplicate data points and omit some other data points that were in the original set.
- The Purpose: By repeatedly creating these varied datasets, you can simulate multiple draws from the true population distribution, allowing you to estimate the variability of a statistic or model performance without needing extra data.

  **Role of bootstrap sampling in Bagging methods**

  Bagging, short for Bootstrap aggregating, is an ensemble learning method that uses bootstrap sampling to enhance model performance and stability.
  
  Here is the step-by-step process of how bootstrap sampling works within a Bagging method:
- Generate diverse training sets: The core of Bagging is to train multiple models, and bootstrap sampling is the mechanism for generating the necessary diverse training data. A random subset of the original training data is selected with replacement to create each bootstrap sample.
- Train base models: A base learning algorithm, such as a decision tree, is then trained independently on each of these newly created bootstrap samples. Because each sample is different, each resulting model will also be slightly different.
- Parallel training: A key benefit of this process is that the training of the base models can be done in parallel, which is computationally efficient.
- Aggregate predictions: Once all base models are trained, they are used to make predictions. These predictions are then aggregated to produce a single final output. For classification problems, this is typically done by majority voting. For regression problems, the predictions are averaged.

4. **What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?**
- Out-of-Bag (OOB) samples are data points that are not included in the bootstrap sample used to train a particular model in an ensemble. The OOB score uses these un-trained-on samples to provide a free and unbiased validation of the ensemble model's performance without the need for a separate test set.

  **What are Out-of-Bag (OOB) samples?**
  
  To create an ensemble model using bagging, the following steps are taken:
  - Bootstrapping: From the original training data, multiple bootstrap samples are created by randomly selecting data points with replacement.
  - Model training: Each of these bootstrap samples is used to train a separate base model (e.g., a decision tree in a Random Forest).
  - Out-of-Bag (OOB) data: Because sampling is done with replacement, each bootstrap sample will contain approximately 63% of the original data, leaving the remaining 37% as "out-of-bag" (OOB) samples.
  - No data leakage: The crucial aspect is that for any given base model, its OOB samples are data points it has never seen during training, ensuring a fresh, unbiased validation.

  **How the OOB score is used for evaluation**
  
  The OOB score evaluates the ensemble's performance by making predictions on the OOB samples and aggregating the results. This acts as a form of internal, built-in cross-validation.

- OOB prediction per data point: For each data point in the original dataset, collect predictions only from the trees for which that data point was an OOB sample.
- Aggregate predictions: Combine the OOB predictions for each data point through an aggregation method:
  - Classification: A majority vote is typically used to determine the final predicted class.
  - Regression: The predictions are averaged to get the final result.
- Compute the score: The final OOB score is calculated by comparing the aggregated OOB predictions against the true labels for all data points.
  - For classification, the OOB score is the accuracy: The proportion of correctly predicted OOB samples.
  - For regression, the OOB score is often the mean squared error (MSE) or another regression metric.

5. **Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.**
- Both Decision Trees and Random Forests offer methods for analyzing feature importance, allowing us to understand which features contribute most significantly to the model's predictions. However, their approaches and the reliability of their feature importance scores differ

  **Decision Trees**
- Mechanism: A single Decision Tree determines feature importance based on how much each feature reduces impurity (e.g., Gini impurity or entropy) when splitting data into branches. Features that consistently lead to greater impurity reduction are assigned higher importance scores.
- Calculation: The importance score for each feature is the total reduction of the chosen criterion (like Gini impurity) brought about by that feature across the entire tree.
- Strengths: Simple and easy to understand, directly reflecting the split decisions within the tree.
- Weakness: A single Decision Tree is inherently unstable. Due to its greedy nature, it might find a good split with a feature by chance, making the importance scores susceptible to variations in the data.
- Limitations: The impurity-based feature importance can be biased towards features with high cardinality (many unique values).
  
  **Random Forests**

- Mechanism: Random Forests, as an ensemble of Decision Trees, calculate feature importance by averaging the importance scores of each feature across all the trees in the forest. Each tree is trained on a different random subset of data and features, which introduces diversity and reduces the risk of individual trees overfitting.
- Calculation: The total importance of a feature is the sum of improvements from all splits across all trees where that feature was used, then normalized so that all scores sum to 1.
- Strength: Averaging feature importance scores across multiple diverse trees results in a more stable and trustworthy estimate of a feature's true predictive power within the model. This ensemble approach helps overcome the instability of feature importance in individual Decision Trees.
- Weaknesses: Less interpretable compared to a single Decision Tree due to its ensemble nature. It doesn't offer the clear, visual pathway of a single Decision Tree. While more robust, Random Forests can still be biased towards features with high cardinality (many unique values).
- Caveats: Random forests might split importance between highly correlated and predictive features, making both appear less important than they truly are.
Advantages and disadvantages of using random forest for feature importance

6. **Write a Python program to:**
  - **Load the Breast Cancer dataset using**
  `sklearn.datasets.load_breast_cancer()`
  - **Train a Random Forest Classifier**
  - **Print the top 5 most important features based on feature importance scores.**

In [1]:
# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd

# Load the Breast Cancer dataset
data = load_breast_cancer()

# Convert the dataset into a DataFrame for easier handling
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Split the data into features and target
X = df.drop('target', axis=1)
y = df['target']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize the RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf.fit(X_train, y_train)

# Get the feature importance scores
feature_importances = rf.feature_importances_

# Create a DataFrame to display feature importance
feature_importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': feature_importances
})

# Sort the features by importance and get the top 5
top_5_features = feature_importance_df.sort_values(by='Importance', ascending=False).head(5)

# Print the top 5 important features
print("Top 5 most important features:")
print(top_5_features)


Top 5 most important features:
                 Feature  Importance
7    mean concave points    0.141934
27  worst concave points    0.127136
23            worst area    0.118217
6         mean concavity    0.080557
20          worst radius    0.077975


7. **Write a Python program to:**
  - **Train a Bagging Classifier using Decision Trees on the Iris dataset**
  - **Evaluate its accuracy and compare with a single Decision Tree**

In [4]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
data = load_iris()
X = data.data
y = data.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize a Decision Tree classifier
dt_classifier = DecisionTreeClassifier(random_state=42)

# Initialize a Bagging Classifier with Decision Tree as base estimator
bagging_classifier = BaggingClassifier(estimator=dt_classifier, n_estimators=50, random_state=42)

# Train the Decision Tree classifier
dt_classifier.fit(X_train, y_train)

# Train the Bagging Classifier
bagging_classifier.fit(X_train, y_train)

# Predict using both classifiers
dt_predictions = dt_classifier.predict(X_test)
bagging_predictions = bagging_classifier.predict(X_test)

# Calculate the accuracy for both models
dt_accuracy = accuracy_score(y_test, dt_predictions)
bagging_accuracy = accuracy_score(y_test, bagging_predictions)

# Print the accuracy comparison
print(f"Accuracy of Decision Tree: {dt_accuracy:.4f}")
print(f"Accuracy of Bagging Classifier: {bagging_accuracy:.4f}")


Accuracy of Decision Tree: 1.0000
Accuracy of Bagging Classifier: 1.0000


8. **Write a Python program to:**
  - **Train a Random Forest Classifier**
  - **Tune hyperparameters `max_depth` and `n_estimators` using GridSearchCV**
  - **Print the best parameters and final accuracy**

In [5]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
data = load_iris()
X = data.data
y = data.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize the RandomForestClassifier
rf_classifier = RandomForestClassifier(random_state=42)

# Define the hyperparameters to tune using GridSearchCV
param_grid = {
    'max_depth': [None, 10, 20, 30, 40],
    'n_estimators': [50, 100, 200]
}

# Set up the GridSearchCV
grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid,
                           cv=5, n_jobs=-1, verbose=2)

# Fit GridSearchCV on the training data
grid_search.fit(X_train, y_train)

# Get the best hyperparameters from GridSearchCV
best_params = grid_search.best_params_

# Train the model with the best parameters found
best_rf_classifier = grid_search.best_estimator_

# Make predictions on the test set
y_pred = best_rf_classifier.predict(X_test)

# Calculate accuracy
final_accuracy = accuracy_score(y_test, y_pred)

# Print the best parameters and final accuracy
print(f"Best Hyperparameters: {best_params}")
print(f"Final Accuracy: {final_accuracy:.4f}")


Fitting 5 folds for each of 15 candidates, totalling 75 fits
Best Hyperparameters: {'max_depth': None, 'n_estimators': 100}
Final Accuracy: 1.0000


9. **Write a Python program to:**
  - **Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset**
  - **Compare their Mean Squared Errors (MSE)**

In [6]:
# Import necessary libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load the California Housing dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize the Bagging Regressor
bagging_regressor = BaggingRegressor(n_estimators=50, random_state=42)

# Initialize the Random Forest Regressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the Bagging Regressor
bagging_regressor.fit(X_train, y_train)

# Train the Random Forest Regressor
rf_regressor.fit(X_train, y_train)

# Make predictions using both regressors
y_pred_bagging = bagging_regressor.predict(X_test)
y_pred_rf = rf_regressor.predict(X_test)

# Calculate Mean Squared Error for both models
mse_bagging = mean_squared_error(y_test, y_pred_bagging)
mse_rf = mean_squared_error(y_test, y_pred_rf)

# Print the MSE comparison
print(f"Mean Squared Error of Bagging Regressor: {mse_bagging:.4f}")
print(f"Mean Squared Error of Random Forest Regressor: {mse_rf:.4f}")


Mean Squared Error of Bagging Regressor: 0.2579
Mean Squared Error of Random Forest Regressor: 0.2565


10. **You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data.**
  
    **You decide to use ensemble techniques to increase model performance.**

    **Explain your step-by-step approach to:**

    ● **Choose between Bagging or Boosting**

    ● **Handle overfitting**

    ● **Select base models**

    ● **Evaluate performance using cross-validation**

    ● **Justify how ensemble learning improves decision-making in this real-world context.**

- In predicting loan default at a financial institution, a data scientist must leverage ensemble techniques to boost model performance and ensure accurate, reliable decision-making. A structured approach is required for choosing the right ensemble method, handling overfitting, selecting base models, and evaluating the final model.

  **Choose between bagging or boosting**

  The choice between bagging and boosting depends on the initial performance of your base models and the primary source of error (bias or variance).
- Initial model assessment: First, train a simple, interpretable model like a decision tree. Analyze its performance on a held-out validation set.
  - High variance (overfitting): If the single decision tree has high accuracy on the training data but performs poorly on the validation set, its issue is high variance. Bagging is the better choice for this scenario, as it focuses on reducing variance by averaging the predictions of multiple models trained on different subsets of the data.
  - High bias (underfitting): If the single decision tree performs poorly on both the training and validation sets, its issue is high bias. Boosting is the better choice, as it trains models sequentially, with each new model correcting the errors (bias) of its predecessors.
- Loan default prediction context: For loan default prediction, the risk of a simple model underfitting is high due to the complex, non-linear relationships in customer data. Thus, boosting techniques like XGBoost, LightGBM, and CatBoost are generally preferred due to their power in reducing bias. However, if initial models are shown to be unstable (high variance), bagging with a Random Forest can be a more stable alternative.

  **Handle overfitting**
  
  Ensemble methods, especially boosting, can still overfit if not tuned correctly. Here is how to handle it:
- Cross-validation (CV): The primary defense against overfitting is cross-validation, which ensures your model is not overly tailored to a single training set.
- Hyperparameter tuning: Carefully tune hyperparameters related to model complexity.
  - For boosting: Use a smaller `learning_rate` and increase the number of `n_estimators`. You can also constrain the size of individual trees using `max_depth` and set a minimum loss reduction for a split with `gamma`.
  - For bagging (Random Forest): Limit the `max_depth` of individual trees and use `max_features` to restrict the number of features each tree can consider for splitting, which increases diversity.
- Early stopping: For boosting algorithms, use early stopping to prevent the model from adding more base learners once the validation error ceases to improve.
- Regularization: Boosting algorithms like XGBoost and LightGBM include built-in regularization techniques (L1) and (L2) that penalize model complexity.

  **Select base models**

  The base models, or "weak learners," should be chosen for their ability to be diverse and computationally efficient.
- Decision trees: For both bagging and boosting, decision trees are a popular and effective choice.
  - Bagging: For a Random Forest, you would use deep, complex decision trees (high variance, low bias) as base learners.
  - Boosting: Boosting algorithms typically use simple, shallow decision trees (low variance, high bias) that learn sequentially.
- Diverse models for stacking: If employing a stacking ensemble, you would choose diverse base learners like logistic regression, Random Forest, and a Gradient Boosting machine. A meta-model is then trained to learn the optimal way to combine their predictions.

  **Evaluate performance using cross-validation**
  
  Instead of a single train-test split, cross-validation provides a more reliable estimate of model performance on unseen data.
- K-fold CV: Split the training data into (k) equal parts (e.g.,k=5 or k=10).Stratified K-fold CV: For loan default prediction, the number of defaults is often much lower than non-defaults, creating a class imbalance.
- Stratified K-fold CV is essential here, as it ensures each fold has the same proportion of default cases as the original dataset, providing more stable and accurate performance metrics.
- Evaluation metrics: Choose metrics relevant to the business problem, not just accuracy.
  - Area Under the ROC Curve (AUC): Measures the model's ability to distinguish between default and non-default cases across all possible classification thresholds. It is a reliable metric for imbalanced datasets.
  - Recall: Measures the percentage of actual defaults that were correctly identified. In the financial sector, minimizing missed defaults is often critical.
  - Precision: Measures the percentage of predicted defaults that were actually defaults. While high recall is important, maintaining decent precision prevents a high false positive rate, which could harm profitability by incorrectly rejecting good loan applicants.

  **Justify how ensemble learning improves decision-making**

  Ensemble learning provides several advantages that directly benefit decision-making at a financial institution:

- Improved accuracy: By combining multiple models, an ensemble can correct for individual model errors, leading to a more accurate prediction of which applicants are likely to default. This directly reduces credit risk and minimizes financial losses.
- Increased robustness and stability: An ensemble model is less sensitive to noise and outliers in the data than a single, individual model. This leads to more reliable and stable predictions that are less susceptible to the quirks of a specific training dataset.
- Enhanced generalization: The combination of different models helps the ensemble to better capture the underlying, complex patterns in the data, leading to superior performance on unseen data. This ensures the model remains reliable even as market conditions and customer behavior evolve.
- Model interpretability (where applicable): While some ensembles are "black boxes," others, like Random Forest, can provide feature importance scores. This helps risk managers understand which factors—such as credit history length or debt-to-income ratio—are most influential in a default prediction, aiding in regulatory compliance and business insight.