##Ensemble Learning
 ## ASSIGNMENT :

 Question 1: What is Ensemble Learning in machine learning? Explain the key idea
behind it

Answer 1 : **Ensemble Learning** is a machine learning technique where multiple individual models (often called "base learners" or "weak learners") are trained to solve the same problem and then combined to produce a single, more accurate, and robust prediction.

**The key ideas behind why this works include:**

* Error Correction: Different models make different types of mistakes. By combining them, the individual errors of one model are "canceled out" by the correct predictions of the others.

* Improving Robustness: A single model might be overly sensitive to noise or outliers in the data. An ensemble is more stable because it relies on a consensus rather than a single viewpoint.

* Diverse Perspectives: You can use different algorithms (e.g., combining a Decision Tree with a Support Vector Machine) or train the same algorithm on different subsets of data. This diversity ensures the ensemble captures a wider range of patterns.

Question 2: What is the difference between Bagging and Boosting?

Answer 2 : **1. How they handle the data**

* Bagging: Imagine you have a large deck of cards. You deal out 10 different hands (subsets), allowing for some cards to repeat. You give each hand to a different person to analyze. No one’s work depends on anyone else’s.

* Boosting: You give the whole deck to one person. After they finish, you look at the cards they struggled to identify. You then give a new hand to the next person, but this hand is heavily weighted with the cards the first person got wrong.

**2. Bias vs. Variance**

* Bagging is your go-to when you have a model that is too complex and "overfits" the training data (high variance). By averaging many such models, the random fluctuations cancel out, creating a smoother, more stable result.
+1

* Boosting is best when your model is too simple and "underfits" (high bias). It keeps adding complexity step-by-step until the model can capture the underlying patterns of the data accurately.

**3. Execution Speed**

Because Bagging models are independent, you can train them all at once if you have multiple processors (Parallelization). Boosting must wait for the first model to finish before it can start the second, making it inherently slower to train.

Question 3: What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?

Answer 3 : "Bagging" is actually a portmanteau of Bootstrap Aggregating. Bootstrap sampling is the "engine" that powers this method. Its primary roles are:

* Creating Diversity: Since you sample with replacement, each bootstrap sample is slightly different. Some data points will appear multiple times, while others won't appear at all. This ensures that each base model (like a Decision Tree) sees a unique version of the truth.
+1

* Reducing Correlation: In a Random Forest, if every tree was trained on the exact same data, they would all make the same predictions. Bootstrapping forces the trees to be different from one another, which is critical for the "ensemble" effect to work.

* The "63% Rule": Mathematically, a bootstrap sample typically contains about 63.2% of the unique original data points. The remaining 36.8% are left out.

Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?

Answer 4 : The OOB score is a clever way to evaluate the model's performance on "unseen" data without needing a separate test set or performing expensive Cross-Validation.3 Here is the step-by-step logic:

1) Individual Predictions: For every row in your original dataset, the model identifies which specific trees did not use that row during their training.

2) Consensus: Only those "ignorant" trees are allowed to vote on or predict the outcome for that row.

3) Aggregated Error: This process is repeated for every single row in the dataset. The model then compares these "OOB predictions" to the actual true values.

4) The Score: The final OOB score is the accuracy (for classification) or 6$R^2$ (for regression) of these aggregated predictions.

Question 5: Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.

Answer 5 : 1. The Stability Factor

* In a single Decision Tree, the importance of a feature is highly dependent on the root node and early splits. If a specific feature is chosen for the first split, it "steals" the importance from other features that might have been just as good. If you remove even a few rows of data, the tree might choose a different root, causing the feature importance list to look completely different.

* In a Random Forest, we use Bootstrap Sampling and Feature Randomness. This forces different trees to start with different features. By the time you average 500 trees, you get a much more "democratic" and stable view of which features actually matter across the entire dataset.

2. Handling Correlated Features

* Decision Tree: If you have two features, "Temperature in Celsius" and "Temperature in Fahrenheit," a single tree will pick the one that gives the best first split. The other feature will then appear to have zero importance, even though it's perfectly predictive.

* Random Forest: Because each split only considers a random subset of features, some trees will be forced to use Celsius and others to use Fahrenheit. When you average them, both features will show moderate-to-high importance, correctly identifying that both carry useful information.

3. Common Bias:
 High Cardinality
It is important to note that both models share a common bias: they tend to favor high-cardinality features (features with many unique values, like IDs or zip codes). These features provide more potential "split points," making it easier for the algorithm to reduce impurity by chance, which can artificially inflate their importance score.

Question 6: Write a Python program to:

● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()

● Train a Random Forest Classifier

● Print the top 5 most important features based on feature importance scores.



In [1]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier

# 1. Load the Breast Cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# 2. Train a Random Forest Classifier
# We set random_state for reproducibility
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X, y)

# 3. Get feature importance scores
importances = rf_model.feature_importances_

# Create a Series to map names to their importance scores
feature_series = pd.Series(importances, index=data.feature_names)

# Sort and print the top 5
top_5_features = feature_series.sort_values(ascending=False).head(5)

print("Top 5 Most Important Features in Breast Cancer Dataset:")
print("-" * 55)
print(top_5_features)

Top 5 Most Important Features in Breast Cancer Dataset:
-------------------------------------------------------
worst area              0.139357
worst concave points    0.132225
mean concave points     0.107046
worst radius            0.082848
worst perimeter         0.080850
dtype: float64


Question 7: Write a Python program to:

● Train a Bagging Classifier using Decision Trees on the Iris dataset

● Evaluate its accuracy and compare with a single Decision Tree

Answer 7 : To compare a single Decision Tree against a Bagging ensemble, we use the BaggingClassifier from Scikit-Learn. In this example, the Bagging ensemble will use 100 separate Decision Trees, each trained on a different bootstrap sample of the Iris data.

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# 2. Split into training and testing sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Train a Single Decision Tree
tree_model = DecisionTreeClassifier(random_state=42)
tree_model.fit(X_train, y_train)
tree_pred = tree_model.predict(X_test)
tree_acc = accuracy_score(y_test, tree_pred)

# 4. Train a Bagging Classifier using Decision Trees
# We use 100 trees (n_estimators=100)
bagging_model = BaggingClassifier(estimator=DecisionTreeClassifier(),
                                  n_estimators=100,
                                  random_state=42)
bagging_model.fit(X_train, y_train)
bagging_pred = bagging_model.predict(X_test)
bagging_acc = accuracy_score(y_test, bagging_pred)

# 5. Compare Results
print(f"Single Decision Tree Accuracy: {tree_acc:.4f}")
print(f"Bagging Classifier Accuracy:   {bagging_acc:.4f}")

Single Decision Tree Accuracy: 1.0000
Bagging Classifier Accuracy:   1.0000


Question 8: Write a Python program to:

● Train a Random Forest Classifier

● Tune hyperparameters max_depth and n_estimators using GridSearchCV

● Print the best parameters and final accuracy

Answer 8 : To tune a Random Forest, we use GridSearchCV, which performs an "exhaustive search." It creates a grid of all possible combinations of the hyperparameters you provide, trains a model for each, and uses Cross-Validation to find the winner.

In [3]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# 1. Load data and split
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# 2. Define the parameter grid
# We will test 3 values for n_estimators and 3 for max_depth (Total 9 combinations)
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20]
}

# 3. Initialize GridSearchCV
# cv=5 means 5-fold cross-validation
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1)

# 4. Run the search
grid_search.fit(X_train, y_train)

# 5. Extract best parameters and evaluate
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred)

print(f"Best Parameters found: {best_params}")
print(f"Final Accuracy on Test Set: {final_accuracy:.4f}")

Best Parameters found: {'max_depth': None, 'n_estimators': 200}
Final Accuracy on Test Set: 0.9649


Question 9: Write a Python program to:

● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset

● Compare their Mean Squared Errors (MSE)

Answer 9 : To compare these two models, we will use the California Housing dataset, which is a standard regression task where the goal is to predict the median house value.

While both models use bagging, the Random Forest Regressor is generally more robust because it adds "feature randomness" (selecting a random subset of features at each split) to the standard bagging process.

In [4]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# 1. Load the California Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# 2. Split the data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Train a Bagging Regressor
# Using 100 Decision Trees as base learners
bagging_model = BaggingRegressor(estimator=DecisionTreeRegressor(),
                                 n_estimators=100,
                                 random_state=42)
bagging_model.fit(X_train, y_train)
bagging_pred = bagging_model.predict(X_test)
bagging_mse = mean_squared_error(y_test, bagging_pred)

# 4. Train a Random Forest Regressor
# Using 100 trees
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_pred)

# 5. Compare Mean Squared Errors
print(f"Bagging Regressor MSE:    {bagging_mse:.4f}")
print(f"Random Forest Regressor MSE: {rf_mse:.4f}")

Bagging Regressor MSE:    0.2559
Random Forest Regressor MSE: 0.2554


Question 10: You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.

You decide to use ensemble techniques to increase model performance.

Explain your step-by-step approach to:

● Choose between Bagging or Boosting

● Handle overfitting

● Select base models

● Evaluate performance using cross-validation

● Justify how ensemble learning improves decision-making in this real-world
context.

Answer 10 : 1. Choosing Between Bagging and Boosting

For loan defaults, I would likely choose Boosting (specifically algorithms like XGBoost or LightGBM).

* Why Boosting? Financial datasets are often "imbalanced" (most people don't default). Boosting is superior here because it focuses on the "hard" cases—sequential models learn specifically from the mistakes of previous models, making them better at capturing the subtle patterns of the minority class (defaulters).

* Why not Bagging? While Random Forest (Bagging) is very robust, it might not capture the complex, non-linear relationships in transaction history as precisely as a gradient-boosted ensemble.

2. Handling Overfitting

Ensemble models can easily "memorize" noise in transaction history. To prevent this, I would:

* Limit Tree Depth: Use max_depth to prevent trees from growing too deep and complex.2Subsampling: Use subsample (row sampling) and colsample_bytree (feature sampling) to ensure no single transaction type or customer segment dominates the model.

* Early Stopping: Monitor the model's performance on a validation set and stop training as soon as the validation error starts to rise, even if the training error is still falling.

* Regularization: Apply 5$L1$ (Lasso) or 6$L2$ (Ridge) penalties to the weights of the leaves to keep them small.

3. Selecting Base Models


* Weak Learners:I would use Decision Trees as the base learners. They handle a mix of categorical data (demographics like "Marital Status") and numerical data (transaction amounts) without requiring extensive scaling.

* Diversity: If using Stacking, I might combine a Gradient Boosted Tree with a Logistic Regression model. The tree captures non-linearities, while the Logistic Regression provides a stable baseline for linear relationships between income and debt.

4. Evaluating Performance using Cross-Validation

I would implement Stratified K-Fold Cross-Validation ($K=5$ or $10$).

* Stratification: This is crucial because loan defaults are rare. Stratification ensures that each "fold" (subset) of data has the same percentage of defaulters as the original dataset.
* Metric: Instead of simple Accuracy, I would evaluate based on the F1-Score or Precision-Recall AUC, ensuring the model is actually effective at identifying defaults rather than just guessing "no default" for everyone.

5. Justifying Ensemble Learning in this Context

Ensemble learning improves financial decision-making in three ways:

* Risk Mitigation: By combining multiple models, we reduce the "idiosyncratic error" of any single algorithm. A single tree might find a weird correlation in one zip code; an ensemble requires that pattern to be verified across many perspectives.

* Handling Non-Linearity: Loan default is rarely caused by one factor. It’s the interaction of demographics (age) and sudden shifts in transaction history (large withdrawals). Ensembles are naturally gifted at capturing these multi-factor interactions.

* Stability: Financial institutions require "stable" models. Bagging and Boosting provide a "consensus" prediction that is less likely to fluctuate wildly when new, slightly different customer data is introduced next month.
