#**ENSEMBLE ASSIGNEMT**

## Q1. What is Ensemble Learning in machine learning? Explain the key idea behind it.

Answer :  
Ensemble Learning is a machine learning technique where I combine multiple models (also called base learners or weak learners) to make a stronger and more accurate model. The central idea is that a group of models working together can perform better than any single model.  

- It helps to reduce variance, bias, and improve overall prediction accuracy.  
- Different models capture different patterns in the data, so when combined, their strengths are added and weaknesses are minimized.  
- Common techniques include Bagging, Boosting, and Stacking.  
- Real-world example: Random Forest (an ensemble of many decision trees).  


## Q2. What is the difference between Bagging and Boosting?

Answer :  
Bagging and Boosting are both ensemble learning methods but they work in different ways:  

- **Bagging (Bootstrap Aggregating):**  
  - Multiple models are trained in parallel on different bootstrap samples (sampling with replacement).  
  - The final prediction is made by combining the results (majority vote for classification or average for regression).  
  - Its main goal is to reduce variance and avoid overfitting.  
  - Example: Random Forest.  

- **Boosting:**  
  - Models are trained sequentially, where each new model focuses on correcting the errors made by the previous ones.  
  - The models are combined with weights, giving more importance to stronger learners.  
  - Its main goal is to reduce both bias and variance.  
  - Example: AdaBoost, Gradient Boosting, XGBoost.  

In short, Bagging builds models independently in parallel to reduce variance, while Boosting builds models sequentially to reduce bias and improve accuracy.


## Q3. What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?

Answer :  
Bootstrap sampling is a statistical technique where I create new datasets by randomly selecting samples **with replacement** from the original training data. Each bootstrap sample is usually the same size as the original dataset, but because of replacement, some records appear multiple times while some may not appear at all.  

In Bagging methods like Random Forest:  
- Each decision tree is trained on a different bootstrap sample, which ensures diversity among the trees.  
- This randomness reduces correlation between the trees and makes the overall model more stable.  
- When predictions of all trees are combined (through voting or averaging), the final result becomes more accurate and less prone to overfitting compared to a single decision tree.  


## Q4. What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?

Answer :  
In bootstrap sampling, since data is selected with replacement, almost one-third of the data is left out in each bootstrap sample. These left-out records are called **Out-of-Bag (OOB) samples**.  

In ensemble models like Random Forest:  
- Each tree is trained on its bootstrap sample, and the corresponding OOB samples for that tree can be used as test data.  
- This allows me to estimate the model’s performance without needing a separate validation set.  
- The **OOB score** is simply the accuracy (or error) calculated using these OOB predictions.  
- It is very useful because it provides an unbiased estimate of model performance and helps save data for training instead of splitting into validation sets.  


## Q5. Compare feature importance analysis in a single Decision Tree vs. a Random Forest.

Answer :  
In a **single Decision Tree**:  
- Feature importance is calculated based on how much a feature reduces impurity (like Gini Index or Entropy) across all its splits.  
- However, the importance can be biased if one tree happens to favor certain features due to randomness or noise in the data.  

In a **Random Forest**:  
- Importance is averaged across many trees trained on different bootstrap samples and random feature subsets.  
- This reduces bias and gives a more reliable and stable measure of which features are truly important.  
- Hence, feature importance from Random Forest is generally considered more trustworthy than that from a single Decision Tree.  


## Q6. Write a Python program to:
- Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer()
- Train a Random Forest Classifier
- Print the top 5 most important features based on feature importance scores.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

importances = pd.Series(rf.feature_importances_, index=X.columns)
top5 = importances.sort_values(ascending=False).head(5)

print("Top 5 important features:\n")
print(top5)


Top 5 important features:

worst area              0.139357
worst concave points    0.132225
mean concave points     0.107046
worst radius            0.082848
worst perimeter         0.080850
dtype: float64


## Q7. Write a Python program to:
- Train a Bagging Classifier using Decision Trees on the Iris dataset
- Evaluate its accuracy and compare with a single Decision Tree


In [5]:
# Answer :

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

from sklearn.datasets import load_iris
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)

from sklearn.metrics import accuracy_score
acc_dt = accuracy_score(y_test, y_pred_dt)

from sklearn.ensemble import BaggingClassifier
bag = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=50, random_state=42)
bag.fit(X_train, y_train)
y_pred_bag = bag.predict(X_test)
acc_bag = accuracy_score(y_test, y_pred_bag)

print("Accuracy of Single Decision Tree :", acc_dt)
print("Accuracy of Bagging Classifier   :", acc_bag)


Accuracy of Single Decision Tree : 0.9333333333333333
Accuracy of Bagging Classifier   : 0.9333333333333333


## Q8. Write a Python program to:
- Train a Random Forest Classifier
- Tune hyperparameters max_depth and n_estimators using GridSearchCV
- Print the best parameters and final accuracy


In [6]:
# Answer :

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=42)

from sklearn.model_selection import GridSearchCV
param_grid = {
    "n_estimators": [50, 100, 150],
    "max_depth": [None, 5, 10]
}

grid = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring="accuracy", n_jobs=-1)
grid.fit(X_train, y_train)

best_params = grid.best_params_

from sklearn.metrics import accuracy_score
y_pred = grid.best_estimator_.predict(X_test)
final_acc = accuracy_score(y_test, y_pred)

print("Best Parameters :", best_params)
print("Final Accuracy  :", final_acc)


Best Parameters : {'max_depth': None, 'n_estimators': 100}
Final Accuracy  : 0.935672514619883


## Q9. Write a Python program to:
- Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset
- Compare their Mean Squared Errors (MSE)


In [7]:
# Answer :

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

from sklearn.datasets import fetch_california_housing
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor

bag = BaggingRegressor(estimator=DecisionTreeRegressor(), n_estimators=50, random_state=42)
bag.fit(X_train, y_train)
y_pred_bag = bag.predict(X_test)

rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

from sklearn.metrics import mean_squared_error
mse_bag = mean_squared_error(y_test, y_pred_bag)
mse_rf = mean_squared_error(y_test, y_pred_rf)

print("Mean Squared Error of Bagging Regressor      :", mse_bag)
print("Mean Squared Error of Random Forest Regressor:", mse_rf)


Mean Squared Error of Bagging Regressor      : 0.25787382250585034
Mean Squared Error of Random Forest Regressor: 0.25650512920799395


## Q10. You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data. You decide to use ensemble techniques to increase model performance. Explain your step-by-step approach to:
- Choose between Bagging or Boosting
- Handle overfitting
- Select base models
- Evaluate performance using cross-validation
- Justify how ensemble learning improves decision-making in this real-world context.

Answer :  
If I have to build a loan default prediction model, my approach will be as follows:

- **Choice between Bagging and Boosting**: I will first try Boosting (like XGBoost or AdaBoost) because boosting generally works better when data has complex patterns and imbalance, which is common in loan default cases. Bagging (like Random Forest) can also be tested for stability, but Boosting usually gives higher accuracy.
- **Handling Overfitting**: I will use techniques like limiting the max_depth of trees, adding regularization (learning rate, min_samples_split), and using early stopping while training Boosting models to avoid overfitting.
- **Selecting Base Models**: Decision Trees will be my base learners since they are flexible and work well as weak learners in boosting and bagging frameworks.
- **Evaluating with Cross-Validation**: I will apply k-fold cross-validation to make sure my model performance is stable and not just by chance on one split of data. I will track metrics like accuracy, precision, recall, and ROC-AUC since in loan default prediction, false negatives are very costly.
- **Justification**: Using ensemble learning reduces variance (in Bagging) or bias (in Boosting). This makes the predictions more robust and reliable, which helps the financial institution reduce risk, improve loan approval decisions, and minimize financial losses.



In [8]:
# Answer :

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

# For demonstration, let's use sklearn's make_classification to simulate loan default dataset
from sklearn.datasets import make_classification
X, y = make_classification(
    n_samples=5000, n_features=20, n_informative=10, n_redundant=5,
    n_classes=2, weights=[0.7, 0.3], random_state=42
)

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score

# GridSearchCV to tune hyperparameters
param_grid = {
    "n_estimators": [50, 100],
    "max_depth": [3, 5],
    "learning_rate": [0.05, 0.1]
}

grid = GridSearchCV(
    estimator=GradientBoostingClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,
    scoring="roc_auc",
    n_jobs=-1
)
grid.fit(X_train, y_train)

best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

print("Best Parameters :", grid.best_params_)
print("Accuracy on Test Set :", accuracy_score(y_test, y_pred))
print("ROC-AUC Score      :", roc_auc_score(y_test, best_model.predict_proba(X_test)[:,1]))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Best Parameters : {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100}
Accuracy on Test Set : 0.9373333333333334
ROC-AUC Score      : 0.9687976368938057

Classification Report:
               precision    recall  f1-score   support

           0       0.94      0.98      0.96      1047
           1       0.94      0.85      0.89       453

    accuracy                           0.94      1500
   macro avg       0.94      0.91      0.92      1500
weighted avg       0.94      0.94      0.94      1500

