##Ensemble Techniques

                                 SUBMITTED BY: MD FAHAM NAUSHAD

#***************************************************
##Theoretical Questions:

#***************************************************
##1. What is Ensemble Learning in machine learning? Explain the key idea behind it.
- Anaswer:

  Ensemble learning is a machine learning technique where multiple models (called base learners) are combined to produce a stronger and more accurate prediction. The key idea is that multiple weak learners can compensate for each other's errors, resulting in higher accuracy and better generalization. Ensembles reduce overfitting and improve robustness compared to single models.

##2. What is the difference between Bagging and Boosting?
- Anaswer:

  Bagging trains multiple models independently on different bootstrapped samples and averages their predictions to reduce variance. Boosting trains models sequentially, where each new model focuses on correcting the errors of the previous one to reduce bias. Bagging prevents overfitting, while boosting improves weak learners.

##3. What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?
- Anaswer:

  Bootstrap sampling is a technique where random samples are drawn with replacement from a dataset to form multiple training subsets. In Bagging models like Random Forest, each tree is trained on a different bootstrapped dataset. This increases diversity among trees, helping the ensemble reduce overfitting and improve accuracy.

##4. What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?
- Anaswer:

  OOB samples are the data points that are not selected in a bootstrapped training subset. Since each tree is trained on its own bootstrap sample, OOB samples can be used as a built-in validation set. The OOB score measures how well the model predicts unseen OOB data and serves as a reliable evaluation metric without needing separate train-test splitting.

##5. Compare feature importance analysis in a single Decision Tree vs. a Random Forest.
- Anaswer:

  A single Decision Tree calculates importance based on how much each feature decreases impurity at its splits. However, its results can be unstable because it depends heavily on the training data. A Random Forest averages feature importance over many trees, making it more stable and reliable as it reduces noise and bias from individual trees.

#***************************************************
##Practical Questions:

#***************************************************

##6. Write a Python program to:

###Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer()


*   Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer()
*   Train a Random Forest Classifier
*   Print the top 5 most important features based on feature importance scores.


###✅Python Code:

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import numpy as np

data = load_breast_cancer()
X = data.data
y = data.target

model = RandomForestClassifier(random_state=42)
model.fit(X, y)

importances = model.feature_importances_
indices = np.argsort(importances)[::-1][:5]

print("Top 5 Most Important Features:")
for i in indices:
    print(data.feature_names[i], ":", round(importances[i], 4))


Top 5 Most Important Features:
worst area : 0.1394
worst concave points : 0.1322
mean concave points : 0.107
worst radius : 0.0828
worst perimeter : 0.0808


##7. Write a Python program to:

*  Train a Bagging Classifier using Decision Trees on the Iris dataset
*  Evaluate its accuracy and compare with a single Decision Tree

###✅Python Code:


In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
pred_dt = dt.predict(X_test)
acc_dt = accuracy_score(y_test, pred_dt)

# Bagging Classifier with Decision Tree as base model
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)
bag.fit(X_train, y_train)
pred_bag = bag.predict(X_test)
acc_bag = accuracy_score(y_test, pred_bag)

print("Accuracy - Decision Tree:", round(acc_dt, 4))
print("Accuracy - Bagging Classifier:", round(acc_bag, 4))



Accuracy - Decision Tree: 0.9333
Accuracy - Bagging Classifier: 0.9333


##8. Write a Python program to:


*   Train a Random Forest Classifier
*   Tune hyperparameters max_depth and n_estimators using GridSearchCV
*   Print the best parameters and final accuracy


###✅Python Code:

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

params = {
    'max_depth': [2, 3, 4, 5, None],
    'n_estimators': [50, 100, 150]
}

grid = GridSearchCV(RandomForestClassifier(random_state=42),
                    params, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

best = grid.best_estimator_
final_acc = accuracy_score(y_test, best.predict(X_test))

print("Best Parameters:", grid.best_params_)
print("Final Accuracy:", round(final_acc, 4))


Best Parameters: {'max_depth': 2, 'n_estimators': 150}
Final Accuracy: 1.0


##9. Write a Python program to:


*   Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset
*   Compare their Mean Squared Errors (MSE)


###✅Python Code:

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

bag = BaggingRegressor(n_estimators=50, random_state=42)
bag.fit(X_train, y_train)
pred_bag = bag.predict(X_test)

rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
pred_rf = rf.predict(X_test)

print("MSE - Bagging Regressor:", round(mean_squared_error(y_test, pred_bag), 4))
print("MSE - Random Forest Regressor:", round(mean_squared_error(y_test, pred_rf), 4))


MSE - Bagging Regressor: 0.2579
MSE - Random Forest Regressor: 0.2565


##10. You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data.
- • You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:


*   Choose between Bagging or Boosting
*   Handle overfitting
*   Select base models
*   Evaluate performance using cross-validation
*   Justify how ensemble learning improves decision-making

- Answer:

    To build a loan default prediction model, I would choose Boosting (such as Gradient Boosting/XGBoost) because it focuses on correcting mistakes made by previous models and handles complex patterns commonly found in financial data. To reduce overfitting, I would tune hyperparameters such as tree depth, learning rate, and number of estimators, and apply early stopping. I would select Decision Trees as base learners because they can capture non-linear behaviour in customer spending patterns and demographics. The model would be evaluated using cross-validation and AUC-ROC score, which helps measure classification reliability. Ensemble learning strengthens prediction quality, reducing false approvals/declines and improving business-level decision-making by lowering financial risk.


###✅Python Code:    

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import GradientBoostingClassifier

# Sample dummy dataset (simulating bank customer data)
np.random.seed(42)
data = pd.DataFrame({
    'age': np.random.randint(21, 65, 500),
    'income': np.random.randint(20000, 150000, 500),
    'credit_score': np.random.randint(300, 850, 500),
    'transactions_per_month': np.random.randint(10, 120, 500),
    'loan_default': np.random.choice([0, 1], 500, p=[0.7, 0.3])  # Target variable
})

X = data.drop('loan_default', axis=1)
y = data['loan_default']

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Gradient Boosting Classifier (Boosting approach)
model = GradientBoostingClassifier(
    learning_rate=0.05,
    n_estimators=150,
    max_depth=3
)

model.fit(X_train, y_train)

# Cross-validation scoring
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')

# Predict probability and compute ROC-AUC
y_pred_proba = model.predict_proba(X_test)[:, 1]
test_auc = roc_auc_score(y_test, y_pred_proba)

print("Cross-Validation AUC Scores:", np.round(cv_scores, 4))
print("Mean CV AUC:", round(cv_scores.mean(), 4))
print("Test ROC-AUC Score:", round(test_auc, 4))


Cross-Validation AUC Scores: [0.4823 0.4523 0.43   0.5645 0.4797]
Mean CV AUC: 0.4818
Test ROC-AUC Score: 0.4437


###************** END  **************