1. What is Ensemble Learning in machine learning? Explain the key idea
behind it.
  - Ensemble learning combines multiple machine learning models to produce a single, more robust and accurate predictive model. The key idea is that the "wisdom of the crowd" can achieve better results than any single model by leveraging the diversity of different "weak learners" to offset individual errors and limitations.

2. What is the difference between Bagging and Boosting?
  - Bagging builds models in parallel on different data subsets to reduce variance, while boosting builds models sequentially, with each new model correcting the errors of the previous one to reduce bias. The key differences are that bagging uses random data samples with replacement and parallel training, giving equal importance to each model, whereas boosting focuses on errors and trains models sequentially, giving more weight to poorly performing ones.  

3. What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?
  - Bootstrap sampling is the process of repeatedly drawing random samples from a dataset with replacement to create multiple unique training subsets. In Bagging (Bootstrap Aggregating) methods like Random Forest, bootstrap sampling is used to create these diverse datasets, which are then used to train multiple base models (like decision trees) independently. By averaging the predictions from these models, Bagging reduces variance and improves overall model stability and accuracy.

4. What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?
  - Out-of-bag (OOB) samples are data points from the training set that are not included in the bootstrap sample used to train a specific base model, such as a decision tree in a random forest. The OOB score is a performance metric calculated by using these OOB samples to make predictions for each model and then evaluating the model's accuracy on them. This provides an internal validation measure, similar to a cross-validation score, without needing a separate validation set.

5. Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.
  - Feature importance analysis in a single Decision Tree and a Random Forest both aim to identify the most influential features in a dataset, but they differ in their methodology and robustness.
      - Single Decision Tree:
          - Feature importance in a single decision tree is typically based on the reduction in impurity (e.g., Gini impurity or entropy) achieved by splitting on that feature. Features that lead to larger reductions in impurity are considered more important.

      - Random Forest:
          - Random Forests calculate feature importance by averaging the impurity reduction (or other metrics like permutation importance) across all the individual decision trees within the forest. For instance, the Mean Decrease in Impurity (MDI) sums the impurity reductions attributed to a feature across all trees and then averages them. Permutation importance involves shuffling a feature's values and measuring the decrease in model performance.


6. Write a Python program to:
  -  Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
  -  Train a Random Forest Classifier
  -  Print the top 5 most important features based on feature importance scores.

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier

print("Loading Breast Cancer dataset...")
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

X_df = pd.DataFrame(X, columns=feature_names)

print("Training RandomForestClassifier...")
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_df, y)

importances = rf_model.feature_importances_

feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

feature_importance_df = feature_importance_df.sort_values(
    by='Importance', ascending=False
)

top_n = 5
print(f"\nTop {top_n} most important features:")

for index, row in feature_importance_df.head(top_n).iterrows():
    print(f"{index + 1}. {row['Feature']:<30} (Importance: {row['Importance']:.4f})")

Loading Breast Cancer dataset...
Training RandomForestClassifier...

Top 5 most important features:
24. worst area                     (Importance: 0.1394)
28. worst concave points           (Importance: 0.1322)
8. mean concave points            (Importance: 0.1070)
21. worst radius                   (Importance: 0.0828)
23. worst perimeter                (Importance: 0.0808)


7. Write a Python program to:
  -  Train a Bagging Classifier using Decision Trees on the Iris dataset
  -  Evaluate its accuracy and compare with a single Decision Tree

In [2]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

print("Loading Iris dataset...")
data = load_iris()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)
print(f"Dataset split: {len(X_train)} training samples, {len(X_test)} testing samples.")

print("\n--- Training Single Decision Tree ---")

single_tree = DecisionTreeClassifier(random_state=42)
single_tree.fit(X_train, y_train)

y_pred_single = single_tree.predict(X_test)
accuracy_single = accuracy_score(y_test, y_pred_single)
print(f"Single Decision Tree Accuracy: {accuracy_single:.4f}")

print("\n--- Training Bagging Classifier ---")

bagging_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(random_state=42),
    n_estimators=50,
    random_state=42,
    bootstrap=True,
    n_jobs=-1
)
bagging_model.fit(X_train, y_train)

y_pred_bagging = bagging_model.predict(X_test)
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)
print(f"Bagging Classifier Accuracy: {accuracy_bagging:.4f}")

print("\n--- Model Comparison ---")
print(f"Single Tree Accuracy:   {accuracy_single:.4f}")
print(f"Bagging Model Accuracy: {accuracy_bagging:.4f}")

if accuracy_bagging > accuracy_single:
    print("\nResult: The Bagging Classifier provided a higher or equal accuracy, demonstrating the benefit of ensemble learning.")
elif accuracy_bagging == accuracy_single:
    print("\nResult: Both models achieved the same accuracy on this dataset split.")
else:
    print("\nResult: The Single Decision Tree performed better in this specific test split.")


Loading Iris dataset...
Dataset split: 105 training samples, 45 testing samples.

--- Training Single Decision Tree ---
Single Decision Tree Accuracy: 0.9333

--- Training Bagging Classifier ---
Bagging Classifier Accuracy: 0.9333

--- Model Comparison ---
Single Tree Accuracy:   0.9333
Bagging Model Accuracy: 0.9333

Result: Both models achieved the same accuracy on this dataset split.


8. Write a Python program to:
  - Train a Random Forest Classifier
  - Tune hyperparameters max_depth and n_estimators using GridSearchCV
  - Print the best parameters and final accuracy

In [3]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

print("Loading Iris dataset...")
data = load_iris()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)
print(f"Dataset split: {len(X_train)} training samples, {len(X_test)} testing samples.")

rf = RandomForestClassifier(random_state=42)

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [2, 5, 10, None],
    'min_samples_split': [2, 5]
}

total_models = np.prod([len(v) for v in param_grid.values()]) * 5 # CV=5 by default
print(f"\nSearching for best parameters using GridSearchCV (evaluating {total_models} fits)...")

grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

print("\n" + "="*40)
print("GridSearchCV Results")
print("="*40)

print(f"Best Parameters Found: {grid_search.best_params_}")

best_rf = grid_search.best_estimator_

y_pred_best = best_rf.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred_best)

print(f"Accuracy of the Tuned Model on Test Set: {final_accuracy:.4f}")
print("="*40)

Loading Iris dataset...
Dataset split: 105 training samples, 45 testing samples.

Searching for best parameters using GridSearchCV (evaluating 120 fits)...
Fitting 5 folds for each of 24 candidates, totalling 120 fits

GridSearchCV Results
Best Parameters Found: {'max_depth': 2, 'min_samples_split': 2, 'n_estimators': 200}
Accuracy of the Tuned Model on Test Set: 0.9333


9. Write a Python program to:
  -  Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
  -  Compare their Mean Squared Errors (MSE)

In [4]:
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

print("Loading California Housing dataset...")
try:

    housing = fetch_california_housing(as_frame=True)
except Exception as e:
    print(f"Error loading dataset: {e}")
    print("Please ensure you have an internet connection if loading for the first time.")
    exit()

X = housing.data
y = housing.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print(f"Dataset split: {len(X_train)} training samples, {len(X_test)} testing samples.")

print("\n--- Training Bagging Regressor ---")

bagging_model = BaggingRegressor(
    estimator=DecisionTreeRegressor(random_state=42),
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)
bagging_model.fit(X_train, y_train)

y_pred_bagging = bagging_model.predict(X_test)
mse_bagging = mean_squared_error(y_test, y_pred_bagging)
print(f"Bagging Regressor MSE: {mse_bagging:.4f}")

print("\n--- Training Random Forest Regressor ---")

rf_model = RandomForestRegressor(
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)
rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)
print(f"Random Forest Regressor MSE: {mse_rf:.4f}")

print("\n" + "="*40)
print("Regression Model MSE Comparison")
print("="*40)
print(f"Bagging Regressor MSE:    {mse_bagging:.4f}")
print(f"Random Forest Regressor MSE: {mse_rf:.4f}")
print("="*40)

if mse_rf < mse_bagging:
    print("\nConclusion: The Random Forest Regressor achieved a lower Mean Squared Error (MSE) and is thus generally the better performing model on this data split.")
elif mse_bagging < mse_rf:
    print("\nConclusion: The Bagging Regressor achieved a lower Mean Squared Error (MSE).")
else:
    print("\nConclusion: Both models performed equally well.")


Loading California Housing dataset...
Dataset split: 16512 training samples, 4128 testing samples.

--- Training Bagging Regressor ---
Bagging Regressor MSE: 0.2559

--- Training Random Forest Regressor ---
Random Forest Regressor MSE: 0.2554

Regression Model MSE Comparison
Bagging Regressor MSE:    0.2559
Random Forest Regressor MSE: 0.2554

Conclusion: The Random Forest Regressor achieved a lower Mean Squared Error (MSE) and is thus generally the better performing model on this data split.


10. You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
  - Choose between Bagging or Boosting
  - Handle overfitting
  - Select base models
  - Evaluate performance using cross-validation
  - Justify how ensemble learning improves decision-making in this real-world
context.


In [5]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report
import xgboost as xgb

print("Generating simulated, imbalanced loan default data...")
X, y = make_classification(
    n_samples=10000,
    n_features=20,
    n_informative=10,
    n_redundant=0,
    n_classes=2,
    n_clusters_per_class=1,
    weights=[0.98, 0.02],
    flip_y=0,
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scale_pos_weight = (len(y_train) - np.sum(y_train)) / np.sum(y_train)
print(f"Minority class proportion (Default): {np.mean(y) * 100:.2f}%")
print(f"Calculated scale_pos_weight: {scale_pos_weight:.2f}")

xgb_model = xgb.XGBClassifier(
    objective='binary:logistic',
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42,

    scale_pos_weight=scale_pos_weight
)

print("\nStarting GridSearchCV with Stratified K-Fold for Hyperparameter Tuning...")

param_grid = {

    'n_estimators': [100, 200],

    'max_depth': [3, 5],

    'learning_rate': [0.1, 0.01],
}

skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

grid_search = GridSearchCV(
    estimator=xgb_model,
    param_grid=param_grid,
    scoring='roc_auc',
    cv=skf,
    verbose=1,
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

print("\n" + "="*50)
print("Tuning and Evaluation Results (XGBoost)")
print("="*50)

print(f"Best Parameters Found: {grid_search.best_params_}")

best_xgb = grid_search.best_estimator_
y_pred_proba = best_xgb.predict_proba(X_test)[:, 1]

auc_roc = roc_auc_score(y_test, y_pred_proba)
print(f"Test Set AUC-ROC Score: {auc_roc:.4f}")

y_pred = (y_pred_proba > 0.5).astype(int)
print("\nConfusion Matrix (Threshold 0.5):")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['No Default', 'Default']))

print("\nJustification: Ensemble learning (Boosting) provides a stable, highly predictive model (high AUC) which is critical for minimizing costly False Negatives (low Recall on 'Default' class) in financial risk assessment.")
print("="*50)


Generating simulated, imbalanced loan default data...
Minority class proportion (Default): 2.00%
Calculated scale_pos_weight: 49.00

Starting GridSearchCV with Stratified K-Fold for Hyperparameter Tuning...
Fitting 3 folds for each of 8 candidates, totalling 24 fits


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)



Tuning and Evaluation Results (XGBoost)
Best Parameters Found: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200}
Test Set AUC-ROC Score: 0.9929

Confusion Matrix (Threshold 0.5):
[[1952    8]
 [   4   36]]

Classification Report:
              precision    recall  f1-score   support

  No Default       1.00      1.00      1.00      1960
     Default       0.82      0.90      0.86        40

    accuracy                           0.99      2000
   macro avg       0.91      0.95      0.93      2000
weighted avg       0.99      0.99      0.99      2000


Justification: Ensemble learning (Boosting) provides a stable, highly predictive model (high AUC) which is critical for minimizing costly False Negatives (low Recall on 'Default' class) in financial risk assessment.
