#Question 1: What is Ensemble Learning in machine learning? Explain the key idea
behind it.
Ensemble learning combines multiple base models (often called learners) to produce a single aggregated prediction that is usually more accurate and robust than any individual model.
Key idea: different models make different errors; by combining them intelligently (averaging, voting, weighted sums, sequential corrections), the ensemble reduces variance, bias or both — improving generalization. Common paradigms: bagging (reduce variance), boosting (reduce bias), stacking (learn how to combine models).

#Question 2: What is the difference between Bagging and Boosting?
Bagging (Bootstrap AGGregatING)
Idea: train many independent base learners on different bootstrap samples (sampling with replacement) of the training set, then aggregate by averaging (regression) or majority vote (classification).
Effect: reduces variance (stabilizes high-variance models such as decision trees).
Base learners trained in parallel (independent).
Examples: Random Forest (bagging + feature subsampling).

Boosting
Idea: train base learners sequentially; each new learner focuses on examples the previous ones handled poorly (via reweighting or residual fitting). The ensemble aggregates learners (often weighted).
Effect: reduces bias and can also reduce variance; produces a strong learner from many weak ones.
More prone to overfitting if not regularized, but often very powerful.
Examples: AdaBoost, Gradient Boosting Machines (GBM), XGBoost, LightGBM, CatBoost.

#Question 3: What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?
Bootstrap sampling is sampling with replacement from the original dataset to create multiple different training sets (each the same size as the original). Each bootstrap sample will contain about ~63.2% unique original examples on average (the rest are duplicates), leaving ~36.8% not selected.

Role in Bagging / Random Forest:
Creates diverse training sets so base learners (e.g., trees) differ from one another.
Diversity reduces correlation among base learners; averaging reduces variance.
Random Forest adds an extra randomness layer by subsampling features at each split (feature bagging), further decorrelating trees.

#Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?
Out-of-Bag (OOB) samples: For a given bootstrap sample, the examples not included (roughly 36.8% of original data) are OOB for that particular base learner.
OOB score (evaluation): For ensemble methods that use bootstrapping (like Random Forest), you can compute predictions for each training sample using only those trees for which the sample was OOB. Aggregating those predictions (vote/average) gives an OOB estimate of performance (accuracy, MSE, etc.) without needing an external validation set or cross-validation.
Practical: OOB score is a fast built-in estimate of generalization performance and is often close to cross-validated performance for Random Forest.

#Question 5: Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.
Single Decision Tree
Feature importance often computed as the total decrease in impurity (Gini/Entropy/MSE) brought by splits on that feature, summed over the tree.
Can be unstable: small changes in data or hyperparameters may change which feature is chosen for splits, so importances can vary a lot.

Random Forest
Feature importance is averaged across all trees (mean decrease in impurity) — much more stable.
Random Forest can also compute permutation importance (measuring drop in performance when a feature is shuffled) which is model-agnostic and often more reliable for ranking.
Because Random Forest averages many trees (each built on different data+feature subsets), the importances are more robust and less sensitive to noise and correlated features. However correlated features can still split importance among themselves reducing per-feature scores.

In [1]:
'''#Question 6: Write a Python program to:
● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
● Train a Random Forest Classifier
● Print the top 5 most important features based on feature importance scores.
(Include your Python code and output in the code box below.)'''

# Q6: Random Forest on Breast Cancer dataset, print top 5 features
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import numpy as np

data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)

importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]

print("Top 5 features:")
for i in indices[:5]:
    print(f"{feature_names[i]}: {importances[i]:.6f}")


Top 5 features:
worst area: 0.149674
worst concave points: 0.127189
mean concave points: 0.104650
worst radius: 0.086963
worst perimeter: 0.080299


In [7]:
# Q7: Bagging Classifier vs single Decision Tree on Iris
from sklearn.datasets import load_iris
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

# Train a single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_acc = accuracy_score(y_test, dt.predict(X_test))

# ✅ Bagging with Decision Trees (use 'estimator' instead of 'base_estimator')
bag = BaggingClassifier(estimator=DecisionTreeClassifier(),
                        n_estimators=25,
                        random_state=42,
                        n_jobs=-1)
bag.fit(X_train, y_train)
bag_acc = accuracy_score(y_test, bag.predict(X_test))

print("=== Q7: BaggingClassifier (Decision Trees) on Iris dataset ===")
print(f"Single Decision Tree accuracy: {dt_acc:.4f}")
print(f"Bagging (25 trees) accuracy: {bag_acc:.4f}")



=== Q7: BaggingClassifier (Decision Trees) on Iris dataset ===
Single Decision Tree accuracy: 0.8947
Bagging (25 trees) accuracy: 0.9474


In [5]:
# Q8: Random Forest + GridSearchCV (tune max_depth and n_estimators) on Breast Cancer
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=7, stratify=y)

rf = RandomForestClassifier(random_state=42)
param_grid = {
    "n_estimators": [50, 100],    # small grid for speed; expand if needed
    "max_depth": [None, 5, 10]
}

grid = GridSearchCV(rf, param_grid, cv=3, n_jobs=-1)
grid.fit(X_train, y_train)

best_params = grid.best_params_
best_cv_score = grid.best_score_
test_acc = accuracy_score(y_test, grid.best_estimator_.predict(X_test))

print("Best parameters found:", best_params)
print(f"Best CV accuracy: {best_cv_score:.4f}")
print(f"Test set accuracy with best params: {test_acc:.4f}")


Best parameters found: {'max_depth': None, 'n_estimators': 100}
Best CV accuracy: 0.9601
Test set accuracy with best params: 0.9510


In [8]:
# Q9: Bagging Regressor and Random Forest Regressor on California housing
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
cal = fetch_california_housing()
X, y = cal.data, cal.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# ✅ Use 'estimator' instead of 'base_estimator'
bag_reg = BaggingRegressor(estimator=DecisionTreeRegressor(),
                           n_estimators=20,
                           random_state=42,
                           n_jobs=-1)

rf_reg = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)

# Train both models
bag_reg.fit(X_train, y_train)
rf_reg.fit(X_train, y_train)

# Predictions
bag_pred = bag_reg.predict(X_test)
rf_pred = rf_reg.predict(X_test)

# Evaluate with Mean Squared Error (MSE)
mse_bag = mean_squared_error(y_test, bag_pred)
mse_rf = mean_squared_error(y_test, rf_pred)

print("=== Q9: Bagging Regressor vs Random Forest Regressor on California Housing ===")
print(f"Bagging Regressor MSE: {mse_bag:.4f}")
print(f"Random Forest Regressor MSE: {mse_rf:.4f}")



=== Q9: Bagging Regressor vs Random Forest Regressor on California Housing ===
Bagging Regressor MSE: 0.2648
Random Forest Regressor MSE: 0.2542


#Question 10: You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
● Choose between Bagging or Boosting
● Handle overfitting
● Select base models
● Evaluate performance using cross-validation
● Justify how ensemble learning improves decision-making in this real-world
context.

Step 1: Choosing Between Bagging and Boosting

Bagging (e.g., Random Forest): Reduces variance; good for unstable models like Decision Trees.

Boosting (e.g., XGBoost, LightGBM): Reduces bias; trains models sequentially to fix previous errors.
✅ For loan default prediction, Boosting is preferred — it handles class imbalance and captures complex patterns.

Step 2: Handling Overfitting

Use cross-validation for model validation.

Apply hyperparameter tuning (e.g., max_depth, learning_rate).

Use regularization and early stopping.

Balance data using class weights or SMOTE.

Step 3: Selecting Base Models

Try Decision Trees, Logistic Regression, or XGBoost.

Choose the model with the best cross-validated ROC-AUC or F1-score.

Step 4: Model Evaluation

Use k-fold cross-validation and metrics like:

Accuracy

Precision/Recall

ROC-AUC

Step 5: Why Ensemble Learning Helps

Combines multiple weak learners for better accuracy.

Reduces bias & variance → more stable predictions.

Improves risk prediction → helps financial institutions make safer loan approvals.

In [10]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

# Load classification dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Train model and evaluate using ROC-AUC
model = GradientBoostingClassifier()
scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
print("Average ROC-AUC:", scores.mean())


Average ROC-AUC: 0.9909759807769962
