#Assignment Code: DA-AG-014


Question 1: What is Ensemble Learning in machine learning? Explain the key idea
behind it.
* Ensemble Learning is a machine learning technique where multiple models are combined to improve prediction accuracy and reliability. **The idea is that a group of weak models together performs better than a single strong model**. It works by training several models and merging their outputs (through voting, averaging, or stacking) to reduce errors, variance, and bias.
* In short: Ensemble learning makes predictions more accurate, stable, and robust.

Question 2: What is the difference between Bagging and Boosting?
* Bagging (Bootstrap Aggregating): Trains models in parallel on random subsets of data. Final prediction is made by averaging or voting. It mainly reduces variance. Example: Random Forest.
* Boosting: Trains models sequentially, where each new model fixes the errors of the previous one. Final prediction is a weighted combination of models. It reduces both bias and variance. Example: AdaBoost, XGBoost.
* Bagging = parallel, reduces variance; Boosting = sequential, reduces bias & variance.

Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?
* Bootstrap sampling is a technique where new datasets are created by randomly selecting samples from the original data with replacement. Some records may repeat while others may be left out.
* Role in Bagging (e.g., Random Forest)
     * Each model is trained on a different bootstrap sample, creating diversity among models. When their outputs are combined (by voting or averaging), the overall prediction becomes more accurate, stable, and less overfitted.
* In short: Bootstrap sampling provides varied training sets, which makes Bagging methods like Random Forest effective.

Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?
* In bootstrap sampling, some data points are left out of the sample used to train a model. These unused data points are called Out-of-Bag (OOB) samples.
* OOB Score in Ensemble Models:
      * OOB samples act like a built-in validation set.
      * After training, each model can be tested on its own OOB samples.
      * The OOB score is the average accuracy/error computed using these OOB predictions.


Question 5: Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.
* Decision Tree: Feature importance = impurity reduction from splits in one tree; can be biased by early/top splits and data quirks.
* Random Forest: Importance = average impurity reduction across many trees; more stable, robust, and less biased.

Question 6: Write a Python program to:
* Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
* Train a Random Forest Classifier.
* Print the top 5 most important features based on feature importance scores.
* (Include your Python code and output in the code box below.)


In [1]:
# Answer:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd


data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target


model = RandomForestClassifier(random_state=1)
model.fit(X, y)


importances = model.feature_importances_
feature_importance = pd.Series(importances, index=data.feature_names)


print("Top 5 Important Features:")
print(feature_importance.sort_values(ascending=False).head(5))

Top 5 Important Features:
worst concave points    0.123350
worst perimeter         0.115661
worst area              0.105248
worst radius            0.102798
mean concave points     0.100735
dtype: float64


Question 7: Write a Python program to:
* Train a Bagging Classifier using Decision Trees on the Iris dataset
* Evaluate its accuracy and compare with a single Decision Tree
* (Include your Python code and output in the code box below.)

In [2]:
# Answer:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score
import pandas as pd


data = load_iris()

df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

dt = DecisionTreeClassifier(random_state=1)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)
dt_acc = accuracy_score(y_test, dt_pred)

bagging = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=50, random_state=1)
bagging.fit(X_train, y_train)
bag_pred = bagging.predict(X_test)
bag_acc = accuracy_score(y_test, bag_pred)

print("Decision Tree Accuracy:", dt_acc)
print("Bagging Classifier Accuracy:", bag_acc)

Decision Tree Accuracy: 0.9555555555555556
Bagging Classifier Accuracy: 0.9555555555555556


Question 8: Write a Python program to:
* Train a Random Forest Classifier
* Tune hyperparameters max_depth and n_estimators using GridSearchCV
* Print the best parameters and final accuracy
(Include your Python code and output in the code box below.)


In [4]:
# Answer:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

rf = RandomForestClassifier(random_state=1)

param_grid = {
    "n_estimators": [20, 50, 100],
    "max_depth": [None, 5, 10]
}

grid = GridSearchCV(rf, param_grid, cv=5, scoring="accuracy")
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)

best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
print("Final Accuracy:", accuracy_score(y_test, y_pred))

Best Parameters: {'max_depth': None, 'n_estimators': 100}
Final Accuracy: 0.9473684210526315


Question 9: Write a Python program to:
* Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
* Compare their Mean Squared Errors (MSE)
(Include your Python code and output in the code box below.)

In [5]:
# Answer:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error
import pandas as pd

data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

bagging = BaggingRegressor(estimator=DecisionTreeRegressor(), n_estimators=50, random_state=1)
bagging.fit(X_train, y_train)
bag_pred = bagging.predict(X_test)
bag_mse = mean_squared_error(y_test, bag_pred)

rf = RandomForestRegressor(n_estimators=50, random_state=1)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_pred)

print("Bagging Regressor MSE:", bag_mse)
print("Random Forest Regressor MSE:", rf_mse)

Bagging Regressor MSE: 0.26483002536121963
Random Forest Regressor MSE: 0.26582715600876


Question 10: You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
* Choose between Bagging or Boosting
* Handle overfitting
* Select base models
* Evaluate performance using cross-validation
* Justify how ensemble learning improves decision-making in this real-world
context.
(Include your Python code and output in the code box below.)

**Answer**-
* Step 1: Choose between Bagging or Boosting We first check if our dataset has high variance or high bias. If the model tends to overfit (high variance), we prefer Bagging like Random Forest. If the model underfits and misses patterns, we use Boosting like XGBoost or AdaBoost to focus on hard-to-predict cases.
* Step 2: Handle overfitting We use techniques like limiting tree depth, reducing the number of estimators, and applying regularization in boosting. We also use cross-validation to tune parameters and ensure the model performs well on unseen data.
* Step 3: Select base models We usually start with Decision Trees because they are simple and work well in ensembles. For more complex data, we can also try logistic regression or shallow models as base learners depending on the problem.
* Step 4: Evaluate performance using cross-validation We split the dataset into multiple folds and train/test on different parts. This gives us an average performance score that is more reliable than a single train-test split. It helps us choose the best ensemble setup.
* Step 5: Justify ensemble learning in real-world context In loan default prediction, wrong decisions can be very costly. Ensemble methods combine multiple models, making predictions more stable and accurate. This reduces both false approvals and false rejections, helping the financial institution make safer and smarter lending decisions.
* By carefully choosing the right ensemble method, tuning parameters, and validating results, we build a strong predictive model. Ensemble learning reduces errors and provides more reliable insights, which is crucial for making better financial decisions like predicting loan defaults.