**Question 1: What is Ensemble Learning in machine learning? Explain the key idea behind it.**

Answer:- Ensemble Learning is a machine learning technique in which multiple individual models (called base or weak learners) are trained and combined to solve the same prediction problem. The key idea behind ensemble learning is that a group of diverse models working together can produce better, more accurate, and more stable results than a single model alone. Each model may make some errors, but when their predictions are aggregated using methods like voting or averaging, the errors tend to cancel out. This improves overall performance and reduces the chances of overfitting or underfitting. Ensemble learning works on the principle that different models capture different patterns in the data. By combining their strengths, ensemble methods achieve higher accuracy, better generalization on unseen data, and improved reliability. Hence, ensemble learning is widely used in practical machine learning applications.


**Question 2: What is the difference between Bagging and Boosting?**

Answer:- Bagging (Bootstrap Aggregating) and Boosting are both ensemble learning techniques, but they differ in how models are trained and combined. In Bagging, multiple models are trained independently and in parallel on different random samples of the training data, created using bootstrapping. The main goal of bagging is to reduce variance and prevent overfitting, and the final prediction is obtained by averaging or majority voting. In contrast, Boosting trains models sequentially, where each new model focuses more on the data points that were misclassified by previous models. Boosting aims to reduce bias and improve accuracy by converting weak learners into strong learners.

**Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?**

Answer:- Bootstrap sampling is a statistical resampling technique in which multiple new training datasets are created by randomly sampling data points from the original dataset with replacement. As a result, some observations may appear multiple times in a bootstrap sample, while others may not appear at all. In Bagging methods like Random Forest, bootstrap sampling plays a crucial role by providing different subsets of data to train each decision tree. This introduces diversity among the trees, ensuring that they do not all learn the same patterns. By training each tree on a different bootstrap sample and then combining their predictions through majority voting or averaging, Random Forest reduces variance, improves model stability, and enhances overall prediction accuracy.

**Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?**

Answer:- Out-of-Bag (OOB) samples are the data points that are not selected during bootstrap sampling when training individual models in bagging-based ensemble methods such as Random Forest. Since bootstrap sampling is done with replacement, on average about 63% of the original data is used to train a given model, while the remaining 37% forms the OOB samples for that model. The OOB score is calculated by using these unused samples to test the model’s performance. Each data point is evaluated using only the models for which it was an OOB sample, and the predictions are aggregated. This provides an unbiased estimate of model performance without needing a separate validation or test dataset, making OOB scoring an efficient and reliable evaluation method for ensemble models.

**Question 5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest.**

Answer :- Feature importance analysis differs significantly between a single Decision Tree and a Random Forest. In a single Decision Tree, feature importance is calculated based on how much each feature reduces impurity (such as Gini index or entropy) at the splits within that one tree. Since the model relies on a single structure, the importance values can be unstable and highly sensitive to small changes in the data, often leading to overfitting and biased importance toward features with more levels. In contrast, a Random Forest computes feature importance by averaging the importance of each feature across many decision trees, each trained on different bootstrap samples and random feature subsets. This averaging process makes the importance scores more reliable, robust, and less prone to overfitting, providing a better representation of the true influence of features in the dataset.


**Question 6: Write a Python program to: ● Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer() ● Train a Random Forest Classifier ● Print the top 5 most important features based on feature importance scores. Include Python code and output**

In [1]:

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

importances = rf.feature_importances_

feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

feature_importance_df = feature_importance_df.sort_values(
    by='Importance', ascending=False
)

print("Top 5 Most Important Features:")
print(feature_importance_df.head(5))


Top 5 Most Important Features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


**Question 7: Write a Python program to: ● Train a Bagging Classifier using Decision Trees on the Iris dataset ● Evaluate its accuracy and compare with a single Decision Tree Include Python code and output**


In [4]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1
)

# Decision Tree
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
dt_acc = accuracy_score(y_test, dt.predict(X_test))

# Bagging Classifier
bag = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=10
)
bag.fit(X_train, y_train)
bag_acc = accuracy_score(y_test, bag.predict(X_test))

# Output
print("Decision Tree Accuracy:", dt_acc)
print("Bagging Accuracy:", bag_acc)


TypeError: BaggingClassifier.__init__() got an unexpected keyword argument 'base_estimator'

**Question 8: Write a Python program to: ● Train a Random Forest Classifier ● Tune hyperparameters max_depth and n_estimators using GridSearchCV ● Print the best parameters and final accuracy Include your Python code and output**

In [5]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X = data.data
y = data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Define Random Forest model
rf = RandomForestClassifier(random_state=42)

# Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 5, 10]
}

# GridSearchCV
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy'
)

# Train model
grid_search.fit(X_train, y_train)

# Best model
best_model = grid_search.best_estimator_

# Make predictions
y_pred = best_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Parameters:", grid_search.best_params_)
print("Final Accuracy:", accuracy)


Best Parameters: {'max_depth': None, 'n_estimators': 150}
Final Accuracy: 1.0


**Question 9: Write a Python program to: ● Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset ● Compare their Mean Squared Errors (MSE) Include your Python code and output**

In [6]:
# Import required libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# -------------------------
# Bagging Regressor
# -------------------------
bagging_model = BaggingRegressor(
    base_estimator=DecisionTreeRegressor(),
    n_estimators=100,
    random_state=42
)

bagging_model.fit(X_train, y_train)
bagging_predictions = bagging_model.predict(X_test)

bagging_mse = mean_squared_error(y_test, bagging_predictions)

# -------------------------
# Random Forest Regressor
# -------------------------
rf_model = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)

rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)

rf_mse = mean_squared_error(y_test, rf_predictions)

# Print results
print("Bagging Regressor MSE:", bagging_mse)
print("Random Forest Regressor MSE:", rf_mse)


TypeError: BaggingRegressor.__init__() got an unexpected keyword argument 'base_estimator'

**Question 10: You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data. You decide to use ensemble techniques to increase model performance. Explain your step-by-step approach to: ● Choose between Bagging or Boosting ● Handle overfitting ● Select base models ● Evaluate performance using cross-validation ● Justify how ensemble learning improves decision-making in this real-world context. Include Python code and output**

In [8]:
# Import libraries
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

# Create synthetic loan default dataset
X, y = make_classification(
    n_samples=5000,
    n_features=10,
    n_informative=6,
    n_redundant=2,
    random_state=42
)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y


SyntaxError: incomplete input (ipython-input-709861716.py, line 18)