theory

1. Can we use Bagging for regression problems?
   Yes, Bagging can be used for regression problems. It is implemented using regressors like DecisionTreeRegressor in ensemble methods such as BaggingRegressor.

2. What is the difference between multiple model training and single model training?
   Single model training involves training one model on the dataset, while multiple model training (ensemble) combines the predictions of several models to improve accuracy and reduce overfitting.

3. Explain the concept of feature randomness in Random Forest.
   Random Forest introduces feature randomness by selecting a random subset of features at each split, ensuring diversity among trees and reducing correlation.

4. What is OOB (Out-of-Bag) Score?
   OOB score is an internal validation method for Bagging and Random Forest. It evaluates model performance using samples not included in the bootstrap sample for training.

5. How can you measure the importance of features in a Random Forest model?
   Feature importance can be measured by how much each feature decreases impurity across the forest or by permutation importance.

6. Explain the working principle of a Bagging Classifier.
   A Bagging Classifier trains multiple base estimators (usually Decision Trees) on different bootstrap samples and aggregates their predictions via majority vote.

7. How do you evaluate a Bagging Classifier’s performance?
   You can evaluate it using metrics like accuracy, precision, recall, F1-score, or AUC on a test set.

8. How does a Bagging Regressor work?
   It trains multiple regressors on different bootstrap samples and averages their predictions to get the final output.

9. What is the main advantage of ensemble techniques?
   Ensemble techniques reduce variance and improve generalization by combining multiple models.

10. What is the main challenge of ensemble methods?
    They can be computationally expensive and may lose interpretability.

11. Explain the key idea behind ensemble techniques.
    The key idea is to combine multiple weak learners to form a strong learner with improved accuracy and robustness.

12. What is a Random Forest Classifier?
    A Random Forest Classifier is an ensemble of Decision Trees using bagging and feature randomness to classify data.

13. What are the main types of ensemble techniques?
    Bagging, Boosting, and Stacking are the main types of ensemble techniques.

14. What is ensemble learning in machine learning?
    Ensemble learning combines predictions from multiple models to improve performance.

15. When should we avoid using ensemble methods?
    Avoid them when interpretability is a key requirement or when computational resources are limited.

16. How does Bagging help in reducing overfitting?
    Bagging reduces overfitting by averaging predictions from multiple models, thus smoothing out noise.

17. Why is Random Forest better than a single Decision Tree?
    Random Forest reduces overfitting and increases accuracy by averaging the predictions of multiple trees.

18. What is the role of bootstrap sampling in Bagging?
    Bootstrap sampling creates diverse training datasets by sampling with replacement, promoting model diversity.

19. What are some real-world applications of ensemble techniques?
    Applications include fraud detection, recommendation systems, medical diagnosis, and spam filtering.

20. What is the difference between Bagging and Boosting?
    Bagging builds models independently in parallel and reduces variance; Boosting builds models sequentially and reduces bias.


In [6]:


import numpy as np
from sklearn.datasets import load_breast_cancer, load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import BaggingClassifier, BaggingRegressor, RandomForestClassifier, RandomForestRegressor
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, mean_squared_error, roc_auc_score


#  Train a Bagging Classifier using Decision Trees on a sample dataset and print model accuracy
def q1_bagging_classifier_decision_tree():
    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    bag_clf = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=50, random_state=42)
    bag_clf.fit(X_train, y_train)
    y_pred = bag_clf.predict(X_test)
    print("Q1 - Bagging Classifier Accuracy:", accuracy_score(y_test, y_pred))


# ((2) Train a Bagging Regressor using Decision Trees and evaluate using Mean Squared Error (MSE)
def q2_bagging_regressor_decision_tree():
    X, y = load_diabetes(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    bag_reg = BaggingRegressor(estimator=DecisionTreeRegressor(), n_estimators=50, random_state=42)
    bag_reg.fit(X_train, y_train)
    y_pred = bag_reg.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    print("Q2 - Bagging Regressor MSE:", mse)


# (.2) Train a Random Forest Classifier on the Breast Cancer dataset and print feature importance scores
def q3_random_forest_classifier_feature_importance():
    X, y = load_breast_cancer(return_X_y=True)
    rf_clf = RandomForestClassifier(random_state=42)
    rf_clf.fit(X, y)
    importances = rf_clf.feature_importances_
    print("Q3 - Random Forest Feature Importances:")
    for i, imp in enumerate(importances):
        print(f" Feature {i}: {imp:.4f}")


# 2) Train a Random Forest Regressor and compare its performance with a single Decision Tree
def q4_random_forest_regressor_vs_decision_tree():
    X, y = load_diabetes(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

    dt_reg = DecisionTreeRegressor(random_state=42)
    dt_reg.fit(X_train, y_train)
    dt_pred = dt_reg.predict(X_test)
    dt_mse = mean_squared_error(y_test, dt_pred)

    rf_reg = RandomForestRegressor(random_state=42)
    rf_reg.fit(X_train, y_train)
    rf_pred = rf_reg.predict(X_test)
    rf_mse = mean_squared_error(y_test, rf_pred)

    print(f"Q4 - Decision Tree Regressor MSE: {dt_mse:.4f}")
    print(f"Q4 - Random Forest Regressor MSE: {rf_mse:.4f}")


# (	2) Compute the Out-of-Bag (OOB) Score for a Random Forest Classifier
def q5_random_forest_oob_score():
    X, y = load_breast_cancer(return_X_y=True)
    rf_clf = RandomForestClassifier(oob_score=True, random_state=42, n_estimators=100)
    rf_clf.fit(X, y)
    print("Q5 - Random Forest OOB Score:", rf_clf.oob_score_)


# 2) Train a Bagging Classifier using SVM as a base estimator and print accuracy
def q6_bagging_classifier_svm():
    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    bag_svm = BaggingClassifier(estimator=SVC(probability=True), n_estimators=10, random_state=42)
    bag_svm.fit(X_train, y_train)
    y_pred = bag_svm.predict(X_test)
    print("Q6 - Bagging Classifier with SVM Accuracy:", accuracy_score(y_test, y_pred))


# 2) Train a Random Forest Classifier with different numbers of trees and compare accuracy
def q7_random_forest_varying_trees():
    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    for n in [10, 50, 100, 200]:
        rf = RandomForestClassifier(n_estimators=n, random_state=42)
        rf.fit(X_train, y_train)
        acc = accuracy_score(y_test, rf.predict(X_test))
        print(f"Q7 - Random Forest with {n} trees Accuracy: {acc:.4f}")


# Train a Bagging Classifier using Logistic Regression as a base estimator and print AUC score

def q8_bagging_classifier_logistic_regression_auc():
    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    bag_lr = BaggingClassifier(estimator=LogisticRegression(max_iter=1000), n_estimators=20, random_state=42)
    bag_lr.fit(X_train, y_train)
    y_prob = bag_lr.predict_proba(X_test)[:, 1]
    auc_score = roc_auc_score(y_test, y_prob)
    print("Q8 - Bagging Classifier with Logistic Regression AUC:", auc_score)


# ('2) Train a Random Forest Regressor and analyze feature importance scores
def q9_random_forest_regressor_feature_importance():
    X, y = load_diabetes(return_X_y=True)
    rf_reg = RandomForestRegressor(random_state=42)
    rf_reg.fit(X, y)
    print("Q9 - Random Forest Regressor Feature Importances:")
    for i, imp in enumerate(rf_reg.feature_importances_):
        print(f" Feature {i}: {imp:.4f}")


# .2 Train an ensemble model using both Bagging and Random Forest and compare accuracy
def q10_ensemble_bagging_random_forest_comparison():
    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    bag = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=50, random_state=42)
    bag.fit(X_train, y_train)
    rf = RandomForestClassifier(n_estimators=50, random_state=42)
    rf.fit(X_train, y_train)
    bag_acc = accuracy_score(y_test, bag.predict(X_test))
    rf_acc = accuracy_score(y_test, rf.predict(X_test))
    print(f"Q10 - Bagging Classifier Accuracy: {bag_acc:.4f}")
    print(f"Q10 - Random Forest Classifier Accuracy: {rf_acc:.4f}")


if __name__ == "__main__":
    q1_bagging_classifier_decision_tree()
    print()
    q2_bagging_regressor_decision_tree()
    print()
    q3_random_forest_classifier_feature_importance()
    print()
    q4_random_forest_regressor_vs_decision_tree()
    print()
    q5_random_forest_oob_score()
    print()
    q6_bagging_classifier_svm()
    print()
    q7_random_forest_varying_trees()
    print()
    q8_bagging_classifier_logistic_regression_auc()
    print()
    q9_random_forest_regressor_feature_importance()
    print()
    q10_ensemble_bagging_random_forest_comparison()


Q1 - Bagging Classifier Accuracy: 0.958041958041958

Q2 - Bagging Regressor MSE: 3096.3144324324335

Q3 - Random Forest Feature Importances:
 Feature 0: 0.0348
 Feature 1: 0.0152
 Feature 2: 0.0680
 Feature 3: 0.0605
 Feature 4: 0.0080
 Feature 5: 0.0116
 Feature 6: 0.0669
 Feature 7: 0.1070
 Feature 8: 0.0034
 Feature 9: 0.0026
 Feature 10: 0.0143
 Feature 11: 0.0037
 Feature 12: 0.0101
 Feature 13: 0.0296
 Feature 14: 0.0047
 Feature 15: 0.0056
 Feature 16: 0.0058
 Feature 17: 0.0038
 Feature 18: 0.0035
 Feature 19: 0.0059
 Feature 20: 0.0828
 Feature 21: 0.0175
 Feature 22: 0.0808
 Feature 23: 0.1394
 Feature 24: 0.0122
 Feature 25: 0.0199
 Feature 26: 0.0373
 Feature 27: 0.1322
 Feature 28: 0.0082
 Feature 29: 0.0045

Q4 - Decision Tree Regressor MSE: 5941.7027
Q4 - Random Forest Regressor MSE: 3010.1126

Q5 - Random Forest OOB Score: 0.961335676625659

Q6 - Bagging Classifier with SVM Accuracy: 0.951048951048951

Q7 - Random Forest with 10 trees Accuracy: 0.9510
Q7 - Random Forest

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Q8 - Bagging Classifier with Logistic Regression AUC: 0.997919267582189

Q9 - Random Forest Regressor Feature Importances:
 Feature 0: 0.0575
 Feature 1: 0.0119
 Feature 2: 0.2762
 Feature 3: 0.0871
 Feature 4: 0.0473
 Feature 5: 0.0554
 Feature 6: 0.0512
 Feature 7: 0.0271
 Feature 8: 0.3156
 Feature 9: 0.0708

Q10 - Bagging Classifier Accuracy: 0.9580
Q10 - Random Forest Classifier Accuracy: 0.9720
