<a href="https://colab.research.google.com/github/aditya301cs/Daily-Data-Science-ML/blob/main/Bagging_Ensemble_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## üìå What is Bagging?

Bagging (Bootstrap Aggregating) is an ensemble learning technique that improves
model performance by training multiple models independently on different random
subsets of the training data and aggregating their predictions.

The main goal of bagging is to **reduce variance** and **prevent overfitting**,
especially for high-variance models like decision trees.


## üîç How Bagging Works

1. Multiple bootstrap samples are created from the original dataset  
2. Each sample is generated **with replacement**
3. A base model is trained independently on each sample
4. Predictions from all models are aggregated:
   - Classification ‚Üí Majority voting
   - Regression ‚Üí Averaging

This aggregation reduces the effect of noise and outliers.


## üéØ Why Bootstrap Sampling?

Bootstrap sampling allows the same data point to appear multiple times in a subset.
As a result:
- Models see **slightly different data**
- Errors made by individual models are less correlated
- Overall model variance decreases

This makes the ensemble more robust and stable.


## ‚öñÔ∏è Bagging vs Boosting

| Feature | Bagging | Boosting |
|------|--------|---------|
| Training | Parallel | Sequential |
| Focus | Reduces variance | Reduces bias |
| Data Sampling | Random with replacement | Weighted samples |
| Model Dependency | Independent | Dependent |
| Example | Random Forest | AdaBoost |

Bagging works well with **unstable models** (e.g., decision trees),
while boosting is more effective for **stable models**.


## ‚úÖ Advantages of Bagging

- Reduces overfitting
- Decreases model variance
- Improves prediction stability
- Handles noisy data effectively
- Works well with high-variance models
- Supports parallel computation
- Simple and easy to implement
- Improves performance on imbalanced datasets


## üìÅ Dataset: Telecom Customer Churn

This dataset contains customer activity data from an Iranian telecom company.
Each row represents customer behavior over one year.

### Target Variable:
- `Churn` ‚Üí Whether the customer left the service

### Features:
- Call failures
- Subscription length
- Usage behavior
- Customer service interactions


In [14]:
# ===============================
# Core Python & Utilities
# ===============================
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

# ===============================
# Data Splitting & Evaluation
# ===============================
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report

# ===============================
# Preprocessing & Pipelines
# ===============================
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# ===============================
# Machine Learning Models
# ===============================
from sklearn.tree import DecisionTreeClassifier

# ===============================
# Ensemble Methods
# ===============================
from sklearn.ensemble import BaggingClassifier

# ===============================
# Visualization (Optional)
# ===============================
import matplotlib.pyplot as plt


In [2]:
import pandas as pd

customer = pd.read_csv("/content/Customer Churn.csv")
customer.head()

Unnamed: 0,Call Failure,Complains,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Frequency of SMS,Distinct Called Numbers,Age Group,Tariff Plan,Status,Age,Customer Value,FN,FP,Churn
0,8,0,38,0,4370,71,5,17,3,1,1,30,197.64,177.876,69.764,0
1,0,0,39,0,318,5,7,4,2,1,2,25,46.035,41.4315,60.0,0
2,10,0,37,0,2453,60,359,24,3,1,1,30,1536.52,1382.868,203.652,0
3,10,0,38,0,4198,66,1,35,1,1,1,15,240.02,216.018,74.002,0
4,3,0,38,0,2393,58,2,33,1,1,1,15,145.805,131.2245,64.5805,0


## üßÆ Feature Engineering


In [17]:
# Independent variables
X = customer.drop("Churn", axis=1)

# Target variable
y = customer["Churn"]


## üîÄ Train-Test Split


In [18]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1
)


## üå≤ Baseline Model: Decision Tree

We start with a single decision tree model to compare
its performance with the bagging ensemble later.


In [4]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', DecisionTreeClassifier(random_state=42))
])

pipeline.fit(X_train, y_train)

## üìä Evaluation of Single Decision Tree


In [5]:
from sklearn.metrics import classification_report

# Make prediction on the testing data
y_pred = pipeline.predict(X_test)

# Classification Report
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           0       0.96      0.97      0.96       807
           1       0.79      0.78      0.78       138

    accuracy                           0.94       945
   macro avg       0.88      0.87      0.87       945
weighted avg       0.94      0.94      0.94       945



### Observation:
- Strong performance on majority class
- Poor recall and precision on minority class
- Indicates **overfitting and high variance**


#Cross-Validation (Single Model)

In [19]:
# Evaluate the classifier using cross-validation
cv_scores = cross_val_score(pipeline, X, y, cv=5)

print("Cross-validation scores:", cv_scores)
print("Mean CV accuracy:", np.mean(cv_scores))

Cross-validation scores: [0.95079365 0.93650794 0.93809524 0.94444444 0.92539683]
Mean CV accuracy: 0.9390476190476191


### Interpretation:
- High variance across folds
- Accuracy ranges from ~92% to ~95%
- Confirms instability of a single decision tree


## üì¶ Bagging Classifier

We now apply Bagging using the same pipeline
as the base estimator.


In [21]:
from sklearn.ensemble import BaggingClassifier

# Create a bagging classifier with the decision tree pipeline
bagging_classifier = BaggingClassifier(estimator=pipeline, n_estimators=50, random_state=42)

# Train the bagging classifier on the training data
bagging_classifier.fit(X_train, y_train)

## üìà Evaluation of Bagging Classifier


In [12]:
# Make prediction on the testing data
y_pred = bagging_classifier.predict(X_test)

# Classification Report
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           0       0.97      0.96      0.97       817
           1       0.78      0.82      0.80       128

    accuracy                           0.94       945
   macro avg       0.87      0.89      0.88       945
weighted avg       0.95      0.94      0.94       945



### Observation:
- Improved recall and precision for minority class
- More balanced predictions
- Reduced overfitting


# Evaluate the classifier using cross-validation


In [13]:
# Evaluate the classifier using cross-validation
cv_scores = cross_val_score(bagging_classifier, X, y, cv=5)

print(f"Cross-validation scores: {cv_scores}")
print(f"Mean CV accuracy: {np.mean(cv_scores):.2f}")

Cross-validation scores: [0.95396825 0.95238095 0.94285714 0.96349206 0.95873016]
Mean CV accuracy: 0.95


### Interpretation:
- Lower variance across folds
- Accuracy improves from 94% ‚Üí 95%
- Bagging clearly enhances model stability


## üß† Best Practices & Tips

- Use bagging with high-variance models
- Increase `n_estimators` (100‚Äì200) for better performance
- Enable parallel processing using `n_jobs`
- Optimize base models before bagging
- Use Random Forest as a ready-made bagging approach


## üèÅ Conclusion

Bagging is a powerful ensemble technique that improves accuracy,
reduces variance, and enhances model stability.

In this notebook:
- We implemented a baseline decision tree
- Identified overfitting issues
- Applied bagging to improve performance
- Achieved higher accuracy and better minority class prediction

Bagging is widely used in real-world machine learning systems
and forms the foundation of Random Forest models.
