<a href="https://colab.research.google.com/github/faisu6339-glitch/Machine-learning/blob/main/ET_Bagging(Classification).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Bagging in Classification Explained

Bagging, short for Bootstrap Aggregating, is an ensemble learning technique used in classification (and regression) to improve the stability and accuracy of machine machine learning algorithms. It helps to reduce variance and prevent overfitting.

#### How Does Bagging Work?

1.  **Bootstrapping**: This is the 'B' in Bagging. Instead of training a single model on the entire dataset, Bagging creates multiple new training datasets by sampling *with replacement* from the original training dataset. Each new dataset, called a "bootstrap sample," will have the same number of instances as the original dataset, but some instances will be repeated, and some will be left out (these are called out-of-bag samples).
    *   **Example**: If your original dataset has 100 samples, you might create 10 bootstrap samples, each containing 100 samples drawn randomly with replacement from the original 100. Each bootstrap sample will likely be unique.

2.  **Parallel Training**: A separate base model (often called an "estimator" or "weak learner") is trained independently on each of these bootstrap samples. These base models are usually of the same type (e.g., decision trees, neural networks).
    *   **Key Point**: The base models are trained in parallel, meaning their training does not depend on each other. This is a key difference from boosting methods.

3.  **Aggregation (Voting for Classification)**: Once all base models are trained, they each make a prediction for a new, unseen data point. For classification tasks, Bagging typically uses a majority voting scheme:
    *   Each base model casts a vote for the class it predicts.
    *   The class that receives the most votes is chosen as the final prediction.
    *   For probabilistic predictions, the probabilities can be averaged across all models.

#### Why Does Bagging Work? (Intuition)

*   **Variance Reduction**: The primary benefit of Bagging is variance reduction. Individual models, especially complex ones like deep decision trees, can be sensitive to the specific training data they see. By training on different bootstrap samples, these individual models will make different errors. When their predictions are averaged (or voted on), these errors tend to cancel each other out, leading to a more stable and robust overall prediction.
*   **Reduced Overfitting**: By reducing variance, Bagging inherently helps to prevent overfitting. While individual base models might overfit their specific bootstrap samples, the aggregated prediction is less likely to overfit the original training data.
*   **Stability**: The ensemble prediction is more stable than any single base model's prediction because it's not overly reliant on any particular subset of the data.

#### Characteristics of Bagging:

*   **Base Estimators**: Bagging is most effective with "strong" and "unstable" learners, meaning models that have low bias but high variance (e.g., unpruned decision trees). If base learners are too simple (high bias), Bagging might not offer significant improvements.
*   **Parallelization**: Training of base models can be parallelized, making it computationally efficient.
*   **Out-of-Bag (OOB) Estimation**: Since each bootstrap sample leaves out about 37% of the original training data, these "out-of-bag" samples can be used to estimate the model's performance without the need for a separate validation set, saving computational resources.

#### Example in Classification: Random Forest

The most prominent example of Bagging in classification is the **Random Forest** algorithm. Random Forest extends Bagging by adding another layer of randomness:

1.  **Bootstrapping**: As in standard Bagging, multiple bootstrap samples are created.
2.  **Feature Randomness**: When growing each decision tree in the forest, at each split point, only a random subset of features is considered. This further decorrelates the individual trees, making them less prone to making the same errors.
3.  **Aggregation**: Predictions are combined via majority voting.

This additional randomness in feature selection makes Random Forests even more robust against overfitting and generally improves performance compared to simple Bagging with decision trees.

#### Advantages of Bagging:

*   Reduces variance and prevents overfitting.
*   Improves model stability and accuracy.
*   Can be parallelized, leading to faster training times.
*   Provides out-of-bag error estimation.

#### Disadvantages of Bagging:

*   Can be computationally expensive if the base models are complex and many models are used.
*   Loss of interpretability compared to a single model.

In summary, Bagging is a powerful and widely used ensemble technique that leverages the principle of diversity (through bootstrapping) to create a more robust and accurate classifier by combining the predictions of multiple base models.

In [1]:
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

In [2]:
X,y = make_classification(n_samples=10000, n_features=10,n_informative=3)

In [3]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [4]:
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train,y_train)
y_pred = dt.predict(X_test)

print("Decision Tree accuracy",accuracy_score(y_test,y_pred))

Decision Tree accuracy 0.8965


#Bagging

In [6]:
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=0.5,
    bootstrap=True,
    random_state=42
)

In [7]:
bag.fit(X_train,y_train)
y_pred = bag.predict(X_test)

print("Bagging Classifier accuracy",accuracy_score(y_test,y_pred))

Bagging Classifier accuracy 0.926


In [8]:
bag.estimators_samples_[0].shape

(4000,)

In [9]:
bag.estimators_features_[0].shape

(10,)

#Bagging using SVM

In [11]:
bag = BaggingClassifier(
    estimator=SVC(),
    n_estimators=500,
    max_samples=0.25,
    bootstrap=True,
    random_state=42
)

In [12]:
bag.fit(X_train, y_train)
y_pred = bag.predict(X_test)

print("Bagging with SVM accuracy", accuracy_score(y_test, y_pred))

Bagging with SVM accuracy 0.8985


#Pasting

In [14]:
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=0.25,
    bootstrap=False,
    random_state=42,
    verbose = 1,
    n_jobs=-1
)

In [15]:
bag.fit(X_train, y_train)
y_pred = bag.predict(X_test)

print("Pasting Classifier accuracy", accuracy_score(y_test, y_pred))

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:   15.7s finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.


Pasting Classifier accuracy 0.924


[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    0.5s finished


#Random Subspaces

In [16]:
bag=BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=1.0,
    bootstrap=False,
    max_features=0.5,
    bootstrap_features=True,
    random_state=42
)

In [17]:
bag.fit(X_train, y_train)
y_pred = bag.predict(X_test)

print("Random Subspaces Classifier accuracy", accuracy_score(y_test, y_pred))

Random Subspaces Classifier accuracy 0.9125


In [18]:
bag.estimators_samples_[0].shape

(8000,)

In [19]:
bag.estimators_features_[0].shape

(5,)

#RAndom Patches

In [20]:
bag=BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=0.25,
    bootstrap=True,
    max_features=0.5,
    bootstrap_features=True,
    random_state=42
)

In [21]:
bag.fit(X_train,y_train)
y_pred = bag.predict(X_test)
print("Random Patches classifier",accuracy_score(y_test,y_pred))

Random Patches classifier 0.909


#OOB Score

In [22]:
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=0.25,
    bootstrap=True,
    oob_score=True,
    random_state=42
)

In [23]:
bag.fit(X_train, y_train)

In [24]:
bag.oob_score_

0.92625

In [25]:
y_pred = bag.predict(X_test)
print("Accuracy",accuracy_score(y_test,y_pred))

Accuracy 0.922


#Applying GridSearchCV

In [26]:
from sklearn.model_selection import GridSearchCV

In [27]:
parameters = {
    'n_estimators': [50,100,500],
    'max_samples': [0.1,0.4,0.7,1.0],
    'bootstrap' : [True,False],
    'max_features' : [0.1,0.4,0.7,1.0]
    }

In [28]:
search = GridSearchCV(BaggingClassifier(), parameters, cv=5)

In [30]:
search.fit(X_train,y_train)

KeyboardInterrupt: 

In [None]:
search.best_params_
search.best_score_

In [None]:
search.best_params_