# Bagging Exercise

In this exercise, you will explore the concept of Bagging (Bootstrap Aggregating) and implement it using a random forest model. Bagging is an ensemble technique mainly used for reducing the variance of a predictive model and preventing overfitting. The main idea behind bagging is to combine multiple learners in a way that the ensemble model performs better than an individual model.

## Dataset
We will use the Iris dataset for this exercise. The Iris dataset is a classic dataset from the field of machine learning, containing measurements for iris flowers of three different species. **Feel free to use another dataset!!**

## Task
Your task is to:
1. Load the dataset.
2. Preprocess the data (if necessary).
3. Implement Bagging models.
4. Evaluate the models performance.

Please fill in the following code blocks to complete the exercise.


In [49]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


# Load the dataset


In [8]:
data = load_iris()
X = data.data
y = data.target

# Preprocess the data (if necessary)

- After looking at the dataset (iris), there is no need for data Preprocessing.

    * No null values
    * No encoding process is needed

- This process might be required if different dataset is used

In [11]:
data

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

In [10]:
X

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [15]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [16]:
X = pd.DataFrame(X, columns=data.feature_names)
y = pd.DataFrame(y, columns=["target"])

In [14]:
X.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [21]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
dtypes: float64(4)
memory usage: 4.8 KB


In [17]:
y.head()

Unnamed: 0,target
0,0
1,0
2,0
3,0
4,0


In [19]:
y.value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,50
1,50
2,50


# Split the Dataset

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and Train the Classifiers

## Random Forest
Initialize and train a Random Forest classifier.

In [23]:
# Number of estimators
n_est=50

rfc = RandomForestClassifier(n_estimators=n_est)

# Fit the model
rfc.fit(X_train, y_train)


  return fit_method(estimator, *args, **kwargs)


### Evaluate the model performance

In [51]:
# Get test prediction
y_pred = rfc.predict(X_test)

# Calculate the accuracy, confusion matrix, and classification report
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
cls_rep = classification_report(y_test, y_pred)
print(f'Random Forest Model Accuracy with {n_est} estimators: {accuracy * 100:.2f}%')

print(f'Random Forest Model Confusion matrix with {n_est} estimators:')
print(cm)

print(f'Random Forest Model Classification report with {n_est} estimators:')
print(cls_rep)




Random Forest Model Accuracy with 50 estimators: 100.00%
Random Forest Model Confusion matrix with 50 estimators:
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]
Random Forest Model Classification report with 50 estimators:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



## Bagging Meta-estimator
Initialize a K-Nearest Neighbors classifier and use it as the base estimator for the Bagging classifier.

In [37]:
knn = KNeighborsClassifier()

bg = BaggingClassifier(base_estimator=knn, n_estimators=n_est)

# Fit the model
bg.fit(X_train, y_train)


  y = column_or_1d(y, warn=True)


### Evaluate the model performance

In [53]:
# Get test prediction
y_pred = bg.predict(X_test)


# Calculate the accuracy, confusion matrix, and classification report
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
cls_rep = classification_report(y_test, y_pred)
print(f'Bagging Classifier Model Accuracy: {accuracy * 100:.2f}%')

print(f'Bagging Classifier Model Confusion matrix:')
print(cm)

print(f'Bagging Classifier Model Classification report:')
print(cls_rep)



Bagging Classifier Model Accuracy: 100.00%
Bagging Classifier Model Confusion matrix:
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]
Bagging Classifier Model Classification report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



## Pasting
Initialize a Decision Tree classifier and use it as the base estimator for a Bagging classifier with Pasting (without replacement).

In [39]:
knn = KNeighborsClassifier()

bg_passting = BaggingClassifier(base_estimator=knn, n_estimators=n_est, max_samples=0.7 , bootstrap=False)

# Fit the model
bg_passting.fit(X_train, y_train)


  y = column_or_1d(y, warn=True)


### Evaluate the model performance

In [54]:
# Get test prediction
y_pred = bg_passting.predict(X_test)

# Calculate the accuracy, confusion matrix, and classification report
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
cls_rep = classification_report(y_test, y_pred)
print(f'Pasting Classifier Model Accuracy: {accuracy * 100:.2f}%')

print(f'Pasting Classifier Model Confusion matrix:')
print(cm)

print(f'Pasting Classifier Model Classification report:')
print(cls_rep)




Pasting Classifier Model Accuracy: 100.00%
Pasting Classifier Model Confusion matrix:
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]
Pasting Classifier Model Classification report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



## Roughly Balanced Bagging (RBB)
Implement Roughly Balanced Bagging by manually creating balanced bootstrap samples and aggregating predictions from multiple Decision Tree classifiers.

* You can see the data is balanced, which means there is no need to balance

In [42]:
y_train.value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
1,41
0,40
2,39


### But if the data is imbalanced, you can do the following:

In [41]:
! pip install imblearn

Collecting imblearn
  Downloading imblearn-0.0-py2.py3-none-any.whl.metadata (355 bytes)
Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Installing collected packages: imblearn
Successfully installed imblearn-0.0


In [43]:
from imblearn.over_sampling import SMOTE

sm = SMOTE()

X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

In [44]:
y_train_res.value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,41
1,41
2,41


## Roughly Balanced Bagging (RBB): Fit and Predict
- Since our data is ready to be used, we can start immediatily to fit our model

In [46]:
# Base estimator
knn = KNeighborsClassifier()

rbb_bg = BaggingClassifier(base_estimator=knn, n_estimators=n_est)

# Fit
rbb_bg.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


### Evaluate the model performance

In [55]:
y_pred = rbb_bg.predict(X_test)

# Calculate the accuracy, confusion matrix, and classification report
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
cls_rep = classification_report(y_test, y_pred)
print(f'RBB model Accuracy: {accuracy * 100:.2f}%')

print(f'RBB model Confusion matrix:')
print(cm)

print(f'RBB model Classification report:')
print(cls_rep)


RBB model Accuracy: 100.00%
RBB model Confusion matrix:
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]
RBB model Classification report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

