<div style="width:100%; height:60px; background-color:aqua; display:flex">
<div style="display:flex; flex-direction:row; justify-content:center">
<h2 style="color:white"><strong>Ensemble Learning - Bagging</strong></h2>
</div>
</div>

<h4 style="color:white"><p>Bagging, also known as bootstrap aggregation, is the ensemble learning method that is commonly used to reduce variance within a noisy data set.</p> 

<p>In bagging, a random sample of data in a training set is selected with replacement—meaning that the individual data points can be chosen more than once. After generating several data samples, these weak models are then trained independently. Depending on the type of task—regression or classification, for example—the average or majority of those predictions yield a more accurate estimate. </p>

<p>As a note, the random forest algorithm is considered an extension of the bagging method, using both bagging and feature randomness to create an uncorrelated forest of decision trees.</p></h4>

<h2 style="color:aqua;font-style:italic;font-weight:bold">Ensemble learning
<h5 style="color:white">
<p>Ensemble learning gives credence to the idea of the “wisdom of crowds,” which suggests that the decision-making of a larger group of people is typically better than that of an individual expert. Similarly, ensemble learning refers to a group (or ensemble) of base learners, or models, which work collectively to achieve a better final prediction.</p>

<p>A single model, also known as a base or weak learner, may not perform well individually due to high variance or high bias. However, when weak learners are aggregated, they can form a strong learner, as their combination reduces bias or variance, yielding better model performance.</p>

<p>Ensemble methods frequently use decision trees for illustration. This algorithm can be prone to overfitting, showing high variance and low bias, when it hasn’t been pruned. Conversely, it can also lend itself to underfitting, with low variance and high bias, when it’s very small, like a decision stump, which is a decision tree with one level.</p>

<p>Remember, when an algorithm overfits or underfits to its training set, it cannot generalize well to new data sets, so ensemble methods are used to counteract this behavior to allow for generalization of the model to new data sets. While decision trees can exhibit high variance or high bias, it’s worth noting that it is not the only modeling technique that leverages ensemble learning to find the “sweet spot” within the bias-variance tradeoff.</p>

</h5>
</h2>

<h2 style="color:aqua;font-style:italic;font-weight:bold">Bagging versus boosting
<h5 style="color:white">

<p>Bagging and boosting are two main types of ensemble learning methods. As highlighted in this study (link resides outside ibm.com), the main difference between these learning methods is how they are trained.</p>

<p>In bagging, weak learners are trained in parallel, but in boosting, they learn sequentially. This means that a series of models is constructed and with each new model iteration, the weights of the misclassified data in the previous model are increased.</p>

<p>This redistribution of weights helps the algorithm identify the parameters that it needs to focus on to improve its performance. AdaBoost, which stands for “adaptative boosting algorithm,” is one of the most popular boosting algorithms as it was one of the first of its kind. Other types of boosting algorithms include XGBoost, GradientBoost and BrownBoost.</p>

<p>Another difference in which bagging and boosting differ are the scenarios in which they are used. For example, bagging methods are typically used on weak learners that exhibit high variance and low bias, whereas boosting methods are used when low variance and high bias are observed</p>

</h5>
</h2>

<div style="width:100%; height:auto; display:flex; flex-direction:row; justify-content:space-around;">
<img style="width:45%;height:auto; background-color:white" src="/home/ahmedunix/data_Science_Work/Machine_Learning/18_Ensemble_Learning_Bagging/bagging_1.png">
<img style="width:45%;height:auto; background-color:white" src="/home/ahmedunix/data_Science_Work/Machine_Learning/18_Ensemble_Learning_Bagging/bagging_2.png">
</div>

In [22]:
import math
import kaggle as kaggle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import tensorflow as tf, keras
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold, cross_val_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, accuracy_score, classification_report
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, LabelEncoder, StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,BaggingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cluster import KMeans

In [2]:
!kaggle datasets download -d uciml/pima-indians-diabetes-database

Dataset URL: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database
License(s): CC0-1.0
Downloading pima-indians-diabetes-database.zip to /home/ahmedunix/data_Science_Work/Machine_Learning/18_Ensemble_Learning_Bagging
  0%|                                               | 0.00/8.91k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 8.91k/8.91k [00:00<00:00, 12.5MB/s]


In [3]:
!unzip pima-indians-diabetes-database.zip -d pima_indians_diabetes_dataset

Archive:  pima-indians-diabetes-database.zip
  inflating: pima_indians_diabetes_dataset/diabetes.csv  


In [39]:
df = pd.read_csv("/home/ahmedunix/data_Science_Work/Machine_Learning/18_Ensemble_Learning_Bagging/pima_indians_diabetes_dataset/diabetes.csv")

In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [41]:
df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [42]:
df.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [43]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [6]:
df.head(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


In [9]:
x = df.drop(columns=['Outcome'])

In [11]:
y = df.Outcome

In [14]:
scalar = StandardScaler().set_output(transform='pandas')

In [15]:
x_scaled = scalar.fit_transform(x)

In [16]:
x_scaled

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,0.639947,0.848324,0.149641,0.907270,-0.692891,0.204013,0.468492,1.425995
1,-0.844885,-1.123396,-0.160546,0.530902,-0.692891,-0.684422,-0.365061,-0.190672
2,1.233880,1.943724,-0.263941,-1.288212,-0.692891,-1.103255,0.604397,-0.105584
3,-0.844885,-0.998208,-0.160546,0.154533,0.123302,-0.494043,-0.920763,-1.041549
4,-1.141852,0.504055,-1.504687,0.907270,0.765836,1.409746,5.484909,-0.020496
...,...,...,...,...,...,...,...,...
763,1.827813,-0.622642,0.356432,1.722735,0.870031,0.115169,-0.908682,2.532136
764,-0.547919,0.034598,0.046245,0.405445,-0.692891,0.610154,-0.398282,-0.531023
765,0.342981,0.003301,0.149641,0.154533,0.279594,-0.735190,-0.685193,-0.275760
766,-0.844885,0.159787,-0.470732,-1.288212,-0.692891,-0.240205,-0.371101,1.170732


In [18]:
x_train,x_test,y_train,y_test = train_test_split(x_scaled,y,stratify=y,random_state=10)

In [19]:
x_train.shape

(576, 8)

In [20]:
x_test.shape

(192, 8)

In [21]:
y_train.value_counts()

Outcome
0    375
1    201
Name: count, dtype: int64

In [24]:
scores = cross_val_score(RandomForestClassifier(),x,y,cv=5)
scores

array([0.75324675, 0.73376623, 0.78571429, 0.83660131, 0.77124183])

In [25]:
scores.mean()

0.776114081996435

In [29]:
bag_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    max_samples=0.8,
    oob_score=True,
    random_state=0
)

bag_model.fit(x_train,y_train)

In [30]:
bag_model.oob_score_

0.7534722222222222

In [31]:
bag_model.score(x_test,y_test)

0.7760416666666666

In [32]:
bag_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    max_samples=0.8,
    oob_score=True,
    random_state=0
)

scores_bag = cross_val_score(bag_model,x,y,cv=5)

In [33]:
scores_bag

array([0.75324675, 0.72727273, 0.74675325, 0.82352941, 0.73856209])

In [34]:
scores_bag.mean()

0.7578728461081402