### Trying out Bagging

We'll use the pima indian diabetes dataset to predict if a person has diabetes or not based on certain features such as blood pressure, insulin, bmi, age, etc. First we'll try a standalone model and then use bagging (ensemble technique) to check how we can improve the performance of the model.

In [10]:
import pandas as pd

In [11]:
df = pd.read_csv("pima_indians_diabetes.csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [13]:
df.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [15]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


Looking at the descriptive stats, we can see that the features are not on a similar scale, we'll use standard scaler to fix this.

In [50]:
df.Outcome.value_counts()

Outcome
0    500
1    268
Name: count, dtype: int64

#### Creating our feature matrix

In [51]:
X = df.iloc[:, :-1]
X

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33
...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63
764,2,122,70,27,0,36.8,0.340,27
765,5,121,72,23,112,26.2,0.245,30
766,1,126,60,0,0,30.1,0.349,47


#### The label vector

In [18]:
y = df.Outcome
y

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

#### Feature Scaling

In [23]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().set_output(transform="pandas")
X_scaled = scaler.fit_transform(X)
X_scaled.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,0.639947,0.848324,0.149641,0.90727,-0.692891,0.204013,0.468492,1.425995
1,-0.844885,-1.123396,-0.160546,0.530902,-0.692891,-0.684422,-0.365061,-0.190672
2,1.23388,1.943724,-0.263941,-1.288212,-0.692891,-1.103255,0.604397,-0.105584
3,-0.844885,-0.998208,-0.160546,0.154533,0.123302,-0.494043,-0.920763,-1.041549
4,-1.141852,0.504055,-1.504687,0.90727,0.765836,1.409746,5.484909,-0.020496


In [24]:
X_scaled.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,-6.476301e-17,-9.251859000000001e-18,1.5034270000000003e-17,1.00614e-16,-3.0068540000000005e-17,2.59052e-16,2.451743e-16,1.931325e-16
std,1.000652,1.000652,1.000652,1.000652,1.000652,1.000652,1.000652,1.000652
min,-1.141852,-3.783654,-3.572597,-1.288212,-0.6928906,-4.060474,-1.189553,-1.041549
25%,-0.8448851,-0.6852363,-0.3673367,-1.288212,-0.6928906,-0.5955785,-0.6889685,-0.7862862
50%,-0.2509521,-0.1218877,0.1496408,0.1545332,-0.4280622,0.0009419788,-0.3001282,-0.3608474
75%,0.6399473,0.6057709,0.5632228,0.7190857,0.4120079,0.5847705,0.4662269,0.6602056
max,3.906578,2.444478,2.734528,4.921866,6.652839,4.455807,5.883565,4.063716


The features are now on a more similar scale.

In [33]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

In [34]:
X_train.shape, y_train.shape

((614, 8), (614,))

In [35]:
X_test.shape, y_test.shape

((154, 8), (154,))

In [36]:
y_train.value_counts()

Outcome
0    401
1    213
Name: count, dtype: int64

#### Training a simple decision tree model

Lets just try to fit a simple Decision tree on our data and see how it performs

In [53]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

scores = cross_val_score(DecisionTreeClassifier(), X_scaled, y, cv=5)
print("The scores on the different folds are: ", scores)
print("The mean score is: ", scores.mean())

The scores on the different folds are:  [0.68181818 0.68831169 0.69480519 0.77124183 0.73856209]
The mean score is:  0.7149477973007384


#### Trying out BaggingClassifier

Let's try a bagging classifier that trains 50 decision trees and uses them to make the predictions

In [59]:
from sklearn.ensemble import BaggingClassifier

model1 = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)

scores = cross_val_score(model1, X_scaled, y, cv=5)
print("The scores on the different folds are: ", scores)
print("The mean score is: ", scores.mean())

The scores on the different folds are:  [0.74675325 0.71428571 0.76623377 0.83006536 0.76470588]
The mean score is:  0.7644087938205586


We can clearly see that accuracy of the model increased when using the bagging classifier trained on 50 trees compared to the old standalone decision tree.
Using a Bagging classifier with decision tree as the estimator is similar to using a RandomForestClassifer. We can see that they perform very similar:

In [66]:
from sklearn.ensemble import RandomForestClassifier

scores = cross_val_score(
    RandomForestClassifier(n_estimators=50),
    X=X_scaled,
    y=y,
    cv=5
)

print("The scores on the different folds are: ", scores)
print("The mean score is: ", scores.mean())

The scores on the different folds are:  [0.75324675 0.72077922 0.75974026 0.83006536 0.76470588]
The mean score is:  0.7657074951192598


We can also use different estimators in our bagging classifier other than decision trees:

In [63]:
from sklearn.linear_model import LogisticRegression

model2 = BaggingClassifier(
    estimator=LogisticRegression(),
    n_estimators=50,
    random_state=42
)

scores = cross_val_score(model2, X_scaled, y, cv=5)
print("The scores on the different folds are: ", scores)
print("The mean score is: ", scores.mean())

The scores on the different folds are:  [0.77272727 0.74675325 0.74675325 0.81699346 0.75816993]
The mean score is:  0.7682794329853152


In [64]:
from sklearn.neighbors import KNeighborsClassifier

model3 = BaggingClassifier(
    estimator=KNeighborsClassifier(),
    n_estimators=50,
    random_state=42
)

scores = cross_val_score(model3, X_scaled, y, cv=5)
print("The scores on the different folds are: ", scores)
print("The mean score is: ", scores.mean())

The scores on the different folds are:  [0.72727273 0.73376623 0.75974026 0.78431373 0.7124183 ]
The mean score is:  0.7435022493846024


In [65]:
from sklearn.svm import SVC

model4 = BaggingClassifier(
    estimator=SVC(),
    n_estimators=50,
    random_state=42
)

scores = cross_val_score(model4, X_scaled, y, cv=5)
print("The scores on the different folds are: ", scores)
print("The mean score is: ", scores.mean())

The scores on the different folds are:  [0.75974026 0.75974026 0.75974026 0.82352941 0.77124183]
The mean score is:  0.7747984042101688
