Blending is very similar to Stacking. It also uses base models to provide base predictions as new features and a new meta model is trained on the new features that gives the final prediction. The only difference is that training of the meta-model is applied on a separate holdout set (e.g 10% of train_data)rather on full and folded training set.

In [1]:
import warnings
warnings.filterwarnings('ignore')

## Loading data

In [2]:
import pandas as pd
import numpy as np

In [3]:
input_path = "../data/"

In [4]:
dataset = pd.read_csv('diabetes.csv')
dataset.head()

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [5]:
dataset.shape

(768, 9)

In [6]:
dataset.isna().sum()

pregnancies    0
glucose        0
diastolic      0
triceps        0
insulin        0
bmi            0
dpf            0
age            0
diabetes       0
dtype: int64

In [7]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   pregnancies  768 non-null    int64  
 1   glucose      768 non-null    int64  
 2   diastolic    768 non-null    int64  
 3   triceps      768 non-null    int64  
 4   insulin      768 non-null    int64  
 5   bmi          768 non-null    float64
 6   dpf          768 non-null    float64
 7   age          768 non-null    int64  
 8   diabetes     768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [8]:
dataset.describe()

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [9]:
dataset[(dataset.insulin != 0) & (dataset.glucose != 0)].shape

(393, 9)

## Feature selection

dropping columns with too many missing data

In [10]:
dataset = dataset[(dataset.insulin != 0) & (dataset.glucose != 0)]

## Exploratory Data Analysis

In [11]:
# %%capture
# !pip install -U dataprep

In [12]:
# from dataprep.eda import create_report
# report = create_report(dataset, title='My Report')
# report

In [13]:
dataset.diabetes.value_counts()

0    263
1    130
Name: diabetes, dtype: int64

The data is already clean and suitable for training so we can focus on learning our new techniques, no need to data pre-processing (not true).

## Splitting data

In [14]:
X, y = dataset.drop("diabetes", axis=1), dataset["diabetes"]

In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train , y_test = train_test_split(X, y, stratify=y, test_size=0.2)

### Splitting data even further (for blinding)

In [16]:
X_train

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age
450,1,82,64,13,95,21.2,0.415,23
421,2,94,68,18,76,26.0,0.561,21
189,5,139,80,35,160,31.6,0.361,25
432,1,80,74,11,60,30.0,0.527,22
273,1,71,78,50,45,33.2,0.422,21
...,...,...,...,...,...,...,...,...
458,10,148,84,48,237,37.6,1.001,51
672,10,68,106,23,49,35.5,0.285,47
216,5,109,62,41,129,35.8,0.514,25
153,1,153,82,42,485,40.6,0.687,23


In [17]:
SPLIT_THRESHOLD = 275
X_train_base_data = X_train.iloc[:SPLIT_THRESHOLD]
X_train_holdout = X_train.iloc[SPLIT_THRESHOLD:] 

In [18]:
X_train_base_data.shape

(275, 8)

In [19]:
X_train_holdout.shape

(39, 8)

## Training singular classifiers

### Random Forest Classifier training

In [20]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()

rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)


### Preparing intermediate features

### Evaluation

In [21]:
from sklearn.metrics import auc, classification_report, roc_auc_score

In [22]:
print(roc_auc_score(y_pred,y_test))

0.7759170653907497


In [23]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.89      0.85        53
           1       0.73      0.62      0.67        26

    accuracy                           0.80        79
   macro avg       0.78      0.75      0.76        79
weighted avg       0.79      0.80      0.79        79



### Cat Boost Classifier

In [24]:
%%capture
!pip install catboost

In [25]:
from catboost import CatBoostClassifier

cat = CatBoostClassifier()
cat.fit(X_train, y_train)
y_pred = cat.predict(X_test)

Learning rate set to 0.006282
0:	learn: 0.6893800	total: 53.6ms	remaining: 53.5s
1:	learn: 0.6859170	total: 56.4ms	remaining: 28.1s
2:	learn: 0.6828350	total: 60.3ms	remaining: 20s
3:	learn: 0.6792385	total: 66.8ms	remaining: 16.6s
4:	learn: 0.6755937	total: 71ms	remaining: 14.1s
5:	learn: 0.6710152	total: 75.6ms	remaining: 12.5s
6:	learn: 0.6667982	total: 77.5ms	remaining: 11s
7:	learn: 0.6633989	total: 79.3ms	remaining: 9.83s
8:	learn: 0.6602123	total: 81.7ms	remaining: 9s
9:	learn: 0.6565891	total: 88.7ms	remaining: 8.78s
10:	learn: 0.6528450	total: 90.7ms	remaining: 8.15s
11:	learn: 0.6485893	total: 92ms	remaining: 7.58s
12:	learn: 0.6453496	total: 93.5ms	remaining: 7.1s
13:	learn: 0.6422618	total: 95.1ms	remaining: 6.7s
14:	learn: 0.6384780	total: 96.6ms	remaining: 6.34s
15:	learn: 0.6349154	total: 98.1ms	remaining: 6.03s
16:	learn: 0.6319625	total: 115ms	remaining: 6.67s
17:	learn: 0.6286011	total: 119ms	remaining: 6.49s
18:	learn: 0.6252886	total: 122ms	remaining: 6.32s
19:	lear

### Evaluation

In [26]:
print(roc_auc_score(y_pred,y_test))

0.7859259259259259


In [27]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.87      0.86        53
           1       0.72      0.69      0.71        26

    accuracy                           0.81        79
   macro avg       0.79      0.78      0.78        79
weighted avg       0.81      0.81      0.81        79



### Naive Bayes Classifier

In [28]:
from sklearn.naive_bayes import GaussianNB

gauss = GaussianNB()
gauss.fit(X_train, y_train)
y_pred = gauss.predict(X_test)

### Evaluation

In [29]:
print(roc_auc_score(y_pred,y_test))

0.7282763532763533


In [30]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.81      0.82        53
           1       0.63      0.65      0.64        26

    accuracy                           0.76        79
   macro avg       0.73      0.73      0.73        79
weighted avg       0.76      0.76      0.76        79

