# Classification models

Dataset: PIMA
By: Sam
Update at: 14/10/2022

===

**Summary:**
- Import unsupervised discretised datasets (already encoded categorical attributes)
- Split dataset: 75% training, 25% testing, seed = 30
- SMOTE oversampling for imbalance data
- Perform 3 classification models: ID3, Categorical Naive Bayes, Knn-VDM (long time, will do this last)
- Bias and variance decomposition for each model (knn-VDM-long time)
- Cross validation (accuracy): 10 folds, repeats: 3
**NOTE**
For categorical Naive Bayes, must pass the min_categories to avoid index out of bound error

===

**Result:**
Error in bias-variance decomposition of KNN-VDM models (EWD - k =7, k = 10) and CNB model (EWD, k = 10)

### About Dataset
Therefore, there is one target (dependent) variable and the 8 attributes (TYNECKI, 2018): 
- pregnancies, 
- OGTT(Oral Glucose Tolerance Test), 
- blood pressure, 
- skin thickness, 
- insulin, 
- BMI(Body Mass Index), 
- age, 
- pedigree diabetes function

In [1]:
import pandas as pd
from pandas import read_csv
from pandas import set_option
import numpy as np
from numpy import arange
## EDA
from collections import Counter

In [2]:
# Pre-processing
from sklearn.preprocessing import OrdinalEncoder
# Cross validation
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score # 1 metric
from sklearn.model_selection import cross_validate # more than 1 metric
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

In [3]:
# # RIPPER (https://pypi.org/project/wittgenstein/) Only for binary
# import wittgenstein as lw 

In [4]:
# For Naive Bayes
from sklearn.naive_bayes import CategoricalNB # Categorical Naive Bayes
from sklearn.naive_bayes import MultinomialNB # Multinominal Naive Bayes (suitable for NLP)
#from mixed_naive_bayes import MixedNB # Mixed Naive Bayes for combination of both discrete & continuous feature

In [5]:
#!pip install mlrose

In [6]:
# For decision tree ID3 
# https://stackoverflow.com/questions/61867945/python-import-error-cannot-import-name-six-from-sklearn-externals
import six
import sys
sys.modules['sklearn.externals.six'] = six
import mlrose
from id3 import Id3Estimator # ID3 Decision Tree (https://pypi.org/project/decision-tree-id3/)
from id3 import export_graphviz

In [7]:
# Knn-VDM 3
from vdm3 import ValueDifferenceMetric
from sklearn.neighbors import KNeighborsClassifier

In [8]:
# For model evaluation
from sklearn.metrics import classification_report
from sklearn import metrics
import sklearn.metrics as metrics
from sklearn.metrics import make_scorer
from sklearn.metrics import confusion_matrix

In [9]:
import seaborn as sns
import matplotlib.pyplot as plt

# 1. EWD Datasets (k = 4, 7, 10)

## 1.1 EWD, k = 4

In [133]:
# Read data
df_ewd1 = pd.read_csv('pima_ewd1.csv')
disc = 'EWD'
k = 4

In [134]:
df_ewd1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype
---  ------                    --------------  -----
 0   Pregnancies               768 non-null    int64
 1   Glucose                   768 non-null    int64
 2   BloodPressure             768 non-null    int64
 3   SkinThickness             768 non-null    int64
 4   Insulin                   768 non-null    int64
 5   BMI                       768 non-null    int64
 6   DiabetesPedigreeFunction  768 non-null    int64
 7   Age                       768 non-null    int64
 8   Outcome                   768 non-null    int64
dtypes: int64(9)
memory usage: 54.1 KB


In [135]:
cols = df_ewd1.columns.to_list()

In [136]:
for col in cols:
    print(col, Counter(df_ewd1[col]))

Pregnancies Counter({0: 492, 1: 190, 2: 72, 3: 14})
Glucose Counter({2: 428, 1: 191, 3: 143, 0: 6})
BloodPressure Counter({2: 571, 1: 121, 0: 38, 3: 38})
SkinThickness Counter({0: 411, 1: 345, 2: 11, 3: 1})
Insulin Counter({0: 693, 1: 57, 2: 15, 3: 3})
BMI Counter({1: 439, 2: 310, 0: 11, 3: 8})
DiabetesPedigreeFunction Counter({0: 598, 1: 145, 2: 20, 3: 5})
Age Counter({0: 514, 1: 181, 2: 64, 3: 9})
Outcome Counter({0: 500, 1: 268})


#### Train test split

In [137]:
# Split dataset
X = df_ewd1.drop(['Outcome'], axis = 1)
Y = df_ewd1['Outcome']

In [138]:
# Split train test, test size 25%, random state 30
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 30, stratify = Y)

In [139]:
# Check representation of class
print('Class representation - original: ', Counter(df_ewd1['Outcome'])) 
print('Class representation - training data: ', Counter(y_train)) 
print('Class representation - testing data: ', Counter(y_test)) 

Class representation - original:  Counter({0: 500, 1: 268})
Class representation - training data:  Counter({0: 350, 1: 187})
Class representation - testing data:  Counter({0: 150, 1: 81})


In [140]:
print(x_train.shape, x_test.shape)

(537, 8) (231, 8)


In [141]:
# SMOTE
#! pip install imblearn --user
from imblearn import under_sampling, over_sampling
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state = 22)
x_train, y_train = smote.fit_resample(x_train, y_train)

In [142]:
# Check number of categories for features
n_categories = df_ewd1.drop(['Outcome'], axis = 1).nunique()

In [143]:
n_categories

Pregnancies                 4
Glucose                     4
BloodPressure               4
SkinThickness               4
Insulin                     4
BMI                         4
DiabetesPedigreeFunction    4
Age                         4
dtype: int64

### Models - EWD, k=4

In [22]:
# ID3 - Default
import time
start = time.time() # For measuring time execution

model_id3 = Id3Estimator()
model_id3.fit(x_train, y_train)
# Testing
y_pred_id3 = model_id3.predict(x_test)
print(classification_report(y_test, y_pred_id3))

end = time.time()
print(f'Time for training model ID3 - default, {disc}, k = {k} is: {end - start}.') # Total time execution


              precision    recall  f1-score   support

           0       0.80      0.59      0.68       150
           1       0.49      0.73      0.59        81

    accuracy                           0.64       231
   macro avg       0.65      0.66      0.63       231
weighted avg       0.69      0.64      0.65       231

Time for training model ID3 - default, EWD, k = 4 is: 0.03110790252685547.


In [26]:
# Naive Bayes - Min-categories
import time
start = time.time() # For measuring time execution
model_nb = CategoricalNB(min_categories = n_categories)
model_nb.fit(x_train, y_train)
# Testing
y_pred_nb = model_nb.predict(x_test)
model_nb.classes_
print(classification_report(y_test, y_pred_nb))
end = time.time()
print(f'Time for training model Naive Bayes - default, {disc}, k = {k} is: {end - start}.') # Total time execution

              precision    recall  f1-score   support

           0       0.85      0.66      0.74       150
           1       0.56      0.79      0.65        81

    accuracy                           0.71       231
   macro avg       0.70      0.73      0.70       231
weighted avg       0.75      0.71      0.71       231

Time for training model Naive Bayes - default, EWD, k = 4 is: 0.006515026092529297.


In [144]:
# WARNING: LONG TIME
# Knn-VDM complete code
# DONE
# Time for training model Knn-VDM, EWD, k = 4 is: 42.19520902633667.

import time
start = time.time() # For measuring time execution

# specific the continuous columns index if any
vdm = ValueDifferenceMetric(x_train, y_train, continuous = None)
vdm.fit()
# Knn model, n_neigbour = 3, metrics = vdm
knn_vdm = KNeighborsClassifier(n_neighbors=3, metric=vdm.get_distance, algorithm='brute')
## Fit model
knn_vdm.fit(x_train, y_train)
# Testing
y_pred_knn = knn_vdm.predict(x_test)
knn_vdm.classes_
print(classification_report(y_test, y_pred_knn))

end = time.time()
print(f'Time for training model Knn-VDM, {disc}, k = {k} is: {end - start}.') # Total time execution

              precision    recall  f1-score   support

           0       0.76      0.75      0.75       150
           1       0.54      0.56      0.55        81

    accuracy                           0.68       231
   macro avg       0.65      0.65      0.65       231
weighted avg       0.68      0.68      0.68       231

Time for training model Knn-VDM, EWD, k = 4 is: 42.19520902633667.


In [30]:
# CROSS VALIDATION
import warnings
warnings.filterwarnings('ignore')

# param
num_folds = 10
num_repeat = 3
seed = 7
scores = 'accuracy'

print(f'Cross validation result, {scores}, {disc}, k = {k}.')

# Create list of algorithms
models = []
models.append(('ID3', Id3Estimator()))
#models.append(('RIPPER', lw.RIPPER()))
models.append(('CNB', CategoricalNB()))
models.append(('Knn-VDM', KNeighborsClassifier(n_neighbors=3, metric=vdm.get_distance, algorithm='brute')))

# Evaluate each model in turn
results = []
names = []
for name, model in models:
  #kfold = KFold(n_splits=num_folds, shuffle = True, random_state=10)
    kfold = RepeatedKFold(n_splits=num_folds, n_repeats=num_repeat, random_state=seed)
    cv_results = cross_val_score(model, X, Y, cv=kfold, scoring=scores)
    results.append(cv_results)
    names.append(name)
    msg = '%s: - Mean: %f, Standard deviation: %f' % (name, cv_results.mean(), cv_results.std())
    print(msg)

Cross validation result, accuracy, EWD, k = 4.
ID3: - Mean: 0.711341, Standard deviation: 0.030314
CNB: - Mean: nan, Standard deviation: nan
Knn-VDM: - Mean: 0.706619, Standard deviation: 0.039965


### Evaluation, EDW, k = 4

In [31]:
from sklearn.metrics import zero_one_loss
#This library is used to decompose bias and variance in our models
from mlxtend.evaluate import bias_variance_decomp
import warnings
warnings.filterwarnings('ignore')

In [34]:
# ID3
# Convert all dataframe to array
x_train = x_train.values
y_train = y_train.values
x_test = x_test.values
y_test = y_test.values

# Evaluation
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
model_id3, x_train, y_train, x_test, y_test,
loss='0-1_loss',
random_seed=123)
#---
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)
print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_id3))

Average expected loss: 0.360
Average bias: 0.359
Average variance: 0.187
Sklearn 0-1 loss: 0.359


In [35]:
# Naive Bayes - min_categories update
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
model_nb, x_train, y_train, x_test, y_test,
loss='0-1_loss',
random_seed=123)
#---
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)
print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_nb))

Average expected loss: 0.288
Average bias: 0.294
Average variance: 0.054
Sklearn 0-1 loss: 0.294


In [146]:
# WARNING - LONG TIME
# Knn-VDM
# Convert all dataframe to array
x_train = x_train.values
y_train = y_train.values
x_test = x_test.values
y_test = y_test.values

import time
start = time.time() # For measuring time execution
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
knn_vdm, x_train, y_train, x_test, y_test,
loss='0-1_loss',
random_seed=123)
#---
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)
print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_knn))
end = time.time()
print(f'Computing time: {end - start}.') # Total time execution

Average expected loss: 0.340
Average bias: 0.316
Average variance: 0.171
Sklearn 0-1 loss: 0.320
Computing time: 8650.40454006195.


## 1.2 EWD, k = 7

In [147]:
# Read data
df_ewd2 = pd.read_csv('pima_ewd2.csv')
df_ewd2.info()
disc = "EWD"
k = 7

## EDA
from collections import Counter

#Check class of control
Counter(df_ewd2['Outcome'])

# Split dataset
X = df_ewd2.drop(['Outcome'], axis = 1)
Y = df_ewd2['Outcome']

# Split train test, test size 25%, random state 30
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 30, stratify = Y)

# Check representation of class
print('Class representation - original: ', Counter(df_ewd2['Outcome'])) 
print('Class representation - training data: ', Counter(y_train)) 
print('Class representation - testing data: ',Counter(y_test)) 
print(x_train.shape, x_test.shape)

# SMOTE
from imblearn import under_sampling, over_sampling
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state = 22)
x_train, y_train = smote.fit_resample(x_train, y_train)

print('='*25)
print('Distribution after SMOTE')
print('Class representation - training data: ', Counter(y_train))
print(x_train.shape, x_test.shape)

# Check number of categories for features
n_categories = df_ewd2.drop(['Outcome'], axis = 1).nunique()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype
---  ------                    --------------  -----
 0   Pregnancies               768 non-null    int64
 1   Glucose                   768 non-null    int64
 2   BloodPressure             768 non-null    int64
 3   SkinThickness             768 non-null    int64
 4   Insulin                   768 non-null    int64
 5   BMI                       768 non-null    int64
 6   DiabetesPedigreeFunction  768 non-null    int64
 7   Age                       768 non-null    int64
 8   Outcome                   768 non-null    int64
dtypes: int64(9)
memory usage: 54.1 KB
Class representation - original:  Counter({0: 500, 1: 268})
Class representation - training data:  Counter({0: 350, 1: 187})
Class representation - testing data:  Counter({0: 150, 1: 81})
(537, 8) (231, 8)
Distribution after SMOTE
Class representation - training data:  Coun

### Models, EDW, k = 7

In [48]:
# ID3 - Default
import time
start = time.time() # For measuring time execution

model_id3 = Id3Estimator()
model_id3.fit(x_train, y_train)
# Testing
y_pred_id3 = model_id3.predict(x_test)
print(classification_report(y_test, y_pred_id3))

end = time.time()
print(f'Time for training model ID3 - default, {disc}, k = {k} is: {end - start}.') # Total time execution


              precision    recall  f1-score   support

           0       0.80      0.67      0.73       150
           1       0.53      0.69      0.60        81

    accuracy                           0.68       231
   macro avg       0.67      0.68      0.67       231
weighted avg       0.71      0.68      0.69       231

Time for training model ID3 - default, EWD, k = 7 is: 0.04917502403259277.


In [49]:
# Naive Bayes - Min-categories
import time
start = time.time() # For measuring time execution
model_nb = CategoricalNB(min_categories = n_categories)
model_nb.fit(x_train, y_train)
# Testing
y_pred_nb = model_nb.predict(x_test)
model_nb.classes_
print(classification_report(y_test, y_pred_nb))
end = time.time()
print(f'Time for training model Naive Bayes - min_categories, {disc}, k = {k} is: {end - start}.') # Total time execution

              precision    recall  f1-score   support

           0       0.85      0.65      0.74       150
           1       0.55      0.79      0.65        81

    accuracy                           0.70       231
   macro avg       0.70      0.72      0.69       231
weighted avg       0.75      0.70      0.71       231

Time for training model Naive Bayes - min_categories, EWD, k = 7 is: 0.008002758026123047.


In [40]:
# WARNING: LONG TIME
# Knn-VDM complete code
import time
start = time.time() # For measuring time execution

# specific the continuous columns index if any
vdm = ValueDifferenceMetric(x_train, y_train, continuous = None)
vdm.fit()
# Knn model, n_neigbour = 3, metrics = vdm
knn_vdm = KNeighborsClassifier(n_neighbors=3, metric=vdm.get_distance, algorithm='brute')
## Fit model
knn_vdm.fit(x_train, y_train)
# Testing
y_pred_knn = knn_vdm.predict(x_test)
knn_vdm.classes_
print(classification_report(y_test, y_pred_knn))

end = time.time()
print(f'Time for training model Knn-VDM, {disc}, k = {k} is: {end - start}.') # Total time execution

KeyError: 6.0

In [50]:
# CROSS VALIDATION
import warnings
warnings.filterwarnings('ignore')

# param
num_folds = 10
num_repeat = 3
seed = 7
scores = 'accuracy'

print(f'Cross validation result, {scores}, {disc}, k = {k}.')

# Create list of algorithms
models = []
models.append(('ID3', Id3Estimator()))
#models.append(('RIPPER', lw.RIPPER()))
models.append(('CNB', CategoricalNB()))
models.append(('Knn-VDM', KNeighborsClassifier(n_neighbors=3, metric=vdm.get_distance, algorithm='brute')))

# Evaluate each model in turn
results = []
names = []
for name, model in models:
  #kfold = KFold(n_splits=num_folds, shuffle = True, random_state=10)
    kfold = RepeatedKFold(n_splits=num_folds, n_repeats=num_repeat, random_state=seed)
    cv_results = cross_val_score(model, X, Y, cv=kfold, scoring=scores)
    results.append(cv_results)
    names.append(name)
    msg = '%s: - Mean: %f, Standard deviation: %f' % (name, cv_results.mean(), cv_results.std())
    print(msg)

Cross validation result, accuracy, EWD, k = 7.
ID3: - Mean: 0.720466, Standard deviation: 0.046353
CNB: - Mean: nan, Standard deviation: nan
Knn-VDM: - Mean: nan, Standard deviation: nan


### Evaluation, EDW, k = 7

In [51]:
from sklearn.metrics import zero_one_loss
#This library is used to decompose bias and variance in our models
from mlxtend.evaluate import bias_variance_decomp
import warnings
warnings.filterwarnings('ignore')

In [54]:
# ID3
# Convert all dataframe to array
x_train = x_train.values
y_train = y_train.values
x_test = x_test.values
y_test = y_test.values

# Evaluation
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
model_id3, x_train, y_train, x_test, y_test,
loss='0-1_loss',
random_seed=123)
#---
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)
print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_id3))

Average expected loss: 0.343
Average bias: 0.303
Average variance: 0.200
Sklearn 0-1 loss: 0.320


In [55]:
# Naive Bayes - min_categories update
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
model_nb, x_train, y_train, x_test, y_test,
loss='0-1_loss',
random_seed=123)
#---
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)
print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_nb))

Average expected loss: 0.302
Average bias: 0.303
Average variance: 0.063
Sklearn 0-1 loss: 0.299


In [56]:
# # WARNING - LONG TIME
# # Knn-VDM
# import time
# start = time.time() # For measuring time execution
# avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
# knn_vdm, x_train, y_train, x_test, y_test,
# loss='0-1_loss',
# random_seed=123)
# #---
# print('Average expected loss: %.3f' % avg_expected_loss)
# print('Average bias: %.3f' % avg_bias)
# print('Average variance: %.3f' % avg_var)
# print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_knn))
# end = time.time()
# print(f'Computing time: {end - start}.') # Total time execution

## 1.3 EWD, k = 10

In [148]:
# Read data
df_ewd3 = pd.read_csv('pima_ewd3.csv')
df_ewd3.info()
disc = "EWD"
k = 10

## EDA
from collections import Counter

# Check class of control
Counter(df_ewd3['Outcome'])

# Split dataset
X = df_ewd3.drop(['Outcome'], axis = 1)
Y = df_ewd3['Outcome']

# Split train test, test size 25%, random state 30
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 30, stratify = Y)

# Check representation of class
print('Class representation - original: ', Counter(df_ewd2['Outcome'])) 
print('Class representation - training data: ', Counter(y_train)) 
print('Class representation - testing data: ',Counter(y_test)) 
print(x_train.shape, x_test.shape)

# SMOTE
from imblearn import under_sampling, over_sampling
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state = 22)
x_train, y_train = smote.fit_resample(x_train, y_train)

print('='*25)
print('Distribution after SMOTE')
print('Class representation - training data: ', Counter(y_train))
print(x_train.shape, x_test.shape)

# Check number of categories for features
n_categories = df_ewd3.drop(['Outcome'], axis = 1).nunique()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype
---  ------                    --------------  -----
 0   Pregnancies               768 non-null    int64
 1   Glucose                   768 non-null    int64
 2   BloodPressure             768 non-null    int64
 3   SkinThickness             768 non-null    int64
 4   Insulin                   768 non-null    int64
 5   BMI                       768 non-null    int64
 6   DiabetesPedigreeFunction  768 non-null    int64
 7   Age                       768 non-null    int64
 8   Outcome                   768 non-null    int64
dtypes: int64(9)
memory usage: 54.1 KB
Class representation - original:  Counter({0: 500, 1: 268})
Class representation - training data:  Counter({0: 350, 1: 187})
Class representation - testing data:  Counter({0: 150, 1: 81})
(537, 8) (231, 8)
Distribution after SMOTE
Class representation - training data:  Coun

### Models, EDW, k = 10

In [58]:
# ID3 - Default
import time
start = time.time() # For measuring time execution

model_id3 = Id3Estimator()
model_id3.fit(x_train, y_train)
# Testing
y_pred_id3 = model_id3.predict(x_test)
print(classification_report(y_test, y_pred_id3))

end = time.time()
print(f'Time for training model ID3 - default, {disc}, k = {k} is: {end - start}.') # Total time execution


              precision    recall  f1-score   support

           0       0.85      0.63      0.72       150
           1       0.53      0.79      0.64        81

    accuracy                           0.68       231
   macro avg       0.69      0.71      0.68       231
weighted avg       0.74      0.68      0.69       231

Time for training model ID3 - default, EWD, k = 10 is: 0.050779104232788086.


In [60]:
n_categories

Pregnancies                 10
Glucose                      9
BloodPressure               10
SkinThickness                8
Insulin                     10
BMI                          9
DiabetesPedigreeFunction    10
Age                         10
dtype: int64

In [61]:
# Naive Bayes - Min categories
import time
start = time.time() # For measuring time execution
model_nb = CategoricalNB(min_categories = n_categories)
model_nb.fit(x_train, y_train)
# Testing
y_pred_nb = model_nb.predict(x_test)
model_nb.classes_
print(classification_report(y_test, y_pred_nb))
end = time.time()
print(f'Time for training model Naive Bayes - min_categories, {disc}, k = {k} is: {end - start}.') # Total time execution

IndexError: index 9 is out of bounds for axis 1 with size 9

In [62]:
# WARNING: LONG TIME
# Knn-VDM complete code
import time
start = time.time() # For measuring time execution

# specific the continuous columns index if any
vdm = ValueDifferenceMetric(x_train, y_train, continuous = None)
vdm.fit()
# Knn model, n_neigbour = 3, metrics = vdm
knn_vdm = KNeighborsClassifier(n_neighbors=3, metric=vdm.get_distance, algorithm='brute')
## Fit model
knn_vdm.fit(x_train, y_train)
# Testing
y_pred_knn = knn_vdm.predict(x_test)
knn_vdm.classes_
print(classification_report(y_test, y_pred_knn))

end = time.time()
print(f'Time for training model Knn-VDM, {disc}, k = {k} is: {end - start}.') # Total time execution

KeyError: 9.0

In [63]:
# CROSS VALIDATION
import warnings
warnings.filterwarnings('ignore')

# param
num_folds = 10
num_repeat = 3
seed = 7
scores = 'accuracy'

print(f'Cross validation result, {scores}, {disc}, k = {k}.')

# Create list of algorithms
models = []
models.append(('ID3', Id3Estimator()))
#models.append(('RIPPER', lw.RIPPER()))
models.append(('CNB', CategoricalNB()))
models.append(('Knn-VDM', KNeighborsClassifier(n_neighbors=3, metric=vdm.get_distance, algorithm='brute')))

# Evaluate each model in turn
results = []
names = []
for name, model in models:
  #kfold = KFold(n_splits=num_folds, shuffle = True, random_state=10)
    kfold = RepeatedKFold(n_splits=num_folds, n_repeats=num_repeat, random_state=seed)
    cv_results = cross_val_score(model, X, Y, cv=kfold, scoring=scores)
    results.append(cv_results)
    names.append(name)
    msg = '%s: - Mean: %f, Standard deviation: %f' % (name, cv_results.mean(), cv_results.std())
    print(msg)

Cross validation result, accuracy, EWD, k = 10.
ID3: - Mean: 0.728788, Standard deviation: 0.039028
CNB: - Mean: nan, Standard deviation: nan
Knn-VDM: - Mean: nan, Standard deviation: nan


### Evaluation, EDW, k = 10

In [64]:
from sklearn.metrics import zero_one_loss
#This library is used to decompose bias and variance in our models
from mlxtend.evaluate import bias_variance_decomp
import warnings
warnings.filterwarnings('ignore')

In [65]:
# ID3
# Convert all dataframe to array
x_train = x_train.values
y_train = y_train.values
x_test = x_test.values
y_test = y_test.values

# Evaluation
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
model_id3, x_train, y_train, x_test, y_test,
loss='0-1_loss',
random_seed=123)
#---
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)
print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_id3))

Average expected loss: 0.337
Average bias: 0.312
Average variance: 0.176
Sklearn 0-1 loss: 0.316


In [66]:
# Naive Bayes - min_categories update
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
model_nb, x_train, y_train, x_test, y_test,
loss='0-1_loss',
random_seed=123)
#---
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)
print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_nb))

IndexError: index 9 is out of bounds for axis 1 with size 9

In [67]:
# # WARNING - LONG TIME
# # Knn-VDM
# import time
# start = time.time() # For measuring time execution
# avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
# knn_vdm, x_train, y_train, x_test, y_test,
# loss='0-1_loss',
# random_seed=123)
# #---
# print('Average expected loss: %.3f' % avg_expected_loss)
# print('Average bias: %.3f' % avg_bias)
# print('Average variance: %.3f' % avg_var)
# print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_knn))
# end = time.time()
# print(f'Computing time: {end - start}.') # Total time execution

# 2. EFD

## 2.1 EFD, k = 4

In [149]:
# Read data
df_efd1 = pd.read_csv('pima_efd1.csv')
df_efd1.info()
disc = "EFD"
k = 4

## EDA
from collections import Counter

# Check class of control
Counter(df_efd1['Outcome'])

# Split dataset
X = df_efd1.drop(['Outcome'], axis = 1)
Y = df_efd1['Outcome']

# Split train test, test size 25%, random state 30
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 30, stratify = Y)

# Check representation of class
print('Class representation - original: ', Counter(df_ewd2['Outcome'])) 
print('Class representation - training data: ', Counter(y_train)) 
print('Class representation - testing data: ',Counter(y_test)) 
print(x_train.shape, x_test.shape)

# SMOTE
from imblearn import under_sampling, over_sampling
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state = 22)
x_train, y_train = smote.fit_resample(x_train, y_train)

print('='*25)
print('Distribution after SMOTE')
print('Class representation - training data: ', Counter(y_train))
print(x_train.shape, x_test.shape)

# Check number of categories for features
n_categories = df_efd1.drop(['Outcome'], axis = 1).nunique()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype
---  ------                    --------------  -----
 0   Pregnancies               768 non-null    int64
 1   Glucose                   768 non-null    int64
 2   BloodPressure             768 non-null    int64
 3   SkinThickness             768 non-null    int64
 4   Insulin                   768 non-null    int64
 5   BMI                       768 non-null    int64
 6   DiabetesPedigreeFunction  768 non-null    int64
 7   Age                       768 non-null    int64
 8   Outcome                   768 non-null    int64
dtypes: int64(9)
memory usage: 54.1 KB
Class representation - original:  Counter({0: 500, 1: 268})
Class representation - training data:  Counter({0: 350, 1: 187})
Class representation - testing data:  Counter({0: 150, 1: 81})
(537, 8) (231, 8)
Distribution after SMOTE
Class representation - training data:  Coun

### Models, EFD, k = 4

In [69]:
# ID3 - Default
import time
start = time.time() # For measuring time execution

model_id3 = Id3Estimator()
model_id3.fit(x_train, y_train)
# Testing
y_pred_id3 = model_id3.predict(x_test)
print(classification_report(y_test, y_pred_id3))

end = time.time()
print(f'Time for training model ID3 - default, {disc}, k = {k} is: {end - start}.') # Total time execution


              precision    recall  f1-score   support

           0       0.77      0.67      0.71       150
           1       0.50      0.63      0.56        81

    accuracy                           0.65       231
   macro avg       0.64      0.65      0.64       231
weighted avg       0.68      0.65      0.66       231

Time for training model ID3 - default, EFD, k = 4 is: 0.04825234413146973.


In [70]:
# Naive Bayes - min_categories
import time
start = time.time() # For measuring time execution
model_nb = CategoricalNB(min_categories = n_categories)
model_nb.fit(x_train, y_train)
# Testing
y_pred_nb = model_nb.predict(x_test)
model_nb.classes_
print(classification_report(y_test, y_pred_nb))
end = time.time()
print(f'Time for training model Naive Bayes - min_categories, {disc}, k = {k} is: {end - start}.') # Total time execution

              precision    recall  f1-score   support

           0       0.85      0.63      0.73       150
           1       0.54      0.79      0.64        81

    accuracy                           0.69       231
   macro avg       0.69      0.71      0.68       231
weighted avg       0.74      0.69      0.70       231

Time for training model Naive Bayes - min_categories, EFD, k = 4 is: 0.00800013542175293.


In [150]:
# WARNING: LONG TIME
# Knn-VDM complete code
# Time for training model Knn-VDM, EFD, k = 4 is: 43.78280520439148.
# Acc: 0.68

import time
start = time.time() # For measuring time execution

# specific the continuous columns index if any
vdm = ValueDifferenceMetric(x_train, y_train, continuous = None)
vdm.fit()
# Knn model, n_neigbour = 3, metrics = vdm
knn_vdm = KNeighborsClassifier(n_neighbors=3, metric=vdm.get_distance, algorithm='brute')
## Fit model
knn_vdm.fit(x_train, y_train)
# Testing
y_pred_knn = knn_vdm.predict(x_test)
knn_vdm.classes_
print(classification_report(y_test, y_pred_knn))

end = time.time()
print(f'Time for training model Knn-VDM, {disc}, k = {k} is: {end - start}.') # Total time execution

              precision    recall  f1-score   support

           0       0.80      0.69      0.74       150
           1       0.54      0.68      0.60        81

    accuracy                           0.68       231
   macro avg       0.67      0.68      0.67       231
weighted avg       0.71      0.68      0.69       231

Time for training model Knn-VDM, EFD, k = 4 is: 42.62577247619629.


In [72]:
# CROSS VALIDATION
import warnings
warnings.filterwarnings('ignore')

# param
num_folds = 10
num_repeat = 3
seed = 7
scores = 'accuracy'

print(f'Cross validation result, {scores}, {disc}, k = {k}.')

# Create list of algorithms
models = []
models.append(('ID3', Id3Estimator()))
#models.append(('RIPPER', lw.RIPPER()))
models.append(('CNB', CategoricalNB()))
models.append(('Knn-VDM', KNeighborsClassifier(n_neighbors=3, metric=vdm.get_distance, algorithm='brute')))

# Evaluate each model in turn
results = []
names = []
for name, model in models:
  #kfold = KFold(n_splits=num_folds, shuffle = True, random_state=10)
    kfold = RepeatedKFold(n_splits=num_folds, n_repeats=num_repeat, random_state=seed)
    cv_results = cross_val_score(model, X, Y, cv=kfold, scoring=scores)
    results.append(cv_results)
    names.append(name)
    msg = '%s: - Mean: %f, Standard deviation: %f' % (name, cv_results.mean(), cv_results.std())
    print(msg)

Cross validation result, accuracy, EFD, k = 4.
ID3: - Mean: 0.701771, Standard deviation: 0.040627
CNB: - Mean: 0.753890, Standard deviation: 0.050901
Knn-VDM: - Mean: 0.720044, Standard deviation: 0.049140


### Evaluation, EFD, k = 4

In [73]:
from sklearn.metrics import zero_one_loss
#This library is used to decompose bias and variance in our models
from mlxtend.evaluate import bias_variance_decomp
import warnings
warnings.filterwarnings('ignore')

In [74]:
# ID3
# Convert all dataframe to array
x_train = x_train.values
y_train = y_train.values
x_test = x_test.values
y_test = y_test.values

# Evaluation
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
model_id3, x_train, y_train, x_test, y_test,
loss='0-1_loss',
random_seed=123)
#---
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)
print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_id3))

Average expected loss: 0.347
Average bias: 0.307
Average variance: 0.185
Sklearn 0-1 loss: 0.346


In [75]:
# Naive Bayes - min_categories update
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
model_nb, x_train, y_train, x_test, y_test,
loss='0-1_loss',
random_seed=123)
#---
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)
print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_nb))

Average expected loss: 0.311
Average bias: 0.316
Average variance: 0.064
Sklearn 0-1 loss: 0.312


In [152]:
# WARNING - LONG TIME
# Knn-VDM
# Convert all dataframe to array
x_train = x_train.values
y_train = y_train.values
x_test = x_test.values
y_test = y_test.values

import time
start = time.time() # For measuring time execution
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
knn_vdm, x_train, y_train, x_test, y_test,
loss='0-1_loss',
random_seed=123)
#---
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)
print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_knn))
end = time.time()
print(f'Computing time: {end - start}.') # Total time execution

Average expected loss: 0.328
Average bias: 0.307
Average variance: 0.155
Sklearn 0-1 loss: 0.316
Computing time: 8415.077355861664.


## 2.2 EFD, k = 7

In [159]:
# Read data
df_efd2 = pd.read_csv('pima_efd2.csv')
df_efd2.info()
disc = "EFD"
k = 7

## EDA
from collections import Counter

# Check class of control
Counter(df_efd2['Outcome'])

# Split dataset
X = df_efd2.drop(['Outcome'], axis = 1)
Y = df_efd2['Outcome']

# Split train test, test size 25%, random state 30
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 30, stratify = Y)

# Check representation of class
print('Class representation - original: ', Counter(df_ewd2['Outcome'])) 
print('Class representation - training data: ', Counter(y_train)) 
print('Class representation - testing data: ',Counter(y_test)) 
print(x_train.shape, x_test.shape)

# SMOTE
from imblearn import under_sampling, over_sampling
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state = 22)
x_train, y_train = smote.fit_resample(x_train, y_train)

print('='*25)
print('Distribution after SMOTE')
print('Class representation - training data: ', Counter(y_train))
print(x_train.shape, x_test.shape)

# Check number of categories for features
n_categories = df_efd2.drop(['Outcome'], axis = 1).nunique()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype
---  ------                    --------------  -----
 0   Pregnancies               768 non-null    int64
 1   Glucose                   768 non-null    int64
 2   BloodPressure             768 non-null    int64
 3   SkinThickness             768 non-null    int64
 4   Insulin                   768 non-null    int64
 5   BMI                       768 non-null    int64
 6   DiabetesPedigreeFunction  768 non-null    int64
 7   Age                       768 non-null    int64
 8   Outcome                   768 non-null    int64
dtypes: int64(9)
memory usage: 54.1 KB
Class representation - original:  Counter({0: 500, 1: 268})
Class representation - training data:  Counter({0: 350, 1: 187})
Class representation - testing data:  Counter({0: 150, 1: 81})
(537, 8) (231, 8)
Distribution after SMOTE
Class representation - training data:  Coun

### Models, EFD, k = 7

In [78]:
# ID3 - Default
import time
start = time.time() # For measuring time execution

model_id3 = Id3Estimator()
model_id3.fit(x_train, y_train)
# Testing
y_pred_id3 = model_id3.predict(x_test)
print(classification_report(y_test, y_pred_id3))

end = time.time()
print(f'Time for training model ID3 - default, {disc}, k = {k} is: {end - start}.') # Total time execution


              precision    recall  f1-score   support

           0       0.77      0.67      0.72       150
           1       0.51      0.62      0.56        81

    accuracy                           0.65       231
   macro avg       0.64      0.65      0.64       231
weighted avg       0.67      0.65      0.66       231

Time for training model ID3 - default, EFD, k = 7 is: 0.07412123680114746.


In [79]:
# Naive Bayes - min_categories
import time
start = time.time() # For measuring time execution
model_nb = CategoricalNB(min_categories = n_categories)
model_nb.fit(x_train, y_train)
# Testing
y_pred_nb = model_nb.predict(x_test)
model_nb.classes_
print(classification_report(y_test, y_pred_nb))
end = time.time()
print(f'Time for training model Naive Bayes - min_categories, {disc}, k = {k} is: {end - start}.') # Total time execution

              precision    recall  f1-score   support

           0       0.89      0.65      0.75       150
           1       0.57      0.85      0.68        81

    accuracy                           0.72       231
   macro avg       0.73      0.75      0.72       231
weighted avg       0.78      0.72      0.73       231

Time for training model Naive Bayes - min_categories, EFD, k = 7 is: 0.008295536041259766.


In [160]:
# WARNING: LONG TIME
# Knn-VDM complete code
# DONE
# Acc: 0.70
# Time for training model Knn-VDM, EFD, k = 7 is: 47.48977565765381.
import time
start = time.time() # For measuring time execution

# specific the continuous columns index if any
vdm = ValueDifferenceMetric(x_train, y_train, continuous = None)
vdm.fit()
# Knn model, n_neigbour = 3, metrics = vdm
knn_vdm = KNeighborsClassifier(n_neighbors=3, metric=vdm.get_distance, algorithm='brute')
## Fit model
knn_vdm.fit(x_train, y_train)
# Testing
y_pred_knn = knn_vdm.predict(x_test)
knn_vdm.classes_
print(classification_report(y_test, y_pred_knn))

end = time.time()
print(f'Time for training model Knn-VDM, {disc}, k = {k} is: {end - start}.') # Total time execution

              precision    recall  f1-score   support

           0       0.82      0.69      0.75       150
           1       0.56      0.73      0.63        81

    accuracy                           0.70       231
   macro avg       0.69      0.71      0.69       231
weighted avg       0.73      0.70      0.71       231

Time for training model Knn-VDM, EFD, k = 7 is: 40.350417375564575.


In [81]:
# CROSS VALIDATION
import warnings
warnings.filterwarnings('ignore')

# param
num_folds = 10
num_repeat = 3
seed = 7
scores = 'accuracy'

print(f'Cross validation result, {scores}, {disc}, k = {k}.')

# Create list of algorithms
models = []
models.append(('ID3', Id3Estimator()))
#models.append(('RIPPER', lw.RIPPER()))
models.append(('CNB', CategoricalNB()))
models.append(('Knn-VDM', KNeighborsClassifier(n_neighbors=3, metric=vdm.get_distance, algorithm='brute')))

# Evaluate each model in turn
results = []
names = []
for name, model in models:
  #kfold = KFold(n_splits=num_folds, shuffle = True, random_state=10)
    kfold = RepeatedKFold(n_splits=num_folds, n_repeats=num_repeat, random_state=seed)
    cv_results = cross_val_score(model, X, Y, cv=kfold, scoring=scores)
    results.append(cv_results)
    names.append(name)
    msg = '%s: - Mean: %f, Standard deviation: %f' % (name, cv_results.mean(), cv_results.std())
    print(msg)

Cross validation result, accuracy, EFD, k = 7.
ID3: - Mean: 0.708276, Standard deviation: 0.042404
CNB: - Mean: 0.756072, Standard deviation: 0.050855
Knn-VDM: - Mean: 0.739907, Standard deviation: 0.049102


### Evaluation, EFD, k = 7

In [82]:
from sklearn.metrics import zero_one_loss
#This library is used to decompose bias and variance in our models
from mlxtend.evaluate import bias_variance_decomp
import warnings
warnings.filterwarnings('ignore')

In [83]:
# ID3
# Convert all dataframe to array
x_train = x_train.values
y_train = y_train.values
x_test = x_test.values
y_test = y_test.values

# Evaluation
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
model_id3, x_train, y_train, x_test, y_test,
loss='0-1_loss',
random_seed=123)
#---
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)
print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_id3))

Average expected loss: 0.341
Average bias: 0.307
Average variance: 0.194
Sklearn 0-1 loss: 0.346


In [84]:
# Naive Bayes - min_categories update
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
model_nb, x_train, y_train, x_test, y_test,
loss='0-1_loss',
random_seed=123)
#---
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)
print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_nb))

Average expected loss: 0.293
Average bias: 0.277
Average variance: 0.074
Sklearn 0-1 loss: 0.277


In [161]:
# WARNING - LONG TIME
# Knn-VDM
# Convert all dataframe to array
x_train = x_train.values
y_train = y_train.values
x_test = x_test.values
y_test = y_test.values

import time
start = time.time() # For measuring time execution
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
knn_vdm, x_train, y_train, x_test, y_test,
loss='0-1_loss',
random_seed=123)
#---
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)
print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_knn))
end = time.time()
print(f'Computing time: {end - start}.') # Total time execution

Average expected loss: 0.294
Average bias: 0.286
Average variance: 0.134
Sklearn 0-1 loss: 0.299
Computing time: 8662.829064846039.


## 2.3 EFD, k =10

In [162]:
# Read data
df_efd3 = pd.read_csv('pima_efd3.csv')
df_efd3.info()
disc = "EFD"
k = 10

## EDA
from collections import Counter

# Check class of control
Counter(df_efd3['Outcome'])

# Split dataset
X = df_efd3.drop(['Outcome'], axis = 1)
Y = df_efd3['Outcome']

# Split train test, test size 25%, random state 30
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 30, stratify = Y)

# Check representation of class
print('Class representation - original: ', Counter(df_ewd2['Outcome'])) 
print('Class representation - training data: ', Counter(y_train)) 
print('Class representation - testing data: ',Counter(y_test)) 
print(x_train.shape, x_test.shape)

# SMOTE
from imblearn import under_sampling, over_sampling
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state = 22)
x_train, y_train = smote.fit_resample(x_train, y_train)

print('='*25)
print('Distribution after SMOTE')
print('Class representation - training data: ', Counter(y_train))
print(x_train.shape, x_test.shape)

# Check number of categories for features
n_categories = df_efd3.drop(['Outcome'], axis = 1).nunique()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype
---  ------                    --------------  -----
 0   Pregnancies               768 non-null    int64
 1   Glucose                   768 non-null    int64
 2   BloodPressure             768 non-null    int64
 3   SkinThickness             768 non-null    int64
 4   Insulin                   768 non-null    int64
 5   BMI                       768 non-null    int64
 6   DiabetesPedigreeFunction  768 non-null    int64
 7   Age                       768 non-null    int64
 8   Outcome                   768 non-null    int64
dtypes: int64(9)
memory usage: 54.1 KB
Class representation - original:  Counter({0: 500, 1: 268})
Class representation - training data:  Counter({0: 350, 1: 187})
Class representation - testing data:  Counter({0: 150, 1: 81})
(537, 8) (231, 8)
Distribution after SMOTE
Class representation - training data:  Coun

### Models, EFD, k = 10

In [87]:
# ID3 - Default
import time
start = time.time() # For measuring time execution

model_id3 = Id3Estimator()
model_id3.fit(x_train, y_train)
# Testing
y_pred_id3 = model_id3.predict(x_test)
print(classification_report(y_test, y_pred_id3))

end = time.time()
print(f'Time for training model ID3 - default, {disc}, k = {k} is: {end - start}.') # Total time execution


              precision    recall  f1-score   support

           0       0.79      0.63      0.70       150
           1       0.50      0.69      0.58        81

    accuracy                           0.65       231
   macro avg       0.64      0.66      0.64       231
weighted avg       0.69      0.65      0.66       231

Time for training model ID3 - default, EFD, k = 10 is: 0.08085155487060547.


In [88]:
# Naive Bayes - min_catgories
import time
start = time.time() # For measuring time execution
model_nb = CategoricalNB(min_categories = n_categories)
model_nb.fit(x_train, y_train)
# Testing
y_pred_nb = model_nb.predict(x_test)
model_nb.classes_
print(classification_report(y_test, y_pred_nb))
end = time.time()
print(f'Time for training model Naive Bayes - min_categories, {disc}, k = {k} is: {end - start}.') # Total time execution

              precision    recall  f1-score   support

           0       0.86      0.65      0.74       150
           1       0.55      0.80      0.65        81

    accuracy                           0.70       231
   macro avg       0.70      0.72      0.70       231
weighted avg       0.75      0.70      0.71       231

Time for training model Naive Bayes - min_categories, EFD, k = 10 is: 0.010376214981079102.


In [163]:
# WARNING: LONG TIME
# Knn-VDM complete code
# Time for training model Knn-VDM, EFD, k = 10 is: 49.08341908454895.
# Acc: 0.71

import time
start = time.time() # For measuring time execution

# specific the continuous columns index if any
vdm = ValueDifferenceMetric(x_train, y_train, continuous = None)
vdm.fit()
# Knn model, n_neigbour = 3, metrics = vdm
knn_vdm = KNeighborsClassifier(n_neighbors=3, metric=vdm.get_distance, algorithm='brute')
## Fit model
knn_vdm.fit(x_train, y_train)
# Testing
y_pred_knn = knn_vdm.predict(x_test)
knn_vdm.classes_
print(classification_report(y_test, y_pred_knn))

end = time.time()
print(f'Time for training model Knn-VDM, {disc}, k = {k} is: {end - start}.') # Total time execution

              precision    recall  f1-score   support

           0       0.85      0.67      0.75       150
           1       0.56      0.79      0.66        81

    accuracy                           0.71       231
   macro avg       0.71      0.73      0.70       231
weighted avg       0.75      0.71      0.72       231

Time for training model Knn-VDM, EFD, k = 10 is: 45.19051241874695.


In [90]:
# CROSS VALIDATION
import warnings
warnings.filterwarnings('ignore')

# param
num_folds = 10
num_repeat = 3
seed = 7
scores = 'accuracy'

print(f'Cross validation result, {scores}, {disc}, k = {k}.')

# Create list of algorithms
models = []
models.append(('ID3', Id3Estimator()))
#models.append(('RIPPER', lw.RIPPER()))
models.append(('CNB', CategoricalNB()))
models.append(('Knn-VDM', KNeighborsClassifier(n_neighbors=3, metric=vdm.get_distance, algorithm='brute')))

# Evaluate each model in turn
results = []
names = []
for name, model in models:
  #kfold = KFold(n_splits=num_folds, shuffle = True, random_state=10)
    kfold = RepeatedKFold(n_splits=num_folds, n_repeats=num_repeat, random_state=seed)
    cv_results = cross_val_score(model, X, Y, cv=kfold, scoring=scores)
    results.append(cv_results)
    names.append(name)
    msg = '%s: - Mean: %f, Standard deviation: %f' % (name, cv_results.mean(), cv_results.std())
    print(msg)

Cross validation result, accuracy, EFD, k = 10.
ID3: - Mean: 0.694025, Standard deviation: 0.057013
CNB: - Mean: 0.743467, Standard deviation: 0.044930
Knn-VDM: - Mean: 0.747357, Standard deviation: 0.038183


### Evaluation, EFD, k = 10

In [91]:
from sklearn.metrics import zero_one_loss
#This library is used to decompose bias and variance in our models
from mlxtend.evaluate import bias_variance_decomp
import warnings
warnings.filterwarnings('ignore')

In [92]:
# ID3
# Convert all dataframe to array
x_train = x_train.values
y_train = y_train.values
x_test = x_test.values
y_test = y_test.values

# Evaluation
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
model_id3, x_train, y_train, x_test, y_test,
loss='0-1_loss',
random_seed=123)
#---
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)
print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_id3))

Average expected loss: 0.348
Average bias: 0.320
Average variance: 0.204
Sklearn 0-1 loss: 0.351


In [93]:
# Naive Bayes - min_categories update
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
model_nb, x_train, y_train, x_test, y_test,
loss='0-1_loss',
random_seed=123)
#---
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)
print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_nb))

Average expected loss: 0.309
Average bias: 0.294
Average variance: 0.083
Sklearn 0-1 loss: 0.299


In [164]:
# WARNING - LONG TIME
# Knn-VDM
# Convert all dataframe to array
x_train = x_train.values
y_train = y_train.values
x_test = x_test.values
y_test = y_test.values

import time
start = time.time() # For measuring time execution
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
knn_vdm, x_train, y_train, x_test, y_test,
loss='0-1_loss',
random_seed=123)
#---
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)
print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_knn))
end = time.time()
print(f'Computing time: {end - start}.') # Total time execution

Average expected loss: 0.328
Average bias: 0.290
Average variance: 0.128
Sklearn 0-1 loss: 0.290
Computing time: 8546.075377464294.


# 3. FFD

## 3.1 FFD, m =10

In [165]:
# Read data
df_ffd1 = pd.read_csv('pima_efd3.csv')
df_ffd1.info()
disc = "FFD"
m = 10

## EDA
from collections import Counter

# Check class of control
Counter(df_ffd1['Outcome'])

# Split dataset
X = df_ffd1.drop(['Outcome'], axis = 1)
Y = df_ffd1['Outcome']

# Split train test, test size 25%, random state 30
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 30, stratify = Y)

# Check representation of class
print('Class representation - original: ', Counter(df_ffd1['Outcome'])) 
print('Class representation - training data: ', Counter(y_train)) 
print('Class representation - testing data: ', Counter(y_test)) 
print(x_train.shape, x_test.shape)

# SMOTE
from imblearn import under_sampling, over_sampling
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state = 22)
x_train, y_train = smote.fit_resample(x_train, y_train)

print('='*25)
print('Distribution after SMOTE')
print('Class representation - training data: ', Counter(y_train))
print(x_train.shape, x_test.shape)

# Check number of categories for features
n_categories = df_ffd1.drop(['Outcome'], axis = 1).nunique()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype
---  ------                    --------------  -----
 0   Pregnancies               768 non-null    int64
 1   Glucose                   768 non-null    int64
 2   BloodPressure             768 non-null    int64
 3   SkinThickness             768 non-null    int64
 4   Insulin                   768 non-null    int64
 5   BMI                       768 non-null    int64
 6   DiabetesPedigreeFunction  768 non-null    int64
 7   Age                       768 non-null    int64
 8   Outcome                   768 non-null    int64
dtypes: int64(9)
memory usage: 54.1 KB
Class representation - original:  Counter({0: 500, 1: 268})
Class representation - training data:  Counter({0: 350, 1: 187})
Class representation - testing data:  Counter({0: 150, 1: 81})
(537, 8) (231, 8)
Distribution after SMOTE
Class representation - training data:  Coun

### Models, FFD, m = 10

In [96]:
# ID3 - Default
import time
start = time.time() # For measuring time execution

model_id3 = Id3Estimator()
model_id3.fit(x_train, y_train)
# Testing
y_pred_id3 = model_id3.predict(x_test)
print(classification_report(y_test, y_pred_id3))

end = time.time()
print(f'Time for training model ID3 - default, {disc}, m = {m} is: {end - start}.') # Total time execution


              precision    recall  f1-score   support

           0       0.79      0.63      0.70       150
           1       0.50      0.69      0.58        81

    accuracy                           0.65       231
   macro avg       0.64      0.66      0.64       231
weighted avg       0.69      0.65      0.66       231

Time for training model ID3 - default, FFD, m = 10 is: 0.07000517845153809.


In [97]:
# Naive Bayes - min_catgories
import time
start = time.time() # For measuring time execution
model_nb = CategoricalNB(min_categories = n_categories)
model_nb.fit(x_train, y_train)
# Testing
y_pred_nb = model_nb.predict(x_test)
model_nb.classes_
print(classification_report(y_test, y_pred_nb))
end = time.time()
print(f'Time for training model Naive Bayes - min_categories, {disc}, m = {m} is: {end - start}.') # Total time execution

              precision    recall  f1-score   support

           0       0.86      0.65      0.74       150
           1       0.55      0.80      0.65        81

    accuracy                           0.70       231
   macro avg       0.70      0.72      0.70       231
weighted avg       0.75      0.70      0.71       231

Time for training model Naive Bayes - min_categories, FFD, m = 10 is: 0.007325649261474609.


In [166]:
# WARNING: LONG TIME
# Knn-VDM complete code
# DONe
# Time for training model Knn-VDM, FFD, m = 10 is: 43.479066610336304.
# Accuracy: 0.71

import time
start = time.time() # For measuring time execution

# specific the continuous columns index if any
vdm = ValueDifferenceMetric(x_train, y_train, continuous = None)
vdm.fit()
# Knn model, n_neigbour = 3, metrics = vdm
knn_vdm = KNeighborsClassifier(n_neighbors=3, metric=vdm.get_distance, algorithm='brute')
## Fit model
knn_vdm.fit(x_train, y_train)
# Testing
y_pred_knn = knn_vdm.predict(x_test)
knn_vdm.classes_
print(classification_report(y_test, y_pred_knn))

end = time.time()
print(f'Time for training model Knn-VDM, {disc}, m = {m} is: {end - start}.') # Total time execution

              precision    recall  f1-score   support

           0       0.85      0.67      0.75       150
           1       0.56      0.79      0.66        81

    accuracy                           0.71       231
   macro avg       0.71      0.73      0.70       231
weighted avg       0.75      0.71      0.72       231

Time for training model Knn-VDM, FFD, m = 10 is: 39.45732259750366.


In [99]:
# CROSS VALIDATION
import warnings
warnings.filterwarnings('ignore')

# param
num_folds = 10
num_repeat = 3
seed = 7
scores = 'accuracy'

print(f'Cross validation result, {scores}, {disc}, k = {k}.')

# Create list of algorithms
models = []
models.append(('ID3', Id3Estimator()))
#models.append(('RIPPER', lw.RIPPER()))
models.append(('CNB', CategoricalNB()))
models.append(('Knn-VDM', KNeighborsClassifier(n_neighbors=3, metric=vdm.get_distance, algorithm='brute')))

# Evaluate each model in turn
results = []
names = []
for name, model in models:
  #kfold = KFold(n_splits=num_folds, shuffle = True, random_state=10)
    kfold = RepeatedKFold(n_splits=num_folds, n_repeats=num_repeat, random_state=seed)
    cv_results = cross_val_score(model, X, Y, cv=kfold, scoring=scores)
    results.append(cv_results)
    names.append(name)
    msg = '%s: - Mean: %f, Standard deviation: %f' % (name, cv_results.mean(), cv_results.std())
    print(msg)

Cross validation result, accuracy, FFD, k = 10.
ID3: - Mean: 0.694025, Standard deviation: 0.057013
CNB: - Mean: 0.743467, Standard deviation: 0.044930
Knn-VDM: - Mean: 0.747357, Standard deviation: 0.038183


### Evaluation, FFD, m = 10

In [100]:
from sklearn.metrics import zero_one_loss
#This library is used to decompose bias and variance in our models
from mlxtend.evaluate import bias_variance_decomp
import warnings
warnings.filterwarnings('ignore')

In [101]:
# ID3
# Convert all dataframe to array
x_train = x_train.values
y_train = y_train.values
x_test = x_test.values
y_test = y_test.values

# Evaluation
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
model_id3, x_train, y_train, x_test, y_test,
loss='0-1_loss',
random_seed=123)
#---
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)
print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_id3))

Average expected loss: 0.348
Average bias: 0.320
Average variance: 0.204
Sklearn 0-1 loss: 0.351


In [102]:
# Naive Bayes - min_categories update
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
model_nb, x_train, y_train, x_test, y_test,
loss='0-1_loss',
random_seed=123)
#---
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)
print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_nb))

Average expected loss: 0.309
Average bias: 0.294
Average variance: 0.083
Sklearn 0-1 loss: 0.299


In [167]:
# WARNING - LONG TIME
# Knn-VDM
# Convert all dataframe to array
x_train = x_train.values
y_train = y_train.values
x_test = x_test.values
y_test = y_test.values

import time
start = time.time() # For measuring time execution
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
knn_vdm, x_train, y_train, x_test, y_test,
loss='0-1_loss',
random_seed=123)
#---
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)
print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_knn))
end = time.time()
print(f'Computing time: {end - start}.') # Total time execution

Average expected loss: 0.328
Average bias: 0.290
Average variance: 0.128
Sklearn 0-1 loss: 0.290
Computing time: 8650.548038959503.


## 3.2 FFD, m = 30

In [168]:
# Read data
df_ffd2 = pd.read_csv('pima_ffd2.csv')
df_ffd2.info()
disc = "FFD"
m = 30

## EDA
from collections import Counter

# Check class of control
Counter(df_ffd2['Outcome'])

# Split dataset
X = df_ffd2.drop(['Outcome'], axis = 1)
Y = df_ffd2['Outcome']

# Split train test, test size 25%, random state 30
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 30, stratify = Y)

# Check representation of class
print('Class representation - original: ', Counter(df_ffd2['Outcome'])) 
print('Class representation - training data: ', Counter(y_train)) 
print('Class representation - testing data: ', Counter(y_test)) 
print(x_train.shape, x_test.shape)

# SMOTE
from imblearn import under_sampling, over_sampling
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state = 22)
x_train, y_train = smote.fit_resample(x_train, y_train)

print('='*25)
print('Distribution after SMOTE')
print('Class representation - training data: ', Counter(y_train))
print(x_train.shape, x_test.shape)

# Check number of categories for features
n_categories = df_ffd2.drop(['Outcome'], axis = 1).nunique()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype
---  ------                    --------------  -----
 0   Pregnancies               768 non-null    int64
 1   Glucose                   768 non-null    int64
 2   BloodPressure             768 non-null    int64
 3   SkinThickness             768 non-null    int64
 4   Insulin                   768 non-null    int64
 5   BMI                       768 non-null    int64
 6   DiabetesPedigreeFunction  768 non-null    int64
 7   Age                       768 non-null    int64
 8   Outcome                   768 non-null    int64
dtypes: int64(9)
memory usage: 54.1 KB
Class representation - original:  Counter({0: 500, 1: 268})
Class representation - training data:  Counter({0: 350, 1: 187})
Class representation - testing data:  Counter({0: 150, 1: 81})
(537, 8) (231, 8)
Distribution after SMOTE
Class representation - training data:  Coun

### Models, FFD, m= 30

In [105]:
# ID3 - Default
import time
start = time.time() # For measuring time execution

model_id3 = Id3Estimator()
model_id3.fit(x_train, y_train)
# Testing
y_pred_id3 = model_id3.predict(x_test)
print(classification_report(y_test, y_pred_id3))

end = time.time()
print(f'Time for training model ID3 - default, {disc}, m = {m} is: {end - start}.') # Total time execution


              precision    recall  f1-score   support

           0       0.79      0.65      0.72       150
           1       0.51      0.68      0.59        81

    accuracy                           0.66       231
   macro avg       0.65      0.67      0.65       231
weighted avg       0.69      0.66      0.67       231

Time for training model ID3 - default, FFD, m = 30 is: 0.09880256652832031.


In [106]:
# Naive Bayes - min_catgories
import time
start = time.time() # For measuring time execution
model_nb = CategoricalNB(min_categories = n_categories)
model_nb.fit(x_train, y_train)
# Testing
y_pred_nb = model_nb.predict(x_test)
model_nb.classes_
print(classification_report(y_test, y_pred_nb))
end = time.time()
print(f'Time for training model Naive Bayes - min_categories, {disc}, m = {m} is: {end - start}.') # Total time execution

              precision    recall  f1-score   support

           0       0.84      0.65      0.73       150
           1       0.54      0.77      0.64        81

    accuracy                           0.69       231
   macro avg       0.69      0.71      0.68       231
weighted avg       0.73      0.69      0.70       231

Time for training model Naive Bayes - min_categories, FFD, m = 30 is: 0.0060002803802490234.


In [169]:
# WARNING: LONG TIME
# Knn-VDM complete code
# DONE
# Time for training model Knn-VDM, FFD, m = 30 is: 41.11604309082031.
# Accuracy: 0.71

import time
start = time.time() # For measuring time execution

# specific the continuous columns index if any
vdm = ValueDifferenceMetric(x_train, y_train, continuous = None)
vdm.fit()
# Knn model, n_neigbour = 3, metrics = vdm
knn_vdm = KNeighborsClassifier(n_neighbors=3, metric=vdm.get_distance, algorithm='brute')
## Fit model
knn_vdm.fit(x_train, y_train)
# Testing
y_pred_knn = knn_vdm.predict(x_test)
knn_vdm.classes_
print(classification_report(y_test, y_pred_knn))

end = time.time()
print(f'Time for training model Knn-VDM, {disc}, m = {m} is: {end - start}.') # Total time execution

              precision    recall  f1-score   support

           0       0.85      0.68      0.76       150
           1       0.57      0.78      0.66        81

    accuracy                           0.71       231
   macro avg       0.71      0.73      0.71       231
weighted avg       0.75      0.71      0.72       231

Time for training model Knn-VDM, FFD, m = 30 is: 40.838098764419556.


In [108]:
# CROSS VALIDATION
import warnings
warnings.filterwarnings('ignore')

# param
num_folds = 10
num_repeat = 3
seed = 7
scores = 'accuracy'

print(f'Cross validation result, {scores}, {disc}, k = {k}.')

# Create list of algorithms
models = []
models.append(('ID3', Id3Estimator()))
#models.append(('RIPPER', lw.RIPPER()))
models.append(('CNB', CategoricalNB()))
models.append(('Knn-VDM', KNeighborsClassifier(n_neighbors=3, metric=vdm.get_distance, algorithm='brute')))

# Evaluate each model in turn
results = []
names = []
for name, model in models:
  #kfold = KFold(n_splits=num_folds, shuffle = True, random_state=10)
    kfold = RepeatedKFold(n_splits=num_folds, n_repeats=num_repeat, random_state=seed)
    cv_results = cross_val_score(model, X, Y, cv=kfold, scoring=scores)
    results.append(cv_results)
    names.append(name)
    msg = '%s: - Mean: %f, Standard deviation: %f' % (name, cv_results.mean(), cv_results.std())
    print(msg)

Cross validation result, accuracy, FFD, k = 10.
ID3: - Mean: 0.718313, Standard deviation: 0.053633
CNB: - Mean: 0.737805, Standard deviation: 0.040628
Knn-VDM: - Mean: 0.724761, Standard deviation: 0.048285


### Evaluation, FFD, m = 30

In [109]:
from sklearn.metrics import zero_one_loss
#This library is used to decompose bias and variance in our models
from mlxtend.evaluate import bias_variance_decomp
import warnings
warnings.filterwarnings('ignore')

In [110]:
# ID3
# Convert all dataframe to array
x_train = x_train.values
y_train = y_train.values
x_test = x_test.values
y_test = y_test.values

# Evaluation
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
model_id3, x_train, y_train, x_test, y_test,
loss='0-1_loss',
random_seed=123)
#---
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)
print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_id3))

Average expected loss: 0.342
Average bias: 0.303
Average variance: 0.211
Sklearn 0-1 loss: 0.338


In [111]:
# Naive Bayes - min_categories update
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
model_nb, x_train, y_train, x_test, y_test,
loss='0-1_loss',
random_seed=123)
#---
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)
print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_nb))

Average expected loss: 0.321
Average bias: 0.312
Average variance: 0.104
Sklearn 0-1 loss: 0.307


In [170]:
# WARNING - LONG TIME
# Knn-VDM
# Convert all dataframe to array
x_train = x_train.values
y_train = y_train.values
x_test = x_test.values
y_test = y_test.values

import time
start = time.time() # For measuring time execution
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
knn_vdm, x_train, y_train, x_test, y_test,
loss='0-1_loss',
random_seed=123)
#---
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)
print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_knn))
end = time.time()
print(f'Computing time: {end - start}.') # Total time execution

Average expected loss: 0.322
Average bias: 0.299
Average variance: 0.138
Sklearn 0-1 loss: 0.286
Computing time: 8000.970769405365.


## 3.3 FFD, m = 60

In [171]:
# Read data
df_ffd3 = pd.read_csv('pima_ffd3.csv')
df_ffd3.info()
disc = "FFD"
m = 60

## EDA
from collections import Counter

# Check class of control
Counter(df_ffd3['Outcome'])

# Split dataset
X = df_ffd3.drop(['Outcome'], axis = 1)
Y = df_ffd3['Outcome']

# Split train test, test size 25%, random state 30
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 30, stratify = Y)

# Check representation of class
print('Class representation - original: ', Counter(df_ffd3['Outcome'])) 
print('Class representation - training data: ', Counter(y_train)) 
print('Class representation - testing data: ', Counter(y_test)) 
print(x_train.shape, x_test.shape)

# SMOTE
from imblearn import under_sampling, over_sampling
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state = 22)
x_train, y_train = smote.fit_resample(x_train, y_train)

print('='*25)
print('Distribution after SMOTE')
print('Class representation - training data: ', Counter(y_train))
print(x_train.shape, x_test.shape)

# Check number of categories for features
n_categories = df_ffd3.drop(['Outcome'], axis = 1).nunique()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype
---  ------                    --------------  -----
 0   Pregnancies               768 non-null    int64
 1   Glucose                   768 non-null    int64
 2   BloodPressure             768 non-null    int64
 3   SkinThickness             768 non-null    int64
 4   Insulin                   768 non-null    int64
 5   BMI                       768 non-null    int64
 6   DiabetesPedigreeFunction  768 non-null    int64
 7   Age                       768 non-null    int64
 8   Outcome                   768 non-null    int64
dtypes: int64(9)
memory usage: 54.1 KB
Class representation - original:  Counter({0: 500, 1: 268})
Class representation - training data:  Counter({0: 350, 1: 187})
Class representation - testing data:  Counter({0: 150, 1: 81})
(537, 8) (231, 8)
Distribution after SMOTE
Class representation - training data:  Coun

### Models, FFD, m= 60

In [114]:
# ID3 - Default
import time
start = time.time() # For measuring time execution

model_id3 = Id3Estimator()
model_id3.fit(x_train, y_train)
# Testing
y_pred_id3 = model_id3.predict(x_test)
print(classification_report(y_test, y_pred_id3))

end = time.time()
print(f'Time for training model ID3 - default, {disc}, m = {m} is: {end - start}.') # Total time execution


              precision    recall  f1-score   support

           0       0.79      0.68      0.73       150
           1       0.53      0.67      0.59        81

    accuracy                           0.68       231
   macro avg       0.66      0.67      0.66       231
weighted avg       0.70      0.68      0.68       231

Time for training model ID3 - default, FFD, m = 60 is: 0.06999921798706055.


In [115]:
# Naive Bayes - min_catgories
import time
start = time.time() # For measuring time execution
model_nb = CategoricalNB(min_categories = n_categories)
model_nb.fit(x_train, y_train)
# Testing
y_pred_nb = model_nb.predict(x_test)
model_nb.classes_
print(classification_report(y_test, y_pred_nb))
end = time.time()
print(f'Time for training model Naive Bayes - min_categories, {disc}, m = {m} is: {end - start}.') # Total time execution

              precision    recall  f1-score   support

           0       0.84      0.65      0.73       150
           1       0.54      0.77      0.63        81

    accuracy                           0.69       231
   macro avg       0.69      0.71      0.68       231
weighted avg       0.73      0.69      0.70       231

Time for training model Naive Bayes - min_categories, FFD, m = 60 is: 0.009998798370361328.


In [172]:
# WARNING: LONG TIME
# Knn-VDM complete code
# DONE, Accuracy: 0.69
# Time for training model Knn-VDM, FFD, m = 60 is: 44.30944085121155.
import time
start = time.time() # For measuring time execution

# specific the continuous columns index if any
vdm = ValueDifferenceMetric(x_train, y_train, continuous = None)
vdm.fit()
# Knn model, n_neigbour = 3, metrics = vdm
knn_vdm = KNeighborsClassifier(n_neighbors=3, metric=vdm.get_distance, algorithm='brute')
## Fit model
knn_vdm.fit(x_train, y_train)
# Testing
y_pred_knn = knn_vdm.predict(x_test)
knn_vdm.classes_
print(classification_report(y_test, y_pred_knn))

end = time.time()
print(f'Time for training model Knn-VDM, {disc}, m = {m} is: {end - start}.') # Total time execution

              precision    recall  f1-score   support

           0       0.82      0.67      0.74       150
           1       0.55      0.73      0.62        81

    accuracy                           0.69       231
   macro avg       0.68      0.70      0.68       231
weighted avg       0.72      0.69      0.70       231

Time for training model Knn-VDM, FFD, m = 60 is: 40.311824560165405.


In [117]:
# CROSS VALIDATION
import warnings
warnings.filterwarnings('ignore')

# param
num_folds = 10
num_repeat = 3
seed = 7
scores = 'accuracy'

print(f'Cross validation result, {scores}, {disc}, k = {k}.')

# Create list of algorithms
models = []
models.append(('ID3', Id3Estimator()))
#models.append(('RIPPER', lw.RIPPER()))
models.append(('CNB', CategoricalNB()))
models.append(('Knn-VDM', KNeighborsClassifier(n_neighbors=3, metric=vdm.get_distance, algorithm='brute')))

# Evaluate each model in turn
results = []
names = []
for name, model in models:
  #kfold = KFold(n_splits=num_folds, shuffle = True, random_state=10)
    kfold = RepeatedKFold(n_splits=num_folds, n_repeats=num_repeat, random_state=seed)
    cv_results = cross_val_score(model, X, Y, cv=kfold, scoring=scores)
    results.append(cv_results)
    names.append(name)
    msg = '%s: - Mean: %f, Standard deviation: %f' % (name, cv_results.mean(), cv_results.std())
    print(msg)

Cross validation result, accuracy, FFD, k = 10.
ID3: - Mean: 0.709211, Standard deviation: 0.048297
CNB: - Mean: 0.757331, Standard deviation: 0.036820
Knn-VDM: - Mean: 0.744304, Standard deviation: 0.037072


### Evaluation, FFD, m= 60

In [118]:
from sklearn.metrics import zero_one_loss
#This library is used to decompose bias and variance in our models
from mlxtend.evaluate import bias_variance_decomp
import warnings
warnings.filterwarnings('ignore')

In [119]:
# ID3
# Convert all dataframe to array
x_train = x_train.values
y_train = y_train.values
x_test = x_test.values
y_test = y_test.values

# Evaluation
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
model_id3, x_train, y_train, x_test, y_test,
loss='0-1_loss',
random_seed=123)
#---
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)
print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_id3))

Average expected loss: 0.348
Average bias: 0.312
Average variance: 0.192
Sklearn 0-1 loss: 0.325


In [120]:
# Naive Bayes - min_categories update
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
model_nb, x_train, y_train, x_test, y_test,
loss='0-1_loss',
random_seed=123)
#---
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)
print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_nb))

Average expected loss: 0.318
Average bias: 0.312
Average variance: 0.088
Sklearn 0-1 loss: 0.312


In [174]:
# WARNING - LONG TIME
# Knn-VDM
# Convert all dataframe to array
x_train = x_train.values
y_train = y_train.values
x_test = x_test.values
y_test = y_test.values

import time
start = time.time() # For measuring time execution
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
knn_vdm, x_train, y_train, x_test, y_test,
loss='0-1_loss',
random_seed=123)
#---
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)
print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_knn))
end = time.time()
print(f'Computing time: {end - start}.') # Total time execution

Average expected loss: 0.329
Average bias: 0.307
Average variance: 0.121
Sklearn 0-1 loss: 0.307
Computing time: 54188.53180384636.


## 3.4 FFD, m = 100

In [None]:
# Read data
df_ffd4 = pd.read_csv('pima_ffd4.csv')
df_ffd4.info()
disc = "FFD"
m = 60

## EDA
from collections import Counter

# Check class of control
Counter(df_ffd4['Outcome'])

# Split dataset
X = df_ffd4.drop(['Outcome'], axis = 1)
Y = df_ffd4['Outcome']

# Split train test, test size 25%, random state 30
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 30, stratify = Y)

# Check representation of class
print('Class representation - original: ', Counter(df_ffd4['Outcome'])) 
print('Class representation - training data: ', Counter(y_train)) 
print('Class representation - testing data: ', Counter(y_test)) 
print(x_train.shape, x_test.shape)

# SMOTE
from imblearn import under_sampling, over_sampling
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state = 22)
x_train, y_train = smote.fit_resample(x_train, y_train)

print('='*25)
print('Distribution after SMOTE')
print('Class representation - training data: ', Counter(y_train))
print(x_train.shape, x_test.shape)

# Check number of categories for features
n_categories = df_ffd4.drop(['Outcome'], axis = 1).nunique()

### Models, FFD, m = 100

In [123]:
# ID3 - Default
import time
start = time.time() # For measuring time execution

model_id3 = Id3Estimator()
model_id3.fit(x_train, y_train)
# Testing
y_pred_id3 = model_id3.predict(x_test)
print(classification_report(y_test, y_pred_id3))

end = time.time()
print(f'Time for training model ID3 - default, {disc}, m = {m} is: {end - start}.') # Total time execution


              precision    recall  f1-score   support

           0       0.78      0.68      0.73       150
           1       0.52      0.65      0.58        81

    accuracy                           0.67       231
   macro avg       0.65      0.67      0.66       231
weighted avg       0.69      0.67      0.68       231

Time for training model ID3 - default, FFD, m = 60 is: 0.05915355682373047.


In [124]:
# Naive Bayes - min_catgories
import time
start = time.time() # For measuring time execution
model_nb = CategoricalNB(min_categories = n_categories)
model_nb.fit(x_train, y_train)
# Testing
y_pred_nb = model_nb.predict(x_test)
model_nb.classes_
print(classification_report(y_test, y_pred_nb))
end = time.time()
print(f'Time for training model Naive Bayes - min_categories, {disc}, m = {m} is: {end - start}.') # Total time execution

              precision    recall  f1-score   support

           0       0.86      0.64      0.74       150
           1       0.55      0.81      0.66        81

    accuracy                           0.70       231
   macro avg       0.71      0.73      0.70       231
weighted avg       0.75      0.70      0.71       231

Time for training model Naive Bayes - min_categories, FFD, m = 60 is: 0.0.


In [125]:
# WARNING: LONG TIME
# Knn-VDM complete code
import time
start = time.time() # For measuring time execution

# specific the continuous columns index if any
vdm = ValueDifferenceMetric(x_train, y_train, continuous = None)
vdm.fit()
# Knn model, n_neigbour = 3, metrics = vdm
knn_vdm = KNeighborsClassifier(n_neighbors=3, metric=vdm.get_distance, algorithm='brute')
## Fit model
knn_vdm.fit(x_train, y_train)
# Testing
y_pred_knn = knn_vdm.predict(x_test)
knn_vdm.classes_
print(classification_report(y_test, y_pred_knn))

end = time.time()
print(f'Time for training model Knn-VDM, {disc}, m = {m} is: {end - start}.') # Total time execution

              precision    recall  f1-score   support

           0       0.79      0.69      0.73       150
           1       0.53      0.65      0.59        81

    accuracy                           0.68       231
   macro avg       0.66      0.67      0.66       231
weighted avg       0.70      0.68      0.68       231

Time for training model Knn-VDM, FFD, m = 60 is: 42.83824014663696.


In [126]:
# CROSS VALIDATION
import warnings
warnings.filterwarnings('ignore')

# param
num_folds = 10
num_repeat = 3
seed = 7
scores = 'accuracy'

print(f'Cross validation result, {scores}, {disc}, k = {k}.')

# Create list of algorithms
models = []
models.append(('ID3', Id3Estimator()))
#models.append(('RIPPER', lw.RIPPER()))
models.append(('CNB', CategoricalNB()))
models.append(('Knn-VDM', KNeighborsClassifier(n_neighbors=3, metric=vdm.get_distance, algorithm='brute')))

# Evaluate each model in turn
results = []
names = []
for name, model in models:
  #kfold = KFold(n_splits=num_folds, shuffle = True, random_state=10)
    kfold = RepeatedKFold(n_splits=num_folds, n_repeats=num_repeat, random_state=seed)
    cv_results = cross_val_score(model, X, Y, cv=kfold, scoring=scores)
    results.append(cv_results)
    names.append(name)
    msg = '%s: - Mean: %f, Standard deviation: %f' % (name, cv_results.mean(), cv_results.std())
    print(msg)

Cross validation result, accuracy, FFD, k = 10.
ID3: - Mean: 0.703554, Standard deviation: 0.053018
CNB: - Mean: 0.749089, Standard deviation: 0.040728
Knn-VDM: - Mean: 0.737828, Standard deviation: 0.040830


### Evaluation, FFD, m=100

In [127]:
from sklearn.metrics import zero_one_loss
#This library is used to decompose bias and variance in our models
from mlxtend.evaluate import bias_variance_decomp
import warnings
warnings.filterwarnings('ignore')

In [128]:
# ID3
# Convert all dataframe to array
x_train = x_train.values
y_train = y_train.values
x_test = x_test.values
y_test = y_test.values

# Evaluation
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
model_id3, x_train, y_train, x_test, y_test,
loss='0-1_loss',
random_seed=123)
#---
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)
print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_id3))

Average expected loss: 0.344
Average bias: 0.286
Average variance: 0.205
Sklearn 0-1 loss: 0.329


In [129]:
# Naive Bayes - min_categories update
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
model_nb, x_train, y_train, x_test, y_test,
loss='0-1_loss',
random_seed=123)
#---
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)
print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_nb))

Average expected loss: 0.304
Average bias: 0.299
Average variance: 0.072
Sklearn 0-1 loss: 0.299


In [132]:
# WARNING - LONG TIME
# Knn-VDM
import time
start = time.time() # For measuring time execution
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
knn_vdm, x_train, y_train, x_test, y_test,
loss='0-1_loss',
random_seed=123)
#---
print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)
print('Sklearn 0-1 loss: %.3f' % zero_one_loss(y_test,y_pred_knn))
end = time.time()
print(f'Computing time: {end - start}.') # Total time execution

Average expected loss: 0.323
Average bias: 0.316
Average variance: 0.147
Sklearn 0-1 loss: 0.325
Computing time: 9363.133692979813.
