We will create a model that predicts whether a person does or does not have diabetes. The target column in diabetes.csv dataset is "Outcome". We assume that no features leak information about the target.

To develope the model we'll do the followings:

1. Feature engineering
2. Model fitting and performance evaluation
3. A function that takes as arguments: a model, train data, test data, and returns the model's predictions on the test data
4. A function that takes a set of predictions and true values and that validates the predictions using appropriate metrics
5. Anything else you that is necessary for modelling or improving the performance of the model


In [1]:
# First we load the necessary libraries. For this problem, we are hoing to use the KNN classifier which is a standard
#  and popular classifier.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score
from sklearn import datasets


In [3]:
# Data: The features (columns) are shown below, therefore, the algorithm can be customized for any other dataframe
# (read in from a csv or other files).


diabetes= pd.read_csv("test_diabetes.csv", sep=";")
diabetes      # Lets have a look at the data

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,,148.0,72.0,35.0,0,33.6,0.627,50.0,1
1,1.0,85.0,66.0,29.0,0,26.6,0.351,31.0,0
2,8.0,183.0,64.0,0.0,0,23.3,0.672,32.0,1
3,1.0,89.0,66.0,23.0,94,28.1,0.167,21.0,0
4,0.0,,40.0,35.0,168,43.1,2.288,,1
...,...,...,...,...,...,...,...,...,...
763,10.0,101.0,76.0,,180,32.9,0.171,63.0,0
764,2.0,122.0,70.0,27.0,Zero,36.8,0.340,27.0,0
765,5.0,121.0,72.0,23.0,112,26.2,0.245,30.0,N
766,1.0,126.0,60.0,0.0,Zero,30.1,0.349,47.0,1


In [4]:
#  To examine how many NaN exist in each column

diabetes.isna().sum()

# Unless something else is recommended by the client:
# 1- NaN for pregnancy may mean sex = male, so we can change them to 0. 
# 2- NaN or 0 for Age is not acceptable. We will thus remove the corresponding rows from the data.
# 3- NaN and 0 (or Zero) in other columns are not acceptable and would be replaced with the median value
# from the same columns

Pregnancies                 37
Glucose                     38
BloodPressure               34
SkinThickness               34
Insulin                     51
BMI                         35
DiabetesPedigreeFunction    40
Age                         51
Outcome                      0
dtype: int64

In [5]:
# First replace the letters N and Y in Outcome column with 0 and 1

negative = ['N','No','n','no']
positive = ['Y', 'Yes', 'y', 'yes']
diabetes_filtered = diabetes.copy()

diabetes_filtered['Outcome'] = diabetes_filtered['Outcome'].replace(negative, 0)
diabetes_filtered['Outcome'] = diabetes_filtered['Outcome'].replace(positive, 1)
# acceptable_outcome = [0 , 1]
# diabetes_filtered = diabetes_filtered[diabetes_filtered.Outcome.isin(acceptable_outcome)]

diabetes_filtered

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,,148.0,72.0,35.0,0,33.6,0.627,50.0,1
1,1.0,85.0,66.0,29.0,0,26.6,0.351,31.0,0
2,8.0,183.0,64.0,0.0,0,23.3,0.672,32.0,1
3,1.0,89.0,66.0,23.0,94,28.1,0.167,21.0,0
4,0.0,,40.0,35.0,168,43.1,2.288,,1
...,...,...,...,...,...,...,...,...,...
763,10.0,101.0,76.0,,180,32.9,0.171,63.0,0
764,2.0,122.0,70.0,27.0,Zero,36.8,0.340,27.0,0
765,5.0,121.0,72.0,23.0,112,26.2,0.245,30.0,0
766,1.0,126.0,60.0,0.0,Zero,30.1,0.349,47.0,1


In [6]:
# Then, let's replace NaN and 'Zero' in Pregnancies with 0

zeros = ['Zero', 'zero', '0']

diabetes_filtered['Pregnancies'] = diabetes_filtered['Pregnancies'].replace(np.NaN, 0)
diabetes_filtered['Pregnancies'] = diabetes_filtered['Pregnancies'].replace(zeros, 0)

In [7]:
# And, let's convert 'Zero' to 0 in Age column and remove all rows that have NaN in this column,
# we'll also remove all wiered age values (0, negative and >100)

diabetes_filtered = diabetes_filtered.dropna(subset=['Age'])

diabetes_filtered['Age'] = diabetes_filtered['Age'].replace(zeros, 0)
accepted_values = diabetes_filtered['Age'].between(1,100) 
diabetes_filtered = diabetes_filtered[accepted_values] # we may instead use df[(x <= df['columnX']) & (df['columnX'] <= y)]


diabetes_filtered

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  diabetes_filtered['Age'] = diabetes_filtered['Age'].replace(zeros, 0)


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,0.0,148.0,72.0,35.0,0,33.6,0.627,50.0,1
1,1.0,85.0,66.0,29.0,0,26.6,0.351,31.0,0
2,8.0,183.0,64.0,0.0,0,23.3,0.672,32.0,1
3,1.0,89.0,66.0,23.0,94,28.1,0.167,21.0,0
5,5.0,116.0,74.0,,Zero,25.6,0.201,30.0,0
...,...,...,...,...,...,...,...,...,...
763,10.0,101.0,76.0,,180,32.9,0.171,63.0,0
764,2.0,122.0,70.0,27.0,Zero,36.8,0.340,27.0,0
765,5.0,121.0,72.0,23.0,112,26.2,0.245,30.0,0
766,1.0,126.0,60.0,0.0,Zero,30.1,0.349,47.0,1


In [8]:
# The data examination showed that the data need more prepeocessing. Some important figures are 
# missing or are set zero (like skin thickness!). they can't be 0 or contain strings.
# So, first we replace zeros with an appropriate value like the median value of the polpulation (from the same culomn).


columns_nonzero = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction']
                  
zeros = ['Zero', 'zero', '0', 0] 

for c in columns_nonzero:

    diabetes_filtered[c] = diabetes_filtered[c].replace(zeros, np.NaN)
    median = diabetes_filtered[c].median(skipna = True)
    diabetes_filtered[c] = diabetes_filtered[c].replace(np.NaN, median)

diabetes_filtered
# We recheck the data cleanness

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,0.0,148.0,72.0,35.0,126,33.6,0.627,50.0,1
1,1.0,85.0,66.0,29.0,126,26.6,0.351,31.0,0
2,8.0,183.0,64.0,29.0,126,23.3,0.672,32.0,1
3,1.0,89.0,66.0,23.0,94,28.1,0.167,21.0,0
5,5.0,116.0,74.0,29.0,126,25.6,0.201,30.0,0
...,...,...,...,...,...,...,...,...,...
763,10.0,101.0,76.0,29.0,180,32.9,0.171,63.0,0
764,2.0,122.0,70.0,27.0,126,36.8,0.340,27.0,0
765,5.0,121.0,72.0,23.0,112,26.2,0.245,30.0,0
766,1.0,126.0,60.0,29.0,126,30.1,0.349,47.0,1


In [9]:
# To get rid of any possible value entered in as data in string type, this will crash the training process
diabetes_filtered = diabetes_filtered.apply(pd.to_numeric) 


In [10]:
# *** We may further investigate and then tweak our data if necessary using these commands

# diabetes_filtered.info()
# np.array(diabetes_filtered.Outcome) 
# diabetes_filtered = diabetes_filtered.astype({'Pregnancies':int, 'Age':int})

In [11]:
#  Now we devide the variables (columns) to input and outpout and divide the dataset into train and test sets

X = diabetes_filtered.iloc[:, 0:8]
y = diabetes_filtered.iloc[:, 8]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0, test_size = 0.2)

In [12]:
# It is recomended to scale all the variables 

sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

In [13]:
# Number of neighbors 9 to 15 were tried and 13 resulted the best classification accuracy
# Other classifiers like SVM may get a better result.

classifier = KNeighborsClassifier(n_neighbors = 13, p = 2, metric = 'euclidean')
classifier.fit(X_train, y_train)

KNeighborsClassifier(metric='euclidean', n_neighbors=13)

In [14]:
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
cm, f1_score(y_test, y_pred), accuracy_score(y_test, y_pred)

(array([[80, 22],
        [18, 23]]),
 0.5348837209302325,
 0.7202797202797203)

In [15]:
# Now let's see how the SVM works on this problem

from sklearn.svm import SVC

svc_model = SVC()
svc_model.fit(X_train,y_train)

SVC()

In [16]:
# Let's predict on the test data and then evaluate the model

predictions = svc_model.predict(X_test)

from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))

[[85 17]
 [18 23]]
              precision    recall  f1-score   support

           0       0.83      0.83      0.83       102
           1       0.57      0.56      0.57        41

    accuracy                           0.76       143
   macro avg       0.70      0.70      0.70       143
weighted avg       0.75      0.76      0.75       143



In [None]:
# Well, svm does better than KNN specially in terms of f1-score.

Now let's see if we can tune the parameters to try to get even better (unlikely, and you probably would be 
satisfied with these results in real like because the data set is quite small, but we just want to 
practice using GridSearch.

In [17]:
#  First import GridsearchCV from SciKit Learn

from sklearn.model_selection import GridSearchCV

# The we create a dictionary called param_grid and fill out some parameters for C and gamma.

param_grid = {'C': [0.01, 0.1, 1, 10, 100], 'gamma': [1,0.1,0.01,0.001, 000.1]} 

# And we also create a GridSearchCV object and fit it to the training data.

grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=2)
grid.fit(X_train,y_train)

Fitting 5 folds for each of 25 candidates, totalling 125 fits
[CV] C=0.01, gamma=1 .................................................
[CV] .................................. C=0.01, gamma=1, total=   0.0s
[CV] C=0.01, gamma=1 .................................................
[CV] .................................. C=0.01, gamma=1, total=   0.0s
[CV] C=0.01, gamma=1 .................................................
[CV] .................................. C=0.01, gamma=1, total=   0.0s
[CV] C=0.01, gamma=1 .................................................
[CV] .................................. C=0.01, gamma=1, total=   0.0s
[CV] C=0.01, gamma=1 .................................................
[CV] .................................. C=0.01, gamma=1, total=   0.0s
[CV] C=0.01, gamma=0.1 ...............................................
[CV] ................................ C=0.01, gamma=0.1, total=   0.0s
[CV] C=0.01, gamma=0.1 ...............................................
[CV] ..........

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


[CV] ................................... C=0.1, gamma=1, total=   0.0s
[CV] C=0.1, gamma=1 ..................................................
[CV] ................................... C=0.1, gamma=1, total=   0.0s
[CV] C=0.1, gamma=1 ..................................................
[CV] ................................... C=0.1, gamma=1, total=   0.0s
[CV] C=0.1, gamma=1 ..................................................
[CV] ................................... C=0.1, gamma=1, total=   0.0s
[CV] C=0.1, gamma=0.1 ................................................
[CV] ................................. C=0.1, gamma=0.1, total=   0.0s
[CV] C=0.1, gamma=0.1 ................................................
[CV] ................................. C=0.1, gamma=0.1, total=   0.0s
[CV] C=0.1, gamma=0.1 ................................................
[CV] ................................. C=0.1, gamma=0.1, total=   0.0s
[CV] C=0.1, gamma=0.1 ................................................
[CV] .

[CV] ................................... C=100, gamma=1, total=   0.0s
[CV] C=100, gamma=1 ..................................................
[CV] ................................... C=100, gamma=1, total=   0.0s
[CV] C=100, gamma=1 ..................................................
[CV] ................................... C=100, gamma=1, total=   0.0s
[CV] C=100, gamma=0.1 ................................................
[CV] ................................. C=100, gamma=0.1, total=   0.0s
[CV] C=100, gamma=0.1 ................................................
[CV] ................................. C=100, gamma=0.1, total=   0.0s
[CV] C=100, gamma=0.1 ................................................
[CV] ................................. C=100, gamma=0.1, total=   0.0s
[CV] C=100, gamma=0.1 ................................................
[CV] ................................. C=100, gamma=0.1, total=   0.0s
[CV] C=100, gamma=0.1 ................................................
[CV] .

[Parallel(n_jobs=1)]: Done 125 out of 125 | elapsed:    1.2s finished


GridSearchCV(estimator=SVC(),
             param_grid={'C': [0.01, 0.1, 1, 10, 100],
                         'gamma': [1, 0.1, 0.01, 0.001, 0.1]},
             verbose=2)

In [18]:
# Now it is the time to take that grid model and create some predictions using the test set and create 
# classification reports and confusion matrices for them.

grid_predictions = grid.predict(X_test)

print(confusion_matrix(y_test,grid_predictions))
print(classification_report(y_test,grid_predictions))

[[85 17]
 [17 24]]
              precision    recall  f1-score   support

           0       0.83      0.83      0.83       102
           1       0.59      0.59      0.59        41

    accuracy                           0.76       143
   macro avg       0.71      0.71      0.71       143
weighted avg       0.76      0.76      0.76       143



In [None]:
# Not a big improvement, only 1%. Though, this performance for such a small training dataset is acceptable.