<b>Introduction and Motivation</b>

Blood glucose level is a decisive factor in determining whether a patient has diabetes, however, other than blood glucose level, the patient’s age, physical conditions, and certain symptoms can potentially help with the diagnosis of diabetes. Early diagnosis of diabetes can increase the likelihood of treatment and can greatly reduce the health complications of the patients. 
Furthermore, identifying key predictors of diabetes can help doctors prioritize the examinations and diagnose diabetes faster and easier. 
Our main goal is to make a classifier that can predict diabetes solely based on known symptoms and physical condition of the patients with at least 80% accuracy. Finally, we would also like to produce a model that identifies the key predictors of diabetes. 

<b>About our Dataset</b>

Our chosen data set is ‘The early-stage diabetes risk prediction’. It contains 17 attributes all of which except age are binary. This includes one class attribute which accepts two values: Negative or Positive. The class attribute represents whether or not the patient has diabetes. Our data has 520 instances.

<b>Previous Work</b>

In the last stage, we focused on data pre-processing, EDA, and reporting preliminary results. 

<b>Current Problem</b>

Our problem is best framed as a binary classification that predicts whether a patient is or soon will be diabetic, based on their symptoms as well as other information like their age and gender.

In this stage of the project, we mainly focus on implementing three different algorithms and techniques to classify our diabetes data. Since python has useful libraries for data analysis, we chose it for our project.



In [16]:
import numpy as np # mathematical operations and algebra
import pandas as pd # data processing, CSV file I/O
#import matplotlib.pyplot as plt # visualization library
from tree import *
import scipy.stats as stats

After importing the required libraries, we import the dataset from the CSV file into a padas data frame for further processing.

<b>IMPORTING DATA</b>

Objective:
<ul>
<li> Import data from CSV file into a padas DataFrame.</li>
</ul>

In [17]:
data = pd.read_csv (r'diabetes_data_upload.csv')
labels = ['Age','Gender','Polyuria','Polydipsia','sudden weight loss','weakness','Polyphagia',
'Genital thrush','visual blurring','Itching','Irritability','delayed healing','partial paresis',
'muscle stiffness','Alopecia','Obesity','class']
df = pd.DataFrame(data, columns= labels)

After all the data is imported, we proceed to data cleaning. In this process, we will make sure our dataset is free of all missing/ incomplete values, duplicate data, noise, outliers and wrong data as much as possible.

<b>DATA CLEANING</b>

Objective:
<ul>
<li> Check if the data contains any null, missing, duplicate.</li>
<li> If yes take appopriate action.</li>
</ul>

In [18]:
print(f'No missing values in data set: {not df.isnull().values.any()}')

No missing values in data set: True


Since we have no missing values, no action is required. Our data is clean and we can proceed with the next step.

In [19]:
# duplicate data

print(f'The total number of dumplicated instances (counting the original) is {sum(df.duplicated(keep=False))}')
print(f'The total number of dumplicated instances  (not counting the original) is: {sum(df.duplicated())}')

The total number of dumplicated instances (counting the original) is 376
The total number of dumplicated instances  (not counting the original) is: 269


At first it seems like 269 out of 520 is a lot of duplicated values.
Lets drop age as it is an obvious tie breaker between two rows. Then recalculate the duplicate values.

In [20]:
df_without_age = df.drop(['Age'], axis=1)
print(f'Number of duplicate instances counting the original is {sum(df_without_age.duplicated(keep=False))}')
print(f'Number of duplicate instances NOT counting the original is {sum(df_without_age.duplicated())}')

Number of duplicate instances counting the original is 407
Number of duplicate instances NOT counting the original is 305


It's a lot more now (as expected) so let's take a look at some of these duplicate values:

In [21]:
duplicated_data = df[df.duplicated()].sort_values(by='Age')
print(duplicated_data[1:5])

     Age Gender Polyuria Polydipsia sudden weight loss weakness Polyphagia  \
374   27   Male       No         No                 No       No         No   
286   27   Male       No         No                 No       No         No   
465   27   Male       No         No                 No       No         No   
474   27   Male       No         No                 No       No         No   

    Genital thrush visual blurring Itching Irritability delayed healing  \
374             No              No      No           No              No   
286             No              No      No           No              No   
465             No              No      No           No              No   
474             No              No      No           No              No   

    partial paresis muscle stiffness Alopecia Obesity     class  
374              No               No       No      No  Negative  
286              No               No       No      No  Negative  
465              No               N

So most of these instances are simply copies of a few 'common' cases. This can be illustrated if we take a look at some of these instances.

We concluded that our data is clean and free of outliers, and those so-called duplicates are not actual duplicates but just a naturally higher rate of occurrence of likely cases. For example, people age 27 with no health conditions or symptoms are common, and this is to be expected.

After all the preprocessing work is done, we can implement 3 classification algorithms one by one to our dataset. 

<br>
<b>DECISION TREE METHOD</b>

Objectives:
<ul>
<li> Create a decision tree and display it as a graph.</li> 
<li>  Evaluate its accuracy of this tree by testing it with a random subset of the data that wasn't a part of training data.</li> 
</ul>

In [23]:
# discretize the age into a few categories
df_discretize = df.copy(deep=True)
minAge = df_discretize.Age.min()
maxAge = df_discretize.Age.max()
range = maxAge - minAge
df_discretize.Age = pd.cut(df["Age"],
       bins=[minAge, minAge + range/3, minAge + 2*range/3, maxAge], 
       labels=["Young", "Adult", "Old"])

We discretize age since it was a continuous attribute and a decision tree can be built using discrete attributes only.

We built our model based on many different approaches to discretization. Practically, we found the equal-width binning approach to yield the most accurate result.

We tested different bin sizes but 3 seemed to work best with our data.

In [24]:
# divide the discretized data into training and test 
df_discretize_test = df_discretize.sample(n = 50, replace = False) # change test number from here
df_discretize_training = df_discretize.copy(deep=True)
df_discretize_training = df_discretize_training.drop(df_discretize_test.index)

We separated the instances into training and test. Out of 520 instances, 50 are randomly selected for test and the rest are our training data.

In [25]:
# decision Tree
root = buildDecisionTree(data=df_discretize_training, classAttribute='class')
# draw it as a graph
buildGraph(root).view()
# calculate its accuracy (evaluation using test data)
print("Accuracy: ",evaluateTree(root,df_discretize_test,'class'))

Accuracy:  0.98


<b>KNN METHOD</b>

Objectives:
<ul>
<li> Classify test data using KNN method</li> 
<li>  Evaluate the accuracy of this method by testing it with a random subset of the data that wasn't a part of training data.</li> 
</ul>

In [26]:
# normalize age (max min)
normalized_age = (df.Age - df.Age.min()) / (df.Age.max() - df.Age.min())
df_normalized = df.copy(deep=True)
df_normalized.Age = normalized_age 

We normalized the age since we don't want it to dominate the distance. Other attributes are binary and no normalization is needed.

Then just like before, we separated the instances into training and test. Out of 520 instances, 50 are randomly selected for test and the rest are our training data.

In [27]:
# divide the normalized data into training and test 
df_normalized_test = df_normalized.sample(n = 50, replace = False) # change test number from here
df_normalized_training = df_normalized # no copy is needed because df_normalized is never used again
df_normalized_training = df_normalized_training.drop(df_normalized_test.index)

To compute the distance between two data points, the euclidean distance of their Age is calculated. For other attributes, the binary distance is calculated. We then sum the two distances and divide them by two to get the total distance.

In [28]:
attr_labels = ['Gender', 'Polyuria','Polydipsia','sudden weight loss','weakness','Polyphagia',
'Genital thrush','visual blurring','Itching','Irritability','delayed healing','partial paresis',
'muscle stiffness','Alopecia','Obesity']
success = 0
for index, data in df_normalized_test.iterrows():
    distance = 0
    for label in attr_labels:
        distance += df_normalized_training[label] != data[label]
    distance = distance / len(attr_labels) 
    distance += ((df_normalized_training.Age - data.Age) ** 2)**(1/2)
    distance = distance / 2
    df_normalized_training['distance'] = distance
    knn = df_normalized_training.sort_values(by=['distance']).head(10) # change k value here
    # print(knn) #uncomment to see the k nearest neighbours
    if knn['class'].value_counts().idxmax() == data['class']:
        success += 1

print("Accuracy: ", success / df_normalized_test.shape[0])

Accuracy:  0.92


<b>BAYES METHOD</b>

Objectives:
<ul>
<li> Classify test data using BAYES method</li> 
<li>  Evaluate the accuracy of this method by testing it with a random subset of the data that wasn't a part of training data.</li> 
</ul>
<br>
We can use the same training and test data that we used for decision tree method here because we need discrete attributes for this method as well.

In [29]:
def prob(data, value, attr, givenValue = None, givenAttr = None): # fuction for probability calculation 
    if (givenValue is not None and givenAttr is not None):
        data = data[data[givenAttr] == givenValue]
    value_count = data[attr].value_counts()
    return value_count[value]/sum(value_count)

labels = ['Age', 'Gender', 'Polyuria','Polydipsia','sudden weight loss','weakness','Polyphagia',
'Genital thrush','visual blurring','Itching','Irritability','delayed healing','partial paresis',
'muscle stiffness','Alopecia','Obesity']
success = 0
p_positive = prob(df_discretize_training, 'Positive', 'class') 
p_negative = 1 - p_positive
for index, data in df_discretize_test.iterrows():
    for label in labels:
        value = data[label]
        p_positive *= prob(df_discretize_training, value, label, 'Positive', 'class')
        p_negative *= prob(df_discretize_training, value, label, 'Negative', 'class')
    classification = "Positive" if p_positive > p_negative else "Negative"
    if data['class'] == classification:
        success += 1

print("Accuracy: ", success / df_discretize_test.shape[0])
 

Accuracy:  0.72


The accuracy of our algorithm is generated dynamically, it depends on which samples are fetched in this run time. We are showing the best result returned by python and there is a possibility that the accuracy can be higher.
Until now, we have compared 3 different algorithms in terms of accuracy to classify our data set. Out of curiosity, we also use Weka as a reference to compare with our algorithms. Based on the property of Weka, the accuracy output can be considered a standard criterion. Weka is setting default with 10 folds cross-validation.

<br><center><b>Decision tree – J48 algorithm (95% accuracy)</b></center><br>
<center><img src="images/dteval.jpg" style="height: 700px; width:900px;"></center>

<center><img src="images/tree.jpg" style="height: 700px; width:900px;"></center>

<br><center><b>KNN algorithm – Lazy IBK (90% accuracy)</b></center><br>
<center><img src="images/knn.jpg" style="height: 700px; width:900px;"></center>

<br><center><b>Bayes – NaiveBayes (87% accuracy)</b></center><br>
<center><img src="images/bayes.jpg" style="height: 700px; width:900px;"></center>

By comparing our algorithms with Weka, we can tell that our performance is almost the same in terms of decision tree and KNN algorithms. In Naïve Bayes, the Weka has slightly higher accuracy due to the more advanced algorithms, we can probably do better with that the next time. 
In terms of accuracy evaluation, we also compare the time complexity of our different algorithms. Bayes runs fastest because it only involves simple calculation, while the decision tree runs slower due to it has to build models for each attribute. KNN is the lowest one because it takes time to let the code decide how to group instances and decide which area it wants to choose. For boosting, since we’ve already done data cleaning and preprocessing, no missing values lead to the same process time in terms of boosting algorithm. 

In conclusion, we cleaned data and import it to python to process, and implement decision tree, KNN, and Bayes algorithms to classify the diabetes dataset with an accuracy rate of 96%, 90%, and 72% respectively.

