# Early stage diabetes risk prediction 

This dataset contains the sign and symptom data of newly diabetic or would be diabetic patient. This has been collected using direct questionnaires from the patients of Sylhet Diabetes Hospital in Sylhet, Bangladesh and approved by a doctor.

Attribute Information:

- Age 1.20-65
- Sex 1. Male, 2.Female
- Polyuria 1.Yes, 2.No.
- Polydipsia 1.Yes, 2.No.
- sudden weight loss 1.Yes, 2.No.
- weakness 1.Yes, 2.No.
- Polyphagia 1.Yes, 2.No.
- Genital thrush 1.Yes, 2.No.
- visual blurring 1.Yes, 2.No.
- Itching 1.Yes, 2.No.
- Irritability 1.Yes, 2.No.
- delayed healing 1.Yes, 2.No.
- partial paresis 1.Yes, 2.No.
- muscle stiffness 1.Yes, 2.No.
- Alopecia 1.Yes, 2.No.
- Obesity 1.Yes, 2.No.
- Class 1.Positive, 2.Negative.

This dataset was downloaded from [UCI's Machine Learning Repository.](https://archive.ics.uci.edu/ml/datasets/Early+stage+diabetes+risk+prediction+dataset.)

Islam, MM Faniqul, et al. 'Likelihood prediction of diabetes at early stage using data mining techniques.' Computer Vision and Machine Intelligence in Medical Image Analysis. Springer, Singapore, 2020. 113-125.

## Importing Libraries 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Loading and analyzing the dataset 

In [2]:
df = pd.read_csv("diabetes_data_upload.csv")
df.head()

Unnamed: 0,Age,Gender,Polyuria,Polydipsia,sudden weight loss,weakness,Polyphagia,Genital thrush,visual blurring,Itching,Irritability,delayed healing,partial paresis,muscle stiffness,Alopecia,Obesity,class
0,40,Male,No,Yes,No,Yes,No,No,No,Yes,No,Yes,No,Yes,Yes,Yes,Positive
1,58,Male,No,No,No,Yes,No,No,Yes,No,No,No,Yes,No,Yes,No,Positive
2,41,Male,Yes,No,No,Yes,Yes,No,No,Yes,No,Yes,No,Yes,Yes,No,Positive
3,45,Male,No,No,Yes,Yes,Yes,Yes,No,Yes,No,Yes,No,No,No,No,Positive
4,60,Male,Yes,Yes,Yes,Yes,Yes,No,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Positive


In [3]:
df.shape

(520, 17)

The dataset has 520 rows and 17 columns.

In [4]:
df["class"].value_counts()

Positive    320
Negative    200
Name: class, dtype: int64

Out of the 520 patients, 320 were diagnosed as diabetic and 200 were not.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 520 entries, 0 to 519
Data columns (total 17 columns):
Age                   520 non-null int64
Gender                520 non-null object
Polyuria              520 non-null object
Polydipsia            520 non-null object
sudden weight loss    520 non-null object
weakness              520 non-null object
Polyphagia            520 non-null object
Genital thrush        520 non-null object
visual blurring       520 non-null object
Itching               520 non-null object
Irritability          520 non-null object
delayed healing       520 non-null object
partial paresis       520 non-null object
muscle stiffness      520 non-null object
Alopecia              520 non-null object
Obesity               520 non-null object
class                 520 non-null object
dtypes: int64(1), object(16)
memory usage: 69.2+ KB


## Exploratory Data Analysis [EDA] 

All the columns are necessary to complete the machine learning model.

### Checking for null values 

In [6]:
df.isnull().sum()

Age                   0
Gender                0
Polyuria              0
Polydipsia            0
sudden weight loss    0
weakness              0
Polyphagia            0
Genital thrush        0
visual blurring       0
Itching               0
Irritability          0
delayed healing       0
partial paresis       0
muscle stiffness      0
Alopecia              0
Obesity               0
class                 0
dtype: int64

### Changing categorical column values to 0 and 1. 

In [7]:
df = df.replace("No", 0)
df.head()

Unnamed: 0,Age,Gender,Polyuria,Polydipsia,sudden weight loss,weakness,Polyphagia,Genital thrush,visual blurring,Itching,Irritability,delayed healing,partial paresis,muscle stiffness,Alopecia,Obesity,class
0,40,Male,0,Yes,0,Yes,0,0,0,Yes,0,Yes,0,Yes,Yes,Yes,Positive
1,58,Male,0,0,0,Yes,0,0,Yes,0,0,0,Yes,0,Yes,0,Positive
2,41,Male,Yes,0,0,Yes,Yes,0,0,Yes,0,Yes,0,Yes,Yes,0,Positive
3,45,Male,0,0,Yes,Yes,Yes,Yes,0,Yes,0,Yes,0,0,0,0,Positive
4,60,Male,Yes,Yes,Yes,Yes,Yes,0,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Positive


In [8]:
df = df.replace("Yes", 1)
df.head()

Unnamed: 0,Age,Gender,Polyuria,Polydipsia,sudden weight loss,weakness,Polyphagia,Genital thrush,visual blurring,Itching,Irritability,delayed healing,partial paresis,muscle stiffness,Alopecia,Obesity,class
0,40,Male,0,1,0,1,0,0,0,1,0,1,0,1,1,1,Positive
1,58,Male,0,0,0,1,0,0,1,0,0,0,1,0,1,0,Positive
2,41,Male,1,0,0,1,1,0,0,1,0,1,0,1,1,0,Positive
3,45,Male,0,0,1,1,1,1,0,1,0,1,0,0,0,0,Positive
4,60,Male,1,1,1,1,1,0,1,1,1,1,1,1,1,1,Positive


### Dummy coding Gender and  Class columns

In [9]:
male = pd.get_dummies(df["Gender"], drop_first=True)
male.head()

Unnamed: 0,Male
0,1
1,1
2,1
3,1
4,1


In [10]:
positive = pd.get_dummies(df["class"], drop_first=True)
positive.head()

Unnamed: 0,Positive
0,1
1,1
2,1
3,1
4,1


In [11]:
df = pd.concat([df, male, positive], axis=1)
df.tail()

Unnamed: 0,Age,Gender,Polyuria,Polydipsia,sudden weight loss,weakness,Polyphagia,Genital thrush,visual blurring,Itching,Irritability,delayed healing,partial paresis,muscle stiffness,Alopecia,Obesity,class,Male,Positive
515,39,Female,1,1,1,0,1,0,0,1,0,1,1,0,0,0,Positive,0,1
516,48,Female,1,1,1,1,1,0,0,1,1,1,1,0,0,0,Positive,0,1
517,58,Female,1,1,1,1,1,0,1,0,0,0,1,1,0,1,Positive,0,1
518,32,Female,0,0,0,1,0,0,1,1,0,1,0,0,1,0,Negative,0,0
519,42,Male,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Negative,1,0


In [12]:
df = df.drop(["Gender", "class"], axis=1)
df.head()

Unnamed: 0,Age,Polyuria,Polydipsia,sudden weight loss,weakness,Polyphagia,Genital thrush,visual blurring,Itching,Irritability,delayed healing,partial paresis,muscle stiffness,Alopecia,Obesity,Male,Positive
0,40,0,1,0,1,0,0,0,1,0,1,0,1,1,1,1,1
1,58,0,0,0,1,0,0,1,0,0,0,1,0,1,0,1,1
2,41,1,0,0,1,1,0,0,1,0,1,0,1,1,0,1,1
3,45,0,0,1,1,1,1,0,1,0,1,0,0,0,0,1,1
4,60,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1


## Data Split 

### Importing Libraries

In [13]:
from sklearn.model_selection import train_test_split

### Performing data split 

In [14]:
X = df.drop("Positive", axis=1)
y = df["Positive"]

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
print(X_train.shape, y_train.shape)

(416, 16) (416,)


In [16]:
print(X_test.shape, y_test.shape)

(104, 16) (104,)


### Performing feature scaling 

In [17]:
from sklearn.preprocessing import StandardScaler

In [18]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

## Logistic Regression Model 

In [19]:
from sklearn.linear_model import LogisticRegression

In [20]:
model = LogisticRegression()

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

### Analyzing the model 

In [21]:
from sklearn.metrics import classification_report, confusion_matrix

In [22]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.89      0.87      0.88        38
           1       0.93      0.94      0.93        66

    accuracy                           0.91       104
   macro avg       0.91      0.90      0.91       104
weighted avg       0.91      0.91      0.91       104



In [23]:
print(confusion_matrix(y_test, y_pred))

[[33  5]
 [ 4 62]]


The model has an accracy score of 91% and can correctly classify 95 of 104 values in the test set.

## Decision Tree Classifier 

A decision tree is a flowchart-like tree structure where an internal node represents feature(or attribute), the branch represents a decision rule, and each leaf node represents the outcome.

In [24]:
from sklearn.tree import DecisionTreeClassifier

In [25]:
model = DecisionTreeClassifier()

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [26]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.92      0.96        38
           1       0.96      1.00      0.98        66

    accuracy                           0.97       104
   macro avg       0.98      0.96      0.97       104
weighted avg       0.97      0.97      0.97       104



In [27]:
print(confusion_matrix(y_test, y_pred))

[[35  3]
 [ 0 66]]


## Random Forest Classifier 

Random forest is a tree-based method that ensembles multiple individual decision trees.

In [28]:
from sklearn.ensemble import RandomForestClassifier

In [29]:
model = RandomForestClassifier()

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [30]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.95      0.97        38
           1       0.97      1.00      0.99        66

    accuracy                           0.98       104
   macro avg       0.99      0.97      0.98       104
weighted avg       0.98      0.98      0.98       104



In [31]:
print(confusion_matrix(y_test, y_pred))

[[36  2]
 [ 0 66]]


## Support Vector Machines (SVM) Classifier 

SVMs construct a set of hyperplanes in high dimensional feature space that can be used for regression and classification problems.

In [32]:
from sklearn.svm import SVC

In [33]:
model = SVC()

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [34]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.92      0.92      0.92        38
           1       0.95      0.95      0.95        66

    accuracy                           0.94       104
   macro avg       0.94      0.94      0.94       104
weighted avg       0.94      0.94      0.94       104



In [35]:
print(confusion_matrix(y_test, y_pred))

[[35  3]
 [ 3 63]]


## K-Nearest Neighbors (KNN) Classifier 

K-nearest neighbors use Euclidean distance calculations where the prediction is the average of the k nearest neighbors.

In [36]:
from sklearn.neighbors import KNeighborsClassifier

In [37]:
model = KNeighborsClassifier()

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [38]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.78      0.95      0.86        38
           1       0.97      0.85      0.90        66

    accuracy                           0.88       104
   macro avg       0.87      0.90      0.88       104
weighted avg       0.90      0.88      0.89       104



In [39]:
print(confusion_matrix(y_test, y_pred))

[[36  2]
 [10 56]]


The classifications done were using the default parameters. By changing the parameters of each classifier function, we can improve or worsen the quality of the model.

From the values we have obtained, Decision Tree and Random Forest Classifers have the best accuracy. KNN Classifier has the lowest accuracy of 88% which is still very good. This could also be due to the small size of the dataset.