# Early Stage Diabetes Risk Prediction Dataset

Data Set Information:
This has been collected using direct questionnaires from the patients of Sylhet Diabetes
Hospital in Sylhet, Bangladesh and approved by a doctor.

### Attribute Information:
- Age 1.20-65

- Sex - 1. Male, 2.Female

All features with values 1: Present 2: Non Present
- Polyuria (Excessive excretion of urine resulting in profuse and frequent micturition) 
- Polydipsia (excessive thirst and fluid intake)
- Sudden weight loss 
- Weakness 
- Polyphagia (exessive eating)
- Genital thrush (candidiasis infection)No.
- Visual blurring 
- Itching 
- Irritability 
- Delayed healing of wounds
- partial paresis (weakening of muscles)
- muscle stiffness
- Alopecia (hair loss)
- Obesity
- class

Class (target variable)
.Positive, 2.Negative.

### Citation
Islam, MM Faniqul, et al. 'Likelihood prediction of diabetes at early stage using data mining techniques.' Computer Vision and Machine Intelligence in Medical Image Analysis. Springer, Singapore, 2020. 113-125.

https://www.kaggle.com/ishandutta/early-stage-diabetes-risk-prediction-dataset

### Research Questions
1. What conditions are most frequent for diabetes patients?
2. How is the distribution of patient ages? 
3. What is the average age for each condition present?
4. How are conditions related to gender?
5. Build a model for the prediction of diabetes.

##  Import libraries

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.figure_factory as ff
# to divide train and test set
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC 
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV

pd.pandas.set_option('display.max_columns', None)

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Read dataset
dataset = pd.read_csv('diabetes_data_upload.csv')

In [3]:
# View dataset contents
dataset.head()

Unnamed: 0,Age,Gender,Polyuria,Polydipsia,sudden weight loss,weakness,Polyphagia,Genital thrush,visual blurring,Itching,Irritability,delayed healing,partial paresis,muscle stiffness,Alopecia,Obesity,class
0,40,Male,No,Yes,No,Yes,No,No,No,Yes,No,Yes,No,Yes,Yes,Yes,Positive
1,58,Male,No,No,No,Yes,No,No,Yes,No,No,No,Yes,No,Yes,No,Positive
2,41,Male,Yes,No,No,Yes,Yes,No,No,Yes,No,Yes,No,Yes,Yes,No,Positive
3,45,Male,No,No,Yes,Yes,Yes,Yes,No,Yes,No,Yes,No,No,No,No,Positive
4,60,Male,Yes,Yes,Yes,Yes,Yes,No,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Positive


In [4]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 520 entries, 0 to 519
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Age                 520 non-null    int64 
 1   Gender              520 non-null    object
 2   Polyuria            520 non-null    object
 3   Polydipsia          520 non-null    object
 4   sudden weight loss  520 non-null    object
 5   weakness            520 non-null    object
 6   Polyphagia          520 non-null    object
 7   Genital thrush      520 non-null    object
 8   visual blurring     520 non-null    object
 9   Itching             520 non-null    object
 10  Irritability        520 non-null    object
 11  delayed healing     520 non-null    object
 12  partial paresis     520 non-null    object
 13  muscle stiffness    520 non-null    object
 14  Alopecia            520 non-null    object
 15  Obesity             520 non-null    object
 16  class               520 no

## Wrangling and Feature Engineering

The dataset is quite clean, nevertheless we need to replace Yes/No labels with 1/0 which are suitable for ML and visualization. We will also create a new feature to group persons by age

In [5]:
#Replace labels
dataset.replace({'Yes':'1','No':'0','Positive':'1','Negative':'0'},inplace=True)

In [6]:
# Cast columns to integer
dataset[['Polyuria', 'Polydipsia', 'sudden weight loss',
       'weakness', 'Polyphagia', 'Genital thrush', 'visual blurring',
       'Itching', 'Irritability', 'delayed healing', 'partial paresis',
       'muscle stiffness', 'Alopecia', 'Obesity', 'class']] = dataset[['Polyuria', 'Polydipsia', 'sudden weight loss',
       'weakness', 'Polyphagia', 'Genital thrush', 'visual blurring',
       'Itching', 'Irritability', 'delayed healing', 'partial paresis',
       'muscle stiffness', 'Alopecia', 'Obesity', 'class']].astype(int)

In [7]:
# Categorize age into groups
bins= [-1,0,18,40,65, 110]
labels = ['unknown','Pediatric 0-18','Young Adult 19-39','Adult 40-64', 'Ederly > 65']
dataset['Age Group'] = pd.cut(dataset['Age'], bins=bins, labels=labels, right=False)

## 1. What conditions are most frequent for diabetes diagnosed patients?

In [25]:
# Filter dataset with patients that are diabetes positive
positive_series = dataset.loc[dataset['class'] == 1][['Polyuria', 'Polydipsia', 'sudden weight loss',
       'weakness', 'Polyphagia', 'Genital thrush', 'visual blurring',
       'Itching', 'Irritability', 'delayed healing', 'partial paresis',
       'muscle stiffness', 'Alopecia', 'Obesity','Gender']]

In [27]:
# Create a dataframe with conditions
positive_q = len(positive_series)
# present_series.transpose()
conditions_df = pd.DataFrame(positive_series.drop(columns=['Gender']).sum().T,index=None,columns=['Positive'])

In [28]:
# Calculare percentage of conditions for patients that have diabetes
conditions_df['No Diabetes'] = positive_q - conditions_df.Positive

conditions_df['Percentage'] = conditions_df.Positive / positive_q
conditions_df = conditions_df.sort_values(by='Positive')
conditions_df.index.rename('Condition',inplace=True)
conditions_df.reset_index(inplace=True)

In [29]:
conditions_df

Unnamed: 0,Condition,Positive,No Diabetes,Percentage
0,Obesity,61,259,0.190625
1,Alopecia,78,242,0.24375
2,Genital thrush,83,237,0.259375
3,Irritability,110,210,0.34375
4,muscle stiffness,135,185,0.421875
5,delayed healing,153,167,0.478125
6,Itching,154,166,0.48125
7,visual blurring,175,145,0.546875
8,sudden weight loss,188,132,0.5875
9,Polyphagia,189,131,0.590625


There are 13 distinct conditions in the dataset. We see that seven conditions are present in more than 50% of the patients. Contrary to conventional wisdom, obesity is the least important condition for people with diabetes.

In [103]:
# Plot conditions present in diabetes patients
fig = px.bar(conditions_df, x='Percentage', y='Condition', text='Percentage', width=1000,height=600,title="Conditions present in positive diabetes patients")
fig.update_traces(texttemplate='%{text:.2%}', textposition='inside')
fig.update_layout(
    uniformtext_minsize=12, 
    uniformtext_mode='hide',
    autosize=False,
    width=1000,
    height=800,
    margin=dict(
        l=50,
        r=50,
        b=100,
        t=100,
        pad=4,
    ),
     yaxis=dict(
        title_text='Condition',
        titlefont=dict(size=16)
    ),
    xaxis=dict(
        title_text="Percentage of patients with condition as of total sample",
        titlefont=dict(size=16)
    ),
    paper_bgcolor="white"
)
fig.show()

## 2. How is the distribution of diabetic patients ages?

In [31]:
# Set up dataframe with age groups
age_group_df = pd.DataFrame(dataset.groupby('Age Group')['class'].agg(['sum', 'count']))
age_group_df.rename(columns={'sum':'Positive for diabetes','count':'persons'},inplace=True)
age_group_df
#age_group_df['Percentage'] = 

Unnamed: 0_level_0,Positive for diabetes,persons
Age Group,Unnamed: 1_level_1,Unnamed: 2_level_1
unknown,0,0
Pediatric 0-18,1,1
Young Adult 19-39,84,143
Adult 40-64,197,319
Ederly > 65,38,57


In [32]:
age_group_df.drop(['unknown','Pediatric 0-18'],inplace=True)
age_group_df.reset_index(inplace=True)
age_group_df

Unnamed: 0,Age Group,Positive for diabetes,persons
0,Young Adult 19-39,84,143
1,Adult 40-64,197,319
2,Ederly > 65,38,57


In [33]:
age_positive = pd.DataFrame(dataset.loc[dataset['class'] == 1]['Age'])

In [34]:
# Plot cases by age
fig = px.histogram(age_positive, x='Age', width=1000,height=600,title='Total distribution by Age',nbins=20)
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide')
fig.show()

In [100]:
fig = px.bar(age_group_df, x='Age Group', y='persons', text='persons', width=1000,height=600,title="Patients by Age Group")
fig.update_traces(textposition='inside')
fig.update_layout(
    uniformtext_minsize=12, 
    uniformtext_mode='hide',
    autosize=False,
    width=1000,
    height=800,
    margin=dict(
        l=50,
        r=50,
        b=100,
        t=100,
        pad=4,
    ),
     yaxis=dict(
        title_text='Patients diagnosed with diabetes',
        titlefont=dict(size=16)
    ),
    xaxis=dict(
        title_text="Age",
        titlefont=dict(size=16)
    ),
    paper_bgcolor="white"
)
fig.show()

## 3. How are symptoms related to gender?

In [36]:
gender_cond_df = pd.melt(positive_series, value_vars=['Polyuria', 'Polydipsia', 'sudden weight loss',
       'weakness', 'Polyphagia', 'Genital thrush', 'visual blurring',
       'Itching', 'Irritability', 'delayed healing', 'partial paresis',
       'muscle stiffness', 'Alopecia', 'Obesity'], id_vars=['Gender'],var_name='Condition')

In [37]:
gender_cond_group = pd.DataFrame(gender_cond_df.groupby(['Gender','Condition'])['value'].sum()).sort_values(by='value',ascending=False).reset_index()

In [38]:
gender_cond_group.head()

Unnamed: 0,Gender,Condition,value
0,Female,Polyuria,129
1,Female,Polydipsia,125
2,Female,partial paresis,122
3,Female,weakness,117
4,Female,sudden weight loss,114


In [98]:
# Plot conditions by Gender
fig = px.bar(gender_cond_group, x='Condition', y='value', color='Gender', text='value', width=1000,height=600,title="Conditions by Gender",barmode='group',
            color_discrete_map={
        'Male': 'cornflowerblue',
        'Female': 'coral'
    })
fig.update_traces(textposition='inside')
fig.update_layout(
    uniformtext_minsize=12, 
    uniformtext_mode='hide',
    autosize=False,
    width=1000,
    height=800,
    margin=dict(
        l=50,
        r=50,
        b=100,
        t=100,
        pad=4,
    ),
     yaxis=dict(
        title_text='Patients diagnosed with diabetes',
        titlefont=dict(size=16)
    ),
    xaxis=dict(
        title_text="Condition",
        titlefont=dict(size=16)
    ),
    paper_bgcolor="white"
)
fig.show()

# Machine learning model

In [40]:
# Create features and target variable sets
X = dataset.drop(columns=['class','Age Group'])
X = pd.get_dummies(X,drop_first=True)
y = dataset['class']

In [41]:
# Features
X.columns

Index(['Age', 'Polyuria', 'Polydipsia', 'sudden weight loss', 'weakness',
       'Polyphagia', 'Genital thrush', 'visual blurring', 'Itching',
       'Irritability', 'delayed healing', 'partial paresis',
       'muscle stiffness', 'Alopecia', 'Obesity', 'Gender_Male'],
      dtype='object')

In [42]:
#Feature shape
X.shape

(520, 16)

In [43]:
# Target variable
y.shape

(520,)

In [44]:
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=5)

## Support Vector Classifier - 1st attempt

In [45]:
# Instantiate a Support Vector Classifier
svc_model = SVC()
# Fit the SVC instance
svc_model.fit(X_train, y_train)

SVC()

In [46]:
# Predict values
y_predict = svc_model.predict(X_test)
# Create a confusion matrix
cm = confusion_matrix(y_test, y_predict)

In [82]:
# Create confusion matrix
labels = ['Diabetes','Non Diabetes']
fig = ff.create_annotated_heatmap(cm, x = labels, y = labels,colorscale='blues')
fig.update_layout(
    autosize=False,
    width=500,
    height=500,
    margin=dict(
        l=50,
        r=50,
        b=100,
        t=100,
        pad=4,
    ),
     yaxis=dict(
        title_text="Actual",
        titlefont=dict(size=18)
    ),
    xaxis=dict(
        title_text="Predicted",
        titlefont=dict(size=18)
    ),
    paper_bgcolor="white"
)
fig.show()

In [48]:
# Get the classification report for the SVC model
print(classification_report(y_test, y_predict))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        42
           1       0.60      1.00      0.75        62

    accuracy                           0.60       104
   macro avg       0.30      0.50      0.37       104
weighted avg       0.36      0.60      0.45       104



## Support Vector Classifier with Grid Search

In [178]:
#Set up parameters
param_grid = {'C': [1,3,10,30,100,300,1000], 'gamma': [1,0.3,0.1,0.03,0.01,0.003,0.001], 'kernel': ['linear']} 

# Instantiate SVC pipeline with grid search
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=4)

In [179]:
# Fit the model
grid.fit(X_train,y_train)

Fitting 5 folds for each of 49 candidates, totalling 245 fits
[CV 1/5] END .......C=1, gamma=1, kernel=linear;, score=0.917 total time=   0.0s
[CV 2/5] END .......C=1, gamma=1, kernel=linear;, score=0.916 total time=   0.0s
[CV 3/5] END .......C=1, gamma=1, kernel=linear;, score=0.916 total time=   0.0s
[CV 4/5] END .......C=1, gamma=1, kernel=linear;, score=0.928 total time=   0.0s
[CV 5/5] END .......C=1, gamma=1, kernel=linear;, score=0.940 total time=   0.0s
[CV 1/5] END .....C=1, gamma=0.3, kernel=linear;, score=0.917 total time=   0.0s
[CV 2/5] END .....C=1, gamma=0.3, kernel=linear;, score=0.916 total time=   0.0s
[CV 3/5] END .....C=1, gamma=0.3, kernel=linear;, score=0.916 total time=   0.0s
[CV 4/5] END .....C=1, gamma=0.3, kernel=linear;, score=0.928 total time=   0.0s
[CV 5/5] END .....C=1, gamma=0.3, kernel=linear;, score=0.940 total time=   0.0s
[CV 1/5] END .....C=1, gamma=0.1, kernel=linear;, score=0.917 total time=   0.0s
[CV 2/5] END .....C=1, gamma=0.1, kernel=linear

In [172]:
# Retrieve the best parameters
grid.best_params_

{'C': 1000, 'gamma': 0.003, 'kernel': 'rbf'}

In [173]:
# retrieve estimator
grid.best_estimator_

SVC(C=1000, gamma=0.003)

In [174]:
# make predictions
grid_predictions = grid.predict(X_test)

In [175]:
# Create confusion matrix
cm = confusion_matrix(y_test, grid_predictions)

In [176]:
# Print report
print(classification_report(y_test,grid_predictions))

              precision    recall  f1-score   support

           0       0.97      0.88      0.93        42
           1       0.92      0.98      0.95        62

    accuracy                           0.94       104
   macro avg       0.95      0.93      0.94       104
weighted avg       0.94      0.94      0.94       104



In [177]:
# Create confusion matrix
labels = ['Diabetes','Non Diabetes']
fig = ff.create_annotated_heatmap(cm, x = labels, y = labels,colorscale='blues')
fig.update_layout(
    autosize=False,
    width=500,
    height=500,
    margin=dict(
        l=50,
        r=50,
        b=100,
        t=100,
        pad=4,
    ),
     yaxis=dict(
        title_text="Actual",
        titlefont=dict(size=18)
    ),
    xaxis=dict(
        title_text="Predicted",
        titlefont=dict(size=18)
    ),
    paper_bgcolor="white"
)
fig.show()