# Early Stage Diabetes Risk Prediction Dataset

Data Set Information:
This has been collected using direct questionnaires from the patients of Sylhet Diabetes
Hospital in Sylhet, Bangladesh and approved by a doctor.

### Attribute Information:
- Age 1.20-65

- Sex - 1. Male, 2.Female

All features with values 1: Present 2: Non Present
- Polyuria (Excessive excretion of urine resulting in profuse and frequent micturition) 
- Polydipsia (excessive thirst and fluid intake)
- Sudden weight loss 
- Weakness 
- Polyphagia (exessive eating)
- Genital thrush (candidiasis infection)No.
- Visual blurring 
- Itching 
- Irritability 
- Delayed healing of wounds
- partial paresis (weakening of muscles)
- muscle stiffness
- Alopecia (hair loss)
- Obesi

risk-prediction-datasets 
.Positive, 2.Negative.

### Citation
Islam, MM Faniqul, et al. 'Likelihood prediction of diabetes at early stage using data mining techniques.' Computer Vision and Machine Intelligence in Medical Image Analysis. Springer, Singapore, 2020. 113-125.

https://www.kaggle.com/ishandutta/early-stage-diabetes-risk-prediction-dataset

### Research Questions
1. What symptoms are most frequent for diabetes patients?
2. How is the distribution of the patient's age? 
3. What is the average age for each symptom present?
4. How are symptoms related to gender?
5. Build a model for the prediction of diabetes.

##  Import libraries

In [50]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.figure_factory as ff
# to divide train and test set
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC 
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV

pd.pandas.set_option('display.max_columns', None)

import warnings
warnings.filterwarnings('ignore')

In [7]:
dataset = pd.read_csv('diabetes_data_upload.csv')

In [8]:
dataset.head()

Unnamed: 0,Age,Gender,Polyuria,Polydipsia,sudden weight loss,weakness,Polyphagia,Genital thrush,visual blurring,Itching,Irritability,delayed healing,partial paresis,muscle stiffness,Alopecia,Obesity,class
0,40,Male,No,Yes,No,Yes,No,No,No,Yes,No,Yes,No,Yes,Yes,Yes,Positive
1,58,Male,No,No,No,Yes,No,No,Yes,No,No,No,Yes,No,Yes,No,Positive
2,41,Male,Yes,No,No,Yes,Yes,No,No,Yes,No,Yes,No,Yes,Yes,No,Positive
3,45,Male,No,No,Yes,Yes,Yes,Yes,No,Yes,No,Yes,No,No,No,No,Positive
4,60,Male,Yes,Yes,Yes,Yes,Yes,No,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Positive


In [9]:
dataset.replace({'Yes':'1','No':'0','Positive':'1','Negative':'0'},inplace=True)

In [10]:
dataset[['Polyuria', 'Polydipsia', 'sudden weight loss',
       'weakness', 'Polyphagia', 'Genital thrush', 'visual blurring',
       'Itching', 'Irritability', 'delayed healing', 'partial paresis',
       'muscle stiffness', 'Alopecia', 'Obesity', 'class']] = dataset[['Polyuria', 'Polydipsia', 'sudden weight loss',
       'weakness', 'Polyphagia', 'Genital thrush', 'visual blurring',
       'Itching', 'Irritability', 'delayed healing', 'partial paresis',
       'muscle stiffness', 'Alopecia', 'Obesity', 'class']].astype(int)

In [11]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 520 entries, 0 to 519
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Age                 520 non-null    int64 
 1   Gender              520 non-null    object
 2   Polyuria            520 non-null    int32 
 3   Polydipsia          520 non-null    int32 
 4   sudden weight loss  520 non-null    int32 
 5   weakness            520 non-null    int32 
 6   Polyphagia          520 non-null    int32 
 7   Genital thrush      520 non-null    int32 
 8   visual blurring     520 non-null    int32 
 9   Itching             520 non-null    int32 
 10  Irritability        520 non-null    int32 
 11  delayed healing     520 non-null    int32 
 12  partial paresis     520 non-null    int32 
 13  muscle stiffness    520 non-null    int32 
 14  Alopecia            520 non-null    int32 
 15  Obesity             520 non-null    int32 
 16  class               520 no

### 1. What symptoms are most frequent for diabetes patients?

In [12]:
present_series = dataset.loc[dataset['class'] == 1][['Polyuria', 'Polydipsia', 'sudden weight loss',
       'weakness', 'Polyphagia', 'Genital thrush', 'visual blurring',
       'Itching', 'Irritability', 'delayed healing', 'partial paresis',
       'muscle stiffness', 'Alopecia', 'Obesity']]

In [13]:
present_q = len(present_series)
present_series.transpose()
present_df = pd.DataFrame(present_series.sum().T,index=None,columns=['Present'])

In [14]:
present_df['Non Present'] = present_q - present_df.Present
present_df['Percentage'] = present_df.Present / present_q
present_df = present_df.sort_values(by='Present')
present_df.index.rename('Condition',inplace=True)
present_df.reset_index(inplace=True)

In [15]:
present_df

Unnamed: 0,Condition,Present,Non Present,Percentage
0,Obesity,61,259,0.190625
1,Alopecia,78,242,0.24375
2,Genital thrush,83,237,0.259375
3,Irritability,110,210,0.34375
4,muscle stiffness,135,185,0.421875
5,delayed healing,153,167,0.478125
6,Itching,154,166,0.48125
7,visual blurring,175,145,0.546875
8,sudden weight loss,188,132,0.5875
9,Polyphagia,189,131,0.590625


In [16]:
fig = px.bar(present_df, x='Percentage', y='Condition', text='Percentage', width=1000,height=600,title="Most common conditions present in diabetic patients")
fig.update_traces(texttemplate='%{text:.2%}', textposition='inside')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide')
fig.show()

### 2. How is the distribution of the patient's age?

In [17]:
bins= [-1,0,18,40,65, 110]
labels = ['unknown','Pediatric','Young Adult','Adult', 'Ederly']
dataset['Age Group'] = pd.cut(dataset['Age'], bins=bins, labels=labels, right=False)

In [18]:
dataset

Unnamed: 0,Age,Gender,Polyuria,Polydipsia,sudden weight loss,weakness,Polyphagia,Genital thrush,visual blurring,Itching,Irritability,delayed healing,partial paresis,muscle stiffness,Alopecia,Obesity,class,Age Group
0,40,Male,0,1,0,1,0,0,0,1,0,1,0,1,1,1,1,Adult
1,58,Male,0,0,0,1,0,0,1,0,0,0,1,0,1,0,1,Adult
2,41,Male,1,0,0,1,1,0,0,1,0,1,0,1,1,0,1,Adult
3,45,Male,0,0,1,1,1,1,0,1,0,1,0,0,0,0,1,Adult
4,60,Male,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,Adult
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
515,39,Female,1,1,1,0,1,0,0,1,0,1,1,0,0,0,1,Young Adult
516,48,Female,1,1,1,1,1,0,0,1,1,1,1,0,0,0,1,Adult
517,58,Female,1,1,1,1,1,0,1,0,0,0,1,1,0,1,1,Adult
518,32,Female,0,0,0,1,0,0,1,1,0,1,0,0,1,0,0,Young Adult


In [19]:
age_df = pd.DataFrame(dataset.groupby('Age Group')['class'].agg(['sum', 'count']))
age_df.rename(columns={'sum':'patients','count':'persons'},inplace=True)
age_df

Unnamed: 0_level_0,patients,persons
Age Group,Unnamed: 1_level_1,Unnamed: 2_level_1
unknown,0,0
Pediatric,1,1
Young Adult,84,143
Adult,197,319
Ederly,38,57


In [20]:
age_df = age_df.iloc[2:,:].copy()
# age_df.reset_index(inplace=True)
age_df['percentage'] = age_df.patients / age_df.persons
age_df.reset_index(inplace=True)

In [21]:
age_df

Unnamed: 0,Age Group,patients,persons,percentage
0,Young Adult,84,143,0.587413
1,Adult,197,319,0.617555
2,Ederly,38,57,0.666667


In [22]:
fig = px.bar(age_df, x='Age Group', y='patients', text='patients', width=1000,height=600,title="Diabetes Patients per Age")
fig.update_traces(textposition='inside')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide')
fig.show()

In [23]:
fig = px.bar(age_df, x='Age Group', y='percentage', text='percentage', width=1000,height=600,title="Diabetes Patients per Age")
fig.update_traces(textposition='inside',texttemplate='%{text:.2%}')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide')
fig.show()

### 3. How are symptoms related to gender?

### 4. Build a model for the prediction of diabetes.

In [24]:
present_series = dataset.loc[dataset['class'] == 1][['Gender','Polyuria', 'Polydipsia', 'sudden weight loss',
       'weakness', 'Polyphagia', 'Genital thrush', 'visual blurring',
       'Itching', 'Irritability', 'delayed healing', 'partial paresis',
       'muscle stiffness', 'Alopecia', 'Obesity']]

age_cond_df = pd.melt(present_series, value_vars=['Polyuria', 'Polydipsia', 'sudden weight loss',
       'weakness', 'Polyphagia', 'Genital thrush', 'visual blurring',
       'Itching', 'Irritability', 'delayed healing', 'partial paresis',
       'muscle stiffness', 'Alopecia', 'Obesity'], id_vars=['Gender'],var_name='Condition')

In [25]:
age_cond_group = pd.DataFrame(age_cond_df.groupby(['Gender','Condition'])['value'].sum()).sort_values(by='value',ascending=False).reset_index()

In [26]:
age_cond_group

Unnamed: 0,Gender,Condition,value
0,Female,Polyuria,129
1,Female,Polydipsia,125
2,Female,partial paresis,122
3,Female,weakness,117
4,Female,sudden weight loss,114
5,Male,Polyuria,114
6,Female,Polyphagia,114
7,Female,visual blurring,104
8,Male,weakness,101
9,Male,Polydipsia,100


In [27]:
fig = px.bar(age_cond_group, x='Condition', y='value', color='Gender', text='value', width=1000,height=600,title="Conditions per Gender",barmode='group',
            color_discrete_map={
        'Male': 'cornflowerblue',
        'Female': 'coral'
    })
fig.update_traces(textposition='inside')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide')
fig.show()

In [42]:

X = dataset.drop(columns=['class','Age Group'])
X = pd.get_dummies(X,drop_first=True)
y = dataset['class']

In [43]:
X.columns

Index(['Age', 'Polyuria', 'Polydipsia', 'sudden weight loss', 'weakness',
       'Polyphagia', 'Genital thrush', 'visual blurring', 'Itching',
       'Irritability', 'delayed healing', 'partial paresis',
       'muscle stiffness', 'Alopecia', 'Obesity', 'Gender_Male'],
      dtype='object')

In [44]:
X.shape

(520, 16)

In [45]:
y.shape

(520,)

In [46]:
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=5)

## Support Vector Classifier - 1st attempt

In [83]:
# Instantiate a Support Vector Classifier
svc_model = SVC()
# Fit the SVC instance
svc_model.fit(X_train, y_train)

SVC()

In [84]:
# Predict values
y_predict = svc_model.predict(X_test)
# Create a confusion matrix
cm = confusion_matrix(y_test, y_predict)

In [85]:
# Create confusion matrix
labels = ['Diabetes','Non Diabetes']
fig = ff.create_annotated_heatmap(cm, x = labels, y = labels,colorscale='blues')
fig.show()

In [86]:
# Get the classification report for the SVC model
print(classification_report(y_test, y_predict))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        42
           1       0.60      1.00      0.75        62

    accuracy                           0.60       104
   macro avg       0.30      0.50      0.37       104
weighted avg       0.36      0.60      0.45       104



## Support Vector Classifier with Grid Search

In [57]:
#Set up parameters
param_grid = {'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001], 'kernel': ['rbf']} 

# Instantiate SVC pipeline with grid search
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=4)

In [59]:
# Fit the model
grid.fit(X_train,y_train)

Fitting 5 folds for each of 16 candidates, totalling 80 fits
[CV 1/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.655 total time=   0.0s
[CV 2/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.627 total time=   0.0s
[CV 3/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.663 total time=   0.0s
[CV 4/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.639 total time=   0.0s
[CV 5/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.614 total time=   0.0s
[CV 1/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.679 total time=   0.0s
[CV 2/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.651 total time=   0.0s
[CV 3/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.663 total time=   0.0s
[CV 4/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.639 total time=   0.0s
[CV 5/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.663 total time=   0.0s
[CV 1/5] END .....C=0.1, gamma=0.01, kernel=rbf;, score=0.619 total time=   0.0s
[CV 2/5] END .....C=0.1, gamma=0.01, kernel=rbf;

GridSearchCV(estimator=SVC(),
             param_grid={'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001],
                         'kernel': ['rbf']},
             verbose=4)

In [60]:
# Retrieve the best parameters
grid.best_params_

{'C': 100, 'gamma': 0.01, 'kernel': 'rbf'}

In [61]:
# retrieve estimator
grid.best_estimator_

SVC(C=100, gamma=0.01)

In [63]:
# make predictions
grid_predictions = grid.predict(X_test)

In [81]:
# Create confusion matrix
cm = confusion_matrix(y_test, grid_predictions)

In [65]:
# Print report
print(classification_report(y_test,grid_predictions))

              precision    recall  f1-score   support

           0       0.97      0.88      0.93        42
           1       0.92      0.98      0.95        62

    accuracy                           0.94       104
   macro avg       0.95      0.93      0.94       104
weighted avg       0.94      0.94      0.94       104



In [80]:
# Create confusion matrix
labels = ['Diabetes','Non Diabetes']
fig = ff.create_annotated_heatmap(cm, x = labels, y = labels,colorscale='blues')
fig.show()