# Business problem
The remarkable advancements in biotechnology and public healthcare infrastructures have led to a momentous production of critical and sensitive healthcare data. By applying intelligent data analysis techniques, many interesting patterns are identified for the early and onset detection and prevention of several fatal diseases. Diabetes mellitus is an extremely life-threatening disease because it contributes to other lethal diseases, i.e., heart, kidney, and nerve damage. In this paper, a machine learning based approach has been proposed for the classification, early-stage identification, and prediction of diabetes. Furthermore, it also presents an IoT-based hypothetical diabetes monitoring system for a healthy and affected person to monitor his blood glucose (BG) level. For diabetes classification, three different classifiers have been employed, i.e., random forest (RF), multilayer perceptron (MLP), and logistic regression (LR). For predictive analysis, we have employed long short-term memory (LSTM), moving averages (MA), and linear regression (LR). For experimental evaluation, a benchmark PIMA Indian Diabetes dataset is used. During the analysis, it is observed that MLP outperforms other classifiers with 86.08% of accuracy and LSTM improves the significant prediction with 87.26% accuracy of diabetes. Moreover, a comparative analysis of the proposed approach is also performed with existing state-of-the-art techniques, demonstrating the adaptability of the proposed approach in many public healthcare applications.

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px


In [2]:
df=pd.read_csv('diabetes.csv')
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [3]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
df.tail()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.34,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1
767,1,93,70,31,0,30.4,0.315,23,0


In [5]:
df.shape

(768, 9)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [7]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [8]:
df['Outcome'].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

In [9]:
# 0 non diabetes and 1 for diabetes

In [10]:
df.groupby('Outcome').mean()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


In [11]:
#seperating data and labels
X= df.drop('Outcome',axis=1)
y=df['Outcome']

In [12]:
X

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33
...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63
764,2,122,70,27,0,36.8,0.340,27
765,5,121,72,23,112,26.2,0.245,30
766,1,126,60,0,0,30.1,0.349,47


# Data preprocessing
- standardization

In [13]:
scaler=StandardScaler()
scaler

In [14]:
scaler.fit(X)

In [15]:
standardized=scaler.transform(X)

In [16]:
import joblib



# Save scaler
sc=joblib.dump(scaler, 'scaler.joblib')
sc

['scaler.joblib']

In [19]:
# or scalar.fit_transform()

In [17]:
standardized

array([[ 0.63994726,  0.84832379,  0.14964075, ...,  0.20401277,
         0.46849198,  1.4259954 ],
       [-0.84488505, -1.12339636, -0.16054575, ..., -0.68442195,
        -0.36506078, -0.19067191],
       [ 1.23388019,  1.94372388, -0.26394125, ..., -1.10325546,
         0.60439732, -0.10558415],
       ...,
       [ 0.3429808 ,  0.00330087,  0.14964075, ..., -0.73518964,
        -0.68519336, -0.27575966],
       [-0.84488505,  0.1597866 , -0.47073225, ..., -0.24020459,
        -0.37110101,  1.17073215],
       [-0.84488505, -0.8730192 ,  0.04624525, ..., -0.20212881,
        -0.47378505, -0.87137393]])

In [18]:
X=standardized
y=df['Outcome']

# Split into Xtrain,y_train and X_test,y_test

In [19]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,stratify=y,random_state=2)

In [20]:
X_train

array([[-1.14185152, -0.05929342, -3.57259724, ...,  0.05170968,
        -0.9992857 , -0.78628618],
       [ 0.63994726, -0.49745345,  0.04624525, ..., -0.15136112,
        -1.05666795,  0.31985461],
       [-0.84488505,  2.13150675, -0.47073225, ..., -0.24020459,
        -0.2231152 ,  2.19178518],
       ...,
       [ 2.12477957, -1.12339636,  0.25303625, ..., -0.24020459,
        -0.51908683,  0.14967911],
       [ 0.04601433, -0.27837344,  0.45982725, ...,  0.94014439,
        -0.71237443,  0.40494237],
       [-1.14185152, -1.09209922, -0.05715025, ...,  0.48323511,
        -0.70633419, -0.70119842]])

In [30]:
clasifier=svm.SVC(kernel='linear')
clasifier

In [31]:
# training the support vector 'Machine Classifier'

In [32]:
clasifier.fit(X_train,y_train)

# Model evaluation

In [33]:
X_train_pred=clasifier.predict(X_train)
accuracy_score(X_train_pred,y_train)

0.7866449511400652

In [34]:
X_test_pred=clasifier.predict(X_test)
accuracy_score(X_test_pred,y_test)

0.7727272727272727

In [35]:
# Save model
model=joblib.dump(clasifier, 'model.joblib')
model

['model.joblib']

# Make predictive data

In [36]:
input_data=(5,166,72,19,175,25.8,0.587,51)
# changing the input data to numpy array
input_data_as_a_numpy_array=np.asarray(input_data)
# reshape the array as we are prediting
input_data_reshaped=input_data_as_a_numpy_array.reshape(1,-1)

In [37]:
input_data_reshaped

array([[  5.   , 166.   ,  72.   ,  19.   , 175.   ,  25.8  ,   0.587,
         51.   ]])

In [38]:
# standarize the input data
stnd_data=scaler.transform(input_data_reshaped)



In [39]:
stnd_data

array([[ 0.3429808 ,  1.41167241,  0.14964075, -0.09637905,  0.82661621,
        -0.78595734,  0.34768723,  1.51108316]])

In [41]:
prediction=clasifier.predict(stnd_data)
print(prediction)

[1]


In [34]:
if (prediction[0]==0):
    print('The person is not diabetic')
else:
    print(' The person is diabetic')

 The person is diabetic


# saving the model

In [42]:
#loading the saved model

In [43]:
model = joblib.load('model.joblib')
scaler =  joblib.load('scaler.joblib')

In [45]:
input_data=(5,166,72,19,175,25.8,0.587,51)
# changing the input data to numpy array
input_data_as_a_numpy_array=np.asarray(input_data)
# reshape the array as we are prediting
input_data_reshaped=input_data_as_a_numpy_array.reshape(1,-1)
stnd_data=scaler.transform(input_data_reshaped)
prediction=model.predict(stnd_data)
print(prediction)
if (prediction[0]==0):
    print('The person is not diabetic')
else:
    print(' The person is diabetic')

[1]
 The person is diabetic




# Deployment

In [34]:
!pip install --upgrade scikit-learn


Collecting scikit-learn
  Downloading scikit_learn-1.3.2-cp39-cp39-win_amd64.whl (9.3 MB)
     ---------------------------------------- 9.3/9.3 MB 3.4 MB/s eta 0:00:00
Collecting joblib>=1.1.1
  Using cached joblib-1.3.2-py3-none-any.whl (302 kB)
Installing collected packages: joblib, scikit-learn
  Attempting uninstall: joblib
    Found existing installation: joblib 1.1.0
    Uninstalling joblib-1.1.0:
      Successfully uninstalled joblib-1.1.0
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.0.2
    Uninstalling scikit-learn-1.0.2:
      Successfully uninstalled scikit-learn-1.0.2
Successfully installed joblib-1.3.2 scikit-learn-1.3.2


In [64]:
import sklearn
print(f"Scikit-learn version: {sklearn.__version__}")


Scikit-learn version: 1.0.2


In [65]:
! pip install --upgrade scikit-learn




In [66]:
import sklearn
print(f"Scikit-learn version: {sklearn.__version__}")

Scikit-learn version: 1.0.2


In [67]:
! pip install --upgrade pip


Collecting pip
  Using cached pip-23.3.2-py3-none-any.whl (2.1 MB)


ERROR: To modify pip, please run the following command:
C:\Users\adepu bharath kumar\anaconda3\python.exe -m pip install --upgrade pip


In [68]:
! python.exe -m pip install --upgrade pip

Collecting pip
  Using cached pip-23.3.2-py3-none-any.whl (2.1 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.2.2
    Uninstalling pip-22.2.2:
      Successfully uninstalled pip-22.2.2
Successfully installed pip-23.3.2


In [69]:
! pip install --upgrade scikit-learn==1.3.2




In [70]:
import sklearn
print(f"Scikit-learn version: {sklearn.__version__}")

Scikit-learn version: 1.0.2


In [2]:
import sklearn
print(f"Scikit-learn version: {sklearn.__version__}")

Scikit-learn version: 1.3.2
