# Diabetes Prediction using Machine Learning

Diabetes, is a group of metabolic disorders in which there are high blood sugar levels over a prolonged period. Symptoms of high blood sugar include frequent urination, increased thirst, and increased hunger. If left untreated, diabetes can cause many complications. Acute complications can include diabetic ketoacidosis, hyperosmolar hyperglycemic state, or death. Serious long-term complications include cardiovascular disease, stroke, chronic kidney disease, foot ulcers, and damage to the eyes.

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

## Objective
We will try to build a machine learning model to accurately predict whether or not the patients in the dataset have diabetes or not.

## **Details about the dataset:**

The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

- **Pregnancies**: Number of times pregnant
- **Glucose**: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- **BloodPressure**: Diastolic blood pressure (mm Hg)
- **SkinThickness**: Triceps skin fold thickness (mm)
- **Insulin**: 2-Hour serum insulin (mu U/ml)
- **BMI**: Body mass index (weight in kg/(height in m)^2)
- **DiabetesPedigreeFunction**: Diabetes pedigree function
- **Age**: Age (years)
- **Outcome**: Class variable (0 or 1)

**Number of Observation Units: 768**

**Variable Number: 9**

**Result; The model created as a result of XGBoost hyperparameter optimization became the model with the lowest Cross Validation Score value. (0.90)**

# 1) Exploratory Data Analysis

In [173]:
#Installation of required libraries
import numpy as np
import pandas as pd 
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score, mean_squared_error, r2_score, roc_auc_score, roc_curve, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import KFold
import warnings
warnings.simplefilter(action = "ignore") 

In [174]:
#Reading the dataset
df = pd.read_csv("../input/pima-indians-diabetes-database/diabetes.csv")

In [175]:
# The first 5 observation units of the data set were accessed.
df.head()

In [176]:
# The size of the data set was examined. It consists of 768 observation units and 9 variables.
df.shape

In [177]:
#Feature information
df.info()

In [178]:
# Descriptive statistics of the data set accessed.
df.describe([0.10,0.25,0.50,0.75,0.90,0.95,0.99]).T

In [179]:
# The distribution of the Outcome variable was examined.
df["Outcome"].value_counts()*100/len(df)

In [180]:
# The classes of the outcome variable were examined.
df.Outcome.value_counts()

In [181]:
# The histagram of the Age variable was reached.
df["Age"].hist(edgecolor = "black");

In [182]:
print("Max Age: " + str(df["Age"].max()) + " Min Age: " + str(df["Age"].min()))

In [183]:
# Histogram and density graphs of all variables were accessed.
fig, ax = plt.subplots(4,2, figsize=(16,16))
sns.distplot(df.Age, bins = 20, ax=ax[0,0]) 
sns.distplot(df.Pregnancies, bins = 20, ax=ax[0,1]) 
sns.distplot(df.Glucose, bins = 20, ax=ax[1,0]) 
sns.distplot(df.BloodPressure, bins = 20, ax=ax[1,1]) 
sns.distplot(df.SkinThickness, bins = 20, ax=ax[2,0])
sns.distplot(df.Insulin, bins = 20, ax=ax[2,1])
sns.distplot(df.DiabetesPedigreeFunction, bins = 20, ax=ax[3,0]) 
sns.distplot(df.BMI, bins = 20, ax=ax[3,1]) 

In [184]:
df.groupby("Outcome").agg({"Pregnancies":"mean"})

In [185]:
df.groupby("Outcome").agg({"Age":"mean"})

In [186]:
df.groupby("Outcome").agg({"Age":"max"})

In [187]:
df.groupby("Outcome").agg({"Insulin": "mean"})

In [188]:
df.groupby("Outcome").agg({"Insulin": "max"})

In [189]:
df.groupby("Outcome").agg({"Glucose": "mean"})

In [190]:
df.groupby("Outcome").agg({"Glucose": "max"})

In [191]:
df.groupby("Outcome").agg({"BMI": "mean"})

In [192]:
# The distribution of the outcome variable in the data was examined and visualized.
f,ax=plt.subplots(1,2,figsize=(18,8))
df['Outcome'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('target')
ax[0].set_ylabel('')
sns.countplot('Outcome',data=df,ax=ax[1])
ax[1].set_title('Outcome')
plt.show()

In [193]:
# Access to the correlation of the data set was provided. What kind of relationship is examined between the variables. 
# If the correlation value is> 0, there is a positive correlation. While the value of one variable increases, the value of the other variable also increases.
# Correlation = 0 means no correlation.
# If the correlation is <0, there is a negative correlation. While one variable increases, the other variable decreases. 
# When the correlations are examined, there are 2 variables that act as a positive correlation to the Salary dependent variable.
# These variables are Glucose. As these increase, Outcome variable increases.
df.corr()

In [194]:
# Correlation matrix graph of the data set
f, ax = plt.subplots(figsize= [20,15])
sns.heatmap(df.corr(), annot=True, fmt=".2f", ax=ax, cmap = "magma" )
ax.set_title("Correlation Matrix", fontsize=20)
plt.show()

# 2) Data Preprocessing

## 2.1) Missing Observation Analysis

We saw on df.head() that some features contain 0, it doesn't make sense here and this indicates missing value.

Below we replace 0 value by NaN:

In [195]:
df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.NaN)

In [196]:
df.head()

In [197]:
# Now, we can look at where are missing values
df.isnull().sum()

In [198]:
# Have been visualized using the missingno library for the visualization of missing observations.
# Plotting 
import missingno as msno
msno.bar(df);

In [199]:
# The missing values ​​will be filled with the median values ​​of each variable.
def median_target(var):   
    temp = df[df[var].notnull()]
    temp = temp[[var, 'Outcome']].groupby(['Outcome'])[[var]].median().reset_index()
    return temp

In [200]:
# The values to be given for incomplete observations are given the median value of people who are not sick and the median values of people who are sick.
columns = df.columns
columns = columns.drop("Outcome")
for i in columns:
    median_target(i)
    df.loc[(df['Outcome'] == 0 ) & (df[i].isnull()), i] = median_target(i)[i][0]
    df.loc[(df['Outcome'] == 1 ) & (df[i].isnull()), i] = median_target(i)[i][1]

In [201]:
df.head()

In [202]:
# Missing values were filled.
df.isnull().sum()

## 2.2) Outlier Observation Analysis

In [203]:
#Outlier Detection (checking for each attribute)
for feature in df:
    
    Q1 = df[feature].quantile(0.25)
    Q3 = df[feature].quantile(0.75)
    IQR = Q3-Q1
    lower = Q1- 1.5*IQR
    upper = Q3 + 1.5*IQR
    
    if df[(df[feature] > upper)].any(axis=None):
        print(feature,"yes")
    else:
        print(feature, "no")

In [204]:
# Visualizing the Insulin variable with boxplot method 
import seaborn as sns
sns.boxplot(x = df["Insulin"]);

In [205]:
# Stand alone observation review for the Insulin variable while suppressing the contradicting values
Q1 = df.Insulin.quantile(0.25)
Q3 = df.Insulin.quantile(0.75)
IQR = Q3-Q1
lower = Q1 - 1.5*IQR
upper = Q3 + 1.5*IQR
df.loc[df["Insulin"] > upper,"Insulin"] = upper

In [206]:
import seaborn as sns
sns.boxplot(x = df["Insulin"]);
print("IQR is:",IQR)
print("Lower limit is:",lower)
print("Upper limit is:",upper)

## 2.3)  Local Outlier Factor (LOF)

In [207]:
# We determine outliers between all variables with the LOF method
from sklearn.neighbors import LocalOutlierFactor
lof =LocalOutlierFactor(n_neighbors= 10)
lof.fit_predict(df)

In [208]:
df_scores = lof.negative_outlier_factor_
np.sort(df_scores)[0:30]

In [209]:
#We choose the threshold value according to lof scores
threshold = np.sort(df_scores)[7]
threshold

In [210]:
#We delete those that are higher than the threshold
outlier = df_scores > threshold
df = df[outlier]

In [211]:
# The size of the data set was examined.
df.shape

# 3) Feature Engineering

Creating new variables is important for models. But you need to create a logical new variable. For this data set, some new variables were created according to BMI, Insulin and glucose variables.

In [212]:
# According to BMI, some ranges were determined and categorical variables were assigned.
NewBMI = pd.Series(["Underweight", "Normal", "Overweight", "Obesity 1", "Obesity 2", "Obesity 3"], dtype = "category")
df["NewBMI"] = NewBMI
df.loc[df["BMI"] < 18.5, "NewBMI"] = NewBMI[0]
df.loc[(df["BMI"] > 18.5) & (df["BMI"] <= 24.9), "NewBMI"] = NewBMI[1]
df.loc[(df["BMI"] > 24.9) & (df["BMI"] <= 29.9), "NewBMI"] = NewBMI[2]
df.loc[(df["BMI"] > 29.9) & (df["BMI"] <= 34.9), "NewBMI"] = NewBMI[3]
df.loc[(df["BMI"] > 34.9) & (df["BMI"] <= 39.9), "NewBMI"] = NewBMI[4]
df.loc[df["BMI"] > 39.9 ,"NewBMI"] = NewBMI[5]

In [213]:
df.head()

In [214]:
# A categorical variable creation process is performed according to the insulin value.
def set_insulin(row):
    if row["Insulin"] >= 16 and row["Insulin"] <= 166:
        return "Normal"
    else:
        return "Abnormal"

In [215]:
# The operation performed was added to the dataframe.
df = df.assign(NewInsulinScore=df.apply(set_insulin, axis=1))

df.head()

In [216]:
# Some intervals were determined according to the glucose variable and these were assigned categorical variables.
NewGlucose = pd.Series(["Low", "Normal", "Overweight", "Secret", "High"], dtype = "category")
df["NewGlucose"] = NewGlucose
df.loc[df["Glucose"] <= 70, "NewGlucose"] = NewGlucose[0]
df.loc[(df["Glucose"] > 70) & (df["Glucose"] <= 99), "NewGlucose"] = NewGlucose[1]
df.loc[(df["Glucose"] > 99) & (df["Glucose"] <= 126), "NewGlucose"] = NewGlucose[2]
df.loc[df["Glucose"] > 126 ,"NewGlucose"] = NewGlucose[3]

In [217]:
df.head(5)

# 4) One Hot Encoding

Categorical variables in the data set should be converted into numerical values. For this reason, these transformation processes are performed with Label Encoding and One Hot Encoding method.

In [218]:
# Here, by making One Hot Encoding transformation, categorical variables were converted into numerical values. It is also protected from the Dummy variable trap.
df = pd.get_dummies(df, columns =["NewBMI","NewInsulinScore", "NewGlucose"], drop_first = True)

In [219]:
df.head()

In [220]:
categorical_df = df[['NewBMI_Obesity 1','NewBMI_Obesity 2', 'NewBMI_Obesity 3', 'NewBMI_Overweight','NewBMI_Underweight',
                     'NewInsulinScore_Normal','NewGlucose_Low','NewGlucose_Normal', 'NewGlucose_Overweight', 'NewGlucose_Secret']]

In [221]:
categorical_df.head()

In [222]:
y = df["Outcome"]
X = df.drop(["Outcome",'NewBMI_Obesity 1','NewBMI_Obesity 2', 'NewBMI_Obesity 3', 'NewBMI_Overweight','NewBMI_Underweight',
                     'NewInsulinScore_Normal','NewGlucose_Low','NewGlucose_Normal', 'NewGlucose_Overweight', 'NewGlucose_Secret'], axis = 1)
cols = X.columns
index = X.index

In [223]:
X.head()

In [224]:
# Using RobutScaler for standardization
from sklearn.preprocessing import RobustScaler
transformer = RobustScaler().fit(X)
X = transformer.transform(X)
X = pd.DataFrame(X, columns = cols, index = index)

In [225]:
X.head()

In [226]:
X = pd.concat([X,categorical_df], axis = 1)

In [227]:
X.head()

In [228]:
y.head()

In [229]:
X.count()
y.count()

In [230]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=1/3,random_state=2022)

In [231]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

In [232]:
X_train.head()

In [233]:
X_test.head()

In [234]:
y_train.head()

In [235]:
y_test.head()

# 5) Models

## KNN Model

In [236]:
from sklearn.neighbors import KNeighborsClassifier

In [237]:
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)

In [238]:
y_pred = knn.predict(X_test)

In [239]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score,ConfusionMatrixDisplay

from sklearn.model_selection import cross_val_score

In [240]:
print(confusion_matrix(y_test,y_pred))

In [241]:
print(classification_report(y_test,y_pred))

In [242]:
accuracy_rate = []


for i in range(1,30):
    
    knn = KNeighborsClassifier(n_neighbors=i)
    score=cross_val_score(knn,df,df['Outcome'],cv=10)
    accuracy_rate.append(score.mean())

In [243]:
plt.figure(figsize=(10,6))
plt.plot(range(1,30),accuracy_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Accuracy Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Accuracy Rate')
plt.show()

In [244]:
knn = KNeighborsClassifier(n_neighbors=23)
knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)

cf=confusion_matrix(y_test,y_pred)
cmd = ConfusionMatrixDisplay(cf, display_labels=['Diabetic','Non-Diabetic'])

print(classification_report(y_test,y_pred))
cmd.plot()
print("\n")
plt.show()

In [245]:
knn.fit(X_train,y_train)
knn.score(X_test,y_test)

## Random Forest Classifier

In [246]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 2022)
classifier.fit(X_train, y_train)

In [247]:
pred = classifier.predict(X_test)
cf1=confusion_matrix(y_test,pred)

print(classification_report(y_test,pred))
print('\n')
cmd1 = ConfusionMatrixDisplay(cf1, display_labels=['Diabetic','Non-Diabetic'])
cmd1.plot()
plt.show()

In [248]:
classifier.fit(X_train,y_train)
classifier.score(X_test,y_test)

## ANN

In [249]:
import tensorflow as tf
from keras import regularizers
from keras.layers.core import Dropout

In [250]:
tf.random.set_seed(2022)

In [251]:
model = tf.keras.models.Sequential()

In [252]:
model.add(tf.keras.layers.Dense(1024,activation='relu',input_dim=18,kernel_regularizer=regularizers.l2(0.003)))
Dropout(0.3)
tf.keras.layers.BatchNormalization()
model.add(tf.keras.layers.Dense(512,activation='relu'))
Dropout(0.3)
tf.keras.layers.BatchNormalization()
model.add(tf.keras.layers.Dense(256,activation='relu'))
Dropout(0.3)
tf.keras.layers.BatchNormalization()
model.add(tf.keras.layers.Dense(128,activation='relu'))
Dropout(0.3)
tf.keras.layers.BatchNormalization()
model.add(tf.keras.layers.Dense(64,activation='relu'))
Dropout(0.3)
tf.keras.layers.BatchNormalization()
model.add(tf.keras.layers.Dense(32,activation='relu'))
Dropout(0.3)
tf.keras.layers.BatchNormalization()
model.add(tf.keras.layers.Dense(1,activation='sigmoid'))

In [253]:
model.compile(optimizer='SGD',loss='binary_crossentropy',metrics=['accuracy'])

In [254]:
history=model.fit(X_train, y_train,validation_data=(X_test,y_test), epochs=100)

In [255]:
# plotting the loss


plt.plot(history.history['loss'], label='train loss')
plt.plot(history.history['val_loss'], label='val loss')
plt.legend()
plt.show()
plt.savefig('LossVal_loss')

# plotting the accuracy


plt.plot(history.history['accuracy'], label='train acc')
plt.plot(history.history['val_accuracy'], label='val acc')
plt.legend()
plt.show()


In [256]:
model.evaluate(X_test,y_test)

In [257]:
y_pred = model.predict(X_test)
y_pred = (y_pred > 0.5)
cm2 = confusion_matrix(y_test, y_pred)
cmd3 = ConfusionMatrixDisplay(cm2, display_labels=['Diabetic','Non-Diabetic'])

print(accuracy_score(y_test, y_pred))
print("\n")
cmd3.plot()
plt.show()

In [258]:
print(classification_report(y_test,y_pred))