# Diabetes prediction using ML

We are going to predict if a person is likely to have diabetes using the insulin levels and the Logistic Regression model:

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

In [None]:
df = pd.read_csv("diabetes2.csv")

In [None]:
df

In [None]:
df_insulin = df[["Insulin", "Outcome"]]

In [None]:
df_insulin

Dropping the rows with insulin = 0:

In [None]:
list_drop = []

for i in range(len(df["Insulin"])):
    if df["Insulin"][i] == 0:
        list_drop.append(i)
        
df_insulin = df_insulin.drop(list_drop)
df_insulin

The outcome 1 represents that the person is likely to have diabetes, and the outcome 0 is that the person is not likely. Let's train the Logistic Regression model:

Our data points:

In [None]:
X = np.array(df_insulin["Insulin"]).reshape(-1,1)
X

In [None]:
Y = np.array(df_insulin["Outcome"])
Y

In [None]:
from sklearn import linear_model

In [None]:
log_reg_diabetes = linear_model.LogisticRegression()
log_reg_diabetes.fit(X,Y)

Let's make a prediction:

In [None]:
prediction = log_reg_diabetes.predict(np.array([200]).reshape(-1,1))
if prediction[0] == 0:
    print("Not likely to have diabetes!")
else:
    print("Likely to have diabetes!")

Let's talk about the odds of having diabetes:

In [None]:
log_reg_diabetes_log_odds = log_reg_diabetes.coef_
odds = np.exp(log_reg_diabetes_log_odds)
odds

As the insuline increases one point, the odds that the person is likely to have diabetes increases by 1.00571746 times.

Let's calculate the probability that the person is likely to have diabetes:

In [None]:
def probability(log_reg, X):
    log_reg_log_odds = log_reg.coef_ * X + log_reg.intercept_
    odds = np.exp(log_reg_log_odds)
    prob = odds / (1 + odds)
    return prob

In [None]:
for i in X:
    print(probability(log_reg_diabetes, np.array([i]).reshape(-1,1)))

However, if we look at the graph of the insulin data points, we are going to see that the Logistic Regression is not a good model to predict this kind of disease, as the boundaries between the data points are not linear. The low number of data points is also responsible for generating not a great model after the learning.

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
sns.relplot(data = df_insulin.reset_index(), x = "index",  y = "Insulin", hue = "Outcome")

Calculating the accuracy of the model, we have:

Test_size should be around 0,1 if we have a small dataset, as we have, and around 0,2 if we have a bigger dataset.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.10, random_state = 1)

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(Y_test, log_reg_diabetes.predict(X_test))

Low accuracy!

Using Deep Learning as Multilayer Perceptrons is a better option, as the boundaries are non-linear. So, let's use it!

In [None]:
from sklearn.neural_network import MLPClassifier

In [None]:
mlp_diabetes = MLPClassifier(hidden_layer_sizes = (768, 384), random_state = 6, verbose = True, activation = "logistic", learning_rate_init = 0.01)

In [None]:
mlp_diabetes.fit(X_train, Y_train)

In [None]:
mlp_prediction = mlp_diabetes.predict(X_test)

In [None]:
accuracy_score(Y_test, mlp_prediction)

The accuracy has improved!

To construct a even better MLP classification, we are going to fill the zeros at the insulin column using this specific model: 

1. Find the mean value (1) of the insulin that is not related to the presence of diabetes
2. Doing the same for the value that is related to diabetes (mean value (2))
3. Filling the zeros using (1) when the outcome is zero, and using (2) when the outcome is one.

With this, we are going to have almost twice the data we already have... This will, probably, increase the accuracy of our model

In [None]:
df

In [None]:
df_insulin_new = df[["Insulin", "Outcome"]]

In [None]:
df_insulin_new

In [None]:
df_insulin_new_yes = df_insulin_new[df_insulin_new["Outcome"] == 1]
df_insulin_new_yes_ = df_insulin_new_yes[df_insulin_new_yes["Insulin"] != 0]
mean_insulin_yes = df_insulin_new_yes_.mean()[0]
mean_insulin_yes

In [None]:
df_insulin_new_no = df_insulin_new[df_insulin_new["Outcome"] == 0]
df_insulin_new_no_ = df_insulin_new_no[df_insulin_new_no["Insulin"] != 0]
mean_insulin_no = df_insulin_new_no_.mean()[0]
mean_insulin_no

In [None]:
insulin_series = df_insulin_new["Insulin"]
insulin_series

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
for c in range(len(insulin_series)):
    if ((insulin_series[c] == 0) & (df_insulin_new["Outcome"] == 1))[c]:
        insulin_series[c] = mean_insulin_yes
    elif ((insulin_series[c] == 0) & (df_insulin_new["Outcome"] == 0))[c]:
        insulin_series[c] = mean_insulin_no
    else:
        continue

In [None]:
df_insulin_new

Now, the dataframe is able to be used for our learning:

In [None]:
X = np.array(insulin_series).reshape(-1,1)
X

In [None]:
Y = np.array(df_insulin_new["Outcome"])
Y

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.10, random_state = 1)

In [None]:
mlp_diabetes_new = MLPClassifier(hidden_layer_sizes = (768, 384), random_state = 6, verbose = True, activation = "logistic", learning_rate_init = 0.01)

In [None]:
mlp_diabetes_new.fit(X_train, Y_train)

In [None]:
accuracy_score(Y_test, mlp_diabetes_new.predict(X_test))

So, we have a even better result, with a great improvement! 