#**Diabetes Prediction using logistic regression**

###**Objective**

#####**To use logistic regression to predict binary class problem of finding whether a person is diabetic or not.**

###**Data Source**

#####**[Link to dataset](https://raw.githubusercontent.com/YBI-Foundation/Dataset/main/Diabetes.csv)**

#####**Data is accessed from FBI-Foundations github site**

###**Import Library**

In [None]:
import pandas as pd
import numpy as np

###**Import Data**

In [None]:
df=pd.read_csv("https://raw.githubusercontent.com/YBI-Foundation/Dataset/main/Diabetes.csv")

In [None]:
df.head()

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   pregnancies  768 non-null    int64  
 1   glucose      768 non-null    int64  
 2   diastolic    768 non-null    int64  
 3   triceps      768 non-null    int64  
 4   insulin      768 non-null    int64  
 5   bmi          768 non-null    float64
 6   dpf          768 non-null    float64
 7   age          768 non-null    int64  
 8   diabetes     768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [None]:
df.shape

(768, 9)

###**Describe Data**

In [None]:
df.describe()

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


###**Data Preprocessing**

In [None]:
df['diabetes'].value_counts()

0    500
1    268
Name: diabetes, dtype: int64

In [None]:
df.groupby('diabetes').mean()

Unnamed: 0_level_0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age
diabetes,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


###**Define Target Variable (y) and Feature Variables (X)**

In [None]:
y = df['diabetes']

In [None]:
y.shape

(768,)

In [None]:
y

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: diabetes, Length: 768, dtype: int64

In [None]:
X=df.drop('diabetes', axis=1)

In [None]:
X.shape

(768, 8)

In [None]:
X

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33
...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63
764,2,122,70,27,0,36.8,0.340,27
765,5,121,72,23,112,26.2,0.245,30
766,1,126,60,0,0,30.1,0.349,47


In [None]:
# X variables Standardized

from sklearn.preprocessing import MinMaxScaler

In [None]:
m = MinMaxScaler()

In [None]:
X=m.fit_transform(X)

In [None]:
X

array([[0.35294118, 0.74371859, 0.59016393, ..., 0.50074516, 0.23441503,
        0.48333333],
       [0.05882353, 0.42713568, 0.54098361, ..., 0.39642325, 0.11656704,
        0.16666667],
       [0.47058824, 0.91959799, 0.52459016, ..., 0.34724292, 0.25362938,
        0.18333333],
       ...,
       [0.29411765, 0.6080402 , 0.59016393, ..., 0.390462  , 0.07130658,
        0.15      ],
       [0.05882353, 0.63316583, 0.49180328, ..., 0.4485842 , 0.11571307,
        0.43333333],
       [0.05882353, 0.46733668, 0.57377049, ..., 0.45305514, 0.10119556,
        0.03333333]])

###**Train Test Split**

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=2529)


In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((537, 8), (231, 8), (537,), (231,))

###**Modeling**

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
l = LogisticRegression()

In [None]:
l.fit(X_train, y_train)

###**Prediction**

In [None]:
y_pred = l.predict(X_test)

In [None]:
y_pred.shape

(231,)

In [None]:
y_pred

array([0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1,
       0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1])

In [None]:
# Probability of Each Predicted Class
l.predict_proba(X_test)

array([[0.87946998, 0.12053002],
       [0.59958843, 0.40041157],
       [0.33869015, 0.66130985],
       [0.70625215, 0.29374785],
       [0.55893242, 0.44106758],
       [0.36724463, 0.63275537],
       [0.16360933, 0.83639067],
       [0.30066834, 0.69933166],
       [0.91142377, 0.08857623],
       [0.49981736, 0.50018264],
       [0.92067479, 0.07932521],
       [0.27488402, 0.72511598],
       [0.75356548, 0.24643452],
       [0.92015134, 0.07984866],
       [0.81541256, 0.18458744],
       [0.47341943, 0.52658057],
       [0.37791835, 0.62208165],
       [0.55553959, 0.44446041],
       [0.89719093, 0.10280907],
       [0.81016432, 0.18983568],
       [0.46241262, 0.53758738],
       [0.46405765, 0.53594235],
       [0.72861372, 0.27138628],
       [0.93939119, 0.06060881],
       [0.79376047, 0.20623953],
       [0.80186616, 0.19813384],
       [0.73978527, 0.26021473],
       [0.51847931, 0.48152069],
       [0.88234404, 0.11765596],
       [0.69813424, 0.30186576],
       [0.

###**Model Evaluation**

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
print(confusion_matrix(y_test, y_pred))

[[133  12]
 [ 46  40]]


In [None]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.74      0.92      0.82       145
           1       0.77      0.47      0.58        86

    accuracy                           0.75       231
   macro avg       0.76      0.69      0.70       231
weighted avg       0.75      0.75      0.73       231



###**Explaination**
#####**Get future predictions**

In [None]:
# A random sample is selected as a new value or a patient

X_new = df.sample(1)

In [None]:
X_new

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
122,2,107,74,30,100,33.6,0.404,23,0


In [None]:
X_new.shape

(1, 9)

In [None]:
X_new = X_new.drop('diabetes', axis = 1)

In [None]:
X_new

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age
122,2,107,74,30,100,33.6,0.404,23


In [None]:
X_new.shape

(1, 8)

In [None]:
X_new = m.fit_transform(X_new)

In [None]:
ypred_new = l.predict(X_new)

In [None]:
ypred_new

array([0])

###**Predicted and actual value is 0 i.e Non-Diabetic**