# Diabetes Likelihood Prediction

## Dataset Overview

The dataset comprises several features related to individuals' health parameters, specifically aimed at predicting the presence or absence of diabetes based on a binary class variable (0 or 1). The features include:

- Number of times pregnant: The count of pregnancies a person has experienced.
- Plasma glucose concentration: The blood glucose level measured 2 hours after an oral glucose tolerance test, providing insights into the body's ability to handle glucose.
- Diastolic blood pressure: The pressure in the arteries when the heart is at rest, measured in millimeters of mercury (mm Hg).
- Triceps skin fold thickness: The thickness of a fold of skin on the back of the arm, measured in millimeters, which can be an indicator of body fat.
- 2-Hour serum insulin: The insulin level in the blood measured 2 hours after consuming glucose, reflecting the body's insulin response.
- Body mass index (BMI): Calculated as the ratio of weight in kilograms to the square of height in meters, BMI is a measure of body fat and overall health.
- Diabetes pedigree function: A function that scores the likelihood of diabetes based on family history.
- Age: The age of the individual in years.
- Class variable (0 or 1): The target variable indicating the presence (1) or absence (0) of diabetes.

This dataset is valuable for training machine learning models, particularly those using supervised learning techniques, to predict and understand the factors associated with the likelihood of diabetes based on these health-related features.


## Data Preprocessing

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf

### Importing Dataset

In [5]:
df = pd.read_csv('diabetes_dataset.csv')
print(df.head(10))

   Number of times pregnant  \
0                         6   
1                         1   
2                         8   
3                         1   
4                         0   
5                         5   
6                         3   
7                        10   
8                         2   
9                         8   

   Plasma glucose concentration a 2 hours in an oral glucose tolerance test  \
0                                                148                          
1                                                 85                          
2                                                183                          
3                                                 89                          
4                                                137                          
5                                                116                          
6                                                 78                          
7                         

In [4]:
print(df.describe())

       Number of times pregnant  \
count                768.000000   
mean                   3.845052   
std                    3.369578   
min                    0.000000   
25%                    1.000000   
50%                    3.000000   
75%                    6.000000   
max                   17.000000   

       Plasma glucose concentration a 2 hours in an oral glucose tolerance test  \
count                                         768.000000                          
mean                                          120.894531                          
std                                            31.972618                          
min                                             0.000000                          
25%                                            99.000000                          
50%                                           117.000000                          
75%                                           140.250000                          
max                 

### Splitting the Data into Training and Test Sets

In [3]:
from sklearn.model_selection import train_test_split

X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

### Feature Scaling
- It centers the data around zero and scales it so that it has a standard deviation of 1.
- It’s important to fit the scaler only on X_train, then transform both X_train and X_test, which is exactly what you're doing (to avoid data leakage).

In [4]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

## Building the Artificial Neural Network

### Architecture

In [5]:
ann = tf.keras.models.Sequential()                            #type: ignore
ann.add(tf.keras.layers.Dense(units=8, activation='relu'))    #type: ignore
ann.add(tf.keras.layers.Dense(units=16, activation='relu'))   #type: ignore
ann.add(tf.keras.layers.Dense(units=32, activation='relu'))   #type: ignore
ann.add(tf.keras.layers.Dense(units=64, activation='relu'))   #type: ignore
ann.add(tf.keras.layers.Dense(units=1, activation='sigmoid')) #type: ignore
ann.compile(optimizer='adamax', loss='binary_crossentropy', metrics=['accuracy'])

### Training the Neural Network

In [6]:
ann.fit(X_train, y_train, batch_size=32, epochs=100)

Epoch 1/100
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - accuracy: 0.4942 - loss: 0.6948
Epoch 2/100
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.7079 - loss: 0.6587 
Epoch 3/100
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.6608 - loss: 0.6422 
Epoch 4/100
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.6444 - loss: 0.6289 
Epoch 5/100
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.6779 - loss: 0.6013 
Epoch 6/100
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.6904 - loss: 0.5896  
Epoch 7/100
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.7079 - loss: 0.5754 
Epoch 8/100
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.6931 - loss: 0.5823 
Epoch 9/100
[1m20/20[0m [32m━━━━━━━━━

<keras.src.callbacks.history.History at 0x155bab72b10>

## Evaluating the Model

### Predicting Test Set Results

In [7]:
y_pred = ann.predict(X_test)
y_pred = (y_pred > 0.5)
print(np.concatenate((y_pred.reshape(len(y_pred),1),y_test.reshape(len(y_test),1)),1))

[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step
[[1 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [1 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [1 1]
 [1 1]
 [0 0]
 [0 1]
 [0 1]
 [0 0]
 [0 0]
 [1 0]
 [0 1]
 [0 1]
 [1 1]
 [0 1]
 [0 0]
 [0 0]
 [0 0]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [1 0]
 [0 0]
 [1 1]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [1 0]
 [0 0]
 [0 0]
 [0 0]
 [0 1]
 [0 0]
 [0 0]
 [1 1]
 [1 1]
 [0 0]
 [0 1]
 [0 0]
 [0 0]
 [1 1]
 [0 0]
 [1 1]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [1 0]
 [0 0]
 [1 1]
 [0 0]
 [0 1]
 [0 0]
 [0 1]
 [0 1]
 [1 0]
 [1 1]
 [1 1]
 [1 0]
 [1 0]
 [0 0]
 [1 1]
 [0 1]
 [1 1]
 [0 0]
 [0 0]
 [1 1]
 [0 1]
 [0 0]
 [0 1]
 [0 1]
 [0 0]
 [1 0]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [0 1]
 [0 0]
 [0 0]
 [0 0]
 [0 1]
 [0 0]
 [0 1]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [0 1]
 [0 0]
 [0 0]
 [0 1]
 [0 0]
 [1 1]
 [1 0]
 [1 1]
 [0 1]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [1 1]
 [1 0]
 [0 0]
 [1 1]
 [0 0]
 [1 

### Confusion Matrix

In [8]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_true=y_test, y_pred=y_pred)
print(cm)
accuracy_score(y_true=y_test, y_pred=y_pred)

[[86 13]
 [23 32]]


0.7662337662337663