# About the project

This project is about building logistics regression model to predict diabetes status in patients (positive or negative) based on their medical history and demographic information namely age, gender, body mass index (BMI), hypertension, heart disease, HbA1c level, and blood glucose level. This can be useful in an attempt to explore the relationships between various medical and demographic factors and the likelihood of developing diabetes as well as assisting healthcare professionals in identifying patients who may be at risk of developing diabetes and in developing personalized treatment plans.

The dataset used in this project is downloaded from <a href="https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset?resource=download">kaggle website</a>, consisting of 100,000 observations, each representing a patient.

# Importing Statements

In [1]:
import pandas as pd
import numpy as np

# Read the Data

In [2]:
dataset = pd.read_csv('diabetes_prediction_dataset.csv')

# Data Exploration & Cleaning

In [3]:
print(f"The dataset has the shape of {dataset.shape}.")

The dataset has the shape of (100000, 8).


In [4]:
dataset.head()

Unnamed: 0,gender,age,hypertension,heart_disease,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,25.19,6.6,140,0
1,Female,54.0,0,0,27.32,6.6,80,0
2,Male,28.0,0,0,27.32,5.7,158,0
3,Female,36.0,0,0,23.45,5.0,155,0
4,Male,76.0,1,1,20.14,4.8,155,0


### Check missing values

In [5]:
print(f"Missing values: {dataset.isna().values.any()}")


Missing values: False


In [6]:
print(f"Check if there is any missing value in a column:\n{dataset.isna().any()}")

Check if there is any missing value in a column:
gender                 False
age                    False
hypertension           False
heart_disease          False
bmi                    False
HbA1c_level            False
blood_glucose_level    False
diabetes               False
dtype: bool


### Check duplicated values

In [7]:
print(f"Duplicated values: {dataset.duplicated().values.any()}")

Duplicated values: True


#### Drop duplicated values

In [8]:
dataset = dataset.drop_duplicates()

In [9]:
dataset.shape

(91313, 8)

# Build the Logistics Regression Model

## Create the set of features and the set of dependent variable

In [10]:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [11]:
print(X)

[['Female' 80.0 0 ... 25.19 6.6 140]
 ['Female' 54.0 0 ... 27.32 6.6 80]
 ['Male' 28.0 0 ... 27.32 5.7 158]
 ...
 ['Male' 66.0 0 ... 27.83 5.7 155]
 ['Female' 24.0 0 ... 35.42 4.0 100]
 ['Female' 57.0 0 ... 22.43 6.6 90]]


In [12]:
print(y)

[0 0 0 ... 0 0 0]


## Encoding the Independent Variable

Since the `gender` column includes categorical data, I will handle it with one-hot encoding. This creates new (binary) columns, indicating the presence of each possible value from the original data (here is `male` and `female`). One-hot encoding enables machine learning models to work with categorical data effectively while avoiding misinterpretation of ordinal relationships.

In [13]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

#### Check the set of features (X) once again after encoding

In [14]:
print(X)

[[1.0 0.0 0.0 ... 25.19 6.6 140]
 [1.0 0.0 0.0 ... 27.32 6.6 80]
 [0.0 1.0 0.0 ... 27.32 5.7 158]
 ...
 [0.0 1.0 0.0 ... 27.83 5.7 155]
 [1.0 0.0 0.0 ... 35.42 4.0 100]
 [1.0 0.0 0.0 ... 22.43 6.6 90]]


## Splitting the dataset into the Training set and Test set

Train test split is a model validation procedure that allows us to simulate how a model would perform on new/unseen data. Here the Test set is split into 20% of actual data and the training set is split into 80% of the actual data.

We will later train the model on the training set and test the model on the testing set and evaluate the performance.

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Training the Logistic Regression model on the Training set

In [16]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0, max_iter=10000)
classifier.fit(X_train, y_train)

LogisticRegression(max_iter=10000, random_state=0)

## Predicting the Test set results

In [17]:
y_pred = classifier.predict(X_test)

## Making the Confusion Matrix

A confusion matrix is a performance evaluation tool in machine learning, representing the accuracy of a classification model. It displays the number of true positives, true negatives, false positives, and false negatives.

In [18]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[16470   160]
 [  606  1027]]


The confusion matrix shows that when we test the performance of the established logistics regression model on the testing set, there are 16470 true positives, 1027 true negatives, 606 false positives and 160 false negatives. 

Based on the resulting confusion matrix, let's calculate the accuracy of this model.

In [19]:
(16470 + 1027) / (16470 + 1027 + 606 + 160) 

0.9580572742703828

## Computing the accuracy with k-Fold Cross Validation

I have just evaluated the model only once and I am not sure my good result is by luck or not. I want to evaluate the model multiple times so I can be more confident about the model design. Therefore, I will employ k-fold cross-validation. This approach divides the input dataset into k groups of samples of equal sizes, called folds. For each learning set, the prediction function uses k-1 folds and the remaining fold is used for test set.

The result of a k-fold cross-validation run is summarized with the mean of accuracy scores. I also include a measure of the variance of accuracy scores, which is the standard deviation.

In [20]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

Accuracy: 95.68 %
Standard Deviation: 0.21 %


The model has been proved to perform very well with the mean of accuracies from k-fold cross validation reaching 95.68% and the associated standard deviation standing at only 0.21%.