# Diabetic Prediction - ML Project Pipeline

## Project Summary

Diabetes mellitus, a chronic metabolic disorder, is characterized by the body's impaired ability to utilize blood sugar (glucose) effectively. The American Diabetes Association categorizes diabetes into two primary types.

### Type 1 diabetes
This form often manifests in childhood. It results from an autoimmune response where the body's immune system mistakenly attacks and destroys insulin-producing beta cells in the pancreas. This destruction leads to a deficiency in insulin production. The etiology of this autoimmune response is likely multifactorial, potentially involving a combination of genetic predisposition, environmental factors, and viral infections.

### Type 2 diabetes
This is the more prevalent type, typically diagnosed in adulthood. It arises due to either insufficient insulin secretion or the development of insulin resistance within the body's cells. Risk factors associated with type 2 diabetes include a positive family history, obesity, and physical inactivity.

Beyond these primary types, less common forms of diabetes can occur due to genetic defects, pancreatic dysfunction, or exposure to medications or chemicals.

### Gestational diabetes mellitus (GDM)
This is a temporary type of diabetes that can develop during pregnancy. Hormonal and metabolic changes during gestation can lead to insulin resistance, causing the body to utilize blood sugar less efficiently. While GDM typically resolves after childbirth, it increases the mother's risk of developing type 2 diabetes later in life.

### Maternal inheritance of diabetes and its impact on offspring
- Gestational diabetes itself is unlikely to directly cause diabetes in the baby.
- If the mother has pre-existing type 2 diabetes, the child has an elevated risk of developing type 2 diabetes later in life due to genetic predisposition.
- Mothers with type 1 diabetes have a slightly increased risk of their child having type 1 diabetes at birth, though this risk remains relatively low.

Diabetes is a multi-factorial disease. Many machine learning models have been built to assist doctors in the diagnosis of diabetes for future patients using different features. Many of these models have been built on the well-known PIMA Indian diabetes dataset.

In this project, we will build such a model using a recent study [Chou et al., J.Pers.Med. 2023] of 15000 women aged between 20 and 80 selected as the subjects in the Taipei Municipal medical center. These women were patients who had gone to the medical center during 2018–2020 and 2021–2022 with or without the diagnosis of diabetes.

### The dataset – TAIPEI_diabetes.csv
The dataset provides attributes for 15000 women on 8 features:
- **Pregnancies**: Number of times pregnant
- **PlasmaGlucose**: Plasma glucose concentration after 2 hours in an oral glucose tolerance test
- **DiastolicBloodPressure**: Diastolic blood pressure (mm Hg)
- **TricepsThickness**: Triceps skin fold thickness (mm)
- **SerumInsulin**: 2-Hour serum insulin (mu U/ml)
- **BMI**: Body mass index (weight in kg/(height in m)^2)
- **DiabetesPedigree**: A function that scores the probability of diabetes based on family history
- **Age**: Age in years

And the variable to predict is in the last column of the table:
- **Diabetic**: 1 = diabetes diagnosed, 0 = no diabetes diagnosed

## 1. Importing Libraries and Loading Data

In [3]:
# Import required packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from imblearn.over_sampling import SMOTE


In [5]:
# Load the dataset
diabetes_csv = pd.read_csv(r'../TAIPEI_diabetes.csv')
diabetes_csv.head()

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
0,1354778,0,171,80,34,23,43.509726,1.213191,21,0
1,1147438,8,92,93,47,36,21.240576,0.158365,23,0
2,1640031,7,115,47,52,35,41.511523,0.079019,23,0
3,1883350,9,103,78,25,304,29.582192,1.28287,43,1
4,1424119,1,85,59,27,35,42.604536,0.549542,22,0


## 2. Feature Engineering

In [6]:
# Create age groups
diabetes_csv['AgeGroup'] = pd.cut(diabetes_csv['Age'], bins=[20, 30, 40, 50, 60, 70, 80], labels=['20-30', '31-40', '41-50', '51-60', '61-70', '71-80'])

# Create BMI categories
diabetes_csv['BMICategory'] = pd.cut(diabetes_csv['BMI'], bins=[0, 18.5, 24.9, 29.9, 100], labels=['Underweight', 'Normal', 'Overweight', 'Obese'])

In [7]:
# One-hot encode the categorical variables
diabetes_csv = pd.get_dummies(diabetes_csv, columns=['AgeGroup', 'BMICategory'], drop_first=True)

## 3. Preprocessing Data

In [18]:
# Drop unnecessary columns
diabetes_csv = diabetes_csv.drop(columns=['PatientID','AgeGroup_31-40', 'AgeGroup_41-50', 'AgeGroup_51-60',
       'AgeGroup_61-70', 'AgeGroup_71-80', 'BMICategory_Normal',
       'BMICategory_Overweight', 'BMICategory_Obese'], errors='ignore')

# Separate features and target variable
X = diabetes_csv.drop(columns=['Diabetic'])
y = diabetes_csv['Diabetic']

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

## 4. Handling Imbalanced Data

In [9]:
# Apply SMOTE to balance the classes
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_scaled, y)

## 5. Splitting Data

In [10]:
# Split the resampled data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

## 6. Training Model

In [11]:
# Train a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

## 7. Evaluating Model

In [12]:
# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)

Accuracy: 0.8335
Confusion Matrix:
[[1679  322]
 [ 344 1655]]
Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.84      0.83      2001
           1       0.84      0.83      0.83      1999

    accuracy                           0.83      4000
   macro avg       0.83      0.83      0.83      4000
weighted avg       0.83      0.83      0.83      4000



In [13]:
# Saving the model
with open("log_model.pkl", "wb")as file:
    pickle.dump(model, file)

In [20]:
# SAving the scaler
with open ("log_scaler.pkl", "wb") as file:
    pickle.dump(scaler, file)

In this project, we built a logistic regression model to predict the diabetic outcome using the TAIPEI_diabetes dataset. We performed exploratory data analysis, feature engineering, data preprocessing, handled class imbalance using SMOTE, and trained and evaluated the model. The model achieved an accuracy of 77.8% with balanced precision, recall, and F1-scores for both classes. We also analyzed the impact of each feature on the probability of being diabetic, identifying the most influential features.