## Logistic Regression Model for Diabetes Prediction

Objective: Build a predictive model to predict the likelihood of a patient having diabetes based on certain features.

Dataset: The dataset "diabetes" contains information about the medical history of patients, including features like Glucose level, Blood Pressure, BMI, etc., and a target variable indicating whether the patient has diabetes (1) or not (0).

GitHub Repository: https://github.com/davhig/diabetes_prediction

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:

# Load the dataset
df = pd.read_csv("datasets_228_482_diabetes.csv")

# Display the first few rows of the dataset
print(df.head())

# Check the structure and summary statistics of the dataset
print(df.info())
print(df.describe())

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768

 From the statistics, we can observe that there are some features with minimum values of 0, such as Glucose, BloodPressure, SkinThickness, Insulin, and BMI. These zero values are likely to represent missing or invalid data since it's not possible for these physiological measurements to be zero in most cases.

 To handle this situation, we are going to replace zero values in the features Glucose, BloodPressure, SkinThickness, Insulin, and BMI with NaN to clearly identify and isolate those values as missing data. Then, we  will use the median to replace those NaN values, as the median is more robust to outliers compared to the mean. And physiological measurements can sometimes have outliers. Additionally, it preserves the overall distribution of the data, ensuring that the replaced values are representative of typical values within the dataset.

In [5]:
# Replace zero values with NaN
df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] = df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']].replace(0, np.nan)

# Impute NaN values with median
df.fillna(df.median(), inplace=True)

# Check if there are any missing values remaining
print(df.isnull().sum())

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64


In [6]:
# Split the dataset into features and target variable
X = df.drop('Outcome', axis=1)
y = df['Outcome']

# Split the dataset into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
# Standardize features by removing the mean and scaling to unit variance
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [8]:
# Build a Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)

In [9]:
# Predict on the testing set
y_pred = model.predict(X_test_scaled)

In [10]:

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("ROC AUC Score:", roc_auc)


Accuracy: 0.7532467532467533
Precision: 0.6666666666666666
Recall: 0.6181818181818182
F1 Score: 0.6415094339622642
ROC AUC Score: 0.7232323232323232


The logistic regression model achieved an accuracy of approximately 75.32%. This indicates that the model correctly predicts whether a patient has diabetes or not around 75% of the time. Additionally, the precision of the model is approximately 66.67%, suggesting that when the model predicts a patient has diabetes, it is correct around 66.67% of the time. However, the recall of the model is approximately 61.82%, indicating that the model correctly identifies around 61.82% of patients who actually have diabetes.

Furthermore, the F1 score, which is a harmonic mean of precision and recall, is approximately 64.15%. Lastly, the ROC AUC score, which measures the model's ability to discriminate between positive and negative instances, is approximately 72.32%. A higher ROC AUC score indicates better discrimination.

In conclusion, the logistic regression model performs reasonably well in most areas, but there's still some room for improvement, especially in correctly identifying patients with diabetes. Considering the real-life impact of inaccurate predictions in healthcare, it's crucial to enhance the model's performance. False predictions could have serious consequences for patients, so it's important to minimize them as much as possible. This means we need to focus on refining the model to make it better at predicting diabetes.

To improve the performance of the logistic regression model, we can implement feature engineerig by exploring the existing features and creating new features that may provide more predictive power, or feature selection by identifying and selecting the most relevant features that have the greatest impact on the target variable. This could be done with techniques such as L1 regularization (Lasso), recursive feature elimination (RFE), etc. Another option is to implement hyperparameter tuning with techniques such as grid search or randomized search.

To define the next steps to follow, we will identify the logistic regression coefficients to quantify the impact of the independent variables on the target variable.

In [11]:
# Access the coefficients
coefficients = model.coef_[0]

# Create a DataFrame to display feature names and coefficients
coef_df = pd.DataFrame({'Feature': X.columns, 'Coefficient': coefficients})
coef_df.sort_values(by='Coefficient', ascending=False, inplace=True)

# Display the DataFrame
print(coef_df)


                    Feature  Coefficient
1                   Glucose     1.102551
5                       BMI     0.688767
7                       Age     0.392364
0               Pregnancies     0.222844
6  DiabetesPedigreeFunction     0.203586
3             SkinThickness     0.068664
4                   Insulin    -0.138304
2             BloodPressure    -0.151521


The logistic regression analysis reveals that Glucose, BMI, Age, and the number of Pregnancies are the most influential factors in predicting diabetes. Higher Glucose and BMI levels, along with advancing age and a higher number of pregnancies, increase the likelihood of diabetes. Family history of diabetes (DiabetesPedigreeFunction) also plays a significant role. While SkinThickness, as well as Insulin, shows a weak impact on the model's predictive power.

In [12]:
# Identify important features based on coefficients
important_features = coef_df[coef_df['Coefficient'].abs() > 0.2]['Feature'].tolist()
print("Important Features:", important_features)

# Create new features based on important features and their interactions
for feature in important_features:
    # Example: Creating polynomial features
    #df[feature + '_squared'] = df[feature] ** 2
    # Example: Creating interaction terms with other important features
    for other_feature in important_features:
        if other_feature != feature:
            df[feature + '_' + other_feature] = df[feature] * df[other_feature]

# Check if new features have been successfully added
print(df.head())

Important Features: ['Glucose', 'BMI', 'Age', 'Pregnancies', 'DiabetesPedigreeFunction']
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6    148.0           72.0           35.0    125.0  33.6   
1            1     85.0           66.0           29.0    125.0  26.6   
2            8    183.0           64.0           29.0    125.0  23.3   
3            1     89.0           66.0           23.0     94.0  28.1   
4            0    137.0           40.0           35.0    168.0  43.1   

   DiabetesPedigreeFunction  Age  Outcome  Glucose_BMI  ...  Age_Pregnancies  \
0                     0.627   50        1       4972.8  ...              300   
1                     0.351   31        0       2261.0  ...               31   
2                     0.672   32        1       4263.9  ...              256   
3                     0.167   21        0       2500.9  ...               21   
4                     2.288   33        1       5904.7  ...                0  

In [13]:
# Split the dataset into features and target variable
X = df.drop('Outcome', axis=1)
y = df['Outcome']

# Split the dataset into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [14]:
# Standardize features by removing the mean and scaling to unit variance
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [15]:
# Build a Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)

In [16]:
# Predict on the testing set
y_pred = model.predict(X_test_scaled)

In [17]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("ROC AUC Score:", roc_auc)


Accuracy: 0.7532467532467533
Precision: 0.660377358490566
Recall: 0.6363636363636364
F1 Score: 0.6481481481481481
ROC AUC Score: 0.7272727272727272


After implementing feature engineering, there was only marginal improvement in the performance metrics compared to the initial model. While precision and ROC AUC score slightly increased, the overall accuracy, recall, and F1 score remained similar. This suggests that the feature engineering techniques applied did not lead to significant enhancements in the model's ability to predict diabetes. Further exploration of alternative approaches may be necessary to achieve bigger improvements in performance.

In [22]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

# Define the logistic regression model
logreg = LogisticRegression(max_iter=1000)

# Define the hyperparameter grid
param_grid = {
    'penalty': ['l2'],  # Regularization penalty
    'C': [0.001, 0.01, 0.1, 1, 10, 100]  # Inverse regularization strength
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit the grid search to the data
grid_search.fit(X_train_scaled, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Get the best cross-validation score
best_score = grid_search.best_score_
print("Best Cross-validation Score:", best_score)

# Get the best model
best_model = grid_search.best_estimator_

# Evaluate the best model on the test set
y_pred = best_model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

# Print evaluation metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("ROC AUC Score:", roc_auc)


Best Hyperparameters: {'C': 0.1, 'penalty': 'l2'}
Best Cross-validation Score: 0.7671331467413035
Accuracy: 0.7597402597402597
Precision: 0.6730769230769231
Recall: 0.6363636363636364
F1 Score: 0.6542056074766355
ROC AUC Score: 0.7323232323232323


After implementing grid search to tune hyperparameters, there was a slight improvement in the performance metrics compared to the previous model. The accuracy increased from 75.32% to 75.97%, and precision improved from 66.04% to 67.31%. However, recall remained the same at 63.64%, and the F1 score increased from 64.81% to 65.42%. The ROC AUC score also saw a slight improvement, rising from 72.73% to 73.23%.

Overall, while grid search helped optimize the logistic regression model's hyperparameters, the improvements in performance metrics were modest.

In [23]:
import pickle

# Save the trained model to a pickle file
with open('diabetes_lr_model.pkl', 'wb') as file:
    pickle.dump(best_model, file)

# Save the scaler to a pickle file
with open('diabetes_scaler.pkl', 'wb') as file:
    pickle.dump(scaler, file)