# Predicting the Risk of Diabetes in Pregnant Women

Authors: 
- Angela Yang
- Tina Chen
- Tom Cui 
- Yiyang Zhang

### Introduction: ###

Diabetes is a chronic health condition where blood glucose is unable to enter cells due to the lack of insulin facilitating this exchange (NIDDK, 2016). With diabetes being one of the top causes of death in North America (CDC, 2022), the need to determine the best health indicators becomes increasingly important for the correct diagnosis of diabetes. This study uses a dataset from Kaggle, originally from the National Institute of Diabetes and Digestive and Kidney Diseases, showing a variety of health indicators (e.g. blood pressure, Body Mass Index/BMI, glucose, insulin) and whether they were diagnosed with diabetes. 

For our project, we will try to answer this predictive question: Given the health profile of a patient, how accurate can our classification model predict potential diabetes cases based on the selected explanatory variables?

In [2]:
library(tidyverse)
library(repr)
library(tidymodels)
library(gridExtra)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔

In [6]:
diabetes_data <- read_csv("data/diabetes.csv")
head(diabetes_data)

[1mRows: [22m[34m768[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (9): Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, D...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
5,116,74,0,0,25.6,0.201,30,0


In [5]:
diabetes_data <- diabetes_data |>
    mutate(Outcome = as_factor(Outcome)) |>
    filter(Pregnancies!=0)

diabetes_data_processed <- diabetes_data |>
    mutate(
        Glucose=ifelse(Glucose==0,NA,Glucose),
        BloodPressure=ifelse(BloodPressure==0,NA,BloodPressure),
        SkinThickness=ifelse(SkinThickness==0,NA,SkinThickness),
        Insulin=ifelse(Insulin==0,NA,Insulin),
        BMI=ifelse(BMI==0,NA,BMI),
        DiabetesPedigreeFunction=ifelse(DiabetesPedigreeFunction==0,NA,DiabetesPedigreeFunction))
head(diabetes_data_processed)

nrow(diabetes_data)

Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
6,148,72,35.0,,33.6,0.627,50,1
1,85,66,29.0,,26.6,0.351,31,0
8,183,64,,,23.3,0.672,32,1
1,89,66,23.0,94.0,28.1,0.167,21,0
5,116,74,,,25.6,0.201,30,0
3,78,50,32.0,88.0,31.0,0.248,26,1


In [None]:
set.seed(1400)

diabetes_split <- initial_split(diabetes_data, prop = 0.76, strata = Outcome)
diabetes_train <- training(diabetes_split)
diabetes_test <- testing(diabetes_split)

In [None]:
diabetes_train |>
    group_by(Outcome) |>
    summarize(count = n())

In [None]:
options(repr.plot.width = 15, repr.plot.height = 8)

BMIGlucose_plot <- ggplot(diabetes_train, aes(x = BMI, y = Glucose, color = Outcome)) +
  geom_point(alpha = 0.6) +
  labs(x = "BMI (kg/m^2)", 
       y = "Glucose Levels (mg/dL)",
       color = "Diagnosis") +
ggtitle("BMI vs. Glucose plot") +
theme(text = element_text(size = 15))

BMIBloodpressure_plot <- ggplot(diabetes_train, aes(x = BMI, y = BloodPressure, color = Outcome)) +
  geom_point(alpha = 0.6) +
  labs(x = "BMI (kg/m^2)", 
       y = "Blood Pressure (mmHg)",
       color = "Diagnosis") +
ggtitle("BMI vs. blood pressure plot") +
theme(text = element_text(size = 15))

BloodpressureGlucose_plot <- ggplot(diabetes_train, aes(x = BloodPressure, y = Glucose, color = Outcome)) +
  geom_point(alpha = 0.6) +
  labs(x = "Blood Pressure (mmHg)", 
       y = "Glucose (mg/dL)",
       color = "Diagnosis") +
ggtitle("Blood pressure vs. Glucose plot") +
theme(text = element_text(size = 15))

grid.arrange(BMIGlucose_plot, BMIBloodpressure_plot, BloodpressureGlucose_plot,
             layout_matrix = rbind(c(1, 2, 3),
                                   c(1, 2, 3)))

In [None]:
diabetes_recipe <- recipe(Outcome ~ BMI + Glucose, data = diabetes_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())
diabetes_recipe

In [None]:
knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

k_vals <- tibble(neighbors = seq(from = 1, to = 50, by = 1)) 

diabetes_vfold <- vfold_cv(data = diabetes_train, v = 5, strata = Outcome)

knn_results <- workflow() |>
    add_recipe(diabetes_recipe) |>
    add_model(knn_tune) |>
    tune_grid(resamples = diabetes_vfold, grid = k_vals) |>
    collect_metrics()

accuracies <- knn_results |>
    filter(.metric == "accuracy")

accuracies_vs_k_plot <- ggplot(accuracies, aes(x=neighbors, y=mean)) +
    geom_point() +
    geom_line() +
    labs(x="Neighbours", y="Accuracy Estimate")
accuracies_vs_k_plot

As seen from the plot, we should use K = 25 because its accuracy estimate is the highest. This K will be used in building our final, most accurate model for classification. 

In [None]:
diabetes_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 25) |>
    set_engine("kknn") |>
    set_mode("classification")
diabetes_spec

diabetes_fit <- workflow() |>
    add_recipe(diabetes_recipe) |>
    add_model(diabetes_spec) |>
    fit(data = diabetes_train)
diabetes_fit

In [None]:
diabetes_test_predictions <- predict(diabetes_fit, diabetes_test) |>
    bind_cols(diabetes_test)
diabetes_test_predictions

In [None]:
diabetes_test_predictions |>
    metrics(truth = Outcome, estimate = .pred_class) |>
    filter(.metric == "accuracy")

From this, it can be seen that the accuracy of our classification model on the testing dataset is 72.3%. Future ways this classification model can be improved include...

Next, this classification system will be applied to a makeshift scenario where an individual has a BMI of 32.2 and glucose level of 150. The following is the code to classify whether this person has diabetes or not.

In [None]:
new_obs <- tibble(BMI = 32.2, Glucose = 150)
new_obs_prediction <- predict(diabetes_fit, new_obs)
new_obs_prediction

Therefore, this classification model produces "1", meaning that this person is predicted to have diabetes. This is in line with our predictions as a person with low BMI but high glucose levels would be at high risk of diabetes...

### Methods: ###