**Predicting Presence of Diabetes with Age, BMI, and Blood Sugar Level**

**Introduction:**


Diabetes is a chronic disease that affects the body's ability to turn food into energy. Normally after consumption, glucose is released into the bloodstream causing the pancreas to release insulin. This provides a signal so that our cells can utilize the newly acquired blood sugar for energy. Diabetics, however, are unable to produce insulin or cannot utilize it effectively, leading to major health problems. Unfortunately, many people suffer from this health problem.  

The risk of developing diabetes increases with age (DECODE-DECODA Study Group & International Diabetes Epidemiology Group, 2003; Mordarska & Godziejewska-Zawada, 2017) due to changes in the human body such as altered metabolism and insulin sensitivity. Higher BMI is often linked with an increased risk of type 2 diabetes (Abdullah et al., 2010; DECODE-DECODA Study Group & International Diabetes Epidemiology Group, 2003), as excess body fat can affect insulin resistance. Blood glucose level is also direct indicator of diabetes. Elevated levels may suggest problems with insulin use or production in the body. The explanation for each predictor highlights their major role in determining whether an individual has diabetes. Therefore, this project aims to answer the question: can diabetes in an individual be predicted accurately using their age, body mass index (BMI), and blood glucose level?  

The diabetes prediction data set contains medical information about patients' diabetes status (positive or negative) and other pertinent health information. There are 9 columns (containing categorical and numerical data): age, gender, BMI, hypertension, heart disease, smoking history, HbA1c level, blood glucose level, and diabetes status. In total, there are 100,000 participants and thus 100,000 rows available. 

**Methods & Results**

In [1]:
#installing themis package
install.packages("themis")

also installing the dependencies ‘RANN’, ‘ROSE’


Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



In [2]:
#loading packages
library(tidyverse)
library(tidymodels)
library(dplyr)
library(themis)

“package ‘ggplot2’ was built under R version 4.3.2”
── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.5.0     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom    

The random seed is set to ensure that the analysis is reproducible. The dataset is then loaded  directly from an online source. Given that this analysis involves categorization, it is necessary to convert the 'diabetes' variable into a factor type. This variable originally contains the values 0 and 1, which are changed  to 'non-diabetic' and 'diabetic' respectively for better readability.  

After preparing the data, it is split into two subsets: 75% of the data is used for training the  model, and the remaining 25% is used to test the model's prediction accuracy.  

In [None]:
#set seed, loading dataset, and splitting into train and test sets

set.seed(1)

url <- "https://raw.githubusercontent.com/danialtaj0/Dataset/main/diabetes_prediction_dataset.csv"
diabetes_data <- read_csv(url) |>
                 mutate(diabetes = as_factor(diabetes)) |> 
                mutate(diabetes = fct_recode(diabetes, "non-diabetic" = "0", "diabetic" = "1"))
diabetes_split <- initial_split(diabetes_data, prop = 0.75, strata = diabetes)
diabetes_train <- training(diabetes_split) 
diabetes_test <- testing(diabetes_split)
cat("Table 1: Diabetes Data Set")
head(diabetes_train)

Table 2 displays the number of participants categorized as diabetics versus non-diabetics in our dataset for the purpose of evaluating class imbalance. An imbalanced dataset could result in a predictive model that frequently classifies patients as non-diabetic, leading to poor performance when identifying the minority class (diabetic). As this dataset contains a highly disproportionate number of non-diabetic individuals, it will be necessary to downscale this class before training the model.  

The mean values of the selected predictors are tabulated in Table 3, and the dataset is examined for missing data (Table 4), which is crucial for ensuring accurate calculations in subsequent analyses.

By understanding the data distribution, appropriate metrics for our model can be chosen, enhancing its accuracy in making predictions. Table 3 presents the average values of BMI, age, and blood glucose level for the training data. These averages are important for understanding the central tendency of our numerical variables, which aids us to spot outliers and understand the overall data distribution. 

In [None]:
class_distribution <- diabetes_train |>
  group_by(diabetes) |>
  summarise(Count = n())
cat("Table 2: Distribution of Diabetic versus Non Diabetic Individuals")
class_distribution

predictor_means <- diabetes_train |>
  summarise(
    mean_bmi = mean(bmi, na.rm = TRUE),
    mean_age = mean(age, na.rm = TRUE),
    mean_blood_glucose_level = mean(blood_glucose_level, na.rm = TRUE)
  )
cat("Table 3: Means of Predictor Variables")
predictor_means

# Filtering rows where there is missing data in any of the specified columns
rows_with_missing_data <- diabetes_train |>
  filter(is.na(bmi) | is.na(age) | is.na(blood_glucose_level) | is.na(diabetes))

# Count the number of rows with missing data
num_rows_with_missing_data <- nrow(rows_with_missing_data)

# Print the count
missing_data <- tibble(n_rows_missing_data = num_rows_with_missing_data)
cat("Table 4: Number of Rows with Missing Data")
missing_data

The histograms showing the distributions of the three predictor variables (Figures 1 - 3) demonstrate that they are not normally distributed. Additionally, the scales are vastly different between different variables. This means that it will be necessary to scale and centre them in order to produce accurate predictions. Additionally, Figures 1 to 3 also show that diabetes is more common in older people and those with high blood glucose levels, but appears to show little to no relationship with BMI. This suggests that some of our variables may more effectively predict diabetes than others.  

The scatter plot (Figure 4) suggests a region for which the model is expected to produce "diabetic" predictions, as there is a cluster of diabetic data points with high age and high BMI. It also demonstrates that there are significantly more non-diabetic than diabetic data points, which means that the non-diabetic data points will need to be downscaled for the algorithm to produce accurate predictions. 




In [None]:
#making histograms and scatterplot
options(repr.plot.height = 5, repr.plot.width = 10)

bmi_dist <- ggplot(diabetes_train, aes(x = bmi, fill = diabetes)) + 
    geom_histogram() + 
    theme(text = element_text(size = 12)) + 
    xlab("BMI") + 
    ggtitle("Figure 1: Distribution of BMI") 

blood_glucose_dist <- ggplot(diabetes_train, aes(x = blood_glucose_level, fill = diabetes)) + 
    geom_histogram() + 
    theme(text = element_text(size = 12)) + 
    xlab("Blood Glucose Level") +
    ggtitle("Figure 2: Distribution of Blood Glucose Level")

age_dist <- ggplot(diabetes_train, aes(x = age, fill = diabetes)) + 
    geom_histogram() + 
    theme(text = element_text(size = 12)) + 
    xlab("Age") +
    ggtitle("Figure 3: Distribution of Age")

age_bmi <- ggplot(diabetes_train, aes(x = age, y = bmi, color = diabetes, shape = diabetes)) + 
    geom_point(alpha = 0.4) + 
    theme(text = element_text(size = 12)) + 
    xlab("Age") + 
    ylab("BMI") + 
    ggtitle("Figure 4: Presence of Diabetes in Relation to Age and BMI")

bmi_dist
blood_glucose_dist
age_dist
age_bmi

The training data is first downsampled to balance the class distribution, ensuring no bias towards the more frequently occurring non-diabetic class. A K-nearest neighbors (KNN) model is then configured with a 'rectangular' weight function and uses 5-fold cross-validation on the balanced dataset. This involves dividing the dataset into five parts, using each sequentially as a validation set while the others are used for training, which helps assess the model's performance across different data subsets. Various k values are tested, ranging from 1 to 100 in increments of 5, to find the one that achieves the highest validation accuracy, effectively selecting a k value that optimally balances bias and variance. The optimal k value is the value that yields the highest average accuracy across all folds, ensuring the model is accurately tuned for generalizable predictions.

In [None]:
#cross validation to find the optimal k value
set.seed(1)

#downsampling the training set
downsample_recipe <- recipe(diabetes ~ bmi + age + blood_glucose_level, data = diabetes_train) |>
    step_downsample(diabetes, under_ratio = 1, skip = FALSE) |> 
    prep() 
downsampled_diabetes_train <- bake(downsample_recipe, diabetes_train) 

#finding optimal k value 
diabetes_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |> 
    set_engine("kknn") |> 
    set_mode("classification") 

diabetes_recipe <- recipe(diabetes ~ bmi + age + blood_glucose_level, data = downsampled_diabetes_train) |>
    step_scale(all_predictors()) |> 
    step_center(all_predictors()) 

diabetes_vfold <- vfold_cv(downsampled_diabetes_train, v = 5, strata = diabetes)
k_vals = tibble(neighbors = seq(from = 1, to = 100, by = 5))

cv_results <- workflow() |> 
    add_model(diabetes_tune) |> 
    add_recipe(diabetes_recipe) |> 
    tune_grid(resamples = diabetes_vfold, grid = k_vals) |> 
    collect_metrics() |> 
    filter(.metric == "accuracy") |> 
    select(neighbors, mean) 

#optimal k value

best_k <- cv_results |> 
    arrange(desc(mean)) |> 
    head(1) |> 
    pull(neighbors)
best_k

After identifying the best k value that maximizes accuracy, this k value is used in the `nearest_neighbor` function to set up the KNN classifier. Using the previously specified configuration, the model is assembled within a workflow, combining both the model specifications and the data preprocessing recipe. The model is then trained on the downsampled training dataset.  

Next, the model's performance is evaluated on the test dataset. Predictions are made for the `diabetes_test` dataset, and these predictions are combined with the actual test data to facilitate further analysis. Accuracy, precision, and recall of the model are calculated and tabulated in Table 5 to assess the effectiveness of the model. These metrics provide a comprehensive understanding of how well the model predicts diabetes cases.  

Additionally, a confusion matrix is generated from the predictions (Table 6). This matrix is crucial as it visually represents the accuracy of the model by showing the true positives, true negatives, false positives, and false negatives. This helps in further evaluating the model’s diagnostic ability in differentiating between diabetic and non-diabetic cases.ses.

In [None]:
set.seed(1)
#classifier 
diabetes_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) |> 
    set_engine("kknn") |> 
    set_mode("classification") 
diabetes_classifier <- workflow() |> 
    add_recipe(diabetes_recipe) |> 
    add_model(diabetes_spec) |> 
    fit(data = downsampled_diabetes_train) 
#applying classifier to test set
predictions <- predict(diabetes_classifier, diabetes_test) |> 
    bind_cols(diabetes_test) 

#accuracy
accuracy <- metrics(predictions, truth = diabetes, estimate = .pred_class) |> 
    filter(.metric == "accuracy") |> 
    select(.metric, .estimate)  |> 
    pivot_wider(names_from = .metric, values_from = .estimate) 

#recall
recall <- predictions |> 
    recall(truth = diabetes, estimate = .pred_class, event_level = "second") |> 
    select(.metric, .estimate) |> 
    pivot_wider(names_from = .metric, values_from = .estimate) 

#precision
precision <- predictions |> 
    precision(truth = diabetes, estimate = .pred_class, event_level = "second") |> 
    select(.metric, .estimate) |> 
    pivot_wider(names_from = .metric, values_from = .estimate) 

metrics_summary <- bind_cols(accuracy, recall, precision) 
cat("Table 5: Classifier Metrics")
metrics_summary
#confusion matrix
conf_mat <- conf_mat(predictions, truth = diabetes, estimate = .pred_class)
cat("Table 6: Classifier Confusion Matrix")
conf_mat
#bmi alone: accuracy 0.69, recall 0.62
#age alone: accuracy 0.88, recall 0.13
#blood glucose level alone: accuracy 0.95, recall 0.39

The classifier’s predictions, as well as the actual class of each data point as diabetic or non-diabetic are plotted on Figures 5 to 10. Each combination of two variables is plotted to best visualize the clusters of diabetic and non-diabetic data points. 

In [None]:
#analysis visualization 
age_bmi_real <- ggplot(predictions, aes(x = age, y = bmi, color = diabetes, shape = diabetes)) + 
    geom_point(alpha = 0.4) + 
    theme(text = element_text(size = 12)) + 
    xlab("Age") + 
    ylab("BMI") + 
    ggtitle("Figure 5: Actual Presence of Diabetes in Relation to Age and BMI")
age_bmi_preds <- ggplot(predictions, aes(x = age, y = bmi, color = .pred_class, shape = .pred_class)) + 
    geom_point(alpha = 0.4) + 
    theme(text = element_text(size = 12)) + 
    xlab("Age") + 
    ylab("BMI") + 
    ggtitle("Figure 6: Predicted Presence of Diabetes in Relation to Age and BMI")
age_bmi_real
age_bmi_preds
bgl_bmi_real <- ggplot(predictions, aes(x = blood_glucose_level, y = bmi, color = diabetes, shape = diabetes)) + 
    geom_point(alpha = 0.4) + 
    theme(text = element_text(size = 12)) + 
    xlab("Blood Glucose Level") + 
    ylab("BMI") + 
    ggtitle("Figure 7: Actual Presence of Diabetes in Relation to Blood Glucose Level and BMI")
bgl_bmi_preds <- ggplot(predictions, aes(x = blood_glucose_level, y = bmi, color = .pred_class, shape = .pred_class)) + 
    geom_point(alpha = 0.4) + 
    theme(text = element_text(size = 12)) + 
    xlab("Blood Glucose Level") + 
    ylab("BMI") + 
    ggtitle("Figure 8: Predicted Presence of Diabetes in Relation to Blood Glucose Level and BMI")
bgl_bmi_real
bgl_bmi_preds
bgl_age_real <- ggplot(predictions, aes(x = age, y = blood_glucose_level, color = diabetes, shape = diabetes)) + 
    geom_point(alpha = 0.4) + 
    theme(text = element_text(size = 12)) + 
    xlab("Age") + 
    ylab("Blood Glucose Level") + 
    ggtitle("Figure 9: Actual Presence of Diabetes in Relation to Age and Blood Glucose Level")
bgl_age_preds <- ggplot(predictions, aes(x = age, y = blood_glucose_level, color = .pred_class, shape = .pred_class)) + 
    geom_point(alpha = 0.4) + 
    theme(text = element_text(size = 12)) + 
    xlab("Age") + 
    ylab("Blood Glucose Level") + 
    ggtitle("Figure 10: Predicted Presence of Diabetes in Relation to Age and Blood Glucose Level")
bgl_age_real
bgl_age_preds

**Discussion**

**Summarize what you found**

A KNN classifier model was created to predict patients as diabetic or non-diabetic using BMI, age, and glucose levels as predictors. After performing various codes and testing k values ranging from 1 to 100, the optimal k value (that maximized accuracy) was found to be 46. Next, a workflow using this k value was created and trained on the downsampled training dataset. To assess how well the model predicts diabetes cases, a confusion matrix was generated to compute performance metrics such as accuracy, precision, and recall. The following was found:  

Accuracy - 79%, Precision - 28%, Recall - 86%.  

Since we built a classifier to predict diabetes status where it is ‘non-diabetic’ 91% of the time, an accuracy of 79% is not sufficient (it just mainly guesses ‘non-diabetic’). However, it is also important to consider precision and recall.

The low precision of the model (28%) can be observed in Figures 5 to 10. As demonstrated by these figures, the model tends to make a large number of false positive predictions, classifying data points near the existing clusters of diabetic data points as diabetic when they are in fact non-diabetic. In effect, it overestimates the size of the regions which should be classified as diabetic. This is likely due to the fact that the non-diabetic data points in the training data were downsampled such that there was an equal proportion of both outcomes, resulting in diabetic data points further from the cluster having a disproportionate impact on the classifier’s predictions. A possible method of mitigating this may be to retain more of the non-diabetic data points in the training set such that the ratio of non-diabetic to diabetic points is greater than 1:1. This may decrease the number of false positives by reducing the effect of outlier diabetic data points on predictions. 

Although a low precision is not ideal, being falsely diagnosed as diabetic likely will not have significantly negative effects on a person's health. In contrast, a high recall is much more important, as those with diabetes need to be identified and treated as soon as possible. A recall of 86% seems relatively good, but considering our classification problem type, it is not very impressive. A significant number of cases are still being classified as false negatives. This has serious consequences as diabetics must identify their illness in a timely manner to receive proper treatment.  

In accordance with the histograms/scatterplots generated, BMI was found to be a weak predictor of diabetes (no relationship), while age and glucose levels showed a positive correlation with diabetes. 



**Discuss whether this is what you expected to find?**

Although we were able to predict patients' diabetes status with some certainty, we expected to have a much higher accuracy. However, our results make sense after reflection on our coding methods. We could only run a cross-validation of 5 due to computational power limitations. If we could have chosen 10 instead, our accuracy estimate may have been better with a lower standard error. This would have led us to potentially find a new k value. Additionally, BMI was not a good predictor of diabetes, leaving us only two good predictors: Age and glucose levels. Having three strong predictors is ideal and could have led to more accurate results. 


**Discuss what impact could such findings have?**

The findings from this study have the potential to significantly influence various aspects of healthcare and individual wellness. Firstly, they can improve screening and early diagnosis. By accurately identifying individuals at a higher risk of diabetes based on age, BMI, and blood glucose levels, healthcare systems can implement targeted screening strategies. Early diagnosis enables timely interventions, which may slow the progression of diabetes and reduce associated complications. Additionally, these insights allow physicians to offer personalized treatment approaches. Understanding how these key variables correlate with diabetes risk aids in the development of tailored treatment plans, enhancing their effectiveness and improving patient outcomes. Furthermore, the research informs preventive healthcare measures. High-risk individuals, for instance, can receive counseling on lifestyle modifications such as diet and exercise to mitigate their risk or delay the onset of diabetes.  

Overall, this research underscores the critical role of personalized medicine and preventive care in managing and combating diabetes, with profound implications for both individual patient care and broader public health strategies.



**Discuss what future questions could this lead to?**

Building on the first study's identification of age, BMI, and blood glucose levels as diabetes predictors, further investigations might examine the interactions between genetic predispositions and these variables, evaluating how they jointly affect the likelihood of developing diabetes. Deeper understanding of these metrics' dynamic link with the course of diabetes might be possible by a longitudinal research tracking changes in these metrics over time. It is also critical to investigate more how various demographics, such as socioeconomic position and ethnicity, impact the prediction power of these factors. Furthermore, the integration of cutting-edge technology such as artificial intelligence (AI) and continuous glucose monitoring has the potential to transform diabetes management and early diagnosis, resulting in customized treatment regimens and preventive measures. This has the potential to improve patient care both individually and inform national health policies that eventually aim to lower the worldwide burden of diabetes.


**References**

Abdullah, A., Peeters, A., de Courten, M., & Stoelwinder, J. (2010). The magnitude of association between overweight and obesity and the risk of diabetes: A meta-analysis of prospective cohort studies. *Diabetes Research and Clinical Practice, 89*(3), 309-319. https://doi.org/10.1016/j.diabres.2010.04.012  

Mordarska, K., & Godziejewska-Zawada, M. (2017). Diabetes in the elderly. *Przegla̜d Menopauzalny, 16*(2), 38-43. https://doi.org/10.5114/pm.2017.68589  

Mustafa, M. (2023). *Diabetes prediction dataset*. Kaggle. https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset

The DECODE-DECODA Study Group, on behalf of the European Diabetes Epidemiology Group. & and the International Diabetes Epidemiology Group. (2003). Age, body mass index and Type 2 diabetes: Associations modified by ethnicity. *Diabetologia, 46*(8), 1063-1070. https://doi.org/10.1007/s00125-003-1158-9