# Predicting Diabetes Status (max. 2000 words)

## Introduction

Diabetes is a chronic metabolic disease where the body cannot produce or use insulin, a hormone that allows sugar to be used for energy. The effects of untreated diabetes can be damaging to various organs, and approximately [422 million](https://www.who.int/health-topics/diabetes#tab=tab_1) people globally have diabetes. It is therefore imperative that diabetes be diagnosed as early as possible to minimise potential complications.

This project aims to create a feasible model for predicting diabetes. Literature has found that [Hba1c levels](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4933534/), [age](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9843502/#:~:text=Advanced%20age%20is%20a%20major%20risk%20factor%20for%20diabetes%20and%20prediabetes.&text=Therefore%2C%20the%20elderly%20has%20a,%2C%20retinal%2C%20and%20renal%20systems.), [hypertension and heart disease statuses](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5953551/) can be related to diabetes, and so the question to be answered is:  can diabetes status be predicted from  these 4 variables?

To answer the overarching question, the dataset that will be used in this project is the [Diabetes prediction dataset](https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset) created by Mohammed Mustafa on kaggle. It contains patient records organised into 8 variables (age, gender, body mass index (BMI), hypertension, heart disease, smoking history, HbA1c level,  and blood glucose level) and a predicted outcome (diabetes status).

## Methods and Results

### Methods
1. Data Cleaning: Before any analysis is conducted, the dataset will be cleaned to handle missing values, outliers, or any inconsistencies.
2. Variable Selection: Not all variables/columns in a dataset may be relevant or useful for prediction. We will primarily focus on variables such as Hba1c levels, age, hypertension and heart disease statuses. These have been shown in various studies to be influential factors in the onset of diabetes. However, other variables may be included or excluded based on their correlation with the predicted outcome (diabetes status) and their importance in the model.
3. Data Splitting: The dataset will be divided into training and testing sets. The training set will be used to train our predictive model, while the testing set will be used to evaluate the model's performance.
4. Model Building and Evaluation: Different models will be considered, including logistic regression, decision trees, and random forests. Each model's performance will be assessed using appropriate metrics, such as accuracy, recall, precision.
5. Results Visualization: The ggplot2 package will be central for this. Histograms (geom_histogram()), scatter plots (geom_point()) will be plotted to visually inspect variable distributions and relationships. Feature importance, when relevant, can be visualised using a bar graph (geom_bar()).

*Proposed results visualization (fs):*
- we have to compare the majority classifier & our K-nearest neighbors classifier
- use metrics() & confusion matrix on both to do this (present both as tables first) -> review chapter 6
    <br>note: draw the comparison after K-nearest neighbors classifier is tuned & cross-validated. Also add in the line graph for accuracy estimate
              vs. K value (set K to 0-100?)
- put values obtained in the above point into a bar graph as visualization(?)

### Results

In [1]:
install.packages("corrplot")
install.packages("Hmisc")
install.packages("gridExtra")
library(tidyverse)
library(tidymodels)
library(corrplot)
library(Hmisc)
library(gridExtra)
options(repr.matrix.max.rows = 15)
options(repr.plot.width = 10, repr.plot.height = 8)
set.seed(1234)

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

also installing the dependency ‘viridis’


Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.2     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
── [1mAttaching packages[22m ────────────────────────────────────── tid

In [2]:
diabetes <- read_csv("https://raw.githubusercontent.com/florencesanjaya/DSCI-100-2023w1-group-36/main/diabetes_prediction_dataset.csv")

[1mRows: [22m[34m100000[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (2): gender, smoking_history
[32mdbl[39m (7): age, hypertension, heart_disease, bmi, HbA1c_level, blood_glucose_l...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [4]:
# Diabetes is rare among children under age 5 (Pregnancy, Birth and Baby, n.d.), so only people above 5 years old are considered.
diabetes <- diabetes |>
filter(age > 5)

*(insert upsampling step here)*

In [20]:
diabetes_split <- initial_split(diabetes, prop = 0.75, strata = diabetes)
diabetes_training <- training(diabetes_split)
diabetes_testing <- testing(diabetes_split)
glimpse(diabetes_training)

Rows: 67,497
Columns: 9
$ gender              [3m[90m<chr>[39m[23m "Female", "Female", "Male", "Female", "Female", "F…
$ age                 [3m[90m<dbl>[39m[23m 80, 54, 28, 20, 44, 32, 53, 67, 78, 15, 42, 42, 40…
$ hypertension        [3m[90m<fct>[39m[23m False, False, False, False, False, False, False, F…
$ heart_disease       [3m[90m<fct>[39m[23m True, False, False, False, False, False, False, Fa…
$ smoking_history     [3m[90m<chr>[39m[23m "never", "No Info", "never", "never", "never", "ne…
$ bmi                 [3m[90m<dbl>[39m[23m 25.19, 27.32, 27.32, 27.32, 19.31, 27.32, 27.32, 2…
$ HbA1c_level         [3m[90m<dbl>[39m[23m 6.6, 6.6, 5.7, 6.6, 6.5, 5.0, 6.1, 5.8, 6.6, 6.1, …
$ blood_glucose_level [3m[90m<dbl>[39m[23m 140, 80, 158, 85, 200, 100, 85, 200, 126, 200, 158…
$ diabetes            [3m[90m<fct>[39m[23m False, False, False, False, True, False, False, Fa…


## Discussion

*(insert summary of findings)*

*(insert discussion of whether findings conform to expectations)*

Utilising machine learning to comb through multitudes of patient data can accelerate crucial decision-making, especially in clinical settings where healthcare providers must consider diabetic potential in some patients’ treatments. We believe that our findings can provide a step closer to accomplishing this by confirming the correlational relationship between the 4 variables and diabetes.

Future questions to consider:
1. Can we differentiate between type 1 and type 2 diabetes with our predictors?
2. Would a bigger patient dataset alter the accuracy of our predictors?

## References

**(currently using APA 7th ed.)**

Mustafa, M. (2023). *Diabetes prediction dataset* [Data set]. Kaggle. https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset

Petrie, J. R., Guzik, T. J., & Touyz, R. M. (2018). Diabetes, Hypertension, and Cardiovascular Disease: Clinical Insights and Vascular Mechanisms. The Canadian journal of cardiology, 34(5), 575–584. https://doi.org/10.1016/j.cjca.2017.12.005

Pregnancy, Birth and Baby. (n.d.). Diabetes in young children. https://www.pregnancybirthbaby.org.au/diabetes-in-young-children#:~:text=Diabetes%20is%20rare%20in%20children,diabetes%20and%20manage%20the%20condition

Sherwani, S. I., Khan, H. A., Ekhzaimy, A., Masood, A., Sakharkar, M. K. (2016). Significance of HbA1c Test in Diagnosis and Prognosis of Diabetic Patients. *Biomarker Insights, 11*, 95-104. https://doi.org/10.4137%2FBMI.S38440

Yan, Z., Cai, M., Han, X., Chen, Q., & Lu, H. (2023). The Interaction Between Age and Risk Factors for Diabetes and Prediabetes: A Community-Based Cross-Sectional Study. *Diabetes, metabolic syndrome and obesity : targets and therapy, 16*, 85–93. https://doi.org/10.2147/DMSO.S390857