# Predicting Diabetes Status (max. 2000 words)

## Introduction

Diabetes is a chronic metabolic disease where the body cannot produce or use insulin, a hormone that allows sugar to be used for energy. The effects of untreated diabetes can be damaging to various organs, and approximately [422 million](https://www.who.int/health-topics/diabetes#tab=tab_1) people globally have diabetes. It is therefore imperative that diabetes be diagnosed as early as possible to minimise potential complications.

This project aims to create a feasible model for predicting diabetes. Literature has found that [Hba1c levels](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4933534/), [age](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9843502/#:~:text=Advanced%20age%20is%20a%20major%20risk%20factor%20for%20diabetes%20and%20prediabetes.&text=Therefore%2C%20the%20elderly%20has%20a,%2C%20retinal%2C%20and%20renal%20systems.), [hypertension and heart disease statuses](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5953551/) can be related to diabetes, and so the question to be answered is:  can diabetes status be predicted from  these 4 variables?

To answer the overarching question, the dataset that will be used in this project is the [Diabetes prediction dataset](https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset) created by Mohammed Mustafa on kaggle. It contains patient records organised into 8 variables (age, gender, body mass index (BMI), hypertension, heart disease, smoking history, HbA1c level,  and blood glucose level) and a predicted outcome (diabetes status).

## Methods and Results

### Methods
1. Data Cleaning: Before any analysis is conducted, the dataset will be cleaned to handle missing values, outliers, or any inconsistencies.
2. Variable Selection: Not all variables/columns in a dataset may be relevant or useful for prediction. We will primarily focus on variables such as Hba1c levels, age, hypertension and heart disease statuses. These have been shown in various studies to be influential factors in the onset of diabetes. However, other variables may be included or excluded based on their correlation with the predicted outcome (diabetes status) and their importance in the model.
3. Data Splitting: The dataset will be divided into training and testing sets. The training set will be used to train our predictive model, while the testing set will be used to evaluate the model's performance.
4. Model Building and Evaluation: Different models will be considered, including logistic regression, decision trees, and random forests. Each model's performance will be assessed using appropriate metrics, such as accuracy, recall, precision.
5. Results Visualization: The ggplot2 package will be central for this. Histograms (geom_histogram()), scatter plots (geom_point()) will be plotted to visually inspect variable distributions and relationships. Feature importance, when relevant, can be visualised using a bar graph (geom_bar()).

### Results

## Discussion

*(insert summary of findings)*

*(insert discussion of whether findings conform to expectations)*

Utilising machine learning to comb through multitudes of patient data can accelerate crucial decision-making, especially in clinical settings where healthcare providers must consider diabetic potential in some patients’ treatments. We believe that our findings can provide a step closer to accomplishing this by confirming the correlational relationship between the 4 variables and diabetes.

Future questions to consider:
1. Can we differentiate between type 1 and type 2 diabetes with our predictors?
2. Would a bigger patient dataset alter the accuracy of our predictors?

## References

**(currently using APA 7th ed.)**

Mustafa, M. (2023). *Diabetes prediction dataset* [Data set]. Kaggle. https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset

Petrie, J. R., Guzik, T. J., & Touyz, R. M. (2018). Diabetes, Hypertension, and Cardiovascular Disease: Clinical Insights and Vascular Mechanisms. The Canadian journal of cardiology, 34(5), 575–584. https://doi.org/10.1016/j.cjca.2017.12.005

Sherwani, S. I., Khan, H. A., Ekhzaimy, A., Masood, A., Sakharkar, M. K. (2016). Significance of HbA1c Test in Diagnosis and Prognosis of Diabetic Patients. *Biomarker Insights, 11*, 95-104. https://doi.org/10.4137%2FBMI.S38440

Yan, Z., Cai, M., Han, X., Chen, Q., & Lu, H. (2023). The Interaction Between Age and Risk Factors for Diabetes and Prediabetes: A Community-Based Cross-Sectional Study. *Diabetes, metabolic syndrome and obesity : targets and therapy, 16*, 85–93. https://doi.org/10.2147/DMSO.S390857