## Group Project Proposal
# An Exploratory Analysis Into Diabetes
#### Group 21: Aiko Sumarno, Yoona Wang, Jerry Jin, Daniel Chou

## Introduction:

Millions of people worldwide suffer from diabetes, a common and possibly fatal illness. In order to manage and avoid its consequences, early identification and management are somehow essential. With the use of their medical history and demographic data, we want to build machine learning models in this research that will predict a patient's risk of developing diabetes.


We use this dataset that includes patient personal and medical data, including age, gender, blood glucose level, body mass index (BMI), smoking history, hypertension, heart disease, and HbA1c level. The diabetes status of each patient is also identified, classified as either positive or negative.


We may create prediction models that use these traits to determine a person's probability of acquiring diabetes by using this information. When it comes to helping healthcare professionals identify high-risk patients and carry out early treatments or preventative measures, this information may be quite helpful.


### Preliminary Exploratory Data Analysis:

In [1]:
library(repr)
library(tidyverse)
library(tidymodels)
library(ggplot2)
options(repr.matrix.max.rows = 10)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.5     [32m✔[39m [34mrsample     [39

In [2]:
set.seed(123)

### 1. Read and Tidy Data

The data we have used was taken from [Kaggle](https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset) and it is a dataset of medical and demographic data of people who may or may not have diabetes. The primary source of this dataset are Electronic Health records by healthcare providers. The dataset contains the following variables: 

* gender = Male or Female
* age = How old is the person?
* hypertension = Does the person have hypertension? (1 - yes, 0 - no)
* heart_disease = Does the person have heart disease? (1 - yes, 0 - no)
* smoking_istory = Is the person a smoker? (never, no info, current, former, ever, never, and not current)
* bmi = body mass index
* HbA1c_level = Hemoglobin A1C, average blood sugar level  over the past two to three months
* blood_glucose_level = amount of glucose in the person's blood
* diabetes = Does the person have diabetes or not? (1 - yes, 0 - no)

For our analysis, we have decided to use the person's **age**, **bmi**, **HbA1c_level**, **blood_glucose_level** and **diabetes** variables only. 

In [7]:
#diabetes <- read_csv("data/diabetes.csv")
diabetes <- read_csv("https://raw.githubusercontent.com/aikosumarno/dsci-100-2023w2-group-21/main/diabetes.csv")
head(diabetes)

[1mRows: [22m[34m100000[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (2): gender, smoking_history
[32mdbl[39m (7): age, hypertension, heart_disease, bmi, HbA1c_level, blood_glucose_l...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
<chr>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Female,80,0,1,never,25.19,6.6,140,0
Female,54,0,0,No Info,27.32,6.6,80,0
Male,28,0,0,never,27.32,5.7,158,0
Female,36,0,0,current,23.45,5.0,155,0
Male,76,1,1,current,20.14,4.8,155,0
Female,20,0,0,never,27.32,6.6,85,0


In [None]:
diabetes <- diabetes |>
              mutate(diabetes = as_factor(diabetes)) |>
              mutate(diabetes = fct_recode(diabetes, "diabetic" = "1", "non-diabetic" = "0")) |> 
              mutate(hypertension = as_factor(hypertension)) |> 
              mutate(hypertension = fct_recode(hypertension, "yes" = "1", "no" = "0")) |> 
              mutate(heart_disease = as_factor(heart_disease)) |> 
              mutate(heart_disease = fct_recode(heart_disease, "yes" = "1", "no" = "0"))
head(diabetes)

In [None]:
diabetes_tidy <- diabetes |> 
                    select(-gender, - hypertension, -heart_disease, -smoking_history)
head(diabetes_tidy)

### 2. Summarize Data 

We decided to split **75%** of the data **for training** and **25% for testing**. 

To summarize the training data, we counted the number and percentage of patients that were diagnosed with diabetes and the ones that were not. We also calculated the average value of each predictor variables and compared the average values between diabetic and non-diabetic patients. 

In [None]:
diabetes_split <- initial_split(diabetes_tidy, prop = 0.75, strata = diabetes) 
diabetes_training <- training(diabetes_split)
diabetes_testing <- testing(diabetes_split)

In [None]:
glimpse(diabetes_training)

In [None]:
glimpse(diabetes_testing)

In [None]:
cat("Table 1: Number and Percentage of Patients that Diagnosed with Diabetes\n")
diabetes_proportions <- diabetes_training |>
                          group_by(diabetes) |>
                          summarize(count = n()) |>
                          mutate(percent = 100*count/nrow(diabetes_training))

diabetes_proportions

In [None]:
cat("Table 2: Average Predictor Values\n")
diabetes_mean <- diabetes_training |>
                    select(-diabetes) |>
                    map_df(mean) 
diabetes_mean

In [None]:
cat("Table 3: Average Predictor Values for Diabetic and Non-Diabetic Patients")
comparison <- diabetes_training |>
                group_by(diabetes) |>
                summarize(avg_age = mean(age),
                          avg_bmi = mean(bmi), 
                          avg_HbA1c_level = mean(HbA1c_level), 
                          avg_blood_glucose_level = mean(blood_glucose_level))
comparison

### 3. Exploratory Data Visualization

Histograms are used to visualize and the distribution of each of the predictor variables between diabetic and non-diabetic patients we plan to use in our analysis. 

In [None]:
options(repr.plot.width = 8, repr.plot.height = 6)
age_plot <- diabetes_training |>
                ggplot(aes(x = age, fill = diabetes)) +
                geom_histogram(binwidth = 8) +
                facet_grid(rows = vars(diabetes)) +
                labs(x = "Age (in years)", y = "Number of Patients", fill = "Diagnosis") +
                ggtitle("Figure 1: Distribution of Age between Diabetic and Non-Diabetic Patients") +
                theme(text = element_text(size = 12))
age_plot

From Figure 1, Non-diabetic patients (upper bars) appear to have a reasonably consistent distribution over the age range, with a slight decrease in the oldest age group. For diabetes patients (lower bars), the numbers begin low for the youngest age groups, rise and peak in the middle age ranges, and then fall for the older age groups.

In [None]:
bmi_plot <- diabetes_training |>
                ggplot(aes(x = bmi, fill = diabetes)) +
                geom_histogram(binwidth = 8) +
                facet_grid(rows = vars(diabetes)) +
                labs(x = "Body Mass Index", y = "Number of Patients", fill = "Diagnosis") +
                ggtitle("Figure 2: Distribution of Body Mass Index between Diabetic and Non-Diabetic Patients") +
                theme(text = element_text(size = 12))
bmi_plot

From Figure 2, for non-diabetic patients, there is a significant peak in the lower BMI range, indicating a larger concentration of non-diabetic persons with this BMI.
On the other hand, when an individual has a BMI under 20 kg/m^2, there is very little change to be diabetic.
Based on the trend of the graph, the higher the BMI, the bigger chance of being diabetic.

In [None]:
HbA1c_level_plot <- diabetes_training |>
                        ggplot(aes(x = HbA1c_level, fill = diabetes)) +
                        geom_histogram(binwidth = 8) +
                        facet_grid(rows = vars(diabetes)) +
                        labs(x = "Hemoglobin A1C Level", y = "Number of Patients", fill = "Diagnosis") +
                        ggtitle("Figure 3: Distribution of Hemoglobin A1C Level between Diabetic and Non-Diabetic Patients") +
                        theme(text = element_text(size = 12))
HbA1c_level_plot

The x-axis shows hemoglobin A1C levels, which are an essential measure in diabetes care since they indicate average blood glucose levels over the previous three months. According to the graph, only people with ~4% hemoglobin A1C levels or higher are possible to be diabetic. 

In [None]:
blood_glucose_level_plot <- diabetes_training |>
                            ggplot(aes(x = blood_glucose_level, fill = diabetes)) +
                            geom_histogram(binwidth = 8) +
                            facet_grid(rows = vars(diabetes)) +
                            labs(x = "Blood Glucose Level", y = "Number of Patients", fill = "Diagnosis") +
                            ggtitle("Figure 4: Distribution of Blood Glucose Level between Diabetic and Non-Diabetic Patients") +
                            theme(text = element_text(size = 12))
blood_glucose_level_plot

Non-diabetic patients' blood glucose levels are concentrated at the lower end of the scale, which is consistent with medical understanding that non-diabetics often have lower glucose levels.
Diabetic patients have a greater distribution of glucose levels with many spikes, implying that they experience a wider range of blood glucose levels, including highly elevated levels.

In [None]:
blood_vs_HbA1c_plot <- diabetes_training |>
                            ggplot(aes(x = blood_glucose_level, y = HbA1c_level, color = diabetes)) +
                            geom_point() +
                            labs(x = "Blood Glucose Level", y = "Hemoglobin A1C Level", color = "Diagnosis") +
                            ggtitle("Figure 5: Blood Glucose Level vs HbA1c_level") +
                            theme(text = element_text(size = 12))
blood_vs_HbA1c_plot

In Figure 5, we are looking at the relationship between Blood Glucose Level vs HbA1c level. Based on the graph, there's a clear distinct between non-diabetic and diabetic patient. It reflects that there's higher chance of being diabetic if an individual has high blood Glucose Level, or Hemonglobin A1C level, or both. 

## Methods:

We will use KNN classification as we are predicting a categorical value (diagnosis) from our predictors. We will create a classifier, tune the classifier and visualize the results. The variables that will be used in the analysis are age, bmi, blood glucose level and HbA1c level and diagnosis of diabetes.
From the ggpairs plot above, we can see a relatively strong relationship between the likelihood to suffer from diabetes and the 4 predictors (age, bmi, blood_glucose_level, HbA1c_level). Thus, we will use average age, average bmi, average blood glucose level and average HbA1c level, which will be calculated by averaging the columns, as strong predictors for diagnosis.
The results will be visualized in 4 histogram with different variables on the x-axis and the others on the y-axis as well as the points being coloured to identify the diagnosis. We will also make a plot of predicted and true diagnosis values with a best-fit plot through the true values, of ”a variable“ vs ”diagnosis“.

## Expected Outcomes and Significance:

## Reference List: 
* Mustafa, M. (2023). Diabetes prediction dataset. Kaggle.com. https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset
