In [None]:
Summary Tables

In [1]:
# importing libraries
library(tidyverse)
library(repr)
library(janitor)
library(ggplot2)
library(tidymodels)
library(RColorBrewer)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors

Attaching package: ‘janitor’


The following objects are masked from ‘package:stats’:

    chisq.test, fisher.test


── [1mAttaching packages[22m ────────────────────

In [2]:
# Cleaning names and specifying categorical variable
diabetes <- read_csv("../data/diabetes.csv") |>
            clean_names() |>
            mutate(outcome = as_factor(outcome)) |>
            mutate(diabetes = fct_recode(outcome, "Yes" = "1", "No" = "0")) |>
            select(-outcome)

# Displaying the data
head(diabetes)

[1mRows: [22m[34m768[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (9): Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, D...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


pregnancies,glucose,blood_pressure,skin_thickness,insulin,bmi,diabetes_pedigree_function,age,diabetes
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
6,148,72,35,0,33.6,0.627,50,Yes
1,85,66,29,0,26.6,0.351,31,No
8,183,64,0,0,23.3,0.672,32,Yes
1,89,66,23,94,28.1,0.167,21,No
0,137,40,35,168,43.1,2.288,33,Yes
5,116,74,0,0,25.6,0.201,30,No


In [3]:
# splitting the diatabes data
diabetes_split <- initial_split(diabetes, prop = 0.75, strata = diabetes)

# training data
diabetes_training <- training(diabetes_split)

# testing data
diabetes_testing <- testing(diabetes_split)

# the data to be used
head(diabetes_training)

pregnancies,glucose,blood_pressure,skin_thickness,insulin,bmi,diabetes_pedigree_function,age,diabetes
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
1,85,66,29,0,26.6,0.351,31,No
1,89,66,23,94,28.1,0.167,21,No
10,115,0,0,0,35.3,0.134,29,No
4,110,92,0,0,37.6,0.191,30,No
10,139,80,0,0,27.1,1.441,57,No
1,103,30,38,83,43.3,0.183,33,No


In [16]:
#Determining the average values for those with and without diabetes
average_summary <- diabetes_training |>
group_by(diabetes) |>
summarize(average_glucose = mean(glucose), 
          average_bmi = mean(bmi),
         average_blood_pressure = mean(blood_pressure))
average_summary

pregnancies,glucose,blood_pressure,skin_thickness,insulin,bmi,diabetes_pedigree_function,age,diabetes
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
1,85,66,29,0,26.6,0.351,31,No
1,89,66,23,94,28.1,0.167,21,No
10,115,0,0,0,35.3,0.134,29,No
4,110,92,0,0,37.6,0.191,30,No
10,139,80,0,0,27.1,1.441,57,No
1,103,30,38,83,43.3,0.183,33,No


diabetes,average_glucose,average_bmi,average_blood_pressure
<fct>,<dbl>,<dbl>,<dbl>
No,108.8853,30.22213,68.336
Yes,141.1592,34.89851,72.10945


In [22]:
#determining the number of data for each pregnancy stage
count_preg <- diabetes_training |>
count(pregnancies)
count_preg 

pregnancies,n
<dbl>,<int>
0,83
1,105
2,74
3,53
4,54
5,41
6,44
7,31
8,26
9,22


In [17]:
#Average glucose, blood pressure, bmi levels for number of pregnancies considered as a factor
diabetes_training_pregnant_stage <- diabetes_training |>
mutate(pregnancies = as_factor(pregnancies)) |>
group_by(pregnancies) |>
summarize(
    average_glucose = mean(glucose),
    average_bmi = mean(bmi),
    average_blood_pressure = mean(blood_pressure))
diabetes_training_pregnant_stage

pregnancies,average_glucose,average_bmi,average_blood_pressure
<fct>,<dbl>,<dbl>,<dbl>
0,122.3976,34.33133,69.09639
1,113.0286,31.11238,67.88571
2,110.7973,30.75405,64.74324
3,123.7925,30.4283,68.22642
4,123.3333,31.95556,69.03704
5,117.0488,32.67561,75.78049
6,119.8636,30.1,68.15909
7,139.7097,33.87097,77.58065
8,128.7308,30.33462,73.65385
9,126.0909,30.95909,75.59091


### Expected Outcomes and Significance

What we would like to find:
- How does blood pressure impact the likelihood of diabetes?
- What is the relationship of BMI and diabetic standing?
- What is the relationship between Glucose and Age with an emphasis on their Diabetic Standing?
  

Such findings will be able to create a much bigger impact in the future, such as ...

1. Finding which variables in the data set have the most significant impact on predicting diabetes in this dataset. (The problem with this is that we are going to be standardizing our data, so I am not sure if we will be able to determine the most impactful variable)
2. Determine what aspects/variables related to diabetes medical workers need to focus on to prevent/decrease the risk of future patients being diagnosed with either Type 1 or 2 diabetes
3. Determining the critical age range where individuals are more susceptible to developing diabetes based on glucose levels

Other potential findings:
1. To predict if an individual is susceptible to either Type 1 diabetes or Type 2 diabetes based on current data.
2. Determine if the increase in pregnancies leads to a higher risk of either Type 1 or 2 diabetes.
3. If individuals with high blood pressure are of higher risk of being diagnosed with either Type 1 or 2 diabetes
4. Visualize if BMI levels influences the diagnosis of Type 1 or 2 diabetes