In [2]:
## Methods & Results
- Firstly, we need to read, wrangle and clean the data to prepare for our analysis.
We filter the variables we are going to use and set the target variable as a categorical variable.
In this step, functions including "read_csv", "colnames", "select", "mutate", and "as_factor" will be used.

- After that, we need to split our data set into training set and testing set for later classification.
In addition, we need to summarize the average value of four predictors (age, physical activity, sleep duration and heart rate) in each of the 3 groups (people without sleep disorder, with insomnia, and with sleep apnea) so that it is easier for us to compare the differences.
In this step, functions including "initial_split", "group_by" and "summarize" will be used.

In [2]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

“package ‘ggplot2’ was built under R version 4.3.2”
── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.5.0     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom    

In [8]:
options(repr.plot.width = 10, repr.plot.height = 8)

#Read in our data.
sleep_data <- read_csv("Sleep_health_and_lifestyle_dataset.csv")
colnames(sleep_data) <- c("person_ID", "gender", "age", "occupation", "sleep_duration", "sleep_quality", "physical_activity", "stress_level", "BMI", "blood_pressure", "heart_rate", "daily_steps", "sleep_disorder")
sleep_data

[1mRows: [22m[34m374[39m [1mColumns: [22m[34m13[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (5): Gender, Occupation, BMI Category, Blood Pressure, Sleep Disorder
[32mdbl[39m (8): Person ID, Age, Sleep Duration, Quality of Sleep, Physical Activity...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


person_ID,gender,age,occupation,sleep_duration,sleep_quality,physical_activity,stress_level,BMI,blood_pressure,heart_rate,daily_steps,sleep_disorder
<dbl>,<chr>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<chr>
1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126/83,77,4200,
2,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
3,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
372,Female,59,Nurse,8.1,9,75,3,Overweight,140/95,68,7000,Sleep Apnea
373,Female,59,Nurse,8.1,9,75,3,Overweight,140/95,68,7000,Sleep Apnea
374,Female,59,Nurse,8.1,9,75,3,Overweight,140/95,68,7000,Sleep Apnea


In [10]:
#Tidy the data.
sleep_data <- sleep_data |>
    select(person_ID, age, sleep_duration, physical_activity, blood_pressure, heart_rate, sleep_disorder) |>
    mutate(sleep_disorder = as_factor(sleep_disorder))
sleep_data

person_ID,age,sleep_duration,physical_activity,blood_pressure,heart_rate,sleep_disorder
<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<fct>
1,27,6.1,42,126/83,77,
2,28,6.2,60,125/80,75,
3,28,6.2,60,125/80,75,
⋮,⋮,⋮,⋮,⋮,⋮,⋮
372,59,8.1,75,140/95,68,Sleep Apnea
373,59,8.1,75,140/95,68,Sleep Apnea
374,59,8.1,75,140/95,68,Sleep Apnea


In [18]:
#Summarize the data set.
sleep_split <- initial_split(sleep_data, prop = 0.75, strata = sleep_disorder)
sleep_train <- training(sleep_split)
sleep_test <- testing(sleep_split)

sleep_train_means <- sleep_train |>
    group_by(sleep_disorder) |>
    summarize(mean_age = mean(age, na.rm = TRUE), mean_sleep_duration = mean(sleep_duration, na.rm = TRUE), 
              mean_PA = mean(physical_activity, na.rm = TRUE), mean_HR = mean(heart_rate, na.rm = TRUE))
sleep_train_means

sleep_disorder,mean_age,mean_sleep_duration,mean_PA,mean_HR
<fct>,<dbl>,<dbl>,<dbl>,<dbl>
,39.53049,7.370122,57.89024,68.96341
Sleep Apnea,49.37931,6.951724,74.63793,73.22414
Insomnia,43.10526,6.603509,48.24561,70.91228


In [None]:
#test