## STAT 306 Finding Relationships in Data

### Group Project Report

Group A2 - Tran Anh Thu Phung, Ena Gupta, Yihang He, Haiyue Lu

### Import

In [None]:
library(tidyverse)
library(repr)
library(broom)
library(leaps)
library(moderndive)
library(gridExtra)
library(grid)

### 1. Data:
**1.1 Introduction:**

Sleep is a fundamental aspect of human life, playing a critical role in our overall health and well-being. In today’s fast-paced world, many individuals struggle with sleep-related issues, such as insomnia, sleep disturbances, and reduced sleep efficiency. To understand how lifestyle factors impact sleep efficiency, in this project, we will use a data set from Kaggle to help individuals make informed choices to improve their sleep quality and overall health.

**1.2 Loading the Sleep Efficiency data**

To read the dataset in R, we will read the dataset from the web into R. 

In [None]:
url <- "https://raw.githubusercontent.com/Wendy1907/Sleep_Efficiency_Project/main/Sleep_Efficiency.csv"
df <- read_csv(url)
head(df)

**1.3 Data Summary**

The dataset contains **452 observations** in total with **15 columns** including ID.

| Variable Name      | Type of Variable | Explanation |
| ----------- | :-----------: | ----------- |
| `ID`     | int       | A unique identifier for each test subject. |
| `Age`   | int        | The age of the test subjects with a minimum value of 9 years old, a maximum value of 69 years old, and a mean of about 40 years old.|
| `Gender`   | chr       | The gender of the test subject, e.g., Male or Female. |
| `Bedtime`   | chr        |Indicate when each subject goes to bed, with the date and time.|
| `Wakeup time`   | chr        | Indicate when each subject wakes up each day, with the date and time.|
| `Sleep duration`   | dbl        | Records the total amount of time each subject slept in hours, with a minimum value of 5 hours, a maximum duration of 10 hours, and an average of 7.47 hours.|
| `Sleep efficiency`   | dbl       | A measure of the proportion of time spent in bed that is spent asleep, with a minimum weight of 50%, maximum of 99% and average of 78.89%.|
| `REM sleep percentage`   | int        | Amount of time (in percentage) each subject spent in the stage of REM sleep. Minimum of 15%, maximum of 30% and mean of 22.61%.|
| `Deep sleep percentage`   | int        | Amount of time (in percentage) each subject spent in the stage of deep sleep. Minimum of 18%, maximum of 75% and mean of 52.82%.|
| `Light sleep percentage`   | int        | Amount of time (in percentage) each subject spent in the stage of light sleep. Minimum of 7%, maximum of 63% and mean of 24.56%.|
| `Awakenings`   | dbl        | Record the number of times each subject wakes up during the night. There are a total of 20 null variables, with a minimum of 0 times, a maximum of 4 times, and a mean of 1.64.|
| `Caffeine consumption`   | dbl        | The amount of caffeine consumed in the 24 hours before bedtime (in mg), with 25 null variables. |
| `Alcohol consumption`   | dbl        | The amount of alcohol consumed in the 24 hours before bedtime (in oz), with 14 null variables. |
| `Smoking status`   | chr        | The smoking status of the test subject, whether they smoke or not. |
| `Exercise frequency`   | dbl        | The number of times the test subject exercises each week. With 6 null variables, with a mean of 1.79, minimum of 0 and maximum of 5 times a week. |

### 2. Research Question:

Can sleep efficiency be improved through lifestyle interventions(caffeine consumption, alcohol consumption, smoking status, exercise frequency), and what are the most effective strategies for enhancing sleep quality?

### 3. Cleaning and Warngling the dataset:

#### 3.1 Cleaning the dataset:

**3.1.1 Rename columns**

Since the columns name have space in the middle, then, we decided to rename them.

In [3]:
df <- df %>% 
        rename("Wakeup.time" = "Wakeup time",
               "Sleep.duration" = "Sleep duration",
               "Sleep.efficiency" = "Sleep efficiency",
               "REM.sleep.percentage" = "REM sleep percentage",
               "Deep.sleep.percentage" = "Deep sleep percentage",
               "Light.sleep.percentage" = "Light sleep percentage",
               "Caffeine.consumption" = "Caffeine consumption",
               "Alcohol.consumption" = "Alcohol consumption",
               "Smoking.status" = "Smoking status",
               "Exercise.frequency" = "Exercise frequency"
              )
head(df)

ID,Age,Gender,Bedtime,Wakeup.time,Sleep.duration,Sleep.efficiency,REM.sleep.percentage,Deep.sleep.percentage,Light.sleep.percentage,Awakenings,Caffeine.consumption,Alcohol.consumption,Smoking.status,Exercise.frequency
<dbl>,<dbl>,<chr>,<dttm>,<dttm>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>
1,65,Female,2021-03-06 01:00:00,2021-03-06 07:00:00,6.0,0.88,18,70,12,0,0.0,0,Yes,3
2,69,Male,2021-12-05 02:00:00,2021-12-05 09:00:00,7.0,0.66,19,28,53,3,0.0,3,Yes,3
3,40,Female,2021-05-25 21:30:00,2021-05-25 05:30:00,8.0,0.89,20,70,10,1,0.0,0,No,3
4,40,Female,2021-11-03 02:30:00,2021-11-03 08:30:00,6.0,0.51,23,25,52,3,50.0,5,Yes,1
5,57,Male,2021-03-13 01:00:00,2021-03-13 09:00:00,8.0,0.76,27,55,18,3,0.0,3,No,3
6,36,Female,2021-07-01 21:00:00,2021-07-01 04:30:00,7.5,0.9,23,60,17,0,,0,No,1


**3.1.2 Finding Null variables**

First, we have to check if the data contains any null or `N/A` value or not

In [4]:
sum(is.na(df))

The function `is.na` which will identify the null value giving out the result that there are 65 null variables in our dataset. We want to take a look deeper to see which columns contain these null variables.

In [5]:
colSums(is.na(df))

Based on the summary above, we can see that there are total 20 null variables in `Awakenings` columns, 25 in `Caffeine consumption`, 14 in `Alcohol consumption`, and 6 in `Exercise frequency` in total. Then, we decided to remove those null values from our dataset.

In [6]:
df_clean <- df %>% drop_na(Awakenings, Caffeine.consumption, Alcohol.consumption, Exercise.frequency)

**3.1.3 Finding duplicate values**

Next step, we are going to check whether the data have filtered null values have duplicate value or not.

In [7]:
sum(duplicated(df_clean))

In this dataset, we can see that there is no duplicate value.

Then, we have got a clean dataset. The clean dataset that we have gained is contain total of **388 observations** and **15 different columns**.

In [8]:
head(df_clean)

ID,Age,Gender,Bedtime,Wakeup.time,Sleep.duration,Sleep.efficiency,REM.sleep.percentage,Deep.sleep.percentage,Light.sleep.percentage,Awakenings,Caffeine.consumption,Alcohol.consumption,Smoking.status,Exercise.frequency
<dbl>,<dbl>,<chr>,<dttm>,<dttm>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>
1,65,Female,2021-03-06 01:00:00,2021-03-06 07:00:00,6,0.88,18,70,12,0,0,0,Yes,3
2,69,Male,2021-12-05 02:00:00,2021-12-05 09:00:00,7,0.66,19,28,53,3,0,3,Yes,3
3,40,Female,2021-05-25 21:30:00,2021-05-25 05:30:00,8,0.89,20,70,10,1,0,0,No,3
4,40,Female,2021-11-03 02:30:00,2021-11-03 08:30:00,6,0.51,23,25,52,3,50,5,Yes,1
5,57,Male,2021-03-13 01:00:00,2021-03-13 09:00:00,8,0.76,27,55,18,3,0,3,No,3
7,27,Female,2021-07-21 21:00:00,2021-07-21 03:00:00,6,0.54,28,25,47,2,50,0,Yes,1


#### 3.2 Wrangling the Sleep Efficiency data

Since our research question focus on factors such as `Caffeine.consumption`, `Alcohol.consumption`, `Smoking.status`, `Exercise.frequency`, and the target `Sleep.efficiency`, then we only interested in those columns to finding the most effective strategies enhancing sleep quality. But first, we can try fitting a linear regression model that includes other columns to remove the effects of confounding variables.

In [9]:
data <- df_clean %>%
select(Caffeine.consumption, Alcohol.consumption, Smoking.status, Exercise.frequency, Sleep.efficiency)
head(data)

Caffeine.consumption,Alcohol.consumption,Smoking.status,Exercise.frequency,Sleep.efficiency
<dbl>,<dbl>,<chr>,<dbl>,<dbl>
0,0,Yes,3,0.88
0,3,Yes,3,0.66
0,0,No,3,0.89
50,5,Yes,1,0.51
0,3,No,3,0.76
50,0,Yes,1,0.54


We discard ID because it doesn't carry any information on its own. We also discard Bedtime and Wakeup.time columns because their coefficients are insignificant in terms of predicting sleep efficiency. Then we produce a heatmap showing correlation between the columns.

In [None]:
df_clean$Gender <- as.factor(df_clean$Gender)
df_clean$Smoking.status <- as.factor(df_clean$Smoking.status)

#delete ID column because it doesn't carry any information
transformed_df <- df_clean %>% select(-ID)
# discard Bedtime and wakeup.time
transformed_df <- select(transformed_df, -Bedtime, -Wakeup.time)

#extract numeric columns only
numeric_df = transformed_df[, sapply(transformed_df, is.numeric)]
# calculate their correlation
cor_matrix <- cor(numeric_df, use = "complete.obs")
#plot heatmap
melted_cor_matrix <- melt(cor_matrix)
ggplot(melted_cor_matrix, aes(Var1, Var2, fill = value)) +
    geom_tile() +
    scale_fill_gradient2(low = "blue", high = "red", mid = "white",
                        midpoint = 0, limit = c(-1,1), space = "Lab",
                        name="Pearson\nCorrelation") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          axis.title = element_blank()) +
    coord_fixed()


From the heatmap, we can see Deep.sleep.percentage is negatively correlated with Light.sleep.efficiency. We discard one of these columns to avoid rank deficient problem.

In [None]:
transformed_df <- select(transformed_df, -Light.sleep.percentage)

Then we plot the distributions of this feature columns. If later on, the residuals don't follow a normal distribution, we can apply transformation to these feature columns to fine-tune the model.

In [None]:
features <- c('Sleep.efficiency', 'Age', 'REM.sleep.percentage',
              'Deep.sleep.percentage',
              'Awakenings', 'Caffeine.consumption', 'Alcohol.consumption',
              'Exercise.frequency')

plots = list()

for (feature in features) {
  p1 <- ggplot(transformed_df, aes_string(x = feature)) +
        geom_density(fill = "blue", alpha = 0.5) +
        labs(title = paste('Distribution of', feature))

  p2 <- ggplot(transformed_df, aes_string(y = feature)) +
        geom_boxplot() +
        labs(title = paste('Box Plot of', feature))
        coord_flip()

  plots[[length(plots) + 1]] <- p1
  plots[[length(plots) + 1]] <- p2
}

#index into the plots list to check distribution of variables
do.call(grid.arrange, c(plots[1:2]))

We can see Sleep.efficiency is left-skewed. Caffeine consumption, alcohol consumption and exercise frequency are right-skewed. We also check the distribution of smoking status below.

In [None]:
table(transformed_df$Smoking.status)

We then fit a multivariate linear regression model using the rest of features.

In [None]:
FullModel = lm(Sleep.efficiency~.,data = transformed_df)
summary(FullModel)

This fullModel has a good multiple R-squared of 0.8051 and adjusted R-squared 0.7999. We can see the coefficients of Gender and sleep duration are insignificant. It might be better to remove these two features from the model. We confirm our choice of features by conducting an exhaustive search using regsubsets.

In [None]:
reg = regsubsets(Sleep.efficiency~., data = transformed_df, nvmax = 10)
s_reg = summary(reg)
cp_values = s_reg$cp
num_predictors = seq(along = cp_values)
plot(num_predictors, cp_values, type = "b", xlab = "Number of Predictors",
     ylab = "Cp", main = "Cp vs p")
abline(a=0,b=1)

In [None]:
print(s_reg$which)

From Cp vs. P plot, p = 8 is a good choice because that's when cp is slightly below cp=c line.


In [None]:
transformed_df = transformed_df %>% select(-Gender, -Sleep.duration)
bestModel = lm(Sleep.efficiency~., data = transformed_df)
summary(bestModel)
kappa(bestModel)
plot(bestModel)

From the summary, we see an increase in adjusted R-squared. We observe no obvious pattern in the Residual vs. Fitted values plot. QQ-plot of residuals shows that the residuals roughly follow a normal distribution. A kappa of 306.3695 suggests that we don't have severe problems with co-linearity.

In [None]:
which(abs(rstudent(bestModel))>3) ## No outliers
which(cooks.distance(bestModel)>0.5) ## No influential points

In [None]:
There are no outliers in residuals or data points with strong influence. This suggests our model is a good fit for predicting sleep.efficiency. Let's check the coefficients the features related to lifestyles.