# SereniGuide: a sleep wellness tracking app

<div style="font-style: italic;">
    <br>
    <p style="margin-top: 0; margin-bottom: 0;"> EMPATHY S11 </p>
    <p style="margin-top: 0; margin-bottom: 0;"> Group 2: Bawa, Gomez, Lejano, Nadela, Roxas</p>
</div>

## I. Dataset Description

Sleep is crucial for the overall health of a person and their emotional well-being. It allows the body to recharge so that people can function properly and perform our daily tasks. Sleep efficiency is a metric that measures the percentage of time spent asleep compared to the total time spent in bed. It reflects the effectiveness of sleep duration in relation to the time spent attempting to sleep. 

The dataset this project used is sourced from Kaggle and titled: "Sleep Efficiency Dataset". The dataset provides insights into the sleep patterns of a group of test subjects, including details on their sleep duration, efficiency, stages, and nocturnal awakenings. It also includes information on factors such as caffeine and alcohol consumption, smoking habits, exercise frequency, age, and gender. 

### Collection Methodology

The dataset originates from a research study conducted in Morocco by a team of artificial intelligence engineering students from ENSIAS. Over a period of several months, the team enlisted participants from the local community and gathered data through a blend of self-reported surveys, actigraphy, and polysomnographyâ€”a method for monitoring sleep patterns.

### Structure of the Dataset

To take a look at the structure of the dataset, as well as to manipulate and analyze the data, some libraries in Python (such as NumPy and Pandas) are imported. The dataset downloaded from [Kaggle](https://www.kaggle.com/datasets/equilibriumm/sleep-efficiency) is also imported using Pandas.

In [1]:
import numpy as np
import pandas as pd

sleep_df = pd.read_csv("Sleep_Efficiency.csv")

In [2]:
sleep_df.head()

Unnamed: 0,ID,Age,Gender,Bedtime,Wakeup time,Sleep duration,Sleep efficiency,REM sleep percentage,Deep sleep percentage,Light sleep percentage,Awakenings,Caffeine consumption,Alcohol consumption,Smoking status,Exercise frequency
0,1,65,Female,2021-03-06 01:00:00,2021-03-06 07:00:00,6.0,0.88,18,70,12,0.0,0.0,0.0,Yes,3.0
1,2,69,Male,2021-12-05 02:00:00,2021-12-05 09:00:00,7.0,0.66,19,28,53,3.0,0.0,3.0,Yes,3.0
2,3,40,Female,2021-05-25 21:30:00,2021-05-25 05:30:00,8.0,0.89,20,70,10,1.0,0.0,0.0,No,3.0
3,4,40,Female,2021-11-03 02:30:00,2021-11-03 08:30:00,6.0,0.51,23,25,52,3.0,50.0,5.0,Yes,1.0
4,5,57,Male,2021-03-13 01:00:00,2021-03-13 09:00:00,8.0,0.76,27,55,18,3.0,0.0,3.0,No,3.0


By using the `head` function, the researchers are able to get a glimpse of some records included in the dataset. Each row represents a specific person and their statistics while each column describe that statistic.

To understand the characteristics of the dataset (insights into the distribution and central tendency) and identify potential outliers, the researchers call the `describe` function which displays the summary statistics of the data.

In [3]:
sleep_df.describe()

Unnamed: 0,ID,Age,Sleep duration,Sleep efficiency,REM sleep percentage,Deep sleep percentage,Light sleep percentage,Awakenings,Caffeine consumption,Alcohol consumption,Exercise frequency
count,452.0,452.0,452.0,452.0,452.0,452.0,452.0,432.0,427.0,438.0,446.0
mean,226.5,40.285398,7.465708,0.788916,22.615044,52.823009,24.561947,1.641204,23.653396,1.173516,1.79148
std,130.625419,13.17225,0.866625,0.135237,3.525963,15.654235,15.313665,1.356762,30.202785,1.621377,1.428134
min,1.0,9.0,5.0,0.5,15.0,18.0,7.0,0.0,0.0,0.0,0.0
25%,113.75,29.0,7.0,0.6975,20.0,48.25,15.0,1.0,0.0,0.0,0.0
50%,226.5,40.0,7.5,0.82,22.0,58.0,18.0,1.0,25.0,0.0,2.0
75%,339.25,52.0,8.0,0.9,25.0,63.0,32.5,3.0,50.0,2.0,3.0
max,452.0,69.0,10.0,0.99,30.0,75.0,63.0,4.0,200.0,5.0,5.0


The `info` function is then be used to obtain the number of observations in the dataset as well as to know the data type of each column.

In [4]:
sleep_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 452 entries, 0 to 451
Data columns (total 15 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   ID                      452 non-null    int64  
 1   Age                     452 non-null    int64  
 2   Gender                  452 non-null    object 
 3   Bedtime                 452 non-null    object 
 4   Wakeup time             452 non-null    object 
 5   Sleep duration          452 non-null    float64
 6   Sleep efficiency        452 non-null    float64
 7   REM sleep percentage    452 non-null    int64  
 8   Deep sleep percentage   452 non-null    int64  
 9   Light sleep percentage  452 non-null    int64  
 10  Awakenings              432 non-null    float64
 11  Caffeine consumption    427 non-null    float64
 12  Alcohol consumption     438 non-null    float64
 13  Smoking status          452 non-null    object 
 14  Exercise frequency      446 non-null    fl

From the results of the `info` function above, the researchers conclude that there are 452 entries in the dataset. Additionally, there are also 15 columns or variables that can be found in the dataset. A brief description of each variable can be seen below.

### Variables Description

<br>

**ID:** a unique identifier for each test subject

**Age:** age of the test subject

**Gender:** male or female

**Bedtime:** the time the test subject goes to bed each night

**Wakeup time:** the time the test subject wakes up each morning 

**Sleep duration:** the total amount of time the test subject slept (in hours) 

**Sleep efficiency:** a measure of the proportion of time in bed spent asleep 

**REM sleep percentage:** the percentage of total sleep time spent in REM sleep 

**Deep sleep percentage:** the percentage of total sleep time spent in deep sleep 

**Light sleep percentage:** the percentage of total sleep time spent in light sleep 

**Awakenings:** the number of times the test subject wakes up during the night 

**Caffeine consumption:** the amount of caffeine consumed in the 24 hours prior to bedtime (in mg) 

**Alcohol consumption:** the amount of alcohol consumed in the 24 hours prior to bedtime (in oz) 

**Smoking status:** whether or not the test subject smokes 

**Exercise frequency:** the number of times the test subject exercises each week

## II. Data Cleaning

### Renaming Column Names

For a more efficient access to the different variables of the dataset, the researchers decided to rename some columns.

In [5]:
sleep_df.rename(columns={ "REM sleep percentage" : "RSP",
                          "Deep sleep percentage" : "DSP",
                          "Light sleep percentage" : "LSP",
                          "Caffeine consumption" : "Caffeine",
                          "Alcohol consumption" : "Alcohol",
                          "Smoking status" : "Smokes",
                          "Exercise frequency" : "Exercises"}, inplace=True)
sleep_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 452 entries, 0 to 451
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   ID                452 non-null    int64  
 1   Age               452 non-null    int64  
 2   Gender            452 non-null    object 
 3   Bedtime           452 non-null    object 
 4   Wakeup time       452 non-null    object 
 5   Sleep duration    452 non-null    float64
 6   Sleep efficiency  452 non-null    float64
 7   RSP               452 non-null    int64  
 8   DSP               452 non-null    int64  
 9   LSP               452 non-null    int64  
 10  Awakenings        432 non-null    float64
 11  Caffeine          427 non-null    float64
 12  Alcohol           438 non-null    float64
 13  Smokes            452 non-null    object 
 14  Exercises         446 non-null    float64
dtypes: float64(6), int64(5), object(4)
memory usage: 53.1+ KB


### Addressing Null Values

The first thing the researchers noticed when they performed the `info` function above is that some columns from the dataset do not match the expected number of 452 observations. To address the null values of each variable, the researchers perform the following:

To get the number of null values per column, a `null_counts` variable is initialized that is assigned the sum of null values of each column

In [6]:
null_counts = sleep_df.isna().sum()
print(null_counts)

ID                   0
Age                  0
Gender               0
Bedtime              0
Wakeup time          0
Sleep duration       0
Sleep efficiency     0
RSP                  0
DSP                  0
LSP                  0
Awakenings          20
Caffeine            25
Alcohol             14
Smokes               0
Exercises            6
dtype: int64


After careful consideration of each column and what it represents, the researchers decided to replace the null values with a value of `0.0` instead of dropping the rows that contains missing data. This is because each of these variables either describes "the number of times" or "the amount of times" the test subject performs something. Thus, a value of `0.0` can represent that these test subjects did not perform these actions.

To replace the missing values with `0.0`, the following code is executed:

In [7]:
sleep_df = sleep_df.fillna(0.0)

null_counts = sleep_df.isnull().sum()
print(null_counts)

ID                  0
Age                 0
Gender              0
Bedtime             0
Wakeup time         0
Sleep duration      0
Sleep efficiency    0
RSP                 0
DSP                 0
LSP                 0
Awakenings          0
Caffeine            0
Alcohol             0
Smokes              0
Exercises           0
dtype: int64


### Multiple Representation

The researchers also decided to check if there are multiple representations of data for object-type data such as `Gender` and `Smoking status`. To do this, the values of these variables and its count are obtained by performing the code below:

In [8]:
gender_counts = sleep_df['Gender'].value_counts()
print(gender_counts)

print("\n")
ss_counts = sleep_df['Smokes'].value_counts()
print(ss_counts)

Male      228
Female    224
Name: Gender, dtype: int64


No     298
Yes    154
Name: Smokes, dtype: int64


The researchers concluded that there are no multiple representations of data.

### Duplicate Data

The researchers confirm after careful deliberation that duplicate values for each column, except for the individual ID of a test subject, is allowed. Thus the following code is performed to check if by mistake there are duplicate values of ID.

In [9]:
id_duplicates = sleep_df['ID'].duplicated()

id_duplicates_count = id_duplicates.value_counts()
print(id_duplicates_count)

False    452
Name: ID, dtype: int64


The researchers confirm that there are no duplicate values in the variable `ID`.

### Incorrect Datatype

While inspecting the datatypes of each variable, the researchers noticed that the variables REM sleep percentage, Deep sleep percentage, and Light sleep percentage are in `int64` type. Although this would not cause any errors as long as addressed properly, the researchers decided to convert them in their decimal format and transform the data type to `float64`.

In [10]:
sleep_df['RSP'] = sleep_df['RSP'].astype('float64')
sleep_df['DSP'] = sleep_df['DSP'].astype('float64')
sleep_df['LSP'] = sleep_df['LSP'].astype('float64')

sleep_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 452 entries, 0 to 451
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   ID                452 non-null    int64  
 1   Age               452 non-null    int64  
 2   Gender            452 non-null    object 
 3   Bedtime           452 non-null    object 
 4   Wakeup time       452 non-null    object 
 5   Sleep duration    452 non-null    float64
 6   Sleep efficiency  452 non-null    float64
 7   RSP               452 non-null    float64
 8   DSP               452 non-null    float64
 9   LSP               452 non-null    float64
 10  Awakenings        452 non-null    float64
 11  Caffeine          452 non-null    float64
 12  Alcohol           452 non-null    float64
 13  Smokes            452 non-null    object 
 14  Exercises         452 non-null    float64
dtypes: float64(9), int64(2), object(4)
memory usage: 53.1+ KB


Afterwards, each value in all three columns would then be divided by `100.0` to get the equivalent value from its integer form.

In [11]:
sleep_df['RSP'] = sleep_df['RSP'] / 100.0
sleep_df['DSP'] = sleep_df['DSP'] / 100.0
sleep_df['LSP'] = sleep_df['LSP'] / 100.0

sleep_df[['RSP', 'DSP', 'LSP']].head()

Unnamed: 0,RSP,DSP,LSP
0,0.18,0.7,0.12
1,0.19,0.28,0.53
2,0.2,0.7,0.1
3,0.23,0.25,0.52
4,0.27,0.55,0.18


### Inconsistent Format

In [12]:
### not sure how to address bedtime and wakeup time because may date siya and not just the time