## Sleep Quality Prediction - Supervised Machine Learning

---

by: Cody Hill

date: 1/18/2023

### Data Source Information

### [ ADD LICENSE INFORMATION AND FAIR USE INFO]

In [2]:
import os
import numpy as np
import pandas as pd
import sklearn
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

### Exploratory Data Analysis (EDA)

In [3]:
# Import data
# TODO: Switch this to the github urls for the data so others can use.
data_folder = '/Users/chill/GitHub/Supervised-Sleep/Data'

# Iterate through each file in .data/ and add it to a dataframe.
file_path = [f'{data_folder}/{file}' for file in os.listdir(data_folder)]
biometric_df = pd.concat(map(pd.read_csv, file_path))


In [4]:
display(biometric_df)

Unnamed: 0,average_breath,average_heart_rate,average_hrv,awake_time,bedtime_end,bedtime_start,day,deep_sleep_duration,efficiency,latency,...,temperature_deviation,temperature_trend_deviation,contributors_activity_balance,contributors_hrv_balance,contributors_previous_day_activity,contributors_previous_night,contributors_recovery_index,contributors_resting_heart_rate,contributors_sleep_balance,contributors_body_temperature
0,13.625,63.25,77.0,4440.0,2023-02-04T07:08:22.000-06:00,2023-02-03T22:40:22.000-06:00,2023-02-04,4650.0,85.0,990.0,...,,,,,,,,,,
1,15.125,92.49,19.0,6210.0,2023-02-05T09:37:28.000-06:00,2023-02-04T23:54:28.000-06:00,2023-02-05,4590.0,82.0,180.0,...,,,,,,,,,,
2,15.125,87.67,,1920.0,2023-02-05T20:10:57.000-06:00,2023-02-05T19:36:57.000-06:00,2023-02-06,30.0,6.0,0.0,...,,,,,,,,,,
3,15.125,82.50,37.0,720.0,2023-02-05T20:36:02.000-06:00,2023-02-05T20:20:02.000-06:00,2023-02-06,0.0,25.0,0.0,...,,,,,,,,,,
4,13.625,68.76,59.0,3600.0,2023-02-06T08:29:25.000-06:00,2023-02-05T22:56:25.000-06:00,2023-02-06,8550.0,90.0,1290.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
331,,,,,,,2024-01-14,,,,...,-0.12,0.10,78.0,91.0,78.0,78.0,100.0,69.0,95.0,100.0
332,,,,,,,2024-01-15,,,,...,0.05,0.05,82.0,92.0,93.0,98.0,28.0,100.0,92.0,100.0
333,,,,,,,2024-01-16,,,,...,-0.33,-0.08,78.0,96.0,77.0,45.0,100.0,86.0,84.0,94.0
334,,,,,,,2024-01-17,,,,...,-0.46,-0.26,83.0,98.0,92.0,78.0,89.0,79.0,89.0,90.0


In [5]:
biometric_df.columns

Index(['average_breath', 'average_heart_rate', 'average_hrv', 'awake_time',
       'bedtime_end', 'bedtime_start', 'day', 'deep_sleep_duration',
       'efficiency', 'latency', 'light_sleep_duration', 'lowest_heart_rate',
       'movement_30_sec', 'period', 'rem_sleep_duration', 'restless_periods',
       'score', 'segment_state', 'sleep_midpoint', 'time_in_bed',
       'total_sleep_duration', 'type', 'sleep_phase_5_min', 'timezone',
       'bedtime_start_delta', 'bedtime_end_delta', 'midpoint_at_delta',
       'heart_rate_5_min', 'hrv_5_min', 'contributors_total_sleep',
       'contributors_deep_sleep', 'contributors_rem_sleep',
       'contributors_efficiency', 'contributors_latency',
       'contributors_restfulness', 'contributors_timing',
       'readiness_contributors_activity_balance',
       'readiness_contributors_body_temperature',
       'readiness_contributors_hrv_balance',
       'readiness_contributors_previous_day_activity',
       'readiness_contributors_previous_night'

As we can see, the Oura Ring tracks and records many biometrics, and with that raw biometric data they use different equations and feature engineering to assign a daily score to categories such as sleep, recovery, readiness, activity, etc. totalling 100 features. Since the purpose of this is to create our own sleep score predictive model, we can use Oura's sleep score as our ground truth label (y_train) in training and validation. Furthermore, we must remove the features with labels that include "contributors" as these columns contain normalized scores which Oura's models have output that are then used to average out into the final score.

In [9]:
# Remove contributor columns.
biometric_df = biometric_df.loc[:, ~biometric_df.columns.str.contains('contributors')]
biometric_df.shape

(127200, 71)

We've removed 29 feature columns containing the word contributors.

Now we will remove all columns that won't be used in feature engineering or the final model.
Then complete data munging by formatting all the features (float, int, dates, dummy/indicator encoding).

In [12]:
biometric_df.columns

Index(['average_breath', 'average_heart_rate', 'average_hrv', 'awake_time',
       'bedtime_end', 'bedtime_start', 'day', 'deep_sleep_duration',
       'efficiency', 'latency', 'light_sleep_duration', 'lowest_heart_rate',
       'movement_30_sec', 'period', 'rem_sleep_duration', 'restless_periods',
       'score', 'segment_state', 'sleep_midpoint', 'time_in_bed',
       'total_sleep_duration', 'type', 'sleep_phase_5_min', 'timezone',
       'bedtime_start_delta', 'bedtime_end_delta', 'midpoint_at_delta',
       'heart_rate_5_min', 'hrv_5_min', 'readiness_score',
       'readiness_temperature_deviation',
       'readiness_temperature_trend_deviation', 'spo2_percentage', 'timestamp',
       'latitude', 'longitude', 'altitude', 'course', 'course_accuracy',
       'horizontal_accuracy', 'vertical_accuracy', 'speed', 'speed_accuracy',
       'active_calories', 'average_met_minutes', 'equivalent_walking_distance',
       'high_activity_met_minutes', 'high_activity_time', 'inactivity_alerts',

In [None]:
# # TODO: oura_sleep_2024-01.csv
# - Nap on day encoding
#   - list(where [type] != long_sleep && between 10 AM - 7 PM)
#   - sum(types of sleep duration)
# - restless_periods vs sum(movement_30_sec)??
# - Only one day per entry
#   - Sum each day sleep durations, restless_periods, awake_time, time_in_bed, total_sleep_duration
#       - awake_time = time_in_bed - total_sleep_duration ??
#   - Save only the [type] == long_sleep, average_breath, average_heart_rate, average_hrv, latency, 
#       lowest_heart_rate, betime_start_delta
# - Remove: efficiency, period, score, segment_state, sleep_midpoint, sleep_phase_5_min, movement_30_sec, timezone, 
#       betime_end_delta, midpoint_at_delta, heart_rate_5_min, hrv_5_min

# # TODO: oura_daily-activity_2024-01.csv
# - 
# -
# -
# -
# -

# # TODO: General
# - Collinearity between features checks in model selection
# - 
# - 