# An exploration on fitbit data -mood and stress parameters - Data cleaning

This notebook has been prepared by Esther Guiu Hernandez on March 2024

This is the Lifesnaps Fitbit dataset 

## About the data files: 
- <b>  Scored Surveys </b> : CSV files containing scored version of PANAS, STAI surveys
- <b> Personality trait  </b> : CSV files containing personality trait data
- <b>  Fitbit & EMA Data (daily granularity) </b>: csv_rais_anonymized/daily_fitbit_sema_df_unprocessed.csv
- <b>  Fitbit & EMA Data (hourly granularity) </b>: csv_rais_anonymized/hourly_fitbit_sema_df_unprocessed.csv

 

In [13]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import missingno as msno
from tabulate import tabulate

In [14]:
# Load in the dataframe from fitbit data

#fitbit and sema
daily_fitbit_sema_df = pd.read_csv('data/original/daily_fitbit_sema_df_unprocessed.csv', delimiter=',')
hourly_fitbit_sema_df = pd.read_csv('data/original/hourly_fitbit_sema_df_unprocessed.csv', delimiter=',')

#survey data
stai_survey_df = pd.read_csv('data/original/stai.csv', delimiter=',')
personality_survey_df = pd.read_csv('data/original/personality.csv', delimiter=',')
panas_survey_df = pd.read_csv('data/original/panas.csv', delimiter=',')

  hourly_fitbit_sema_df = pd.read_csv('data/original/hourly_fitbit_sema_df_unprocessed.csv', delimiter=',')


# 1. Data exploration

## 1.1 Fitbit data variable descriptions

What is the rleationship between the different stress associacted physiological data?  We are going to select the following variables

- <b>Age</b>
- <b>Gender</b>
- <b>BMI</b>
- <b>nightly_temperature </b>: measures the temperature at night during sleep
- <b>rmssd </b>: The Root Mean Square of Successive Differences (RMSSD) between heart beats. It measures short-term variability in the user’s heart rate while in deep sleep, in milliseconds (ms)
    -- Heart Rate Variability (HRV): If your nervous system is balanced, your heart is constantly being told to beat slower by your parasympathetic system, and beat faster by your sympathetic system. This causes a fluctuation in your heart rate: HRV.  higher HRV is seen as healthier, while a drop in HRV could indicate that you are experiencing stress or showing potential signs of illness. 
- <b> spo2 </b>: termine the percentage of oxygen saturation in the blood. A low score may be an indication of important changes in your fitness and wellness. Fitbit measures SpO2 above 80%, and a healthy score would be above 95%.
- <b> Full_sleep_breathing_rate </b> :  RR of healthy adults in a relax state is about 12–20 times per minute.
- <b> Strees score </b>:  Calculated by adding:
    - <b> responsiveness_points_percentage: </b>responsiveness out of a possible 30
    Responsiveness monitors your sympathetic nervous system, aka your fight or flight response, by monitoring your heart rate and heart rate variability. 
    - <b> exertion_points_percentage: </b> for exertion balance is out of 40 
     Exertion akes into account your recent physical activity like steps and accounts for both overexertion or lack of exercise. 
    - <b> sleep_points_percentage:</b> sleep patterns is out of 30 
    include measurements of deep sleep from the previous night and whether your sleep was fitful or fragmented. It also tracks your “sleep reservoir” based on the amount and quality of sleep you’ve managed over the previous week
    
-  <b> Sleep score </b>: Sleep Score that's made up of time asleep (50 percent of the score) the amount of time you spent in deep and REM sleep (25 percent of the score) and restoration (which shows how much of your sleep time is below your resting heart rate (also 25 percent).
- <b> daily_tempearature_variation: </b> The top-end Fitbit smartwatches feature a sensor that tracks your skin temperature each night to show how it varies from your personal baseline (set over three nights when you first setup the watch), so you can be aware of your trends over time. 

- <b> daily_temperature_variation:</b> variations in temperature can be caused by menstrual cycle in women, by ilness and also by stress periods. 

- <b>FilteredVO2 Max (Cardio Fitness Score):</b>  It’s a measurement of your cardiovascular fitness, or how well your body uses oxygen when you are working out at your hardest. The higher your score, the more fit you are. The more stationary your lifestyle, the lower your score will be, and the higher your risk of developing high blood pressure and coronary heart disease

- <b> lightly_active_minutes</b>
- <b> moderately_active_minutes</b>
- <b> very_active_minutes</b>
- <b> sedentary_minutes</b>
- <b> mindfulness_session</b>
- <b> scl_avg (skin conductance level): </b>
- <b>resting_hr ( Resting Heart Rate): </b> number of times your heart beats per minute (bpm) when you are still and rested .
RHR usually ranges from 60-100bpm, but varies according to your age and fitness level. Generally, the lower the better –
- <b> sleep_duration</b>
- <b> sleep_efficiency</b>
- <b> SEMA Values</b>: ALTERT, HAPPY, NEUTRAL, RESTED/RELAXED, TENSE/ANXIOUS, TIRED



# 2. Data cleaning

## 2.1 Data Cleaning Steps
<b> Daily_fitbit_sema_df </b>
1. Calculate which day of the week each day corresponds
2. Order by day accross participants and asign each day whether is day 1 in the study, so create new variable called Days_in_study
3. Delete all the features that we don't need
4. Save data in a new csv file

<b> Hourly_fitbit_sema_df </b>
1. Calculate which day of the week each day corresponds
2. Calculate whether the hour corresponds to the night (0-5) morning (6 -12), afternoon (13 - 18) evening (19 - 23)
3. Calculate a variable that reflects the number of days in the study in order to align participants accross this time
4. Delete all the features that we don't need
5. Save data in a new csv file

<b>  Scored Surveys </b>
- PANAS
1. Add a column that represents the first day in the study as the minimum date in the hourly or daily fitbit dataset
2. Calculate number of days in the study and create a new column that reflects it, this way we can compare participants in a way

- STAI
1. Add a column that represents the first day in the study as the minimum date in the hourly or daily fitbit dataset
2. Calculate number of days in the study and create a new column that reflects it, this way we can compare participants in a way


<b> Personality trait  </b>
- No pre-processing needed




## 2.2 Cleaning the databases

### Hourly fitbit sema df

In [15]:
columns_to_drop = ['minutes_below_default_zone_1', 'minutes_in_default_zone_2',
       'minutes_in_default_zone_3', 'step_goal', 'min_goal', 'max_goal', 'bpm', 'temperature', 'badgeType', 'scl_avg']

hourly_fitbit_sema_df.drop(columns=columns_to_drop, inplace = True)


# Create a mapping dictionary for Participant_ID to new integer numbers in order, so that is easier to plot later
unique_participants = hourly_fitbit_sema_df['id'].unique()
participant_mapping = {participant_id: i for i, participant_id in enumerate(sorted(unique_participants), 1)}

# Map Participant_ID to new integer numbers
hourly_fitbit_sema_df['Mapped_ID'] = hourly_fitbit_sema_df['id'].map(participant_mapping)


# Create a column that represents the number of day since the start of the study
# Calculate the minimum date for each participant
hourly_fitbit_sema_df['date'] = pd.to_datetime(hourly_fitbit_sema_df['date'])
hourly_fitbit_sema_df['Min_date'] = hourly_fitbit_sema_df.groupby('id')['date'].transform('min')
# substract the start date in the study from the current day to calculate which day from the study each entrry represents
hourly_fitbit_sema_df['Days_from_beggining'] = (hourly_fitbit_sema_df['date'] - hourly_fitbit_sema_df['Min_date']).dt.days + 1  # Adding 1 to start counting from day 1


# Save DataFrame to CSV file
hourly_fitbit_sema_df.to_csv('data/cleaned/hourly_fitbit_sema_cleaned.csv', index=False)  # Set index=False to exclude row numbers in the output CSV file



### PANAS

In [16]:
## 1.2 PANAS

# create a dataframe that is only users and their start date, just once
users_and_start_date = hourly_fitbit_sema_df.drop_duplicates(subset=['id'], keep='first')[["id", "age", "gender","Min_date", "Mapped_ID"]]
#users_and_start_date.head(40)

# rename id column so that both dataframes can be merged
panas_survey_df.rename(columns={'user_id': 'id'}, inplace=True)

# Merge the two dataframes on Participant_ID
panas_survey_df = pd.merge(panas_survey_df, users_and_start_date, on='id', how='left')

# Convert 'Date' column to datetime
panas_survey_df['submitdate'] = pd.to_datetime(panas_survey_df['submitdate'])

# substract the start date in the study from the current day to calculate which day from the study each entrry represents
panas_survey_df['Days_from_beggining'] = (panas_survey_df['submitdate'] - panas_survey_df['Min_date']).dt.days + 1  # Adding 1 to start counting from day 1

# Calculate the number of weeks from the beginning of the study
panas_survey_df['Weeks_from_beggining'] = panas_survey_df['Days_from_beggining'] // 7  

# create a score that represents positive over negative
panas_survey_df['Positive_over_negative'] = (panas_survey_df['positive_affect_score'] / panas_survey_df['negative_affect_score']).round(2)

# Drop the 'Min_Date' column if not needed
panas_survey_df.drop(columns=['Min_date'], inplace=True)

# Map Participant_ID to new integer numbers
#panas_survey_df.head(10)

# Save DataFrame to CSV file
panas_survey_df.to_csv('data/cleaned/panas_survey_cleaned.csv', index=False)  # Set index=False to exclude row numbers in the output CSV file


### STAI


In [17]:
# rename id column so that both dataframes can be merged
stai_survey_df.rename(columns={'user_id': 'id'}, inplace=True)

# Merge the two dataframes on Participant_ID
stai_survey_df = pd.merge(stai_survey_df, users_and_start_date, on='id', how='left')

# Convert 'Date' column to datetime
stai_survey_df['submitdate'] = pd.to_datetime(stai_survey_df['submitdate'])

# substract the start date in the study from the current day to calculate which day from the study each entrry represents
stai_survey_df['Days_from_beggining'] = (stai_survey_df['submitdate'] - stai_survey_df['Min_date']).dt.days + 1  # Adding 1 to start counting from day 1

# Calculate the number of weeks from the beginning of the study
stai_survey_df['Weeks_from_beggining'] = stai_survey_df['Days_from_beggining'] // 7  

# Drop the 'Min_Date' column if not needed
stai_survey_df.drop(columns=['Min_date'], inplace=True)

# Map Participant_ID to new integer numbers
#panas_survey_df.head(10)

# Save DataFrame to CSV file
stai_survey_df.to_csv('data/cleaned/stai_survey_df_cleaned.csv', index=False)  # Set index=False to exclude row numbers in the output CSV file


### Daily fitbit sema

In [18]:
columns_to_drop = ['nremhr', 'badgeType',
       'bpm', 'step_goal', 'min_goal', 'max_goal', 'minutesAwake', 'minutesAfterWakeup', 'badgeType']

daily_fitbit_sema_df.drop(columns=columns_to_drop, inplace = True)


# Map Participant_ID to new integer numbers
daily_fitbit_sema_df['Mapped_ID'] = daily_fitbit_sema_df['id'].map(participant_mapping)

# Create a column that represents the number of day since the start of the study
# Calculate the minimum date for each participant
daily_fitbit_sema_df['date'] = pd.to_datetime(daily_fitbit_sema_df['date'])
daily_fitbit_sema_df['Min_date'] = daily_fitbit_sema_df.groupby('id')['date'].transform('min')
# substract the start date in the study from the current day to calculate which day from the study each entrry represents
daily_fitbit_sema_df['Days_from_beggining'] = (daily_fitbit_sema_df['date'] - daily_fitbit_sema_df['Min_date']).dt.days + 1  # Adding 1 to start counting from day 1
# Calculate the number of weeks from the beginning of the study
daily_fitbit_sema_df['Weeks_from_beggining'] = daily_fitbit_sema_df['Days_from_beggining'] // 7  

# Extract weekday name from the 'date' column
daily_fitbit_sema_df['weekday'] = daily_fitbit_sema_df['date'].dt.strftime('%A')




In [19]:
#Let's create a column that shifts the stress score 
# Define a function to shift stress score by one day for each participant
def shift_stress_score(group):
    # Sort DataFrame by date
    group = group.sort_values(by='date')
    # Shift stress score by one day
    group['stress_score_shifted'] = group['stress_score'].shift(-1)
    # Propagate NaN values from the original stress score
    group['stress_score_shifted'] = group['stress_score_shifted'].ffill()
    return group

# Group the data by participant
grouped_data = daily_fitbit_sema_df.groupby('id')

daily_fitbit_sema_with_shifted_score = pd.concat([shift_stress_score(group) for _, group in grouped_data]).reset_index(drop=True)
daily_fitbit_sema_df = daily_fitbit_sema_with_shifted_score
# Save DataFrame to CSV file
daily_fitbit_sema_df.to_csv('data/cleaned/daily_fitbit_sema_cleaned.csv', index=False)  # Set index=False to exclude row numbers in the output CSV file


### Personality

In [20]:
personality_survey_df.rename(columns={'user_id': 'id'}, inplace=True)
personality_survey_df.drop(columns=['extraversion',
       'agreeableness', 'conscientiousness', 'stability', 'intellect'], inplace=True)

personality_survey_df.rename(columns={'ipip_extraversion_category': 'extraversion', 'ipip_agreeableness_category': 'agreeableness', 
                                     'ipip_conscientiousness_category': 'conscientiousness', 'ipip_stability_category': 'stability', 
                                     'ipip_intellect_category': 'intellect'}, inplace=True)

In [21]:
# Save DataFrame to CSV file
personality_survey_df.to_csv('data/cleaned/personality_survey_cleaned.csv', index=False)  # Set index=False to exclude row numbers in the output CSV file
