# Pre-processing and Feature Extraction

This notebook contains all necessary code to replicate the feature extraction and final csv files that were used in the statistical analyses.

The K-Emophone dataset should be present in the same folder as this notebook. <br>
Due to the relevant csv files being rather large and time-consuming to process (20-30 minutes), the final Dataframes used for analysis can be found in the zip-file as well.

Fixed Effects Regression Model: total_df.csv

Multiple Linear Regression Model: total_df_weekly.csv

### Sidenote
Some markdowns explaining the features can also be found in the final paper. To provide clarity to the code, we decided to add some explanations here as well.

In [None]:
import pandas as pd
import numpy as np
import os
from datetime import datetime
from functools import reduce

### Convert timestamps
The data contains Unix timestamps in miliseconds. We convert these to local time using date.time, which is later stored in column 'responseTime'. Out-of-bounds values are caught and dropped.

In [None]:
# safe convert for timestamps
def safe_convert(timestamp):
    try:
        # timestamp unit in milliseconds
        dt = pd.to_datetime(timestamp, unit='ms', utc=True)
        # convert from UTC to Asia/Tokyo timezone
        return dt.tz_convert('Asia/Tokyo')
    except pd.errors.OutOfBoundsDatetime:
        return pd.NaT  # return Not a Time for out-of-bound values

## Loading data

The initial dataset is structured to contain measurement data as subfolders per person. Because we want to access the data per topic for all individuals instead of the other way around, we traverse through the directory structure and its subfolders. We then initialise a list that collects the dataframes we are interested in, with each dataframe corresponding to each participant.

This is not needed for ESMResponse and userInfo this is not needed, since they already contain all participants in the data.

In [None]:
# access dataset
base_dir = 'k-emophone'

# lists to hold the dataframes
calories = []
screenEvent = []
activity = []

for root, dirs, files in os.walk(base_dir):
        if root[-3] == 'P':
                p_val = root[-3:]
                for file in files:
                        # access relevant files
                        if file in ['Calorie.csv', 'ScreenEvent.csv', 'ActivityEvent.csv']:
                                file_path = os.path.join(root, file)
                                df = pd.read_csv(file_path)
                                df['timestamp'] = df['timestamp'].apply(safe_convert)
                                df['pcode'] = p_val # track pcode of each participant and add to all dataframes
                                if file == 'Calorie.csv':
                                        calories.append(df)
                                elif file == 'ScreenEvent.csv':
                                        screenEvent.append(df)
                                elif file == 'ActivityEvent.csv':
                                        activity.append(df)

ESMResponse = pd.read_csv(os.path.join(base_dir, 'SubjData', 'EsmResponse.csv'))
userInfo = pd.read_csv(os.path.join(base_dir, 'SubjData', 'UserInfo.csv'))

## Feature Extraction


### Stress feature

First, date from responseTime is extracted, so the data can be grouped on day. Next, the mean stress score on a 7-point scale is retrieved from ESMresponse, and a daily mean value is taken. A new dataframe is created storing each daily stress score for each participant.

In [None]:
# initialise new dataframe
Stress = pd.DataFrame()

ESMResponse['responseTime'] = ESMResponse['responseTime'].apply(safe_convert) # # apply safe conversion from timestamps to datetime
ESMResponse = ESMResponse.dropna(subset=['responseTime']) # only continue processing rows that have a valid 'responseTime'

ESMResponse['date'] = ESMResponse['responseTime'].dt.date # extract date from responseTime
grouped_df = ESMResponse.groupby(['pcode', 'date'])['stress'].mean().reset_index() # group by person and day, take aggregate stress values
Stress = pd.concat([Stress, grouped_df], ignore_index=True) # append to the corresponding dataframe

display(Stress)

### Screen time feature
Using the screenEvent dataframes, we group the data by day and participant, extracting the date from the converted time.

The files in screenEvent.csv contain information of screen status per participant. 3 possible screen events have been recorded, namely 'ON', 'UNLOCK' and 'OFF'. ON refers to the screen being turned on whilst still being locked. Therefore, we decided to calculate screen time from the moment when the screen is actually unlocked (UNLOCK event) until the screen in turned off again (OFF event).  
Screen time is calculated by summing the miliseconds between Unlock and Off in column 'type'. This is then converted to seconds and stored in a new Dataframe per participant.

All data is stored in dataframe 'Screen_time', accounting for each participants' daily screen time.

In [None]:
# initialize dataframe
Screen_time = pd.DataFrame()

merged_df = pd.concat(screenEvent)
merged_df['timestamp'] = merged_df['timestamp'].apply(safe_convert)
merged_df['date'] = merged_df['timestamp'].dt.date

grouped_df = merged_df.groupby('pcode') # group by participant

# calculate screen time between an 'UNLOCK' and 'LOCK'
def calculate_screentime(person, df):

    screen_times = []
    day_group = df.groupby('date')

    # loop over days
    for _, day_df in day_group:
        screen_time = 0

        unlock_time = None
        for _, row in day_df.iterrows():
            if row['type'] == 'UNLOCK':
                unlock_time = row['time']
            elif row['type'] == 'OFF' and unlock_time is not None:
                off_time = row['time']
                screen_duration = (off_time - unlock_time).total_seconds() # calculate screen time in seconds
                screen_time += screen_duration # add to the total screen time for that day
                unlock_time = None  # reset unlock_time

        screen_times.append(screen_time)

    screentime_df = pd.DataFrame(np.array([df['date'].unique(), screen_times, [person]*len(screen_times)]).T, columns=['date', 'screentime', 'pcode']) # create dataframe

    return screentime_df


screen_times = []

for person, group in grouped_df:
    screen_times.append(calculate_screentime(person, group))

Screen_time = pd.concat(screen_times)
display(Screen_time)

### Activity Event feature
Below, we constructed another feature for determining physical activity. Leveraging the activityEvent.csv, which records a confidence level (0-1), ideally, every 15 miliseconds about the activity state of someone's mobile.

First, we create a new dataframe to store the information that we are interested in. For each row (a mobile entry) in each dataframe (corresponding to a participant), we record whether someone is active or not. For this we assume that when the mobile is above 80% sure someone is either walking, running or biking, someone likely is. We assume for missing entries that someone is not active. Additionally, this feature only allows us to record activity when someone carries their phone (which is also not guaranteed).

Total minutes spent in a period of likely activity is stored under 'active_min', reflecting the daily minutes spent either walking, running or biking. We then categorise a day as inactive' when someone's total daily minutes is < 21.42, if >21.42 and <42.8 'moderate', if >42.8 'highly active', according to WHO guidelines. (https://www.who.int/news-room/fact-sheets/detail/physical-activity)



In [None]:
# detect changes in active status and return amount of active minutes
def calculate_active_minutes(df):
    active_columns = ['confidenceRunning', 'confidenceOnBicycle', 'confidenceOnFoot', 'confidenceWalking']
    df['isActive'] = df[active_columns].max(axis=1) > 0.8

    df['activityChange'] = df['isActive'].diff().ne(0).cumsum()

    # filter only active periods and calculate the duration
    active_periods = df[df['isActive']].groupby('activityChange').agg(start_time=('timestamp', 'min'), end_time=('timestamp', 'max'))
    active_periods['duration_minutes'] = (active_periods['end_time'] - active_periods['start_time']).dt.total_seconds() / 60 # convert to seconds

    # group by day
    active_periods['date'] = active_periods['start_time'].dt.floor('D').dt.date
    daily_active_minutes = active_periods.groupby('date')['duration_minutes'].sum().reset_index()

    # classify activity levels based on WHO guidelines
    daily_active_minutes['activity_level'] = daily_active_minutes['duration_minutes'].apply(
        lambda x: 'inactive' if x < 21.42 else ('moderate' if x < 42.8 else 'highly active')
    )

    daily_active_minutes.insert(0, 'pcode', df['pcode'][0]) # make sure pcode is added

    # rename columns and order them
    daily_active_minutes.columns = ['pcode', 'date', 'active_min', 'activity_level']

    return daily_active_minutes

# Processing each DataFrame in list 'activityEvent'
dfs = []
total_minutes_list = []
for df in activity:
    df = df.dropna(subset=['timestamp'])

    daily_active_minutes = calculate_active_minutes(df)
    total_minutes_list.append(daily_active_minutes)

    dfs.append(daily_active_minutes)

AE = pd.concat(dfs)

### Calories feature
The Calorie Dataframe contains information on how many calories a participant has burned on that day and in total since the beginning of the experiment.

We create a new Dataframe where each row corresponds to a specific day per participant. Firstly, we retrieve the amount of calories burned on a specific day by identifying the last entry of that day. Since the original CSV files record calories cumulatively for each day, the value in the last entry provides the total number of calories burned for that day.
For total burned calories (caloriesTotal) we can directly use the data from the original Dataframe as the columns correspond perfectly.

To determine whether a participant has had an active day or not, we estimate their average resting calorie expenditure (REE), also known as BMR (Basal Metabolic Rate). The equations for men and women respectively are as follows.

BMR (male) = ( 13.7516 × weight in kg ) + ( 5.0033 × height in cm ) – ( 6.755 × age in years ) + 66.473 <br>
BMR (female) = ( 9.5634 × weight in kg ) + ( 1.8496 × height in cm ) – ( 4.6756 × age in years ) + 655.0955

Since we don't know the weight and height of the participants, we will estimate the BMR's for men and women based on the average weight and height of Korean young adults (specification can be found in the final paper).
. On the other hand, we do have information on participants' gender and age, allowing us to directly insert this information into the right formula.

BMR (male) = ( 13.7516 × 76,5 ) + ( 5.0033 × 174 ) – ( 6.755 × age in years ) + 66.473 <br>
BMR (female) = ( 9.5634 × 57 ) + ( 1.8496 × 161 ) – ( 4.6756 × age in years ) + 655.0955

See our final paper for more information on the equations above.

In order to classify participants' days we apply the following logic rules. If participants burn more than 1.2 times their BMR, the day gets classified as 'moderately active'. If one burns less than that, the day gets classified as 'inactive'. Lastly, if one burns more than 1.55 times their BMR, the day gets classified as 'highly active'.

#### Missing values:
A problem we ran into is that the participants don't always seem to wear the smartband when they're expected to wear it. The total amount of data entries per day are included. We ignore the days with less then 10.000 data entries, as this data seems unreliable.


In [None]:
# calculate BMR for each participant
def calculate_bmr(row):
    if row['gender'] == 'M':
        return round((13.7516 * 76.5) + (5.0033 * 174) - (6.755 * row['age']) + 66.473, 2)
    else: # assuming that there's only 2 genders
        return round((9.5634 * 57) + (1.8496 * 161) - (4.6756 * row['age']) + 655.0955, 2)

# function for determining PAlevel based on the given conditions
def determine_palevel(row):
    if row['caloriesToday'] < row['BMR'] * 1.2:
        return 'inactive'
    elif row['BMR'] * 1.2 <= row['caloriesToday'] < row['BMR'] * 1.55:
        return 'moderately active'
    else:
        return 'highly active'


In [None]:
# initialize new dataframe
Calories = pd.DataFrame()

# loop over all participants
for df in calories:
    df['timestamp'] = df['timestamp'].apply(safe_convert)
    df = df.dropna(subset=['timestamp'])
    df['date'] = df['timestamp'].dt.date

    # track number of datapoints for each day
    datapoints_per_day = df.groupby('date').size().reset_index(name='datapointsToday')

    # identify the last entry of each day
    last_entries = df.groupby('date').tail(1).copy()
    last_entries['pcode'] = df['pcode'][0] # add pcode to the last entries

    last_entries = last_entries.merge(datapoints_per_day, on='date') # merge dataframes
    Calories = pd.concat([Calories, last_entries[['pcode', 'date', 'datapointsToday', 'caloriesToday', 'totalCalories']]], ignore_index=True)  # append to the final DataFrame


# sort the dataframe by pcode and date
Calories = Calories.sort_values(by=['pcode', 'date']).reset_index(drop=True)

In [None]:
# caluclate BMR
Calories = Calories.merge(userInfo[['pcode', 'gender', 'age']], on='pcode') # merge with UserInfo to retrieve gender and age

Calories = Calories[Calories['datapointsToday'] >= 10000] # filter out rows where datapointsToday is under 10000

Calories['BMR'] = Calories.apply(calculate_bmr, axis=1) # call calculate BMR function

Calories['caloriesActive'] = Calories['caloriesToday'] - Calories['BMR'] # add column for difference between caloriesToday and BMR

Calories['PAlevel'] = Calories.apply(determine_palevel, axis=1) # apply the function to determine the PAlevel for each row

### Combined physical activity feature

This feature is a combination of Calories and AE. Both of these features have the column PAlevel included in the DataFrame. This column takes as value either 'inactive', 'moderately active', or 'highly active' depending on the amount of calories burned through activity and how many minutes a participant has been active on a given day. <br>
When a participant's activity levels on a given day differ only one level from each other, we will choose the lower activity level in our combined physical activity feature. For example, if Calories classifies a participant's day as 'highly active', but AE classifies the same day as 'moderately active', we will use 'moderately active' as the activity level for the combined physical activity feature. When the activity levels differ with more than one level (i.e. activity level of one feature is 'inactive' and the other one is 'highly active'), we will use the middle ground: 'moderately active'.
This way we aim to create a more accurate and powerful feature by using the second feature as a control.

If activity levels from both features don't correspond, there is a possibility that the data is inaccurate or incorrectly interpreted. A high level of burned calories could, for example, be due to a participant being larger. Since we don't have information on participants' height and weight and simply used the average for Korean young adults in our calculations, we need to ensure that we interpret the data the right way. Similarly, if a participant has not been wearing their Smartband sufficiently enough for us to draw any conclusions on their Calories data, this way we still have an opportunity to include their physical activity in the feature (if their AE activity level is high enough). The same goes for participants who might not take their phone when exercising, their AE data might not be complete, while their Calories data might indicate that they have been active.  

In [None]:
# determine combined activity level
def combine_activity_levels(row):
    levels = ['inactive', 'moderately active', 'highly active']
    level_calories = row['activity_level_calories']
    level_ae = row['activity_level_ae']

    # retrieve the index of each activity level
    index_calories = levels.index(level_calories)
    index_ae = levels.index(level_ae)

    if abs(index_calories - index_ae) == 1:
        return levels[min(index_calories, index_ae)] # choose the lower activity level
    elif abs(index_calories - index_ae) >= 2:
        return 'moderately active'  # use the middle ground: 'moderately active'
    else:
        return level_calories # if both are the same level, return that level

In [None]:
# merge DataFrames on participant and date
PA = pd.merge(Calories, AE, on=['pcode', 'date'], suffixes=('_calories', '_ae'))

# apply combining function to the merged DataFrame
PA['activity_level'] = PA.apply(combine_activity_levels, axis=1)
PA.rename(columns={'date_calories': 'date'}, inplace=True)

PA = PA[['pcode', 'date', 'activity_level']]  # only keep relevant columns

## Missing values
Not every participant dataset contains data for the full 7 days of the experiment. To make sure we only include participants with sufficient data, we decided to leave out any participants that have less than 3 days of relevant data.

From the displayed dataframes below, we can see that P61 and P22 both have one type of data that only contains 2 days in total. Therefore, we will be removing P61 and P22 from our dataset as a whole.

In [None]:
# stress levels
p_missing_SL = Stress.groupby('pcode').filter(lambda x: len(x) < 3)
display(p_missing_SL)

# activity event
p_missing_AE = daily_active_minutes.groupby('pcode').filter(lambda x: len(x) < 3)
display(p_missing_AE)

# screen time
p_missing_ST = Screen_time.groupby('pcode').filter(lambda x: len(x) < 3)
display(p_missing_ST)

# calories
p_missing_CA = Calories.groupby('pcode').filter(lambda x: len(x) < 3)
display(p_missing_CA)

## Fixed Effects Regression Model: Final Dataframe

Here all output from the features is concatenated into one dataframe, being matched on pcode and date.

In [None]:
# merge dataframes
dataframes = [Stress, AE, Screen_time, Calories, PA]
total_df = reduce(lambda left, right: pd.merge(left, right, on=['pcode', 'date'], how='outer'), dataframes)

total_df = total_df[~total_df['pcode'].isin(['P22', 'P61'])] # delete rows for participant P22 and P61
total_df = total_df.sort_values(by=['pcode', 'date'], ascending=[True, True])


# length stress: 535
# length AE: 534
# length stress: 556
# length stress: 480

# total nans: 113
# total nans after deleting P22 and P61: 103

In [None]:
# save df as csv file
total_df.to_csv('total_df.csv', index=False)

## Multiple Linear Regression Model


To merge the daily variables into single, weekly variables for each participant, the median is taken for all features.


In [None]:
# Depression scores
Depression_scores_ = userInfo[['pcode', 'PHQ']]

# Stress levels
Stress_ = pd.DataFrame(Stress.groupby('pcode')['stress'].median())

#Screen time
Screen_time['screentime'] = pd.to_numeric(Screen_time['screentime'], errors='coerce') # Ensure screentime is numeric
Screen_time_ = pd.DataFrame(Screen_time.groupby('pcode')['screentime'].median())

# Acitivity event
AE_ = pd.DataFrame(AE.groupby('pcode')['active_min'].median())
# classify activity levels based on WHO guidelines
AE_['activity_level'] = AE_['active_min'].apply(
        lambda x: 'inactive' if x < 21.42 else ('moderately active' if x < 42.8 else 'highly active')
    )

# Calories
Calories_ = pd.DataFrame(Calories.groupby('pcode')[['caloriesToday', 'BMR']].median())
Calories_['activity_level'] = Calories_.apply(determine_palevel, axis=1) # apply the function to determine the PAlevel

# Physical activity
PA_ = pd.merge(Calories_, AE_, on='pcode', suffixes=('_calories', '_ae'))
PA_['activity_level'] = PA_.apply(combine_activity_levels, axis=1) # classify activity levels
PA_ = PA_[['activity_level']]

In [None]:
# merge dataframes
dataframes_ = [Stress_, PA_, Calories_, Screen_time_, Depression_scores_]
total_df_ = reduce(lambda left, right: pd.merge(left, right, on=['pcode'], how='outer'), dataframes_)


total_df_ = total_df_[~total_df_['pcode'].isin(['P22', 'P61'])] # Delete rows for participant P22 and P61
total_df_ = total_df_.sort_values(by=['pcode'])

In [None]:
# save df as csv file
total_df_.to_csv('total_df_weekly.csv', index=False)