The purpose of this notebook is to investigate an alternative method for counting instances of moderate/vigorous physical activity [MVPA]

We'll start by segmenting each day into 5-minute "bouts"

But rather than compute the percent of bouts that are MVPA, we'll do the following:
1. Identify each day as having had the accelerometer on enough to believe that it was on for most/all waking hours
2. Counting the number of such days
3. Counting the number of MVPA bouts on these days and identifying them as "active" days
4. Computing the average MVPA bouts per active day

In [2]:
# Start by importing packages we'll need
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

In [26]:
# Load the parquet data file series_train.parquet/id=0a418b57/part-0.parquet
data0a418b57 = pd.read_parquet('series_train.parquet/id=0a418b57/part-0.parquet')

# Add a new column that converts time_of_day into datetime
data0a418b57['dt'] = pd.to_datetime(data0a418b57['time_of_day'])

# Change the day in the dt variable to be equal to the relative_date_PCIAT value
data0a418b57['dt_mod'] = data0a418b57['dt'] + pd.to_timedelta(data0a418b57['relative_date_PCIAT'], unit='D')

In [28]:
# Create a new data frame by splitting data0a418b57 into 5-minute intervals and computing the mean of each variable within the interval
data0a418b57_resampled_5min = data0a418b57.set_index('dt_mod').resample('5min').mean()

In [29]:
data0a418b57_resampled_5min.head()

Unnamed: 0_level_0,step,X,Y,Z,enmo,anglez,non-wear_flag,light,battery_voltage,time_of_day,weekday,quarter,relative_date_PCIAT,dt
dt_mod,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1969-12-23 14:10:00,4.5,0.00436,-0.145778,-0.918847,0.024701,-78.099045,0.0,10.135992,4187.600098,51272500000000.0,2.0,4.0,-9.0,1970-01-01 14:14:32.500
1969-12-23 14:15:00,10.0,0.030891,-0.019828,-0.995308,0.000162,-87.887527,0.0,1.053176,4186.833496,51300000000000.0,2.0,4.0,-9.0,1970-01-01 14:15:00.000
1969-12-23 14:20:00,,,,,,,,,,,,,,NaT
1969-12-23 14:25:00,,,,,,,,,,,,,,NaT
1969-12-23 14:30:00,,,,,,,,,,,,,,NaT


Next, we should figure out where to make the cutoff for the minimum number of valid 5-minute bouts to indicate whether the accelerometer was used frequently enough to suggest the participant (basically) had it turned on for most of their waking hours.

We'll count the number of valid bouts on each day and look at the distribution of values and a time-series graph

In [39]:
# Count the number of non-NaN values of the 'enmo' variable for each value of relative_date_PCIAT
boutcount = data0a418b57_resampled_5min.groupby(by=["relative_date_PCIAT"]).count()['enmo']

# Make a histogram of the values of boutcount
fig = px.histogram(boutcount, x=boutcount, nbins=100)
fig.show()

In [40]:
# Make a graph of bootcount
fig = go.Figure()
fig.add_trace(go.Scatter(x=boutcount.index, y=boutcount))
fig.show()

If we assume someone would sleep for 8 hours a night, that leaves 192 possible 5-minute bouts in which there might be valid data.

Certainly the days where there were over 200 5-minute bouts were "valid" days. 

When the number of valid bouts dropped below 100, that seems like it might have not been worn for most of the day. BUT - particularly in instances where the accelerometer had the "stop recording during sedentary periods" feature turned on, it also seems possible that the participant was just sedentary during some of those periods.

When the number of valid bouts dropped to around 150 it seems possible that the accelerometer was on for most/all of the day. Very hard to say....

Maybe it would (still) be beneficial to try to "fill in" gaps that likely correspond to sedentary behavior where the accelerometer just turned itself off.

(Note that I should probably find a participant who had some non-wear_flag=1 to see what their data look like)

Maybe we can find strings of NaN that are 30 minutes or less (so 6 or fewer in a row) and fill those in with the averages?

We'll do that by re-resampling and using the ffill function

In [49]:
# The following method is suggested at https://stackoverflow.com/questions/32890124/pandas-dataframe-running-sum-with-reset

# Other potential methods:
#https://stackoverflow.com/questions/45964740/python-pandas-cumsum-with-reset-everytime-there-is-a-0

# Create a new variable that flags when the cumulative sum should reset (when 'step' is not NaN)
data0a418b57_resampled_5min['cumreset'] = data0a418b57_resampled_5min['step'].notna()

# Create a new variable that counts the number of NaN values in the step variable and resets its count when it gets to a non-NaN value
#data0a418b57_resampled_5min['step_nan_count'] = data0a418b57_resampled_5min['step'].isna().cumsum()
data0a418b57_resampled_5min['cumsum'] = data0a418b57_resampled_5min['cumreset'].cumsum()
data0a418b57_resampled_5min['nan_count'] = data0a418b57_resampled_5min.groupby(['cumsum'])['step'].cumsum()


In [53]:
# Create a new variable called 'enmogroup' that increases by 1 each time the value of enmo is numerical
data0a418b57_resampled_5min['enmogroup'] = data0a418b57_resampled_5min['enmo'].notna().cumsum()

In [61]:
# Create a new variable called enmogrouplength that is the size of enmogroup
data0a418b57_resampled_5min['enmogrouplength'] = data0a418b57_resampled_5min.groupby(['enmogroup']).count()

ValueError: Cannot set a DataFrame with multiple columns to the single column enmogrouplength

Not working so well so far...

Maybe it would be better to manually create groups and then count each groupsize?

Could set a counter at 1 and then increment it when it encounters a NaN and reset it to 1 when it encounters a non-NaN?