In [1]:
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.2f' %x)
import numpy as np
import json
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as datetime
import glob

# Sleep Data

For this project, I'm curious about looking into my quality of sleep according to my Fitbit. Personally, I was diagnosed with moderate sleep apnea and occasionally suffer from migraines after waking up in the morning which affects the rest of my day and productivity. I want to investigate if my quality of sleep and how often/how much time I spend in each stage of sleep.

In [2]:
# Explore the structure of JSON sleep files
with open('../Data/CurtisHiga/user-site-export/sleep-2019-01-23.json', 'r') as json_file:
    json_data = json.load(json_file)

In [3]:
json_data

[{'logId': 21320930034,
  'dateOfSleep': '2019-02-22',
  'startTime': '2019-02-22T02:08:30.000',
  'endTime': '2019-02-22T11:52:00.000',
  'duration': 34980000,
  'minutesToFallAsleep': 0,
  'minutesAsleep': 501,
  'minutesAwake': 82,
  'minutesAfterWakeup': 0,
  'timeInBed': 583,
  'efficiency': 87,
  'type': 'stages',
  'infoCode': 0,
  'levels': {'summary': {'deep': {'count': 6,
     'minutes': 79,
     'thirtyDayAvgMinutes': 66},
    'wake': {'count': 31, 'minutes': 82, 'thirtyDayAvgMinutes': 57},
    'light': {'count': 37, 'minutes': 353, 'thirtyDayAvgMinutes': 219},
    'rem': {'count': 7, 'minutes': 69, 'thirtyDayAvgMinutes': 63}},
   'data': [{'dateTime': '2019-02-22T02:08:30.000',
     'level': 'wake',
     'seconds': 480},
    {'dateTime': '2019-02-22T02:16:30.000', 'level': 'light', 'seconds': 900},
    {'dateTime': '2019-02-22T02:31:30.000', 'level': 'deep', 'seconds': 960},
    {'dateTime': '2019-02-22T02:47:30.000', 'level': 'wake', 'seconds': 510},
    {'dateTime': '2019

It appears the data is logged with a specific ``logId`` which can be used as a index for my data frame. The items I want as columns also appears to be the nested in the first level of each entry in the dictionaries. That's important to remember when applying the *read_json* function of the Pandas library on each of the sleep JSON files.

In [4]:
# Use read_json to read in json file as a data frame
sleep_jan19 = pd.read_json('../Data/CurtisHiga/user-site-export/sleep-2019-01-23.json',
                          orient = 'columns',
                          convert_dates = ['dateOfSleep', 'endTime', 'startTime'])

As stated above, the ``logId`` value seems to be unique and could be used as an index for the data frame. I want to take a quick look at the structure of the data frame to make sure it imported like I expected. The *transpose* method is applied here only to look at all the columns easier.

In [5]:
sleep_jan19.set_index('logId', inplace = True)
sleep_jan19.head().transpose()

logId,21320930034,21295605623,21294114499,21281516129,21269375088
dateOfSleep,2019-02-22 00:00:00,2019-02-20 00:00:00,2019-02-20 00:00:00,2019-02-19 00:00:00,2019-02-18 00:00:00
duration,34980000,9780000,12900000,32400000,27120000
efficiency,87,95,91,93,95
endTime,2019-02-22 11:52:00,2019-02-20 12:21:30,2019-02-20 07:29:00,2019-02-19 09:16:00,2019-02-18 10:29:00
infoCode,0,2,0,0,0
levels,"{'summary': {'deep': {'count': 6, 'minutes': 7...","{'summary': {'restless': {'count': 4, 'minutes...","{'summary': {'deep': {'count': 2, 'minutes': 5...","{'summary': {'deep': {'count': 3, 'minutes': 6...","{'summary': {'deep': {'count': 4, 'minutes': 6..."
minutesAfterWakeup,0,0,0,0,0
minutesAsleep,501,155,180,453,401
minutesAwake,82,8,35,87,51
minutesToFallAsleep,0,0,0,0,0


The JSON file seems to have imported successfully and I'm satisfied with the result for the time being. Below is a list of things I want to do in terms of cleaning up this data before moving on.
+ Investigate ``levels``
+ Ensure ``dataOfSleep``, ``stateTime``, and ``endTime`` are in *datetime* formats
    + Consider splitting dates into columns
+ Sort the data by ``dateOfSleep``
+ Determine the difference between ``duration`` and ``timeInBed`` plus how they correlate to the different ``levels`` of sleep
+ Handle naps
    + Naps could be indicative of a day where I had a migraine
    + Possibly remove them after deciding what to do with them
+ Determine what ``infoCode`` and ``type`` represents

## ``levels``

Before things get too complicated, I want to take a look at the ``levels`` column and what data lies in each observation.

In [6]:
# Review structure of a 'level' observation
sleep_jan19['levels'][21320930034]

{'summary': {'deep': {'count': 6, 'minutes': 79, 'thirtyDayAvgMinutes': 66},
  'wake': {'count': 31, 'minutes': 82, 'thirtyDayAvgMinutes': 57},
  'light': {'count': 37, 'minutes': 353, 'thirtyDayAvgMinutes': 219},
  'rem': {'count': 7, 'minutes': 69, 'thirtyDayAvgMinutes': 63}},
 'data': [{'dateTime': '2019-02-22T02:08:30.000',
   'level': 'wake',
   'seconds': 480},
  {'dateTime': '2019-02-22T02:16:30.000', 'level': 'light', 'seconds': 900},
  {'dateTime': '2019-02-22T02:31:30.000', 'level': 'deep', 'seconds': 960},
  {'dateTime': '2019-02-22T02:47:30.000', 'level': 'wake', 'seconds': 510},
  {'dateTime': '2019-02-22T02:56:00.000', 'level': 'light', 'seconds': 1800},
  {'dateTime': '2019-02-22T03:26:00.000', 'level': 'deep', 'seconds': 480},
  {'dateTime': '2019-02-22T03:34:00.000', 'level': 'light', 'seconds': 660},
  {'dateTime': '2019-02-22T03:45:00.000', 'level': 'rem', 'seconds': 810},
  {'dateTime': '2019-02-22T03:58:30.000', 'level': 'light', 'seconds': 990},
  {'dateTime': '20

The data contained with in the ``levels`` columns seems to be a detailed summary of number and duration during each phase of sleep. Right now, I don't need when and how long I spent in each phase of sleep at what part of the night. The overall totals of each phase of sleep will suffice for now. Instead of handling the nested dictionaries after importing it from the JSON file, it'll be easier to normalize the JSON data so the nested dictionaries are imported as separate columns.

In [7]:
# Normalize JSON data and import as a DataFrame
json_df = pd.io.json.json_normalize(json_data)

In [8]:
json_df.head().transpose()

Unnamed: 0,0,1,2,3,4
dateOfSleep,2019-02-22,2019-02-20,2019-02-20,2019-02-19,2019-02-18
duration,34980000,9780000,12900000,32400000,27120000
efficiency,87,95,91,93,95
endTime,2019-02-22T11:52:00.000,2019-02-20T12:21:30.000,2019-02-20T07:29:00.000,2019-02-19T09:16:00.000,2019-02-18T10:29:00.000
infoCode,0,2,0,0,0
levels.data,"[{'dateTime': '2019-02-22T02:08:30.000', 'leve...","[{'dateTime': '2019-02-20T09:38:30.000', 'leve...","[{'dateTime': '2019-02-20T03:54:00.000', 'leve...","[{'dateTime': '2019-02-19T00:15:30.000', 'leve...","[{'dateTime': '2019-02-18T02:57:00.000', 'leve..."
levels.shortData,"[{'dateTime': '2019-02-22T02:22:30.000', 'leve...",,"[{'dateTime': '2019-02-20T04:40:30.000', 'leve...","[{'dateTime': '2019-02-19T01:09:30.000', 'leve...","[{'dateTime': '2019-02-18T02:57:00.000', 'leve..."
levels.summary.asleep.count,,0.00,,,
levels.summary.asleep.minutes,,155.00,,,
levels.summary.awake.count,,0.00,,,


In [9]:
# Set index of json_df to 'logId'
json_df.set_index('logId', inplace = True)

Since I'm only interested in the ``summary`` of ``levels``, the columns ``levels.data`` and ``levels.shortData`` aren't needed and can be removed.

In [10]:
# Drop 'levels.data' and 'levels.shortData'
json_df.drop(['levels.data', 'levels.shortData'], axis = 1, inplace = True)

## datetime Objects

Next, the columns ``dateOfSleep``, ``startTime``, and ``endTime`` need to be converted into *datetime* objects.

In [11]:
json_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31 entries, 21320930034 to 20977707058
Data columns (total 30 columns):
dateOfSleep                                 31 non-null object
duration                                    31 non-null int64
efficiency                                  31 non-null int64
endTime                                     31 non-null object
infoCode                                    31 non-null int64
levels.summary.asleep.count                 11 non-null float64
levels.summary.asleep.minutes               11 non-null float64
levels.summary.awake.count                  11 non-null float64
levels.summary.awake.minutes                11 non-null float64
levels.summary.deep.count                   20 non-null float64
levels.summary.deep.minutes                 20 non-null float64
levels.summary.deep.thirtyDayAvgMinutes     20 non-null float64
levels.summary.light.count                  20 non-null float64
levels.summary.light.minutes                20 non-nul

The columns ``dateOfSleep``, ``endTime``, and ``startTime`` already appear to be in *datetime* formats. It may be necessary to split each of these columns into separate datetime columns but for now it's fine. However, there are some missing values in a number of the *levels* columns. It should be safe to fill these *null* values with $0$ assuming if that values are missing, it's because there were no occurances.

In [12]:
# Fill NaN with 0
json_df.fillna(0, inplace = True)

In [13]:
# Create a function to convert columns to datetime
def convert_dt_column(df, column_names):
    '''Convert list of columns to datetime objects'''
    for column in column_names:
        df[column] = pd.to_datetime(df[column])

In [14]:
# Apply convert_dt_column to json_df columns
convert_dt_column(json_df, ['dateOfSleep', 'startTime', 'endTime'])

In [15]:
# Verify datetime columns are present
json_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31 entries, 21320930034 to 20977707058
Data columns (total 30 columns):
dateOfSleep                                 31 non-null datetime64[ns]
duration                                    31 non-null int64
efficiency                                  31 non-null int64
endTime                                     31 non-null datetime64[ns]
infoCode                                    31 non-null int64
levels.summary.asleep.count                 31 non-null float64
levels.summary.asleep.minutes               31 non-null float64
levels.summary.awake.count                  31 non-null float64
levels.summary.awake.minutes                31 non-null float64
levels.summary.deep.count                   31 non-null float64
levels.summary.deep.minutes                 31 non-null float64
levels.summary.deep.thirtyDayAvgMinutes     31 non-null float64
levels.summary.light.count                  31 non-null float64
levels.summary.light.minutes          

In [16]:
# Sort json_df
json_df.sort_index(inplace = True)

## ``duration``

I have a hunch that ``duration`` and ``timeInBed`` are the same measurement but in different units. Also ``timeInBed`` is the difference in minutes of ``startTime`` and ``endTime``. Some simple math could be done to verify this.

In [17]:
# Filter out time columns from json_df
timed = json_df[['duration', 'startTime', 'endTime', 'timeInBed']]

In [18]:
# Create a boolean column to verify that 'endTime' - 'startTime' == 'timeInBed'
timed['start_end_diff'] = ((timed['endTime'] - timed['startTime']).dt.total_seconds()/60).astype('int')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [19]:
timed.head()

Unnamed: 0_level_0,duration,startTime,endTime,timeInBed,start_end_diff
logId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
20977707058,31680000,2019-01-26 02:09:00,2019-01-26 10:57:30,528,528
20992920247,13320000,2019-01-27 03:33:00,2019-01-27 07:15:30,222,222
20995444650,8400000,2019-01-27 09:52:30,2019-01-27 12:12:30,140,140
20997357151,4200000,2019-01-27 15:36:00,2019-01-27 16:46:00,70,70
21011938251,4320000,2019-01-28 16:52:30,2019-01-28 18:05:00,72,72


In [20]:
timed['diff_bool'] = (timed['timeInBed'] == timed['start_end_diff'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [21]:
# Divide 'duration' by 'timeInBed'
timed['timeInBed_factor'] = timed['duration']/timed['timeInBed']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Both columns ``diff_bool`` and ``timeInBed_factor`` should be filled with the same value for all observations if my hunch is true.

In [22]:
timed.groupby(['diff_bool', 'timeInBed_factor']).nunique()

Unnamed: 0_level_0,Unnamed: 1_level_0,duration,startTime,endTime,timeInBed,start_end_diff,diff_bool,timeInBed_factor
diff_bool,timeInBed_factor,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
True,60000.0,29,31,31,29,29,1,1


My hunch seems to be true seeing as there's only one unique set of ``diff_bool`` and ``timeInBed_factor``. Therefore removing the ``duration`` column is necessary since it's a redundant variable.

In [23]:
# Remove 'duration' column
json_df.drop('duration', axis = 1, inplace = True)

## ``infoCode`` & ``type``

I'm curious as to what the different values under ``type`` and ``infoCode`` represent. To get a better understanding, grouping by each column and aggregating by the mean might provide a better understanding.

In [24]:
# Group data by type and aggregate by mean
json_df.groupby('type').mean().transpose()

type,classic,stages
efficiency,93.64,91.9
infoCode,2.0,0.0
levels.summary.asleep.count,0.0,0.0
levels.summary.asleep.minutes,82.82,0.0
levels.summary.awake.count,0.18,0.0
levels.summary.awake.minutes,0.27,0.0
levels.summary.deep.count,0.0,3.7
levels.summary.deep.minutes,0.0,66.35
levels.summary.deep.thirtyDayAvgMinutes,0.0,68.45
levels.summary.light.count,0.0,25.55


Immediately, the first thing to notice is that the ``infoCode`` for a *classic* ``type`` averaged out to $2$. That could simply be a coincidence but my initial assumption would be that ``infoCode`` is a numeric representation of ``type``. It'll become clearer once the ``infoCode`` column is investigated further.

The next question to ask is what classifies the different ``type`` values? Based on the mean of the values, especially the ``minutesAsleep`` column, my guess would be that a *classic* type represents a *nap* whereas *stages* represents a extended period time of sleep.

In [25]:
# Group data by infoCode and aggregate by mean
json_df.groupby('infoCode').mean().transpose()

infoCode,0,2
efficiency,91.9,93.64
levels.summary.asleep.count,0.0,0.0
levels.summary.asleep.minutes,0.0,82.82
levels.summary.awake.count,0.0,0.18
levels.summary.awake.minutes,0.0,0.27
levels.summary.deep.count,3.7,0.0
levels.summary.deep.minutes,66.35,0.0
levels.summary.deep.thirtyDayAvgMinutes,68.45,0.0
levels.summary.light.count,25.55,0.0
levels.summary.light.minutes,225.4,0.0


After grouping the data by ``infoCode`` and aggregating by the mean, the same values and calculated as when the data was grouped by ``type``. So my assumption that ``infoCode`` is a numeric representation of ``type`` seems to be correct. More data will be used to further validate this assumption but for now, both columns will be kept.

# Aggregating Sleep Data

Now that I know what data I need from the sleep files, I can create a function to aggregate all the sleep files into one data frame. This function should:
+ Read in and normalize each JSON sleep file
+ Remove ``levels.data``, ``levels.shortData``, and ``duration`` columns
+ Set ``logId`` as the index
+ Set missing values to $0$
+ Set ``dateOfSleep``, ``startTime``, and ``endTime`` to datetime objects
+ Order the data by ``logId``

In [26]:
def aggregate_json(file_string, index_col = None):
    '''This function should take a string format and search for file names matching the 
    string. The file is then read and all data from files aggregated into one data frame'''
    
    # Initialize an empty data frame
    aggregate_df = pd.DataFrame()
    
    # Search data folder for files matching file_string
    for file in glob.glob('../Data/CurtisHiga/user-site-export/' + file_string):
        with open(file, 'r') as json_f:
            json_dict = json.load(json_f)
    
        # Normalize JSON data
        json_dataframe = pd.io.json.json_normalize(json_dict)
        
        # Append json_dataframe to aggregate_df
        aggregate_df = aggregate_df.append(json_dataframe)
    
    # Check if index_col exists and is a valid column
    # Set column as index if column exists
    if index_col == None:
        return aggregate_df
    
    elif isinstance(index_col, str):
        try:
            return aggregate_df.set_index(index_col)
        except:
            raise ValueError("%s does not exist" % index_col)  
            
    else:
        raise TypeError("index_col must be type 'str'")

In [27]:
sleep_df = aggregate_json('sleep*.json', index_col = 'logId')

In [28]:
sleep_df.head()

Unnamed: 0_level_0,dateOfSleep,duration,efficiency,endTime,infoCode,levels.data,levels.shortData,levels.summary.asleep.count,levels.summary.asleep.minutes,levels.summary.awake.count,...,levels.summary.wake.count,levels.summary.wake.minutes,levels.summary.wake.thirtyDayAvgMinutes,minutesAfterWakeup,minutesAsleep,minutesAwake,minutesToFallAsleep,startTime,timeInBed,type
logId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
20919781966,2019-01-21,31680000,93,2019-01-21T11:14:30.000,0,"[{'dateTime': '2019-01-21T02:26:00.000', 'leve...","[{'dateTime': '2019-01-21T04:07:00.000', 'leve...",,,,...,30.0,64.0,55.0,0,464,64,0,2019-01-21T02:26:00.000,528,stages
20883311218,2019-01-18,30180000,95,2019-01-18T10:51:30.000,0,"[{'dateTime': '2019-01-18T02:28:00.000', 'leve...","[{'dateTime': '2019-01-18T02:28:00.000', 'leve...",,,,...,45.0,60.0,54.0,0,443,60,0,2019-01-18T02:28:00.000,503,stages
20869802728,2019-01-17,29280000,94,2019-01-17T10:31:00.000,0,"[{'dateTime': '2019-01-17T02:22:30.000', 'leve...","[{'dateTime': '2019-01-17T02:22:30.000', 'leve...",,,,...,42.0,58.0,54.0,0,430,58,0,2019-01-17T02:22:30.000,488,stages
20844157660,2019-01-15,29100000,95,2019-01-15T14:34:30.000,0,"[{'dateTime': '2019-01-15T06:29:00.000', 'leve...","[{'dateTime': '2019-01-15T06:35:30.000', 'leve...",,,,...,39.0,51.0,54.0,0,434,51,0,2019-01-15T06:29:00.000,485,stages
20812478867,2019-01-13,27240000,90,2019-01-13T03:50:30.000,0,"[{'dateTime': '2019-01-12T20:16:00.000', 'leve...","[{'dateTime': '2019-01-12T20:16:00.000', 'leve...",,,,...,33.0,59.0,54.0,1,395,59,0,2019-01-12T20:16:00.000,454,stages


Now that a function has been created to read and consolidate data into a data frame, another function can be created to clean said data frame.

In [29]:
def clean_sleep(df):
    '''This function should wrangle the Fitbit sleep data in a format needed for
    my analysis'''
    
    # Drop unnecessary columns
    df.drop(['levels.data', 'levels.shortData', 'duration'], axis = 1, inplace = True)
    
    # Fill NaN values with 0
    df.fillna(0, inplace = True)
    
    # Convert date columns to datetime objects
    convert_dt_column(df, ['dateOfSleep', 'startTime', 'endTime'])
    
    df.sort_index(inplace = True)
    
    return df

In [30]:
sleep_cleaned = clean_sleep(sleep_df)

In [31]:
sleep_cleaned.head(10)

Unnamed: 0_level_0,dateOfSleep,efficiency,endTime,infoCode,levels.summary.asleep.count,levels.summary.asleep.minutes,levels.summary.awake.count,levels.summary.awake.minutes,levels.summary.deep.count,levels.summary.deep.minutes,...,levels.summary.wake.count,levels.summary.wake.minutes,levels.summary.wake.thirtyDayAvgMinutes,minutesAfterWakeup,minutesAsleep,minutesAwake,minutesToFallAsleep,startTime,timeInBed,type
logId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
20597811657,2018-12-26,96,2018-12-26 10:55:00,0,0.0,0.0,0.0,0.0,6.0,65.0,...,32.0,33.0,0.0,0,453,33,0,2018-12-26 02:49:00,486,stages
20598864707,2018-12-26,95,2018-12-26 14:09:30,2,0.0,95.0,1.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0,95,5,0,2018-12-26 12:29:00,100,classic
20608911100,2018-12-27,94,2018-12-27 10:23:30,0,0.0,0.0,0.0,0.0,3.0,63.0,...,33.0,66.0,33.0,0,393,66,0,2018-12-27 02:44:00,459,stages
20624450075,2018-12-28,94,2018-12-28 10:01:30,0,0.0,0.0,0.0,0.0,6.0,83.0,...,37.0,49.0,50.0,1,400,49,0,2018-12-28 02:32:00,449,stages
20626208454,2018-12-28,97,2018-12-28 15:58:00,2,0.0,83.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0,83,3,0,2018-12-28 14:32:00,86,classic
20635724038,2018-12-29,94,2018-12-29 10:37:30,0,0.0,0.0,0.0,0.0,3.0,84.0,...,21.0,40.0,49.0,0,346,40,0,2018-12-29 04:11:00,386,stages
20637499610,2018-12-29,94,2018-12-29 14:58:00,2,0.0,138.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0,138,9,0,2018-12-29 12:31:00,147,classic
20669631480,2019-01-01,90,2019-01-01 10:58:30,0,0.0,0.0,0.0,0.0,2.0,66.0,...,35.0,53.0,47.0,0,389,53,0,2019-01-01 03:36:30,442,stages
20681927031,2019-01-02,93,2019-01-02 11:27:00,0,0.0,0.0,0.0,0.0,5.0,104.0,...,34.0,50.0,48.0,0,450,50,0,2019-01-02 03:06:30,500,stages
20694317514,2019-01-03,93,2019-01-03 10:22:00,0,0.0,0.0,0.0,0.0,6.0,70.0,...,36.0,52.0,49.0,0,404,52,0,2019-01-03 02:46:00,456,stages


The data seems to be in a format suitable for my needs and can be exported to a CSV. Next I want to look at my heart rate data.

In [32]:
# Export sleep_cleaned to CSV
sleep_cleaned.to_csv('../Data/sleep_cleaned.csv', index = True)

# Heart Rate Data

According to Fitbit, it uses your heart rate as an indicator for when you sleep and which stage of sleep your in. It may be useful to look into the heart rate data to better understand the different stages of sleep. There are two types of heart rate JSON data, *heart_rate* and *time_in_heart_rate_zone*. I'll be looking to both types of files.

First, like with the sleep data, I want to take a look at the structure of the JSON files for the heart rate data.

In [33]:
# Import a heart_rate JSON file
with open('../Data/CurtisHiga/user-site-export/heart_rate-2018-12-27.json', 'r') as hr_json_f:
    hr_json_dict = json.load(hr_json_f)

hr_json_dict

[{'dateTime': '12/27/18 08:00:09', 'value': {'bpm': 80, 'confidence': 1}},
 {'dateTime': '12/27/18 08:00:14', 'value': {'bpm': 81, 'confidence': 1}},
 {'dateTime': '12/27/18 08:00:19', 'value': {'bpm': 81, 'confidence': 2}},
 {'dateTime': '12/27/18 08:00:34', 'value': {'bpm': 81, 'confidence': 1}},
 {'dateTime': '12/27/18 08:00:39', 'value': {'bpm': 82, 'confidence': 1}},
 {'dateTime': '12/27/18 08:00:49', 'value': {'bpm': 83, 'confidence': 1}},
 {'dateTime': '12/27/18 08:00:54', 'value': {'bpm': 82, 'confidence': 1}},
 {'dateTime': '12/27/18 08:01:09', 'value': {'bpm': 82, 'confidence': 1}},
 {'dateTime': '12/27/18 08:01:14', 'value': {'bpm': 81, 'confidence': 1}},
 {'dateTime': '12/27/18 08:01:19', 'value': {'bpm': 79, 'confidence': 1}},
 {'dateTime': '12/27/18 08:01:24', 'value': {'bpm': 78, 'confidence': 2}},
 {'dateTime': '12/27/18 08:01:29', 'value': {'bpm': 75, 'confidence': 2}},
 {'dateTime': '12/27/18 08:01:34', 'value': {'bpm': 73, 'confidence': 2}},
 {'dateTime': '12/27/18 0

In [34]:
# Import a time_in_heart_rate_zone JSON file
with open('../Data/CurtisHiga/user-site-export/time_in_heart_rate_zones-2018-12-27.json', 'r') as tihrz_json_f:
    tihrz_json_dict = json.load(tihrz_json_f)

tihrz_json_dict

[{'dateTime': '12/27/18 00:00:00',
  'value': {'valuesInZones': {'IN_DEFAULT_ZONE_2': 19.0,
    'BELOW_DEFAULT_ZONE_1': 1347.0,
    'IN_DEFAULT_ZONE_3': 6.0,
    'IN_DEFAULT_ZONE_1': 33.0}}}]

The data in the *time_in_heart_rate_zone* doesn't seem to provide valuable insight in regards to my purposes so that data could be ignored. As for the overall *heart_rate* data, it may be useful, although may not be used, to extract the heart rate data during times when I'm asleep.

As with the sleep data, the *aggregate_json* function will be used to aggregate all the heart rate data into one data frame.

In [35]:
# Use aggregate_json to get heart rate data
heartrate_df = aggregate_json('heart_rate*.json', index_col = 'dateTime')

In [36]:
# Convert dateTime to datetime object
heartrate_df.index = pd.to_datetime(heartrate_df.index)

In [37]:
heartrate_df.head(10)

Unnamed: 0_level_0,value.bpm,value.confidence
dateTime,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-12-25 19:51:16,70,0
2018-12-25 19:51:26,61,1
2018-12-25 19:51:31,59,3
2018-12-25 19:51:36,59,2
2018-12-25 19:51:41,58,2
2018-12-25 19:51:56,58,2
2018-12-25 19:52:11,59,2
2018-12-25 19:52:16,60,2
2018-12-25 19:52:21,61,2
2018-12-25 19:52:26,60,3


This is all the data that I'm interested in extracting from my Fitbit data.

## Combine Heart Rate & Sleep Data

The next challenge is to filter out the heart rate data during times when I'm asleep. That should be done using the ``startTime`` and ``endTime`` columns in *sleep_cleaned* and removing any time in *heartrate_df* not within those ranges. Not only is filtering the heart rate data a challenge but also how it should be combined with the sleep data. My idea is to append a ``logId`` column to *heartrate_df* to represent the ``logId`` of the sleep observation if the heart rate data was recorded within the ``startTime`` and ``endTime`` range.

In [38]:
# Create a data frame of the startTime and endTime of each sleep observation
sleep_time_ranges = sleep_cleaned[['startTime', 'endTime']]

In [39]:
sleep_time_ranges.head()

Unnamed: 0_level_0,startTime,endTime
logId,Unnamed: 1_level_1,Unnamed: 2_level_1
20597811657,2018-12-26 02:49:00,2018-12-26 10:55:00
20598864707,2018-12-26 12:29:00,2018-12-26 14:09:30
20608911100,2018-12-27 02:44:00,2018-12-27 10:23:30
20624450075,2018-12-28 02:32:00,2018-12-28 10:01:30
20626208454,2018-12-28 14:32:00,2018-12-28 15:58:00


In [40]:
# Define a function to check if time is between two specific times
def is_time_between(starttime, endtime, check_time):
    '''Checks if check_time is between starttime and endtime'''
    return ((starttime <= check_time) & (check_time <= endtime))

In [41]:
# Reset the index of hearrate_df
heartrate_df.reset_index(inplace = True)

In [42]:
# Iterate over all observations in sleep_time_ranges
# Input the logId if the time in heartrate_df is between a period in sleep_time_ranges
for log, values in sleep_time_ranges.iterrows():
    heartrate_df.loc[is_time_between(values[0], values[1], heartrate_df['dateTime']), 'logId'] = log

In [43]:
# Drop NaN from heartrate_df
heartrate_df.dropna(inplace = True)

# Set index of heartrate_df to logId
heartrate_df.set_index('logId', inplace = True)

In [44]:
# Verify that the number of unique observations in heartrate_df['logId'] is the same as
# sleep_time_ranges['logId'] plus 1
heartrate_df.head(10)

Unnamed: 0_level_0,dateTime,value.bpm,value.confidence
logId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
20597811657.0,2018-12-26 02:49:05,70,1
20597811657.0,2018-12-26 02:49:10,73,1
20597811657.0,2018-12-26 02:49:15,74,1
20597811657.0,2018-12-26 02:49:20,73,1
20597811657.0,2018-12-26 02:49:25,74,1
20597811657.0,2018-12-26 02:49:35,79,1
20597811657.0,2018-12-26 02:49:40,80,1
20597811657.0,2018-12-26 02:49:45,79,2
20597811657.0,2018-12-26 02:49:55,79,1
20597811657.0,2018-12-26 02:50:10,76,1


I'm satisfied with the format of the heart rate data frame and can export it to a CSV.

In [45]:
# Export heartrate_df to CSV
heartrate_df.to_csv('../Data/heartrate.csv', index = True)