# Apple Health Extractor

This code will parse your Apple Health export data, create multiple CSV and do some simple data checks and data analysis. 

Enjoy! 

--------

## Setup and Usage NOTE

* Export your data from Apple Health App on your phone. 
* Unzip export.zip into this directory and rename to data. 
* Inside your directory there should be a directory and file here: /data/export.xml
* Run inside project or in the command line.

In [1]:
# %run -i 'apple-health-data-parser' 'export.xml' 
%run -i 'apple-health-data-parser' 'export.xml' 

Reading data from export.xml . . . done
Unexpected node of type ExportDate.

Tags:
ActivitySummary: 686
ExportDate: 1
Me: 1
Record: 1142965
Workout: 106

Fields:
HKCharacteristicTypeIdentifierBiologicalSex: 1
HKCharacteristicTypeIdentifierBloodType: 1
HKCharacteristicTypeIdentifierDateOfBirth: 1
HKCharacteristicTypeIdentifierFitzpatrickSkinType: 1
activeEnergyBurned: 686
activeEnergyBurnedGoal: 686
activeEnergyBurnedUnit: 686
appleExerciseTime: 686
appleExerciseTimeGoal: 686
appleStandHours: 686
appleStandHoursGoal: 686
creationDate: 1143071
dateComponents: 686
device: 1125552
duration: 106
durationUnit: 106
endDate: 1143071
sourceName: 1143071
sourceVersion: 1138201
startDate: 1143071
totalDistance: 106
totalDistanceUnit: 106
totalEnergyBurned: 106
totalEnergyBurnedUnit: 106
type: 1142965
unit: 1133858
value: 1142954
workoutActivityType: 106

Record types:
ActiveEnergyBurned: 525528
AppleExerciseTime: 11599
AppleStandHour: 9073
AppleStandTime: 4813
BasalEnergyBurned: 100290
BodyFatPer

-----

# Apple Health Data Check and Simple Data Analysis

In [1]:
import numpy as np
import pandas as pd
import glob
from datetime import date, datetime, timedelta as td
import pytz
import numpy as np
import pandas as pd

----

# Weight

In [2]:
weight = pd.read_csv("BodyMass.csv")

In [55]:
weight.tail()

Unnamed: 0,sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value
176,Mi Fit,201907081918,,BodyMass,kg,2020-07-02 07:52:37 +0530,2020-07-02 07:52:31 +0530,2020-07-02 07:52:31 +0530,88.8
177,Mi Fit,201907081918,,BodyMass,kg,2020-07-04 09:09:36 +0530,2020-07-04 09:09:25 +0530,2020-07-04 09:09:25 +0530,90.9
178,Mi Fit,201907081918,,BodyMass,kg,2020-07-05 09:03:03 +0530,2020-07-04 09:10:52 +0530,2020-07-04 09:10:52 +0530,89.4
179,Mi Fit,201907081918,,BodyMass,kg,2020-07-05 09:03:03 +0530,2020-07-05 09:02:55 +0530,2020-07-05 09:02:55 +0530,88.9
180,Mi Fit,201907081918,,BodyMass,kg,2020-07-06 08:33:11 +0530,2020-07-06 08:33:05 +0530,2020-07-06 08:33:05 +0530,88.3


In [56]:
weight.describe()

Unnamed: 0,device,value
count,0.0,181.0
mean,,88.637569
std,,0.806861
min,,84.2
25%,,88.3
50%,,88.6
75%,,89.1
max,,90.9


----

## Steps

In [2]:
steps = pd.read_csv("StepCount.csv")

In [3]:
len(steps)

174943

In [4]:
steps.columns

Index(['sourceName', 'sourceVersion', 'device', 'type', 'unit', 'creationDate',
       'startDate', 'endDate', 'value'],
      dtype='object')

In [5]:
steps.describe()

Unnamed: 0,value
count,174943.0
mean,82.619207
std,214.041698
min,1.0
25%,17.0
50%,40.0
75%,90.0
max,43109.0


In [6]:
# TRIAL CODE TO GROUPBY
# a = len(steps.index)
# for i in range(0,a):
#     steps['creationDate'][i] =steps['creationDate'][i].replace('-',':')[0:19].replace(" ", ":")
#     steps['startDate'][i] =steps['startDate'][i].replace('-',':')[0:19].replace(" ", ":")
#     steps['endDate'][i] =steps['endDate'][i].replace('-',':')[0:19].replace(" ", ":")
#     print(i)
# print(steps['creationDate'])
# steps['creationDate'][0] =steps['creationDate'][0].replace('-',':')[0:19].replace(" ", ":") 
# print(steps['creationDate'][0])


In [7]:
# functions to convert UTC to Kolkata time zone and extract date/time elements
convert_tz = lambda x: x.to_pydatetime().replace(tzinfo=pytz.utc).astimezone(pytz.timezone('Asia/Kolkata'))
get_year = lambda x: convert_tz(x).year
get_month = lambda x: '{}-{:02}'.format(convert_tz(x).year, convert_tz(x).month) #inefficient
get_date = lambda x: '{}-{:02}-{:02}'.format(convert_tz(x).year, convert_tz(x).month, convert_tz(x).day) #inefficient
get_day = lambda x: convert_tz(x).day
get_hour = lambda x: convert_tz(x).hour
get_minute = lambda x: convert_tz(x).minute
get_day_of_week = lambda x: convert_tz(x).weekday()

In [8]:
# parse out date and time elements as Kolkata time
steps['startDate'] = pd.to_datetime(steps['startDate'])
steps['year'] = steps['startDate'].map(get_year)
steps['month'] = steps['startDate'].map(get_month)
steps['date'] = steps['startDate'].map(get_date)
steps['day'] = steps['startDate'].map(get_day)
steps['hour'] = steps['startDate'].map(get_hour)
steps['minute'] = steps['startDate'].map(get_minute)
steps['dow'] = steps['startDate'].map(get_day_of_week)

In [9]:
steps_by_date = steps.groupby(['date'])['value'].sum().reset_index(name='Steps')
steps_by_date.to_csv("steps_per_day_shashank.csv", index=False)
steps_by_date.head()

Unnamed: 0,date,Steps
0,2015-12-21,4355
1,2015-12-22,4389
2,2015-12-23,6566
3,2015-12-24,5180
4,2015-12-25,4498


In [13]:
steps_by_date['date'] = pd.to_datetime(steps_by_date['date'])
steps_by_date['dow'] = steps_by_date['date'].dt.weekday
steps_by_date['weekNo'] = steps_by_date['dow']
ref_for_week = steps_by_date['dow'][0]
p = 0
for i in range(0,len(steps_by_date)):
    if steps_by_date['dow'][i] == ref_for_week:
        p = p+1
    steps_by_date['weekNo'][i] = p

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [101]:
steps_by_date

Unnamed: 0,date,Steps,dow,weekNo
0,2015-12-21,4355,0,1
1,2015-12-22,4389,1,1
2,2015-12-23,6566,2,1
3,2015-12-24,5180,3,1
4,2015-12-25,4498,4,1
...,...,...,...,...
1648,2020-07-02,14843,3,237
1649,2020-07-03,25581,4,237
1650,2020-07-04,14891,5,237
1651,2020-07-05,11537,6,237


In [81]:
# grouping data by week and storing in table
steps_by_week = steps_by_date.groupby(['weekNo'])['Steps'].sum().reset_index(name='Steps')

In [86]:
steps_by_week['stdDev'] = steps_by_date.groupby(['weekNo'])['Steps'].std()
steps_by_week

Unnamed: 0,weekNo,Steps,stdDev
0,1,30513,
1,2,42378,1430.254290
2,3,42809,2658.213310
3,4,31654,2795.473595
4,5,37155,1848.365765
...,...,...,...
233,234,99240,4927.263497
234,235,129670,6948.434342
235,236,181365,8530.072034
236,237,148001,6071.716855


In [102]:
steps_by_week.to_csv("steps_per_week_shashank.csv", index=False)
steps_by_week.to_numpy()
steps_week_np = steps_by_week.to_numpy()
steps_week_np

array([[1.00000000e+00, 3.05130000e+04,            nan],
       [2.00000000e+00, 4.23780000e+04, 1.43025429e+03],
       [3.00000000e+00, 4.28090000e+04, 2.65821331e+03],
       [4.00000000e+00, 3.16540000e+04, 2.79547359e+03],
       [5.00000000e+00, 3.71550000e+04, 1.84836576e+03],
       [6.00000000e+00, 4.78440000e+04, 1.14168493e+03],
       [7.00000000e+00, 3.98530000e+04, 3.19637625e+03],
       [8.00000000e+00, 3.37730000e+04, 2.61593608e+03],
       [9.00000000e+00, 4.41610000e+04, 1.50738733e+03],
       [1.00000000e+01, 3.06040000e+04, 2.65117205e+03],
       [1.10000000e+01, 4.58670000e+04, 1.49621311e+03],
       [1.20000000e+01, 3.49190000e+04, 3.68977727e+03],
       [1.30000000e+01, 4.70890000e+04, 1.88778299e+03],
       [1.40000000e+01, 8.50680000e+04, 2.50903846e+03],
       [1.50000000e+01, 3.31690000e+04, 4.29579488e+03],
       [1.60000000e+01, 4.66400000e+04, 1.23731818e+03],
       [1.70000000e+01, 4.75950000e+04, 1.67399287e+03],
       [1.80000000e+01, 3.33320

In [115]:
def setofInsightMonthly(steps_week,threeWeek = False,twoWeek = False):
#   THIS FUNCTION FINDS and STORES THE INSIGHTS ON THE BASIS OF A 4WEEK/28DAY PERIOD 
#   ALSO FINDS ON THE BASIS OF 3 and 2 weeks
    steps_week_np = steps_week.to_numpy()
    steps_12week = np.flip(steps_week_np[len(steps_week_np)-13:len(steps_week_np)-2],axis = 0) #flipping the last to the first for easier access to indices 
    sliding_insight_four_week = {'mean':np.zeros(len(steps_12week) - 4),'stdDev':np.zeros(len(steps_12week) - 4)} #hardcoded sliding possibilities according to a month
    sliding_insight_three_week = {'mean':np.zeros(len(steps_12week) - 3),'stdDev':np.zeros(len(steps_12week) - 3)}
    sliding_insight_two_week = {'mean':np.zeros(len(steps_12week) - 2),'stdDev':np.zeros(len(steps_12week) - 2)}
#     finding mean of Grouped weekly data
    sliding_insight_four_week['mean'] = [np.mean(steps_12week[i:i+4,1]) for i in range(0,len(steps_12week)-4)]
    if threeWeek:
        sliding_insight_three_week['mean'] = [np.mean(steps_12week[i:i+3,1]) for i in range(0,len(steps_12week)-3)]
    if twoWeek:
        sliding_insight_two_week['mean'] = [(np.mean(steps_12week[i:i+2,1])) for i in range(0,len(steps_12week)-2)]
    return sliding_insight_four_week,sliding_insight_three_week,sliding_insight_two_week

A,B,C = setofInsightMonthly(steps_by_week,True,True)
print(A)
print(B)
print(C)

{'mean': [134636.25, 126032.75, 131760.25, 146805.5, 151290.0, 147402.5, 131213.25], 'stdDev': array([0., 0., 0., 0., 0., 0., 0.])}
{'mean': [136758.33333333334, 119060.0, 124820.33333333333, 142600.33333333334, 152984.0, 152736.33333333334, 145676.66666666666, 121810.66666666667], 'stdDev': array([0., 0., 0., 0., 0., 0., 0., 0.])}
{'mean': [155517.5, 114455.0, 113755.0, 137610.5, 149765.5, 156000.5, 152814.5, 138804.5, 109612.0], 'stdDev': array([0., 0., 0., 0., 0., 0., 0., 0., 0.])}


## Stand Count

In [108]:
stand = pd.read_csv("AppleStandHour.csv")

In [109]:
len(stand)

9073

In [110]:
stand.columns

Index(['sourceName', 'sourceVersion', 'device', 'type', 'unit', 'creationDate',
       'startDate', 'endDate', 'value'],
      dtype='object')

In [15]:
stand.tail()

Unnamed: 0,sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value
9068,Shashank’s Apple Watch,6.1.3,"<<HKDevice: 0x2826a4be0>, name:Apple Watch, ma...",AppleStandHour,,2020-07-05 19:01:23 +0530,2020-07-05 19:00:00 +0530,2020-07-05 20:00:00 +0530,HKCategoryValueAppleStandHourStood
9069,Shashank’s Apple Watch,6.1.3,"<<HKDevice: 0x2826a4cd0>, name:Apple Watch, ma...",AppleStandHour,,2020-07-05 20:11:10 +0530,2020-07-05 20:00:00 +0530,2020-07-05 21:00:00 +0530,HKCategoryValueAppleStandHourStood
9070,Shashank’s Apple Watch,6.1.3,"<<HKDevice: 0x2826a4dc0>, name:Apple Watch, ma...",AppleStandHour,,2020-07-05 21:04:22 +0530,2020-07-05 21:00:00 +0530,2020-07-05 22:00:00 +0530,HKCategoryValueAppleStandHourStood
9071,Shashank’s Apple Watch,6.1.3,"<<HKDevice: 0x2826a4eb0>, name:Apple Watch, ma...",AppleStandHour,,2020-07-05 22:01:13 +0530,2020-07-05 22:00:00 +0530,2020-07-05 23:00:00 +0530,HKCategoryValueAppleStandHourStood
9072,Shashank’s Apple Watch,6.1.3,"<<HKDevice: 0x2826a4fa0>, name:Apple Watch, ma...",AppleStandHour,,2020-07-05 23:35:54 +0530,2020-07-05 23:00:00 +0530,2020-07-06 00:00:00 +0530,HKCategoryValueAppleStandHourStood


In [17]:
# stand['value'] = stand['endDate'] - stand['startDate']
a = ['i' for i in range(0,len(stand['endDate']))]
# for i in range(0,len(stand['endDate'])):
    
    
for i in range(0,len(stand['endDate'])):
    t1 = datetime.strptime(stand['endDate'][i][0:19], '%Y%m%d%H%M%S')
#     t1 = datetime.time(*map(int, stand['endDate'][i][12:19].split(':')))
#     t2 = datetime.time(*map(int, stand['startDate'][i][12:19].split(':')))
    t2 = datetime.strptime(stand['startDate'][i][0:19], '%Y%m%d%H%M%S')
    a[i] =  t1-t2
stand['value'] = a
print(stand['value'].tail())

ValueError: time data '2017-11-18 02:30:00' does not match format '%Y%m%d%H%M%S'

------

## Resting Heart Rate (HR)

In [18]:
restingHR = pd.read_csv("RestingHeartRate.csv")

In [19]:
len(restingHR)

645

In [20]:
restingHR.describe()

Unnamed: 0,device,value
count,0.0,645.0
mean,,69.809302
std,,5.422455
min,,50.0
25%,,67.0
50%,,69.0
75%,,72.0
max,,98.0


---

## Walking Heart Rate (HR) Average

In [21]:
walkingHR = pd.read_csv("WalkingHeartRateAverage.csv")

In [22]:
len(walkingHR)

539

In [23]:
walkingHR.describe()

Unnamed: 0,device,value
count,0.0,539.0
mean,,99.084416
std,,11.996546
min,,72.5
25%,,91.0
50%,,97.0
75%,,104.0
max,,143.0


---

## Heart Rate Variability (HRV)

In [24]:
hrv = pd.read_csv("HeartRateVariabilitySDNN.csv")

In [25]:
len(hrv)

1687

In [26]:
hrv.columns

Index(['sourceName', 'sourceVersion', 'device', 'type', 'unit', 'creationDate',
       'startDate', 'endDate', 'value'],
      dtype='object')

In [27]:
hrv.describe()

Unnamed: 0,value
count,1687.0
mean,33.308511
std,13.458962
min,7.32718
25%,23.76185
50%,31.0815
75%,40.1322
max,160.64


In [28]:
hrv.tail()

Unnamed: 0,sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value
1682,Shashank’s Apple Watch,6.1.3,"<<HKDevice: 0x282780a00>, name:Apple Watch, ma...",HeartRateVariabilitySDNN,ms,2020-07-05 11:54:27 +0530,2020-07-05 11:53:26 +0530,2020-07-05 11:54:27 +0530,21.7083
1683,Shashank’s Apple Watch,6.1.3,"<<HKDevice: 0x282784af0>, name:Apple Watch, ma...",HeartRateVariabilitySDNN,ms,2020-07-05 13:55:48 +0530,2020-07-05 13:54:43 +0530,2020-07-05 13:55:48 +0530,38.0041
1684,Shashank’s Apple Watch,6.1.3,"<<HKDevice: 0x282785a90>, name:Apple Watch, ma...",HeartRateVariabilitySDNN,ms,2020-07-05 17:54:11 +0530,2020-07-05 17:53:06 +0530,2020-07-05 17:54:11 +0530,31.3568
1685,Shashank’s Apple Watch,6.1.3,"<<HKDevice: 0x282786260>, name:Apple Watch, ma...",HeartRateVariabilitySDNN,ms,2020-07-05 18:09:13 +0530,2020-07-05 18:08:10 +0530,2020-07-05 18:09:13 +0530,30.0479
1686,Shashank’s Apple Watch,6.1.3,"<<HKDevice: 0x2827863f0>, name:Apple Watch, ma...",HeartRateVariabilitySDNN,ms,2020-07-05 21:55:17 +0530,2020-07-05 21:54:11 +0530,2020-07-05 21:55:17 +0530,28.2694


-------

## VO2 Max

In [29]:
vo2max = pd.read_csv("VO2Max.csv")

In [30]:
len(vo2max)

57

In [31]:
vo2max.describe()

Unnamed: 0,sourceVersion,device,value
count,0.0,0.0,57.0
mean,,,33.181767
std,,,2.716489
min,,,29.8837
25%,,,31.7687
50%,,,31.8863
75%,,,33.9545
max,,,41.4593


----

## Blood Pressure

In [32]:
diastolic = pd.read_csv("BloodPressureDiastolic.csv")
systolic = pd.read_csv("BloodPressureSystolic.csv")

FileNotFoundError: File b'BloodPressureDiastolic.csv' does not exist

In [33]:
diastolic.describe()

NameError: name 'diastolic' is not defined

In [34]:
systolic.describe()

NameError: name 'systolic' is not defined

------

## Sleep

In [41]:
sleep = pd.read_csv("SleepAnalysis.csv")
sleep['unit'] = 'hours'

In [42]:
sleep.tail()

Unnamed: 0,sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value
17,Dozee,1,,SleepAnalysis,hours,2020-06-12 00:01:58 +0530,2019-04-07 00:55:05 +0530,2019-04-07 06:57:33 +0530,HKCategoryValueSleepAnalysisAsleep
18,Dozee,1,,SleepAnalysis,hours,2020-06-12 00:02:37 +0530,2019-04-07 00:53:05 +0530,2019-04-07 06:57:33 +0530,HKCategoryValueSleepAnalysisInBed
19,Dozee,1,,SleepAnalysis,hours,2020-06-12 00:02:37 +0530,2019-04-07 00:55:05 +0530,2019-04-07 06:57:33 +0530,HKCategoryValueSleepAnalysisAsleep
20,Dozee,1,,SleepAnalysis,hours,2020-06-12 00:03:58 +0530,2019-04-07 00:53:05 +0530,2019-04-07 06:57:33 +0530,HKCategoryValueSleepAnalysisInBed
21,Dozee,1,,SleepAnalysis,hours,2020-06-12 00:03:58 +0530,2019-04-07 00:55:05 +0530,2019-04-07 06:57:33 +0530,HKCategoryValueSleepAnalysisAsleep


In [37]:
sleep.describe()

Unnamed: 0,sourceVersion,unit
count,22.0,0.0
mean,18.818182,
std,24.125932,
min,1.0,
25%,1.0,
50%,1.0,
75%,50.0,
max,50.0,
