# Outlining my process in extracting, storing, and organizing Apple Health data for analysis and visualization purposes

Related files: 
- `extractapplehealth.py`: Extracts Health data and formats them into tables in a database that is accessible with SQL.
- `healthdatabase.py`: Custom context manager for accessing and modifying the database with [APSW (Another Python SQL Wrapper)](https://rogerbinns.github.io/apsw/).
- Apple Health export file: `export.xml`. 

In [1]:
import pandas as pd
import xml.etree.ElementTree as ET
from pathlib import Path
import numpy as np

project_path = Path.cwd().parent
DATA_PATH = Path(project_path, 'apple_health_export', 'export.xml')

print("export.xml path: ", DATA_PATH)

export.xml path:  /Users/frootloops/Documents/pyprojects/runnersdash/apple_health_export/export.xml


In [227]:
def get_first_instance(treeroot, elem_tag, parent_tag=None):
    """ Returns the parent node of and the node of the first instance
    of a node in the ElementTree with tag = elem_tag and 
    parent node tag = parent_tag, given the root of the tree.

    If parent_tag = None, returns the root as parent_node.
    """
    if parent_tag is None:
        parent_node = treeroot.find('.')
    else:
        parent_node = treeroot.find(f'./{parent_tag}/{elem_tag}/..')

    child_node = parent_node.find(f'./{elem_tag}')

    return parent_node, child_node

## 1.0 I get by with a little help from `xml.etree.ElementTree`

I use the ElementTree library to help parse this file format. Refer to the documentation for xml.etree.ElementTree: [The ElementTree XML API](https://docs.python.org/3/library/xml.etree.elementtree.html#)

Note: Other alternatives, like [lxml](https://lxml.de), might be worth looking into. 
See also: 
- [Benchmarks and speed - lxml](https://lxml.de/performance.html)
- [ElementTree compatibility of lxml.etree](https://lxml.de/compatibility.html)
- [Faster XML stream processing in Python](http://blog.behnel.de/posts/faster-xml-stream-processing-in-python.html)

Nevertheless, ElementTree extracts all entries under an XML file and stores it in a tree-like framework, with all entries nested under the "root" of the tree.

Each element (or "node") of the tree, including the root, has a 'tag' and 'attrib'. Its 'tag' refers to its element type (or tag, or id). 

In the example below, each node of the tree is denoted by a tuple (tag, attrib), where the first entry is the node's 'tag' and the second is its attribute ('attrib'). 
```
('MusicArtists',)   <-- root
|
|--- ('Genre', {type: 'pop'})
|       |
|       |--- ('Subgenre', {type: 'artpop'})
|       |         | --- ('Country', {name: 'Iceland'})
|       |         |        |--- ('Artist', {name: 'Bjork'})
|       |         |
|       |         | --- ('Country', {type: 'Britain'})
|       |                  |--- ('Artist', {name: 'FKA Twigs'})
|       |                  |--- ('Artist', {name: 'Kate Bush'})
|       |             
|       |--- ('Subgenre', {type: 'hyperpop'})
|                 | ....
|
|
|--- ('Genre', {type: 'hip hop'})
|       | 
|      ...
...
```

We note that there are tags associated with nodes that are child nodes of a more top-level node. Example: the tags 'Subgenre', 'Country', and 'Artist'. 
The tag 'Genre', deriving directly from the root is what we consider a top level node tag (will be important later).  

In [2]:
# Open and parse file and create an ElementTree object out of it
with open(DATA_PATH, 'r') as f:
    tree = ET.parse(f)

# Get root of ElementTree object
root = tree.getroot()

# Print 'tag' and 'attribute' of root
print('root.tag = {}'.format(root.tag))
print('root.attrib = {}'.format(root.attrib))

root.tag = HealthData
root.attrib = {'locale': 'en_US'}


Use `.iter()` to iterate through the entire tree and get a list of all the tags. Then filter out duplicates to get a unique list of tags.

In [229]:
# Get all tags and filter out duplicates
listoftags = [i.tag for i in root.iter()]
uniquetags = list(set(listoftags))
print(uniquetags)

['Record', 'HeartRateVariabilityMetadataList', 'Me', 'ActivitySummary', 'MetadataEntry', 'Audiogram', 'Workout', 'ExportDate', 'Correlation', 'WorkoutRoute', 'FileReference', 'InstantaneousBeatsPerMinute', 'WorkoutEvent', 'HealthData', 'SensitivityPoint']


## 2.0 Traversing the Tree

`uniquetags` is a list of all unique tags. But it doesn't distinguish which tags are associated top-level nodes and which tags are associated with nodes nested under (related to) top-level ones.

To preserve this relation when extracting from the ElementTree, we want to look at top-level nodes, then extract its child nodes and store its information within the same framework as its parent node.

```
'HealthData' (root)
  |
  |-- 'Workout'
  |         |-- MetadataEntry
  |         |-- child-node-2
  |         |
  |        ...
  |-- 'Record'
  |         |-- MetadataEntry
  |         |-- child-node-3
  |         |
  |        ...
  ...
```

Refer to [XPath support for ElementTree](https://docs.python.org/3/library/xml.etree.elementtree.html?highlight=doctype#xpath-support) for locating and accessing elements of the tree. 

**Example 1** The following shows there are no `MetadataEntry` elements under the root. This indicates that the tag is associated with child nodes.

In [230]:
# A child node: MetadataEntry. There is no element with tag 'MetadataEntry' directly under the root.
assert root.find('./MetadataEntry') is None
assert root.find('./FileReference') is None  # Another child node

**Example 2** There are `Workout` element(s) directly under the root. This indicates that the tag is associated with top-level nodes.

In [231]:
# Outputs the first element found as a direct child of root, with tag = 'Workout'
root.find('./Workout')

<Element 'Workout' at 0x7f84c34d6cc0>

In [232]:
# Get top level nodes
top_level_nodes = [i for i in uniquetags if root.find(f'./{i}') is not None]
# for t in uniquetags:
#     if root.find(f'./{t}') is not None:
#         top_level_nodes.append(t)
print("root HealthData's top level node tags: ", top_level_nodes)
print("\n The following tags are associated with child nodes: ", list(set(uniquetags) - set(top_level_nodes)))

root HealthData's top level node tags:  ['Record', 'Me', 'ActivitySummary', 'Audiogram', 'Workout', 'ExportDate', 'Correlation']

 The following tags are associated with child nodes:  ['HeartRateVariabilityMetadataList', 'MetadataEntry', 'WorkoutRoute', 'FileReference', 'InstantaneousBeatsPerMinute', 'WorkoutEvent', 'HealthData', 'SensitivityPoint']


##### Useful to have: a dictionary mapping the node types with its known child node tags 
See export.xml doctype. Would be very much worth exploring whether this library can parse doctype.

In [12]:
ELEM_WITH_CHILD = {"Record": ["MetadataEntry", "HeartRateVariabilityMetadataList"],
    "Workout": ["MetadataEntry", "WorkoutEvent", "WorkoutRoute"],
    "Correlation": ["Record", "MetadataEntry"],
    "WorkoutEvent": ["MetadataEntry", "FileReference"],
    "HeartRateVariabilityMetadataList": ["InstantaneousBeatsPerMinute"]}

#### Get total number of elements for each tag and total number of elements under root

In [233]:
nsum = 0
for t in uniquetags:
    tnodes = 0
    # tnodes += len(root.findall(f'.//{t}'))  # This also works, just a lil slower?
    tnodes += sum(1 for _ in root.iter(t))
    print(f"There are {tnodes} {t} nodes.")
    nsum += tnodes

print(f'There are {nsum} total nodes.')

There are 1190848 Record nodes.
There are 2043 HeartRateVariabilityMetadataList nodes.
There are 1 Me nodes.
There are 438 ActivitySummary nodes.
There are 186108 MetadataEntry nodes.
There are 1 Audiogram nodes.
There are 889 Workout nodes.
There are 1 ExportDate nodes.
There are 27 Correlation nodes.
There are 304 WorkoutRoute nodes.
There are 304 FileReference nodes.
There are 75081 InstantaneousBeatsPerMinute nodes.
There are 5891 WorkoutEvent nodes.
There are 1 HealthData nodes.
There are 6 SensitivityPoint nodes.
There are 1461943 total nodes.


## 3.0 Extracting nodes/elements with tag 'Workout', and their child nodes

In [14]:
workout_nodes = root.findall('./Workout')
print("There are {} Workout nodes".format(len(workout_nodes)))

There are 889 Workout nodes


##### Case n = 1
Let's look at the first node on the list.

In [15]:
for i in workout_nodes[:1]:
    # Get current node's attributes
    print(i, ': ', i.attrib, '\n')

    # Find all nodes nested under the current node
    child_nodes = i.findall('./')
    
    print(f'There are {len(child_nodes)} child nodes:')
    for i, c in enumerate(child_nodes, start=1):
        print(f"{i} {c.tag}: {c.attrib}")

<Element 'Workout' at 0x7f8529656bd0> :  {'workoutActivityType': 'HKWorkoutActivityTypeRunning', 'duration': '25.91666666666667', 'durationUnit': 'min', 'totalDistance': '3.669', 'totalDistanceUnit': 'km', 'totalEnergyBurned': '0', 'totalEnergyBurnedUnit': 'Cal', 'sourceName': 'RunGap', 'sourceVersion': '671', 'creationDate': '2022-03-06 00:09:32 -0800', 'startDate': '2018-03-17 14:42:58 -0800', 'endDate': '2018-03-17 15:08:53 -0800'} 

There are 3 child nodes:
1 MetadataEntry: {'key': 'HKIndoorWorkout', 'value': '1'}
2 MetadataEntry: {'key': 'HKAverageSpeed', 'value': '2.35968 m/s'}
3 MetadataEntry: {'key': 'HKMaximumSpeed', 'value': '2.35968 m/s'}


Compare this with the entry from `export.xml`
```
 <Workout workoutActivityType="HKWorkoutActivityTypeRunning" duration="25.91666666666667" durationUnit="min" totalDistance="3.669" totalDistanceUnit="km" totalEnergyBurned="0" totalEnergyBurnedUnit="Cal" sourceName="RunGap" sourceVersion="671" creationDate="2022-03-06 00:09:32 -0800" startDate="2018-03-17 14:42:58 -0800" endDate="2018-03-17 15:08:53 -0800">
  <MetadataEntry key="HKIndoorWorkout" value="1"/>
  <MetadataEntry key="HKAverageSpeed" value="2.35968 m/s"/>
  <MetadataEntry key="HKMaximumSpeed" value="2.35968 m/s"/>
</Workout>
```

Element attributes are contained as a dictionary, which will be easy to turn into a DataFrame with `pd.DataFrame.from_dict`. The slightly more laborious task would be to turn each child node (each nested 'MetadataEntry' element in this case) as a column in that DataFrame.

### 3.1 'MetadataEntry' child nodes

In [16]:
first_workout = root.find('./Workout')
workouts = pd.DataFrame.from_dict([first_workout.attrib])

workouts

Unnamed: 0,workoutActivityType,duration,durationUnit,totalDistance,totalDistanceUnit,totalEnergyBurned,totalEnergyBurnedUnit,sourceName,sourceVersion,creationDate,startDate,endDate
0,HKWorkoutActivityTypeRunning,25.91666666666667,min,3.669,km,0,Cal,RunGap,671,2022-03-06 00:09:32 -0800,2018-03-17 14:42:58 -0800,2018-03-17 15:08:53 -0800


2. Pass in as column names each unique child node key, with its value as its cell value. If there is more than one child node with the same key with different values, then the cell value should be a list of those values.

    **Example**: if this node had another child 'MetadataEntry' with key = "HKMaximumSpeed" with value = "4 m/s", then 
    
    `workouts.loc[0]['HKMaximumSpeed'] = ["2.35968 m/s", "4 m/s"]`

In [17]:
first_children = [j.attrib for j in first_workout.findall('./')]
print(first_children)

[{'key': 'HKIndoorWorkout', 'value': '1'}, {'key': 'HKAverageSpeed', 'value': '2.35968 m/s'}, {'key': 'HKMaximumSpeed', 'value': '2.35968 m/s'}]


We have a list of `dict`, which we can't pass in easily into `pd.DataFrame.from_dict()`. 

**Idea**: Iterate through each child node, and concatenate to the `workouts` DataFrame, taking the 'MetadataEntry' attribute `key` as the column names.

**Note**: the following code only works because 

1. each child node has the same tag `MetadataEntry` with the same attributes `key` and `value`. Need to modify this to account for different child node types.
1. Each `MetadataEntry` has a unique `key`. This would not work with multiple `MetadataEntry` with the same `key` mapping to different `value`s.

In [18]:
for child in first_children:
    workouts.loc[0, child['key']] = child['value']

# # The following also works
# for child in first_children:
#     workouts[child['key']] = child['value']

In [19]:
workouts

Unnamed: 0,workoutActivityType,duration,durationUnit,totalDistance,totalDistanceUnit,totalEnergyBurned,totalEnergyBurnedUnit,sourceName,sourceVersion,creationDate,startDate,endDate,HKIndoorWorkout,HKAverageSpeed,HKMaximumSpeed
0,HKWorkoutActivityTypeRunning,25.91666666666667,min,3.669,km,0,Cal,RunGap,671,2022-03-06 00:09:32 -0800,2018-03-17 14:42:58 -0800,2018-03-17 15:08:53 -0800,1,2.35968 m/s,2.35968 m/s


### 3.2 'WorkoutEvent' and 'WorkoutRoute', the child nodes of 'Workout' elements

  To make this easier, we refer to the doctype of `export.xml`, which tells us what the child nodes are of each element type, and what the attributes of those child nodes are. 

  For element type 'Workout':
  ```
  <!ELEMENT Workout ((MetadataEntry|WorkoutEvent|WorkoutRoute)*)>
  ```
  It's child nodes are of type 'MetadataEntry', 'WorkoutEvent' and 'WorkoutRoute'.

  The attribute list of each type:
  ```
  <!ATTLIST WorkoutEvent
    type         CDATA #REQUIRED
    date         CDATA #REQUIRED
    duration     CDATA #IMPLIED
    durationUnit CDATA #IMPLIED
  >

  <!ATTLIST WorkoutRoute
    sourceName    CDATA #REQUIRED
    sourceVersion CDATA #IMPLIED
    device        CDATA #IMPLIED
    creationDate  CDATA #IMPLIED
    startDate     CDATA #REQUIRED
    endDate       CDATA #REQUIRED
  >

  <!ATTLIST MetadataEntry
    key   CDATA #REQUIRED
    value CDATA #REQUIRED
  >
  ```

##### 3.2.1 Exploring tag 'WorkoutEvent'

In [20]:
# Find the first instance of a 'WorkoutEvent' element
parent_workoutevent, first_workoutevent = get_first_instance(root, 'WorkoutEvent', 'Workout')
print(first_workoutevent.tag, '\n', first_workoutevent.attrib)

WorkoutEvent 
 {'type': 'HKWorkoutEventTypeSegment', 'date': '2019-02-05 17:00:54 -0800', 'duration': '8.487282014468525', 'durationUnit': 'min'}


Let's look at its parent and its child nodes and check that it is nested there.

In [21]:
# Let's look at its parent and its child nodes
print(f"Parent node 'Workout' {parent_workoutevent}: \n", parent_workoutevent.attrib)
print("\nParent node children:")

# Assert
assert parent_workoutevent.find(f"./{first_workoutevent.tag}") is not None

# View
for m in parent_workoutevent.findall('./'):
    print(f"{m.tag}: {m.attrib}")

Parent node 'Workout' <Element 'Workout' at 0x7f852965ddb0>: 
 {'workoutActivityType': 'HKWorkoutActivityTypeRunning', 'duration': '9.083333333333334', 'durationUnit': 'min', 'totalDistance': '1.722', 'totalDistanceUnit': 'km', 'totalEnergyBurned': '0', 'totalEnergyBurnedUnit': 'Cal', 'sourceName': 'RunGap', 'sourceVersion': '671', 'creationDate': '2022-03-06 00:09:31 -0800', 'startDate': '2019-02-05 17:00:54 -0800', 'endDate': '2019-02-05 17:09:59 -0800'}

Parent node children:
MetadataEntry: {'key': 'HKIndoorWorkout', 'value': '1'}
MetadataEntry: {'key': 'HKAverageSpeed', 'value': '3.15963 m/s'}
MetadataEntry: {'key': 'HKMaximumSpeed', 'value': '3.15963 m/s'}
WorkoutEvent: {'type': 'HKWorkoutEventTypeSegment', 'date': '2019-02-05 17:00:54 -0800', 'duration': '8.487282014468525', 'durationUnit': 'min'}
WorkoutEvent: {'type': 'HKWorkoutEventTypeMarker', 'date': '2019-02-05 17:09:23 -0800'}


It's the 4th node in the list.

Let's create a length-1 DataFrame containing the attributes of that parent 'Workout' node. 

**Note: this is a different 'Workout' node to the one defined under the previous section dealing with 'MetadataEntry' nodes**.

In [22]:
# DataFrame of this specific Workout node
parent_node = pd.DataFrame([parent_workoutevent.attrib])
parent_node

Unnamed: 0,workoutActivityType,duration,durationUnit,totalDistance,totalDistanceUnit,totalEnergyBurned,totalEnergyBurnedUnit,sourceName,sourceVersion,creationDate,startDate,endDate
0,HKWorkoutActivityTypeRunning,9.083333333333334,min,1.722,km,0,Cal,RunGap,671,2022-03-06 00:09:31 -0800,2019-02-05 17:00:54 -0800,2019-02-05 17:09:59 -0800


Let's add its 'MetadataEntry' nodes as columns (refer to the previous section):

In [23]:
for m_entry in parent_workoutevent.findall('./MetadataEntry'):
    parent_node.loc[0, m_entry.attrib['key']] = m_entry.attrib['value']

assert "HKMaximumSpeed" in parent_node.columns
assert "HKAverageSpeed" in parent_node.columns 
assert "HKIndoorWorkout" in parent_node.columns 

This should create three new columns called 'HKIndoorWorkout', 'HKAverageSpeed', 'HKMaximumSpeed' inside parent_node.

In [24]:
parent_node  # Parent 'Workout' node with MetadataEntry columns

Unnamed: 0,workoutActivityType,duration,durationUnit,totalDistance,totalDistanceUnit,totalEnergyBurned,totalEnergyBurnedUnit,sourceName,sourceVersion,creationDate,startDate,endDate,HKIndoorWorkout,HKAverageSpeed,HKMaximumSpeed
0,HKWorkoutActivityTypeRunning,9.083333333333334,min,1.722,km,0,Cal,RunGap,671,2022-03-06 00:09:31 -0800,2019-02-05 17:00:54 -0800,2019-02-05 17:09:59 -0800,1,3.15963 m/s,3.15963 m/s


#### 3.2.2 Brainstorming what to do with child nodes 'WorkoutEvent' and 'WorkoutRoute'
 
And now for this node's children. Looking at the data, it's helpful to establish rules in working with the different types.

'**MetadataEntry**' types are easy: pass in its 'key' as the column name and its 'value' as the corresponding cell value.

For '**WorkoutEvent**' and similarly '**WorkoutRoute**', I have several ideas:
1. Each new column would be comprised of its 'date', 'duration', 'durationUnit' with its column name prefixed by 'type'.

    **Example**: The columns 'HKWorkoutEventTypeSegment date', 'HKWorkoutEventTypeSegment duration', and 'HKWorkoutEventTypeSegment durationUnit' would be created, with its values as cell values.

    Pro(s): Every associated data with 'Workout' residing in the same table. 

    Con(s): It would be a very big table. With so many columns. And those columns would have long names because some tags have the same generic naming for their keys ('date', 'duration', etc.) so you have to prefix "WorkoutEventTypeSegment" to each one of them. Pass.

2. Store its 'date', 'duration', 'durationUnit' attributes as a dict inside column named by its 'type' attribute.

    **Example**: One column 'HKWorkoutEventTypeSegment' would be created,
    `workouts['HKWorkoutEventTypeSegment'] = {'date': ..., 'duration': ..., 'durationUnit': ...}`

    Pro(s): One column per WorkoutEvent type. Everything in the same table. 

    Con(s): Storing a dict (or a list of dicts) inside a DataFrame. :/ Pass.

3. Store all child WorkoutEvent in another table or DataFrame with other 'WorkoutEvent' of other nodes. We can link it back to this specific Workout with its 'date' attribute. It's 'date' should be between the 'startDate' and 'endDate' of the 'Workout' node/`workouts` row. However, we should be careful to document this. 

    Pro(s): It's organized. 'Workout' table would be cleaner. This specific data point isn't really useful in what we're doing (running data analytics), except when we're analyzing individual runs. But that's for later. So it's good to set it aside in another table for future reference. 

    Con(s): Would have to query 'WorkoutEvent' table where 'date' is between this or that... which might be time consuming if we have decades worth of data. Which brings me to the next option.

4. Similarly to option 3, store 'WorkoutEvent' attributes 'type', 'date', 'duration', and 'durationUnit' in another DataFrame that would be comprised of other 'WorkoutEvent' nodes. In additon to its attribute, it would have a column "Workout index" that holds the index of its Workout node within the Workouts table. 

    The Workouts table would have a column 'WorkoutEvent' of dtype boolean to indicate whether it has child nodes under the 'WorkoutEvent' table:

    **Example**:
    ```
    workouts.loc[1]['WorkoutEvent'] = True

    WorkoutEvent table:
    _______type___________date_________Workout index___...___
    0 | ...Segment | '2019-02-05...' |       1      |  ...
    1 | ...Marker  | '2019-02-05...' |       1      |  ...
    ...
    ```

    But we just have to be careful to document this and preserve the indexing of the Workouts table.

    This one's my favorite and I'm gonna go with this, not just for 'WorkoutEvent' but for other similar child node tags with the **notable** exception of 'MetadataEntry' nodes.

5. The 3D-fication of DataFrames: [xarray](https://xarray.pydata.org/en/stable/). 

    Maybe worth looking into in the future? However I'm just gonna go with what I'm more familiar with, right now. Also, I'm trying to store this data both as a CSV and as a database accessible with SQL. So option 4 works best.

#### 3.2.3 Implementing Option 4 for child node tag = 'WorkoutEvent'

1. Set up `workout_events` table and remind myself of the variables I set up previously

1a. 'WorkoutEvent' table

In [25]:
# Create empty DataFrame of WorkoutEvents
workout_events = pd.DataFrame()
workout_events

1b. 'Workout' table

In [26]:
workouts # The DataFrame of compiled 'Workout' nodes

Unnamed: 0,workoutActivityType,duration,durationUnit,totalDistance,totalDistanceUnit,totalEnergyBurned,totalEnergyBurnedUnit,sourceName,sourceVersion,creationDate,startDate,endDate,HKIndoorWorkout,HKAverageSpeed,HKMaximumSpeed
0,HKWorkoutActivityTypeRunning,25.91666666666667,min,3.669,km,0,Cal,RunGap,671,2022-03-06 00:09:32 -0800,2018-03-17 14:42:58 -0800,2018-03-17 15:08:53 -0800,1,2.35968 m/s,2.35968 m/s


1c. DataFrame of the 'Workout' node to add to 'Workout' table, with its remaining child nodes of type 'WorkoutEvent' to deal with. (Its 'MetadataEntry' nodes have already been accounted for).

In [27]:
parent_node # The parent 'Workout' node, to add to workouts

Unnamed: 0,workoutActivityType,duration,durationUnit,totalDistance,totalDistanceUnit,totalEnergyBurned,totalEnergyBurnedUnit,sourceName,sourceVersion,creationDate,startDate,endDate,HKIndoorWorkout,HKAverageSpeed,HKMaximumSpeed
0,HKWorkoutActivityTypeRunning,9.083333333333334,min,1.722,km,0,Cal,RunGap,671,2022-03-06 00:09:31 -0800,2019-02-05 17:00:54 -0800,2019-02-05 17:09:59 -0800,1,3.15963 m/s,3.15963 m/s


1d. The parent node ElementTree object

In [28]:
parent_workoutevent # ElementTree object of parent node

<Element 'Workout' at 0x7f852965ddb0>

In [29]:
# Count number of WorkoutEvent nodes under this Workout element
nworkout_events = sum(1 for _ in parent_workoutevent.findall('./WorkoutEvent'))
assert nworkout_events == 2

2. In the event that there are 'WorkoutEvent' children found, do the following:
    - Add 'WorkoutEvent' column to `parent_node`
    - Concatenate `parent_node` to the greater Workouts table (`workouts`)
    - Store the parent node's index in the Workouts table (will be `len(workouts) - 1`) 
    - Loop through all 'WorkoutEvent' children
        - Create a length-1 DataFrame containing attributes of the child node
        - Add column "Workout index", set to parent node's index
        - Appent to WorkoutEvent table (`workout_events`)

In [30]:
# Check if there are 'WorkoutEvent's under parent Workout node
workout_event_children = parent_workoutevent.findall('./WorkoutEvent')

if len(workout_event_children) > 0:
    # If there is, add column 'WorkoutEvent' to parent_node and set to True
    parent_node.loc[0, 'WorkoutEvent'] = True

    # Add parent 'Workout' node to Workouts table as the last element
    workouts = pd.concat([workouts, parent_node], ignore_index=True)
    workout_idx = len(workouts) - 1

    # Storing child WorkoutEvent nodes
    for wevent in workout_event_children: 
        # Create a DataFrame for this specific 'WorkoutEvent'
        event_df = pd.DataFrame([wevent.attrib])
        # Set column 'Workout index' = workout_idx
        event_df.loc[0, 'Workout index'] = workout_idx

        # Append WorkoutEvent node to greater WorkoutEvents table
        workout_events = pd.concat([workout_events, event_df], ignore_index=True)

**Resulting tables**:
1. Workouts table (`workouts`) should have 2 entries now.
1. WorkoutEvents table (`workout_events`) should also have 2 entries, with a column named "Workout index"

In [31]:
assert len(workout_events) == nworkout_events
assert "Workout index" in workout_events.columns
assert "WorkoutEvent" in workouts.columns
assert workouts["WorkoutEvent"].iloc[-1] is True

print(f"Latest Workout node has index {workout_idx} in Workouts table.")

Latest Workout node has index 1 in Workouts table.


In [32]:
workouts

Unnamed: 0,workoutActivityType,duration,durationUnit,totalDistance,totalDistanceUnit,totalEnergyBurned,totalEnergyBurnedUnit,sourceName,sourceVersion,creationDate,startDate,endDate,HKIndoorWorkout,HKAverageSpeed,HKMaximumSpeed,WorkoutEvent
0,HKWorkoutActivityTypeRunning,25.91666666666667,min,3.669,km,0,Cal,RunGap,671,2022-03-06 00:09:32 -0800,2018-03-17 14:42:58 -0800,2018-03-17 15:08:53 -0800,1,2.35968 m/s,2.35968 m/s,
1,HKWorkoutActivityTypeRunning,9.083333333333334,min,1.722,km,0,Cal,RunGap,671,2022-03-06 00:09:31 -0800,2019-02-05 17:00:54 -0800,2019-02-05 17:09:59 -0800,1,3.15963 m/s,3.15963 m/s,True


In [33]:
workout_events

Unnamed: 0,type,date,duration,durationUnit,Workout index
0,HKWorkoutEventTypeSegment,2019-02-05 17:00:54 -0800,8.487282014468525,min,1.0
1,HKWorkoutEventTypeMarker,2019-02-05 17:09:23 -0800,,,1.0


**Things to note for later data cleanup**
1. 'Workout' table
    1. Have to specify dtype of 'WorkoutEvent' and similar other columns as bool or cast NaN values as False with `fillna()`
    1. 'MetadataEntry' columns are formatted as string. 
1. 'WorkoutEvent' table
    1. Have to specify 'Workout index' column (and similar) as dtype int.

#### 3.2.4 Implementing the same steps for child node tag = 'WorkoutRoute'

The attribute list of 'WorkoutRoute' nodes:
```
<!ELEMENT WorkoutRoute ((MetadataEntry|FileReference)*)>
<!ATTLIST WorkoutRoute
  sourceName    CDATA #REQUIRED
  sourceVersion CDATA #IMPLIED
  device        CDATA #IMPLIED
  creationDate  CDATA #IMPLIED
  startDate     CDATA #REQUIRED
  endDate       CDATA #REQUIRED
>
```

**Task**: Wrap the process shown in sections [3.2.1](#321-exploring-tag-workoutevent)-[3.2.3](#323-implementing-option-4-for-child-node-tag--workoutevent) in a function called `add_workout_children()` that adds data from a Workout element's children (if found) with tags ['MetadataEntry', 'WorkoutEvent' and 'WorkoutRoute'], given the following inputs:
- A Workouts table (empty or otherwise)
- A WorkoutRoutes table (empty or otherwise)
- A Workout node (ElementTree obj)
- Its child node (ElementTree obj)

**Things to note:** 
- 'WorkoutRoute' nodes can have sublements of type 'MetadataEntry' and 'FileReference'.
  - Check how many children (all of them) are nested under a 'WorkoutRoute' node. 
  - Then iterate through list until you've gotten all of them   (`sum_nodes_traversed = len(route_node.iter())`)
    - Call `add_metadata_entry()` or `add_file_reference()` (to be created)
- When iterating through the entire tree for top-level nodes, cannot use `iter()` function because that will search through the entire tree, which is problem for tags like 'Record' because 'Record' also appears as a subelement of type 'Correlation'. Have to use `findall` and XPath.

In [34]:
# Output:
# - modified Workouts tabble
# - modfied WorkoutEntry table
# - modified WorkoutRoute table

# Given:
# - An empty Workout table
# - An empty WorkoutEvent tble
# - An empty WorkoutRoute table


def add_child_attrib_as_column(target_dataframe, child_attrib, 
                               col_name_key, col_val_key, 
                               col_name_prefix=None):
    """ Adds attribute values of child_node as columns of 
    target_dataframe. Assumes target_dataframe has length 1.

    For use for children nodes with tag = "MetadataEntry" or 
    "FileReference". 

    Args:
        target_dataframe (pd.DataFrame): Length-1 DataFrame to
                add new columns to from child_attrib.
        child_node (dict): Attribute of a child node.
        col_name_key (str): A key from child_attrib used
                to name the columns.
        col_val_key (str): A key from child_attrib used
                to set the value of the new column.

    Kwargs:
        col_name_prefix (str): String to prefix to the column names.

    Returns:
         None. Modifies target_dataframe in place.
    """
    col_name = child_attrib[col_name_key]
    if col_name_prefix is not None:
        col_name = "{0} {1}".format(col_name_prefix, col_name)

    target_dataframe.loc[0, col_name] = child_attrib[col_val_key]



# for workoutnode in root.findall('./Workout')[:1]:

#     # Create table filled with its attributes
#     node_table = pd.DataFrame([workoutnode.attrib])

#     # Check for children and loop, 
#     # if no children, the loop won't run.
#     for child in workoutnode.findall('./'):

#         if child.tag not in ELEM_WITH_CHILD['Workout']:
#             raise ValueError(f'Have not implemented support for Workout subelement with tag = {child.tag}')
#         elif child.tag == "MetadataEntry":
#             add_child_attrib_as_column(node_table, child.attrib, 'key', 'value')
#         else:
#             return


In [35]:
route_parent, route_node = get_first_instance(root, 'WorkoutRoute', 'Workout')

print(f"Parent 'Workout' node {route_parent}:  {route_parent.attrib}", end='\n\n')
print(f"'WorkoutRoute' node {route_node}: {route_node.attrib}")

Parent 'Workout' node <Element 'Workout' at 0x7f85297c4220>:  {'workoutActivityType': 'HKWorkoutActivityTypeWalking', 'duration': '50.01240504980088', 'durationUnit': 'min', 'totalDistance': '3.920867535269382', 'totalDistanceUnit': 'km', 'totalEnergyBurned': '174.1197457948795', 'totalEnergyBurnedUnit': 'Cal', 'sourceName': 'Nadine’s Apple\xa0Watch', 'sourceVersion': '7.1', 'device': '<<HKDevice: 0x2835ce030>, name:Apple Watch, manufacturer:Apple Inc., model:Watch, hardware:Watch3,3, software:7.1>', 'creationDate': '2020-12-29 16:16:00 -0800', 'startDate': '2020-12-29 15:23:51 -0800', 'endDate': '2020-12-29 16:15:53 -0800'}

'WorkoutRoute' node <Element 'WorkoutRoute' at 0x7f85297cd130>: {'sourceName': 'Nadine’s Apple\xa0Watch', 'sourceVersion': '7.1', 'creationDate': '2020-12-29 16:16:13 -0800', 'startDate': '2020-12-29 15:25:52 -0800', 'endDate': '2020-12-29 16:15:48 -0800'}


In [36]:
# Traverses the entire subtree of route_node
route_node_children = route_node.findall('.//')

# Traverses only the direct children nodes of route_parent
route_parent_children = route_parent.findall('./')

# Traverses entire subtree of route_parent
route_parent_entire = route_parent.findall('.//')

assert len(route_parent_entire) - len(route_node_children) == len(route_parent_children)

### 3.3 Iterating through all 'Workout' nodes

We wrap the routines above in functions.

In [56]:
def get_subtree(rootnode):
    """ Depth-first tree traversal.
    """
    for subelem in rootnode.findall('./'):
        yield from get_subtree(subelem)
    yield rootnode

def check_if_workout_route(metadata_node):
    """ Returns whether or not input MetadataEntry node is a 
    child of a WorkoutRoute node.
    Assumes that node passed in has tag == "MetadataEntry".
    """
    if "HKMetadataKey" in metadata_node.attrib['key']:
        return True
    return False

In [142]:
top_level_tag = "Workout"
workout_tables = {}
events = pd.DataFrame()  # Dataframe of WorkoutEvents
routes = pd.DataFrame()  # Dataframe of WorkoutRoutes

for num, node in enumerate(root.findall(f'./{top_level_tag}'), start=1):
    # Add data from this node to its respective table within workout_tables

    node_table = pd.DataFrame([node.attrib])
    activity_type = node.attrib["workoutActivityType"].removeprefix("HKWorkoutActivityType")

    if activity_type not in workout_tables.keys():
        workout_tables[activity_type] = pd.DataFrame()

    # Add current node to activity dataframe
    activity_table = workout_tables[activity_type]
    workout_tables[activity_type] = pd.concat([activity_table, node_table], ignore_index=True)

    # Get index of current activity within the workout activity table
    idx = len(workout_tables[activity_type]) - 1

    workout_route_queue = []
    # Iterate through the children of this current Workout node
    for child in get_subtree(node):

        # If statements for all the known children of a Workout node
        if child.tag == "MetadataEntry":
             # If its parent is a WorkoutRoute node
            if check_if_workout_route(child):
                # print(num, "MetadataEntry")
                workout_route_queue.append(child)
            else: # Its parent is a Workout node
                workout_tables[activity_type].loc[idx, child.attrib['key']] = child.attrib['value']
        elif child.tag == "WorkoutEvent":
            # Create a column in the node's table named "WorkoutEvent" and set to True
            workout_tables[activity_type].loc[idx, "WorkoutEvent"] = True
            # -- TODO Better yet, wrap this process up in a function 
            # for Python to do automatic garbage collection with child_node DataFrame
            child_node = pd.DataFrame([child.attrib])
            child_node.loc[0, ['workoutType', 'workoutIndex']] = [activity_type, idx]
            events = pd.concat([events, child_node], ignore_index=True)
            # -- #
        elif child.tag == "WorkoutRoute":
            # get_subtree() implements depth-first search, so when it gets to
            # a WorkoutRoute node, it would have already traversed through
            # the children (if any) of this current WorkoutRoute node 
            # and added the children nodes to workout_route_queue

            workout_tables[activity_type].loc[idx, "WorkoutRoute"] = True

            # TODO Wrap the following process in a function
            route_node = pd.DataFrame([child.attrib])
            route_node.loc[0, ["workoutType", "workoutIndex"]] = [activity_type, idx]

            while len(workout_route_queue) > 0:
                route_child = workout_route_queue[0]
                
                if route_child.tag == "FileReference":
                    col_name = 'Filepath'
                    value_key = 'path'
                else:  # WorkoutRoute child node tag is 'MetadataEntry'
                    col_name = route_child.attrib['key']
                    value_key = 'value'
                
                # Update WorkoutRoute node with child data
                route_node.loc[0, col_name] = route_child.attrib[value_key]
                workout_route_queue.pop(0) # Remove child from queue
            
            # Add WorkoutRoute node to table
            routes = pd.concat([routes, route_node], ignore_index=True)

        elif child.tag == "FileReference":
            # print(num, "FileReference")
            workout_route_queue.append(child)

        elif child.tag == "Workout":
            pass

        else:
            raise ValueError(f"Have not implemented extraction rules for child node of {top_level_tag} with tag '{child.tag}'")

print(f"Iterated through {num} Workout nodes")

# Check lengths of each Workout table
workouts_sum = 0
for key in workout_tables.keys():
    tbl_len = len(workout_tables[key])
    workouts_sum += tbl_len
    print(f"{key}: {tbl_len} elements")

# Test that sum of all workout tables lengths == number from enumeration
assert workouts_sum == num

Iterated through 889 Workout nodes
Running: 359 elements
Barre: 8 elements
HighIntensityIntervalTraining: 36 elements
CoreTraining: 21 elements
Pilates: 30 elements
FunctionalStrengthTraining: 62 elements
Yoga: 14 elements
CrossTraining: 42 elements
Walking: 157 elements
Flexibility: 127 elements
Cooldown: 25 elements
Other: 1 elements
CardioDance: 4 elements
Hiking: 3 elements


#### 3.3.1 Check that we have extracted all possible WorkoutEvent and Workout nodes

In [143]:
# Check that we have extracted all possible WorkoutEvent and WorkoutRoute nodes
workoutevents_check = len(root.findall('.//WorkoutEvent'))
workoutroutes_check = len(root.findall('.//WorkoutRoute'))

assert len(events) == workoutevents_check
assert len(routes) == workoutroutes_check

In [128]:
workout_tables['Running']

Unnamed: 0,workoutActivityType,duration,durationUnit,totalDistance,totalDistanceUnit,totalEnergyBurned,totalEnergyBurnedUnit,sourceName,sourceVersion,creationDate,...,WorkoutEvent,HKElevationDescended,HKTimeZone,HKElevationAscended,HKWasUserEntered,device,HKAverageMETs,HKWeatherTemperature,HKWeatherHumidity,WorkoutRoute
0,HKWorkoutActivityTypeRunning,25.91666666666667,min,3.669,km,0,Cal,RunGap,671,2022-03-06 00:09:32 -0800,...,,,,,,,,,,
1,HKWorkoutActivityTypeRunning,37.73333333333333,min,5.327,km,0,Cal,RunGap,671,2022-03-06 00:09:32 -0800,...,,,,,,,,,,
2,HKWorkoutActivityTypeRunning,41.33333333333334,min,6.808,km,0,Cal,RunGap,671,2022-03-06 00:09:32 -0800,...,,,,,,,,,,
3,HKWorkoutActivityTypeRunning,42.06666666666667,min,6.63,km,0,Cal,RunGap,671,2022-03-06 00:09:32 -0800,...,,,,,,,,,,
4,HKWorkoutActivityTypeRunning,51.23333333333333,min,8.256,km,0,Cal,RunGap,671,2022-03-06 00:09:32 -0800,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
354,HKWorkoutActivityTypeRunning,19.20116647283236,min,2.805828900913457,km,122.013844088736,Cal,Nadine’s Apple Watch,8.3,2022-02-23 18:03:17 -0800,...,True,,America/Los_Angeles,4144 cm,,"<<HKDevice: 0x283590820>, name:Apple Watch, ma...",7.63099 kcal/hr·kg,44.6 degF,3200 %,True
355,HKWorkoutActivityTypeRunning,34.72069991429647,min,5.021799267990767,km,210.0138024178834,Cal,Nadine’s Apple Watch,8.3,2022-02-25 18:21:44 -0800,...,True,,America/Los_Angeles,3068 cm,,"<<HKDevice: 0x283590820>, name:Apple Watch, ma...",8.95977 kcal/hr·kg,59 degF,2500 %,True
356,HKWorkoutActivityTypeRunning,17.44557295441627,min,2.468255764373055,km,104.7491072547596,Cal,Nadine’s Apple Watch,8.3,2022-02-26 17:47:36 -0800,...,True,,America/Los_Angeles,3971 cm,,"<<HKDevice: 0x283590820>, name:Apple Watch, ma...",8.17095 kcal/hr·kg,64.4 degF,1400 %,True
357,HKWorkoutActivityTypeRunning,32.08665848771731,min,4.243732956817863,km,214.2973264160422,Cal,Nadine’s Apple Watch,8.3,2022-03-02 18:08:39 -0800,...,True,,America/Los_Angeles,7317 cm,,"<<HKDevice: 0x283590820>, name:Apple Watch, ma...",10.0936 kcal/hr·kg,75.2 degF,1800 %,True


In [129]:
# Check that the 'totalDistance' and 'totalDuration' columns are consistent in units
# E.g: all entries under column 'durationUnit' are of 'min' and nothing else.
assert len(set(workout_tables['Running']['totalDistanceUnit'])) == 1
assert len(set(workout_tables['Running']['durationUnit'])) == 1

##### Things to note in later data cleanup

1. 'MetadataEntry' columns like 'HKElevationAscended' don't have consistent units across all entries. Assume that applies for every quantity column prefixed by 'HK'.
1. For 'Running' entries that are marked True or 1 for 'HKIndoorActivity' have the same values for average speed and maximum speed. When reading maximum speed, do so while filtering out Running entires where 'HKIndoorActivity' = 1.

In [131]:
# Outputs distinct units of each MetadataEntry columns (columns prefixed with "HK")
def get_units_of_metadatacolumn(dframe, col):
    df = dframe[col].copy()
    filternan = df[df.notnull()]
    colunits = set(filternan.apply(lambda x: x.split(' ')[1]))
    return colunits

tbl = workout_tables['Running']

for c in ['HKAverageSpeed', 'HKMaximumSpeed', 'HKElevationDescended', 
         'HKElevationAscended', 'HKAverageMETs', 'HKWeatherTemperature', 'HKWeatherHumidity']:
    
    print(f"Column '{c}' has units {get_units_of_metadatacolumn(tbl, c)}")

Column 'HKAverageSpeed' has units {'m/s'}
Column 'HKMaximumSpeed' has units {'m/s'}
Column 'HKElevationDescended' has units {'m'}
Column 'HKElevationAscended' has units {'m', 'cm'}
Column 'HKAverageMETs' has units {'kcal/hr·kg'}
Column 'HKWeatherTemperature' has units {'degF'}
Column 'HKWeatherHumidity' has units {'%'}


**Columns of table 'Running', 'WorkoutEvent', and 'WorkoutRoute'**

In [144]:
workout_tables['Running'].columns

Index(['workoutActivityType', 'duration', 'durationUnit', 'totalDistance',
       'totalDistanceUnit', 'totalEnergyBurned', 'totalEnergyBurnedUnit',
       'sourceName', 'sourceVersion', 'creationDate', 'startDate', 'endDate',
       'HKIndoorWorkout', 'HKAverageSpeed', 'HKMaximumSpeed', 'WorkoutEvent',
       'HKElevationDescended', 'HKTimeZone', 'HKElevationAscended',
       'HKWasUserEntered', 'device', 'HKAverageMETs', 'HKWeatherTemperature',
       'HKWeatherHumidity', 'WorkoutRoute'],
      dtype='object')

In [148]:
routes.columns

Index(['sourceName', 'sourceVersion', 'creationDate', 'startDate', 'endDate',
       'workoutType', 'workoutIndex', 'HKMetadataKeySyncVersion',
       'HKMetadataKeySyncIdentifier', 'Filepath'],
      dtype='object')

In [150]:
events.columns

Index(['type', 'date', 'duration', 'durationUnit', 'workoutType',
       'workoutIndex'],
      dtype='object')

#### 3.3.2 Checking workoutIndex values for WorkoutRoutes table

Let's look at one Running event:

- Filter out specific Running event, with index = 357
- Get startDate from this entry, filter, use string split to single out the date portion 'YYYY-mm-dd'
- Filter out Routes table with 'workoutType' = 'Running' and 'workoutIndex' = 357 to single out this specific route.
- Get start date from this route, use string split to single out the date portion 'YYYY-mm-dd'
- Compare the strings, check that they're the same.

In [146]:
# let's look at Running node # 357 on workouts table
indx = 357 
run_node = workout_tables['Running'].loc[indx]

# Let's check that the route marked with Running #357 has the same start date
running_routes = routes[routes['workoutType'] == "Running"]
route_node = running_routes[running_routes["workoutIndex"] == float(indx)]

route_startdate = route_node.iloc[0]['startDate'].split(' ')[0]
run_startdate = run_node['startDate'].split(' ')[0]
assert route_startdate == run_startdate

print(f"Route start date: {route_startdate}, Run start date: {run_startdate}")

Route start date: 2022-03-02, Run start date: 2022-03-02


In [147]:
running_routes

Unnamed: 0,sourceName,sourceVersion,creationDate,startDate,endDate,workoutType,workoutIndex,HKMetadataKeySyncVersion,HKMetadataKeySyncIdentifier,Filepath
5,Nadine’s Apple Watch,7.2,2021-02-01 17:01:28 -0800,2021-02-01 16:17:55 -0800,2021-02-01 17:00:14 -0800,Running,202.0,2,EA6DE576-6E4A-4B60-A6B7-D9A332C1D136,/workout-routes/route_2021-02-01_5.00pm.gpx
6,Nadine’s Apple Watch,7.2,2021-02-02 17:06:33 -0800,2021-02-02 16:19:28 -0800,2021-02-02 17:06:14 -0800,Running,203.0,2,A97E5BD9-701E-4ABD-B3A6-06161DB02B2D,/workout-routes/route_2021-02-02_5.06pm.gpx
7,Nadine’s Apple Watch,7.3,2021-02-04 17:20:19 -0800,2021-02-04 16:30:43 -0800,2021-02-04 17:19:36 -0800,Running,204.0,2,D212C37D-1B01-47F8-8EAB-C93F600FA920,/workout-routes/route_2021-02-04_5.19pm.gpx
13,Nadine’s Apple Watch,7.3,2021-02-12 17:38:47 -0800,2021-02-12 16:59:18 -0800,2021-02-12 17:38:11 -0800,Running,206.0,2,BA162623-9510-46C4-BCCA-F286A35A6538,/workout-routes/route_2021-02-12_5.38pm.gpx
15,Nadine’s Apple Watch,7.3,2021-02-13 17:20:50 -0800,2021-02-13 16:38:31 -0800,2021-02-13 17:15:31 -0800,Running,207.0,2,FE552B8C-EE03-4250-A7EB-A1043E554AF4,/workout-routes/route_2021-02-13_5.15pm.gpx
...,...,...,...,...,...,...,...,...,...,...
294,Nadine’s Apple Watch,8.3,2022-02-23 18:03:27 -0800,2022-02-23 17:36:54 -0800,2022-02-23 18:03:02 -0800,Running,354.0,2,6C4D3DA4-D087-4A44-A2B7-9572B306CBE2,/workout-routes/route_2022-02-23_6.03pm.gpx
296,Nadine’s Apple Watch,8.3,2022-02-25 18:21:59 -0800,2022-02-25 17:42:31 -0800,2022-02-25 18:21:11 -0800,Running,355.0,2,F70BB11A-D951-4EED-87D7-5B47990733D8,/workout-routes/route_2022-02-25_6.21pm.gpx
298,Nadine’s Apple Watch,8.3,2022-02-26 17:47:55 -0800,2022-02-26 17:26:30 -0800,2022-02-26 17:47:23 -0800,Running,356.0,2,72D86576-9175-4D76-9A9A-3F74A3C290AC,/workout-routes/route_2022-02-26_5.47pm.gpx
301,Nadine’s Apple Watch,8.3,2022-03-02 18:08:53 -0800,2022-03-02 17:33:52 -0800,2022-03-02 18:08:26 -0800,Running,357.0,2,59F63D1C-6C1B-41D2-B181-0735E145B156,/workout-routes/route_2022-03-02_6.08pm.gpx


Use the same process, except iterate through all the Running entries with 'WorkoutRoute' column set to True:

In [149]:
# Let's extend this across all Running entries
runs_with_routes = workout_tables['Running'][workout_tables['Running']['WorkoutRoute'] == True]
for idx in runs_with_routes.index:
    # Get individual run
    run_node = workout_tables['Running'].loc[idx]

    # Get individual route
    running_routes = routes[routes['workoutType'] == "Running"]
    route_node = running_routes[running_routes["workoutIndex"] == float(idx)]

    try:
        route_startdate = route_node.iloc[0]['startDate'].split(' ')[0]
    except:
        print(route_node)
    run_startdate = run_node['startDate'].split(' ')[0]
    
    assert route_startdate == run_startdate


## 4.0 Expand routine to other node types

We have from section [3.0](#30-extracting-nodeselements-with-tag-workout-and-their-child-nodes) the routines for extracting data from all Workout nodes (including its children) and storing them into a set of DataFrames.

Specifically for the 'Workout' node, we have as outputs:
- (?) DataFrame(s) for every "workoutActivityType" (Workout node.attrib['workoutActivityType'])
    - Stored in a dictionary keyed by type (without the prefix "HKWorkoutActivityType").
    - How many DataFrames we have depends on the dataset. If a user only had 'Walking' type of 'Workout',
      it would output only 1 DataFrame. If another had 'Walking' and 'Cycling' entries, it would output 2
      DataFrames.
- 1 DataFrame for WorkoutEvent nodes
- 1 DataFrame for Workout

It's a bit lengthy though, and some blocks need to be wrapped in a function to make sure garbage is 
being collected and memory not being used is freed for other Pandas objects, especially with DataFrames being made and disregarded with each iteration. 

We'll start by cleaning the routine, wrapping as many repeated blocks in a function, and adapt it for other node tags = ['ActivitySummary', 'Me', 'ExportDate', 'ClinicalRecord']. These top level node tags don't have any known children, they should be easier to tackle first.

The remaining: 'Record' and 'Correlation' and 'Audiogram' have children that need special protocols. 

We define the helper routines:

### 4.1 ['Workout', 'ActivitySummary', 'Me', 'ExportDate', 'ClinicalRecord']

In [200]:
def get_subtree(rootnode):
    """ Depth-first tree traversal.
    """
    for subelem in rootnode.findall('./'):
        yield from get_subtree(subelem)
    yield rootnode

def check_if_workout_route(metadata_node):
    """ Returns whether or not input MetadataEntry node is a 
    child of a WorkoutRoute node.
    Assumes that node passed in has tag == "MetadataEntry".
    """
    if "HKMetadataKey" in metadata_node.attrib['key']:
        return True
    return False

def add_workout_property(workoutchild, workouttype, workoutidx, tables_dict):
    """ Adds data from a node 'workoutchild' to a table in 'tables_dict'
    (tables_dict[workouttype]). This function adds two columns: 'workoutType'
    and 'workoutIndex' mapping to values 'workouttype' and 'workoutidx' 
    respectively.

    Since dicts are mutable, this function will modify the dictionary passed
    in. 

    Args:
        workoutchild (Element)
        workout_type (str)
        workoutidx (int or float)
        tables_dict (dict)
    """
    df = pd.DataFrame([workoutchild.attrib])
    df.loc[0, ['workoutType', 'workoutIndex']] = [workouttype, workoutidx]
    
    if workoutchild.tag not in tables_dict.keys():
        tables_dict[workoutchild.tag] = df
    else:
        tableref = tables_dict[workoutchild.tag]
        tables_dict[workoutchild.tag] = pd.concat([tableref, df], ignore_index=True)


In [211]:
# Modified routine
def extract_top_level_nodes(t, treeroot, tabledict):

    for node in treeroot.findall(f'./{t}'):
        # Add data from this node to its respective table within tabledict
        node_table = pd.DataFrame([node.attrib])

        if t == "Workout":
            table_name = node.attrib["workoutActivityType"].removeprefix("HKWorkoutActivityType")
        else:
            table_name = t

        # Needs to be inside the for loop because we want tables created
        # from columns of nodes like 'Workout' and 'Record' 
        if table_name not in tabledict.keys(): 
                tabledict[table_name] = pd.DataFrame()
        
        # Add current node to table
        temp_table = tabledict[table_name]
        tabledict[table_name] = pd.concat([temp_table, node_table], ignore_index=True)

        # Get index of current activity within the tabledict[table_name]
        idx = len(tabledict[table_name]) - 1

        workout_route_queue = [] # If t = "Workout", this holds the children nodes of subtree
                                 # of a WorkoutRoute node.

        # Iterate through the children of this current node (if t in ['Workout', 'Route', 'Correlation'])
        for child in get_subtree(node):

            # If statements for all the known children of a Workout node
            if child.tag == "MetadataEntry":
                # If its parent is a WorkoutRoute node
                if check_if_workout_route(child):
                    workout_route_queue.append(child)
                else: # Its parent is a Workout, Record, Correlation node
                    tabledict[table_name].loc[idx, child.attrib['key']] = child.attrib['value']
            elif child.tag == "WorkoutEvent":
                # Create a column in the node's table named "WorkoutEvent" and set to True
                tabledict[table_name].loc[idx, "WorkoutEvent"] = True
                add_workout_property(child, table_name, idx, tabledict)

            elif child.tag == "WorkoutRoute":
                # get_subtree() implements depth-first search, so when it gets to
                # a WorkoutRoute node, it would have already traversed through
                # the children (if any) of this current WorkoutRoute node 
                # and added the children nodes to workout_route_queue
                tabledict[table_name].loc[idx, "WorkoutRoute"] = True
                add_workout_property(child, table_name, idx, tabledict)

                while len(workout_route_queue) > 0:
                    route_child = workout_route_queue[0]
                
                    if route_child.tag == "FileReference":
                        col_name = 'Filepath'
                        value_key = 'path'
                    else:  # WorkoutRoute child node tag is 'MetadataEntry'
                        col_name = route_child.attrib['key']
                        value_key = 'value'
                    
                    # Update WorkoutRoute node with child data
                    route_index = len(tabledict[child.tag]) - 1
                    tabledict[child.tag].loc[route_index, col_name] = route_child.attrib[value_key]
                    workout_route_queue.pop(0) # Remove child from queue

            elif child.tag == "FileReference":
                # print(num, "FileReference")
                workout_route_queue.append(child)

            elif child.tag == t:
                pass

            else:
                raise ValueError(f"Have not implemented extraction rules for child node of {t} with tag '{child.tag}'")

    # TESTING
    # Check lengths of each table
    func_sum = 0
    for key, val in tabledict.items():
        tbl_len = len(val)
        if t == "Workout":
            if key not in ['WorkoutRoute', 'WorkoutEvent'] + top_level_nodes:
                func_sum += tbl_len
                print(f"{key}: {tbl_len} elements")
        else:
            if key == t:
                func_sum += tbl_len
                print(f"{key}: {tbl_len} elements")

    # Test that sum of all workout tables lengths == number from enumeration
    assert func_sum == len(treeroot.findall(f"./{t}"))


Let's test for tag = 'ExportDate'

In [212]:
all_tables = {}
extract_top_level_nodes('ExportDate', root, all_tables)

ExportDate: 1 elements


In [213]:
all_tables['ExportDate']

Unnamed: 0,value
0,2022-03-06 11:36:48 -0800


Now let's test for tag = 'Workout' and see if get the same results.

In [214]:
extract_top_level_nodes('Workout', root, all_tables)

Running: 359 elements
Barre: 8 elements
HighIntensityIntervalTraining: 36 elements
CoreTraining: 21 elements
Pilates: 30 elements
FunctionalStrengthTraining: 62 elements
Yoga: 14 elements
CrossTraining: 42 elements
Walking: 157 elements
Flexibility: 127 elements
Cooldown: 25 elements
Other: 1 elements
CardioDance: 4 elements
Hiking: 3 elements


In [223]:
all_tables['Running']

Unnamed: 0,workoutActivityType,duration,durationUnit,totalDistance,totalDistanceUnit,totalEnergyBurned,totalEnergyBurnedUnit,sourceName,sourceVersion,creationDate,...,WorkoutEvent,HKElevationDescended,HKTimeZone,HKElevationAscended,HKWasUserEntered,device,HKAverageMETs,HKWeatherTemperature,HKWeatherHumidity,WorkoutRoute
0,HKWorkoutActivityTypeRunning,25.91666666666667,min,3.669,km,0,Cal,RunGap,671,2022-03-06 00:09:32 -0800,...,,,,,,,,,,
1,HKWorkoutActivityTypeRunning,37.73333333333333,min,5.327,km,0,Cal,RunGap,671,2022-03-06 00:09:32 -0800,...,,,,,,,,,,
2,HKWorkoutActivityTypeRunning,41.33333333333334,min,6.808,km,0,Cal,RunGap,671,2022-03-06 00:09:32 -0800,...,,,,,,,,,,
3,HKWorkoutActivityTypeRunning,42.06666666666667,min,6.63,km,0,Cal,RunGap,671,2022-03-06 00:09:32 -0800,...,,,,,,,,,,
4,HKWorkoutActivityTypeRunning,51.23333333333333,min,8.256,km,0,Cal,RunGap,671,2022-03-06 00:09:32 -0800,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
354,HKWorkoutActivityTypeRunning,19.20116647283236,min,2.805828900913457,km,122.013844088736,Cal,Nadine’s Apple Watch,8.3,2022-02-23 18:03:17 -0800,...,True,,America/Los_Angeles,4144 cm,,"<<HKDevice: 0x283590820>, name:Apple Watch, ma...",7.63099 kcal/hr·kg,44.6 degF,3200 %,True
355,HKWorkoutActivityTypeRunning,34.72069991429647,min,5.021799267990767,km,210.0138024178834,Cal,Nadine’s Apple Watch,8.3,2022-02-25 18:21:44 -0800,...,True,,America/Los_Angeles,3068 cm,,"<<HKDevice: 0x283590820>, name:Apple Watch, ma...",8.95977 kcal/hr·kg,59 degF,2500 %,True
356,HKWorkoutActivityTypeRunning,17.44557295441627,min,2.468255764373055,km,104.7491072547596,Cal,Nadine’s Apple Watch,8.3,2022-02-26 17:47:36 -0800,...,True,,America/Los_Angeles,3971 cm,,"<<HKDevice: 0x283590820>, name:Apple Watch, ma...",8.17095 kcal/hr·kg,64.4 degF,1400 %,True
357,HKWorkoutActivityTypeRunning,32.08665848771731,min,4.243732956817863,km,214.2973264160422,Cal,Nadine’s Apple Watch,8.3,2022-03-02 18:08:39 -0800,...,True,,America/Los_Angeles,7317 cm,,"<<HKDevice: 0x283590820>, name:Apple Watch, ma...",10.0936 kcal/hr·kg,75.2 degF,1800 %,True


Now let's test it for the following list of tags:

In [215]:
list_of_tags = ['ActivitySummary', 'Me', 'ClinicalRecord']

In [216]:
for tag in list_of_tags:
    extract_top_level_nodes(tag, root, all_tables)

ActivitySummary: 438 elements
Me: 1 elements


In [217]:
all_tables.keys()

dict_keys(['ExportDate', 'Running', 'WorkoutEvent', 'Barre', 'HighIntensityIntervalTraining', 'CoreTraining', 'Pilates', 'FunctionalStrengthTraining', 'Yoga', 'CrossTraining', 'Walking', 'Flexibility', 'WorkoutRoute', 'Cooldown', 'Other', 'CardioDance', 'Hiking', 'ActivitySummary', 'Me'])

In [221]:
all_tables['ActivitySummary'].loc[320]

dateComponents            2021-11-09
activeEnergyBurned           346.305
activeEnergyBurnedGoal           320
activeEnergyBurnedUnit           Cal
appleMoveTime                      0
appleMoveTimeGoal                  0
appleExerciseTime                 66
appleExerciseTimeGoal             45
appleStandHours                    8
appleStandHoursGoal               10
Name: 320, dtype: object

In [225]:
# Write to csv
for k, v in all_tables.items():
    v.to_csv(f"20220306_{k}.csv")

### 4.2 Remaining node tags to work on: ['Record', 'Correlation', 'Audiogram']

#### 4.2.1 Record