Running data exploration
========================

In this notebook I explore my running data exported from Runkeeper app. The data is itself very primitive - just a XML full of GPS checkpoints. This means I will have to try to calculate distances myself.  

Prerequisites:
 - python3
 - ``data`` folder contains all the files exported from runkeeper.
 - python packages: ``pip3 install python-dateutil geopy bokeh --user``

In [1]:
ls data/

2014-09-07-1616.gpx  2015-03-06-2142.gpx  2015-10-05-2208.gpx
2014-09-09-0906.gpx  2015-03-09-1753.gpx  2015-10-28-1207.gpx
2014-09-12-2043.gpx  2015-03-12-2113.gpx  2016-03-04-1838.gpx
2014-09-20-1017.gpx  2015-03-17-1658.gpx  2016-04-05-2253.gpx
2014-09-24-0900.gpx  2015-04-04-1355.gpx  2016-05-01-2045.gpx
2014-09-28-2045.gpx  2015-04-05-1327.gpx  2016-05-21-1723.gpx
2014-10-04-1951.gpx  2015-05-04-2003.gpx  2016-05-30-2106.gpx
2014-10-30-2039.gpx  2015-05-10-2130.gpx  2016-06-04-1909.gpx
2014-11-09-1830.gpx  2015-06-12-1957.gpx  2016-06-07-2137.gpx
2014-12-08-1733.gpx  2015-07-07-2242.gpx  2016-06-10-2134.gpx
2015-01-08-1930.gpx  2015-07-11-2132.gpx  2016-06-19-1714.gpx
2015-01-11-1406.gpx  2015-07-28-2009.gpx  2016-06-25-2054.gpx
2015-01-20-1855.gpx  2015-08-01-2128.gpx  cardioActivities.csv
2015-01-23-1949.gpx  2015-08-08-1956.gpx  measurements.csv
2015-01-28-1855.gpx  2015-08-28-2159.gpx
2015-02-19-1957.gpx  2015-09-25-1113.gpx


No idea why Runkeeper calls these files .gpx, it might be some kind of standard. But meh. Lets see what is inside one of these:


```xml
<?xml version="1.0" encoding="UTF-8"?>
<gpx
    version="1.1"
    creator="Runkeeper - http://www.runkeeper.com"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns="http://www.topografix.com/GPX/1/1"
    xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd"
    xmlns:gpxtpx="http://www.garmin.com/xmlschemas/TrackPointExtension/v1">
    <trk>
        <name><![CDATA[Running 6/25/16 8:54 pm]]></name>
        <time>2016-06-25T18:54:53Z</time>
        <trkseg>
            <trkpt lat="50.599000" lon="14.205000">
               <ele>327.0</ele>
               <time>2016-06-25T18:54:53Z</time>
            </trkpt>
            <trkpt lat="50.599000" lon="14.205000">
               <ele>326.7</ele>
               <time>2016-06-25T18:54:55Z</time>
            </trkpt>
        </trkseg>
    </trk>
</gpx>
```

Note I redacted ``trkseg`` section because it had so many datapoints. Anyway as you can see the whole file contains datapoint for every 3 seconds.


Parsing
=======

I am going to use python ``xml.etree`` library because I don't want to add additional dependencies. I will also get a bit low-level at the beggining and after I will refactor this parsing mess.

Note that most of the ugliness of this code comes from using xml namespaces, unfortunetely there is no option to turn them off in the default xml.etree library, lxml project would probably be better at least for this purpose.



In [2]:
import xml.etree.ElementTree as ET
tree = ET.parse('data/2016-06-25-2054.gpx')
trkseg = tree.find('.//{http://www.topografix.com/GPX/1/1}trkseg')

In [3]:
records = list(trkseg)
len(records)

499

In [4]:
first = records[0]
first.attrib.keys()

dict_keys(['lon', 'lat'])

In [5]:
list(first)

[<Element '{http://www.topografix.com/GPX/1/1}ele' at 0x7fd1dc0534a8>,
 <Element '{http://www.topografix.com/GPX/1/1}time' at 0x7fd1dc0534f8>]

In [6]:
elevation = first.find('{http://www.topografix.com/GPX/1/1}ele')
elevation.text

'327.0'

In [7]:
first_time = first.find('{http://www.topografix.com/GPX/1/1}time')
first_time.text

'2016-06-25T18:54:53Z'

# Refactoring

Now that you see how to work with those files on the low level xml api lets hide this logic into simple class that can be constructed simply with a filename.

In [8]:
from dateutil.parser import parse

class Run:
    def __init__(self, filename):
        self.tree = ET.parse(filename)
        self.records = [
            {
                'lat': record.attrib['lat'],
                'lng': record.attrib['lon'],
                'time': parse(record.find('{http://www.topografix.com/GPX/1/1}time').text),
                'elevation': float(record.find('{http://www.topografix.com/GPX/1/1}ele').text),
            } for record in self.tree.find('.//{http://www.topografix.com/GPX/1/1}trkseg')
        ]

This is much shorter than the original version. It is very simple and does not have any exception handling. Lets just assume everything goes all right. Now lets load all the runs.

In [9]:
import glob, os

def load_runs(data_dir='./data'):
    glob_joined = os.path.join(data_dir, '*.gpx')
    return [Run(f) for f in glob.glob(glob_joined)]

In [10]:
from geopy.distance import distance
from geopy import Point

Yeah I know, you probably wanted me to calculate the distance by my own and not just import and module and be done with it. But I am lazy and know a shit about geography. Read distance implementation here: https://github.com/geopy/geopy/blob/master/geopy/distance.py and here: https://en.wikipedia.org/wiki/Vincenty%27s_formulae

Anyway lets modify the run class to suppport distance calculation.

In [11]:
class Run:
    def __init__(self, filename):
        self.tree = ET.parse(filename)
        self.records = [
            {
                'lat': record.attrib['lat'],
                'lng': record.attrib['lon'],
                'time': parse(record.find('{http://www.topografix.com/GPX/1/1}time').text),
                'elevation': float(record.find('{http://www.topografix.com/GPX/1/1}ele').text),
            } for record in self.tree.find('.//{http://www.topografix.com/GPX/1/1}trkseg')
        ]
        
    def pluck_attribute(self, attribute):
        return [r[attribute] for r in self.records]
    
    @property
    def elevations(self):
        return self.pluck_attribute('elevation')
    
    @property
    def times(self):
        return self.pluck_attribute('time')
    
    @property
    def speed(self):
        'm/s'
        return self.distance_total / self.time_total_s
    
    @property
    def speed_kmph(self):
        'km/h'
        return self.speed * 3.6
    
    @property
    def pace(self):
        'Pace in m/km'
        return (self.time_total_s / 60) / (self.distance_total / 1000)
    
    
    @property
    def distance_total(self):
        """
        Returns distance in metres from the total run
        """
        distances = []
        for i in range(len(self.records)-1):
            this_point = Point(self.records[i]['lat'], self.records[i]['lng'])
            next_point = Point(self.records[i+1]['lat'], self.records[i+1]['lng'])
            distances.append(distance(this_point, next_point).meters)
        
        return sum(distances)
    
    @property
    def start(self):
        return min(self.times)
    
    @property
    def end(self):
        return max(self.times)
    
    @property
    def time_total(self):
        "Time total"
        return self.end - self.start
    
        
    @property
    def time_total_s(self):
        "Time total in seconds"
        return self.time_total.seconds

last_night = Run('data/2016-06-25-2054.gpx')

In [12]:
last_night.distance_total

5641.157399618419

This is the same value that Run keeper shows. Nice.

In [13]:
last_night.time_total_s / 60

50.8

In [14]:
last_night.speed_kmph

6.662784330257975

In [15]:
last_night.pace

9.00524420811875

# Plot it
![Plot cesky](http://www.saternus.sk/a/files/produkte/PLOT_DREVENY_Z_DOSIEK/PLOT_DREVENY_Z_DOSIEK_high.jpg)


In [16]:
from bokeh.plotting import figure, show, output_notebook
output_notebook()

In [17]:
elevation = figure(
    title="Elevation from last night run", 
    background_fill_color="#E8DDCB",
    y_axis_label='Elevation (m)', 
    x_axis_label='Time',
    x_axis_type="datetime"
)

elevation.line(last_night.times, last_night.elevations)
show(elevation)


Very nice graph that looks just like the one I saw in Runkeeper. Now lets plot overall trends of all the runs.

In [18]:
all_runs = load_runs()

In [19]:
distances = [r.distance_total for r in all_runs]
times = [r.start for r in all_runs]

p = figure(
    title="Distances run over time", 
    background_fill_color="#E8DDCB",
    y_axis_label='Distance (m)', 
    x_axis_label='Time',
    x_axis_type="datetime"
)

p.circle(times, distances)
show(p)

In [26]:
p = figure(
    title="Speeds ", 
    background_fill_color="#E8DDCB",
    x_axis_label='Distance (km)', 
    y_axis_label='Time in hours',
)

p.circle(
    [r.distance_total / 1000 for r in all_runs],
    [r.time_total_s / 3600 for r in all_runs]
)
show(p)

On the graph above you can see that I have overall very stable speed, which is kind of disturbing - I am not improving at all. :(