# 01 - Speed vs Distance

This started as a purely exploratory notebook but I ended up focusing on distance vs average speed of my runs.

In [1]:
import pandas as pd

I exported my MapMyFitness data via using these instructions: https://support.mapmyfitness.com/hc/en-us/articles/200118594-Export-Workout-Data.

Load in the CSV.

In [2]:
data = pd.read_csv('data/user82388963_workout_history.csv')
data.head()

Unnamed: 0,Date Submitted,Workout Date,Activity Type,Calories Burned (kCal),Distance (mi),Workout Time (seconds),Avg Pace (min/mi),Max Pace (min/mi),Avg Speed (mi/h),Max Speed (mi/h),Avg Heart Rate,Steps,Notes,Source,Link
0,"March 26, 2023","March 26, 2023",Run,715,5.36776,2275,7.05833,1.081,8.50059,55.5044,166.0,6221.0,b'Shoes: Karhu Fusion 2021 2\n\nThis was the ...,Map My Fitness MapMyRun iPhone,http://www.mapmyfitness.com/workout/7175814409
1,"March 19, 2023","March 19, 2023",Run,886,6.2379,3329,8.89212,0.438779,6.74755,136.743,149.0,8647.0,b'Shoes: Karhu Fusion 2021',Map My Fitness MapMyRun iPhone,http://www.mapmyfitness.com/workout/7164286915
2,"March 16, 2023","March 16, 2023",Run,849,5.94303,3046,8.53909,0.362757,7.02651,165.4,158.0,8184.0,"b'Shoes: New Balance M1080K10\n\nLight rain, b...",Map My Fitness MapMyRun iPhone,http://www.mapmyfitness.com/workout/7158791356
3,"March 5, 2023","March 5, 2023",Run,892,6.00173,3245,9.00933,0.951973,6.65976,63.027,155.0,8714.0,b'Shoes: Karhu Fusion 2021 2\n\nHot and humid',Map My Fitness MapMyRun iPhone,http://www.mapmyfitness.com/workout/7140104104
4,"March 1, 2023","March 1, 2023",Run,720,4.93614,2451,8.27323,0.837613,7.25231,71.6321,154.0,6494.0,b'Shoes: Karhu Fusion 2021',Map My Fitness MapMyRun iPhone,http://www.mapmyfitness.com/workout/7132914841


What are the activity types?
I'm only interested in my running workouts.

In [3]:
data['Activity Type'].value_counts()

Run                 419
Indoor Run / Jog    188
Weight Workout       12
Bike Ride             5
Machine Workout       5
Gym Workout           1
Walk                  1
Name: Activity Type, dtype: int64

In [5]:
run_activity_types = ['Run', 'Indoor Run / Jog']
data = data[data['Activity Type'].isin(run_activity_types)]

Save that data in its own file.

In [6]:
data.to_csv('data/runs.csv', index=False)

Plot the mileage / speed interplay.

In [7]:
data.head()

Unnamed: 0,Date Submitted,Workout Date,Activity Type,Calories Burned (kCal),Distance (mi),Workout Time (seconds),Avg Pace (min/mi),Max Pace (min/mi),Avg Speed (mi/h),Max Speed (mi/h),Avg Heart Rate,Steps,Notes,Source,Link
0,"March 26, 2023","March 26, 2023",Run,715,5.36776,2275,7.05833,1.081,8.50059,55.5044,166.0,6221.0,b'Shoes: Karhu Fusion 2021 2\n\nThis was the ...,Map My Fitness MapMyRun iPhone,http://www.mapmyfitness.com/workout/7175814409
1,"March 19, 2023","March 19, 2023",Run,886,6.2379,3329,8.89212,0.438779,6.74755,136.743,149.0,8647.0,b'Shoes: Karhu Fusion 2021',Map My Fitness MapMyRun iPhone,http://www.mapmyfitness.com/workout/7164286915
2,"March 16, 2023","March 16, 2023",Run,849,5.94303,3046,8.53909,0.362757,7.02651,165.4,158.0,8184.0,"b'Shoes: New Balance M1080K10\n\nLight rain, b...",Map My Fitness MapMyRun iPhone,http://www.mapmyfitness.com/workout/7158791356
3,"March 5, 2023","March 5, 2023",Run,892,6.00173,3245,9.00933,0.951973,6.65976,63.027,155.0,8714.0,b'Shoes: Karhu Fusion 2021 2\n\nHot and humid',Map My Fitness MapMyRun iPhone,http://www.mapmyfitness.com/workout/7140104104
4,"March 1, 2023","March 1, 2023",Run,720,4.93614,2451,8.27323,0.837613,7.25231,71.6321,154.0,6494.0,b'Shoes: Karhu Fusion 2021',Map My Fitness MapMyRun iPhone,http://www.mapmyfitness.com/workout/7132914841


In [8]:
import altair as alt
alt.Chart(data).mark_point().encode(
    x='Avg Pace (min/mi)',
    y='Distance (mi)',
    color='Activity Type:N'
)

Some cool stuff already -- my outdoor runs are pretty randomly distributed in terms of distance (y axis) but my indoor runs are mostly clustered around a few round distances.
This makes a lot of sense: on a treadmill, I am for a nice round number of miles.

One issue I see here is that there are quite a few runs of less than a mile and they have a huge variance in terms of pace: from 0 (?) to almost 18 minutes per mile.
I am not very interested in my runs of less than a mile, so will remove those records.

In [9]:
print('Record count of all runs:', len(data))
data = data[data['Distance (mi)'] >= 1]
print('Record count of runs of at least one mile:', len(data))

Record count of all runs: 607
Record count of runs of at least one mile: 583


Plot again but with the filtered data.

In [10]:
import altair as alt
run_chart = alt.Chart(data).mark_point().encode(
    x='Distance (mi)',
    y='Avg Pace (min/mi)',
    color='Activity Type:N'
).interactive()
run_chart

Is there a trend here?
One would expect my speed to decrease at higher mileage.
But in this case we're measuring speed in min/mi, which means that lower numbers are actually faster (a 6 minute mile is faster than an 8 minute mile).
So these two variables should be positively correlated -- as distance increases, the number of minutes per mile should also increase.

We can run a quick regression.

In [11]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
X = data[['Distance (mi)']]
y = data['Avg Pace (min/mi)']
lr.fit(X=X, y=y)

In [12]:
lr.coef_, lr.intercept_

(array([0.11266824]), 7.551902861240663)

So this model is a line `y=0.16*x + 7.23`
That means that for each additional mile, my average pace increases by about .16 minutes.

In [13]:
.16 * 60

9.6

... which is 10 seconds.

Let's see what predictions that makes for common distances.

In [15]:
import numpy as np
distances = np.array([1, 2, 3, 4, 6, 8, 13.1, 26.2])
# Reshape to be a column.
X_pred = distances.reshape((len(distances), 1))
y_pred = lr.predict(X_pred)
pd.DataFrame({'Distance (mi)': distances, 'Avg Pace (min/mi)': y_pred})



Unnamed: 0,Distance (mi),Avg Pace (min/mi)
0,1.0,7.664571
1,2.0,7.777239
2,3.0,7.889908
3,4.0,8.002576
4,6.0,8.227912
5,8.0,8.453249
6,13.1,9.027857
7,26.2,10.503811


These predictions are pretty good for the 3-4 mile range, but my long run times have been a lot less than 9 min/mi.

In [16]:
data.loc[data['Distance (mi)'] >= 10, ['Distance (mi)', 'Avg Pace (min/mi)']]

Unnamed: 0,Distance (mi),Avg Pace (min/mi)
23,10.3513,8.73628
89,13.0552,7.74282
97,10.0508,8.42828
99,10.277,8.94037
217,10.2975,8.39976
523,13.1995,7.38536
584,13.3893,7.16739
601,10.0,7.89167


Maybe this model is just not very well-suited to the problem -- the data may be nonlinear.

We can check the R-squared.

In [17]:
lr.score(X, y)

0.04891765474915255

That .... is terrible?

We can overlay a line on our data to see how bad the fit really is visually.

In [18]:
# To draw a line, we need two coordinates to connect.
x_line = np.array([0, 20]).reshape((2, 1))
y_line = lr.predict(x_line)
line_df = pd.DataFrame({'x': x_line.flatten(), 'y': y_line})

line_chart = alt.Chart(line_df).mark_line().encode(
    x='x',
    y='y'
)
line_chart + run_chart



It looks okay, but probably not all that much better than just predicting every run will be the average pace (so a flat, horizontal line).

In [19]:
rule = alt.Chart(data).mark_rule(color='red').encode(
    y='mean(Avg Pace (min/mi)):Q'
)
rule + run_chart

What have we learned?
Distance and speed are surprisingly unrelated for me.