In [1]:
import pandas as pd
data = pd.read_csv('data/runs.csv')

## Modeling speed by distance, indoor/outdoor, and month.
Kind of a weird idea, but I already saw that my speed is hard to predict based on distance alone.
Perhaps incorporating indoor/outdoor and month would help.

In [2]:
target = data['Avg Pace (min/mi)']
features = data.loc[:, ['Distance (mi)', 'Activity Type', 'Workout Date']]

Clean up the features.

In [3]:
features['Indoor'] = (features['Activity Type'] == 'Indoor Run / Jog').astype(int)
features = features.drop(columns='Activity Type')
features['month'] = pd.to_datetime(features['Workout Date']).dt.month_name()
features = features.drop(columns='Workout Date')
features = features.rename(columns={'Distance (mi)': 'distance'})
features.head()

Unnamed: 0,distance,Indoor,month
0,4.45226,0,January
1,4.0,1,January
2,3.0,1,January
3,4.0,1,January
4,4.0,1,January


Month is categorical (intentionally), so we need to one-hot encode.

In [4]:
X = pd.get_dummies(features, drop_first=True)
X.head()

Unnamed: 0,distance,Indoor,month_August,month_December,month_February,month_January,month_July,month_June,month_March,month_May,month_November,month_October,month_September
0,4.45226,0,0,0,0,1,0,0,0,0,0,0,0
1,4.0,1,0,0,0,1,0,0,0,0,0,0,0
2,3.0,1,0,0,0,1,0,0,0,0,0,0,0
3,4.0,1,0,0,0,1,0,0,0,0,0,0,0
4,4.0,1,0,0,0,1,0,0,0,0,0,0,0


In [5]:
y = target

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
# Split the training set further so we have a validation set.
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, test_size=0.33)

#### Linear Regression

In [7]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Score on the validation data.

In [8]:
from sklearn import metrics
y_pred = lr.predict(X_validation)
metrics.mean_squared_error(y_true=y_validation, y_pred=y_pred)

0.5474632620879237

So an MSE of .55 on the validation data.

#### Lasso Regression
Because I have a lot of features but not a lot of rows, I'm thinking lasso regression might help me get rid of columns that don't provide predictive value.

In [9]:
from sklearn.linear_model import LassoCV
# 4-fold cross validation to select a model.
reg = LassoCV(cv=4).fit(X_train, y_train)

In [10]:
# Score it on the validation data.
y_pred = reg.predict(X_validation)
metrics.mean_squared_error(y_true=y_validation, y_pred=y_pred)

0.49469580377069905

Better!

#### Ridge Regression

In [11]:
from sklearn.linear_model import RidgeCV
# 4-fold CV
rr = RidgeCV(cv=4).fit(X_train, y_train)

In [12]:
# Score it on the validation data.
y_pred = rr.predict(X_validation)
metrics.mean_squared_error(y_true=y_validation, y_pred=y_pred)

0.46465484201524854

Still a tad better.

Since I value explainability here, simple is better, so I prefer the Lasso model since it only has two coefficients.
Let's look deeper at that.

In [15]:
print('Intercept: ', reg.intercept_)
pd.DataFrame({'column': X_train.columns, 'coefficient': reg.coef_})

Intercept:  8.371326523610389


Unnamed: 0,column,coefficient
0,distance,0.000636
1,Indoor,-1.155005
2,month_August,0.126411
3,month_December,-0.0
4,month_February,-0.0
5,month_January,0.0
6,month_July,0.0
7,month_June,-0.0
8,month_March,0.0
9,month_May,-0.0


All the predictive power comes from the distance I ran, whether it was indoor or outdoor, and if it's August.

### Improve the Features
I have a suspicion that part of the problem is the way months are recorded -- the model has no way of knowing that January and February should be similar months, etc.

I can see two approaches.
1. Encode the months as seasons instead. Group Dec-Jan-Feb as Winter, etc.
2. Instead of using months at all, create a feature that describes how far the run was from Jan 1 (in the closer direction). So Dec 25 is 7 days away, Feb 1 is 31 away, etc.

### Seasons

In [16]:
# This is a little tricky but you can run it to confirm it works.
features['season'] = pd.to_datetime(data['Workout Date']).dt.month.apply(lambda x: 0 if x == 12 else x // 3)
features['season'] = features.season.map({0: 'Winter', 1: 'Spring', 2: 'Summer', 3: 'Autumn'})
features.groupby(['month', 'season']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,distance,Indoor
month,season,Unnamed: 2_level_1,Unnamed: 3_level_1
April,Spring,11,11
August,Summer,27,27
December,Winter,16,16
February,Winter,34,34
January,Winter,35,35
July,Summer,21,21
June,Summer,15,15
March,Spring,22,22
May,Spring,16,16
November,Autumn,24,24


Seems to be working. Now we can drop the month column and one-hot encode the season column.

In [17]:
features = features.drop(columns='month')
X = pd.get_dummies(features, drop_first=True)

In [18]:
X.head()

Unnamed: 0,distance,Indoor,season_Spring,season_Summer,season_Winter
0,4.45226,0,0,0,1
1,4.0,1,0,0,1
2,3.0,1,0,0,1
3,4.0,1,0,0,1
4,4.0,1,0,0,1


Train-test-validation split.

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
# Split the training set further so we have a validation set.
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, test_size=0.33)

#### Linear Regression

In [20]:
lr = LinearRegression().fit(X_train, y_train)
y_pred = lr.predict(X_validation)
metrics.mean_squared_error(y_true=y_validation, y_pred=y_pred)

0.5383032545608006

About the same.

So maybe my hypothesis that season was more meaningful was actually incorrect.

#### Lasso Regression

In [21]:
# 4-fold cross validation to select a model.
reg = LassoCV(cv=4).fit(X_train, y_train)
# Score it on the validation data.
y_pred = reg.predict(X_validation)
metrics.mean_squared_error(y_true=y_validation, y_pred=y_pred)

0.5030405200801229

Also bad.

#### Ridge Regression

In [22]:
# 4-fold CV
rr = RidgeCV(cv=4).fit(X_train, y_train)
# Score it on the validation data.
y_pred = rr.predict(X_validation)
metrics.mean_squared_error(y_true=y_validation, y_pred=y_pred)

0.5331402463705504

All the models are basically the same as before, maybe slightly worse.

In [23]:
print('Intercept: ', reg.intercept_)
pd.DataFrame({'column': X_train.columns, 'coefficient': reg.coef_})

Intercept:  7.971169336050804


Unnamed: 0,column,coefficient
0,distance,0.053724
1,Indoor,-0.930385
2,season_Spring,0.0
3,season_Summer,0.060035
4,season_Winter,-0.0


We can consider that a failure.
Moving on...

### Distance from Jan 1

In [24]:
# Another tricky one
features['from_jan1'] = pd.to_datetime(data['Workout Date']).dt.dayofyear.apply(lambda x: min(x-1, 366-x))
# Make sure all the dates 150 days from Jan 1 are in the summer.
features.loc[features.from_jan1 > 150, 'season'].value_counts()

Summer    37
Name: season, dtype: int64

In [25]:
features.head()

Unnamed: 0,distance,Indoor,season,from_jan1
0,4.45226,0,Winter,24
1,4.0,1,Winter,22
2,3.0,1,Winter,20
3,4.0,1,Winter,19
4,4.0,1,Winter,16


Looks good. We can drop season and get right to work -- no need for one-hot encoding since none of our variables are categorical.

In [26]:
X = features.drop(columns='season')

Train-test-validation split.

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
# Split the training set further so we have a validation set.
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, test_size=0.33)

#### Linear Regression

In [28]:
lr = LinearRegression().fit(X_train, y_train)
y_pred = lr.predict(X_validation)
metrics.mean_squared_error(y_true=y_validation, y_pred=y_pred)

0.8612228811818462

#### Lasso Regression

In [29]:
# 4-fold cross validation to select a model.
reg = LassoCV(cv=4).fit(X_train, y_train)
# Score it on the validation data.
y_pred = reg.predict(X_validation)
metrics.mean_squared_error(y_true=y_validation, y_pred=y_pred)

0.8625991013948958

#### Ridge Regression

In [30]:
# 4-fold CV
rr = RidgeCV(cv=4).fit(X_train, y_train)
# Score it on the validation data.
y_pred = rr.predict(X_validation)
metrics.mean_squared_error(y_true=y_validation, y_pred=y_pred)

0.8597866131369687

All worse than before!

In [31]:
print('Intercept: ', reg.intercept_)
pd.DataFrame({'column': X_train.columns, 'coefficient': reg.coef_})

Intercept:  8.028208203826615


Unnamed: 0,column,coefficient
0,distance,0.050006
1,Indoor,-1.073303
2,from_jan1,0.000766


### Time Before April 1

One last idea... a lot of my running revolves around doing the Holy Half, which is usually around early April.
If I just counted how many days until the next April 1, would that be even more predictive?
I wonder if that's what the Jan 1 distance is proxying in the above model.

In [32]:
pd.to_datetime('2020-04-01').dayofyear

92

So April 1 is the 92nd day of the year.
We'll need that.

In [33]:
features['until_apr1'] = pd.to_datetime(data['Workout Date']).dt.dayofyear.apply(lambda x: (92 - x) % 365)
features['date'] = data['Workout Date']
features.head(200)

Unnamed: 0,distance,Indoor,season,from_jan1,until_apr1,date
0,4.45226,0,Winter,24,67,"Jan. 25, 2020"
1,4.00000,1,Winter,22,69,"Jan. 23, 2020"
2,3.00000,1,Winter,20,71,"Jan. 21, 2020"
3,4.00000,1,Winter,19,72,"Jan. 20, 2020"
4,4.00000,1,Winter,16,75,"Jan. 17, 2020"
...,...,...,...,...,...,...
195,3.00000,1,Autumn,61,152,"Nov. 1, 2017"
196,4.00000,1,Autumn,62,153,"Oct. 31, 2017"
197,4.58005,0,Autumn,72,163,"Oct. 21, 2017"
198,4.03568,0,Autumn,86,177,"Oct. 7, 2017"


This looks right to me.
Drop date, season, and from_jan1.

In [34]:
X = features.drop(columns=['from_jan1', 'season', 'date'])
X

Unnamed: 0,distance,Indoor,until_apr1
0,4.45226,0,67
1,4.00000,1,69
2,3.00000,1,71
3,4.00000,1,72
4,4.00000,1,75
...,...,...,...
257,6.11891,0,3
258,3.80420,0,7
259,5.61139,0,25
260,6.35580,0,33


Train-test-validation split.

In [35]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
# Split the training set further so we have a validation set.
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, test_size=0.33)

#### Linear Regression

In [36]:
lr = LinearRegression().fit(X_train, y_train)
y_pred = lr.predict(X_validation)
metrics.mean_squared_error(y_true=y_validation, y_pred=y_pred)

0.40337469046324526

Best yet!

#### Lasso Regression

In [37]:
# 4-fold cross validation to select a model.
reg = LassoCV(cv=4).fit(X_train, y_train)
# Score it on the validation data.
y_pred = reg.predict(X_validation)
metrics.mean_squared_error(y_true=y_validation, y_pred=y_pred)

0.3941794447587229

#### Ridge Regression

In [38]:
# 4-fold CV
rr = RidgeCV(cv=4).fit(X_train, y_train)
# Score it on the validation data.
y_pred = rr.predict(X_validation)
metrics.mean_squared_error(y_true=y_validation, y_pred=y_pred)

0.402928223007173

#### Coefficients

In [39]:
print('Intercept: ', reg.intercept_)
pd.DataFrame({'column': X_train.columns, 'coefficient': reg.coef_})

Intercept:  8.205437195280142


Unnamed: 0,column,coefficient
0,distance,0.032985
1,Indoor,-1.217544
2,until_apr1,6.9e-05


#### Test Data

In [48]:
y_pred = reg.predict(X_test)
metrics.mean_squared_error(y_true=y_test, y_pred=y_pred)

0.30899467624489113

Surprisingly excellent score on the test data.

## Conclusion

The best model I got used distance, indoor/outdoor, and the days until the next April 1.
It said my "base" pace was 8.2 minutes/mile, but:
- Running indoor cut about 1.2 minutes per mile off my time
- Running an additional mile added about 0.03 min/mile to my time (about 0.6 seconds)
- The more days until the next April 1, the slower I ran -- very slightly.
90 days away that would add .006 min/mile (0.37 seconds), 180 days away would add 0.012 min/mile (0.75 seconds), and 365 days away that would add 0.025 min/mile (1.5 seconds).

### Other Thoughts

I could probably have improved this model by removing outliers or doing more intense engineering on the features.
Regarding the outliers, looking back I somewhat regret not taking out my actual half marathons, which are both very far and relatively very fast.

Doing more feature engineering would have helped with making a more predictive model, but I think it would have made my model less interpretable and also may have led me to accidentally introduce leakage (adding in heart rate, for example, would help but would represent information I could not have until after a the run).

Overall pretty interesting.