# Strava Moving Time estimator by Scikit-Learn

In [None]:
import numpy
import pandas
import matplotlib.pyplot as plt
import scipy.stats
from sklearn.linear_model import LinearRegression # Regression Model
from sklearn.model_selection import train_test_split # to split train and test sets
plt.style.use("bmh")
%config InlineBackend.figure_formats=["png"]

Let's assume to have already downloaded the activities.csv file from our Strava profile.

In [None]:
activities = pandas.read_csv("activities.csv")
print("dataset type is:", type(activities), "length:", len(activities), "shape:", activities.shape)

The dataset is really rich. It includes 86 columns for each route. 
Not all the columns contain usable data because I don't have a Strava subscription, but only a free account.

In [None]:
print("columns: ", len(list(activities.columns)))

In [None]:
activities.info()

In [None]:
list(activities.columns)

Of course not all the columns are helpful. So let's extract only the columns really helpful to train a model.

After a fast analysis of the available features, only the following features will be used to train the model:
- Distance
- Elevation gain
- Max Grade
- Average Grade

The target label is the 'Moving Time'.

Our task is to train a model to predict the expected 'Moving time' given the Distance, the Elevation Gain, the Max Grade, the Average Grade.

These informations can be retrieved by Rouvy, the well know virtual biking app.

<img src="rouvy.png" alt="Rouvy data for a route" />

or from Google Maps

<img src="google-maps.png" alt="Google Maps data for a route" />

## Dropping 'walk' activities

In [None]:
activities = activities.drop(activities[activities['Activity Type'] == 'Walk'].index)

In [None]:
activities.info()

## Dropping short routes

Sometimes short routes have been recorded as for example the 'Warm up' and 'Cool down' routes under Rouvy. They are not useful for our purposes. So let's remove all the routes whose 'Moving Time' is less than 3 minutes (180 secs).

In [None]:
activities = activities.drop(activities[activities['Moving Time'] < 180].index)

In [None]:
activities.describe()

Let's extract only the required columns from the original activities dataset.

In [None]:
activities = activities[["Distance", "Elevation Gain", "Max Grade", "Average Grade", "Moving Time", "Max Speed"]]

In [None]:
mph2kmh=0.621371192
activities['Max Speed'] = activities['Max Speed']/mph2kmh

In [None]:
activities.info()

Let's take a look at the data content

In [None]:
activities.head()

In [None]:
activities.describe()

We can check the 'null' values and remove them if present.

In [None]:
if activities.isnull().values.any():
    print("removing null values ...")
    activities=activities.dropna()

In [None]:
activities.info()

Now we have removed the four rows with 'null' or 'NaN' from the dataframe. 

## Visualization

We can obtain a first impression of the dependency between variables by examining a multidimensional scatterplot.

In [None]:
from pandas.plotting import scatter_matrix
scatter_matrix(activities, diagonal="kde", figsize=(12,10));

As expected we can see a linear relationship between the Moving Time and the Distance.

In [None]:
activities.plot(kind="scatter", x='Distance', y='Moving Time', grid=True)

there is an approximately linear relationship between Elevation Gain and the Distance: more Kms more the overall gain in altitude

In [None]:
activities.plot(kind="scatter", x='Distance', y='Elevation Gain', grid=True)

We can also generate a 3D plot of the observations, which can sometimes help to interpret the data more easily. Here we plot 'Moving Time' as a function of 'Distance' and 'Elevation Gain'.

In [None]:
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(projection="3d")
ax.scatter(activities["Distance"], activities["Elevation Gain"], activities["Moving Time"])
ax.set_xlabel("Distance")
ax.set_ylabel("Elevation Gain")
ax.set_zlabel("Moving Time")
ax.set_facecolor("white")

## Looking for correlation

You can easily compute the standard correlation coefficient (also called Pearson's r) between every pair of attributes using the 'corr()' method.

In [None]:
corr_matrix= activities.corr()

In [None]:
corr_matrix["Distance"].sort_values(ascending=False)

As expected the 'Distance' is strongly correlated to the 'Moving Time' (0.87) and a bit less (0.59) to the 'Elevation Gain'.

## Getting the 'labels' and 'features'

In [None]:
labels=activities.pop("Moving Time")

In [None]:
print("label shape:", labels.shape, "and type:", type(labels))

In [None]:
print("features shape:", activities.shape, "and type:", type(activities))

## Split training and test set

In [None]:
X = activities.values  # values converts it into a numpy array
Y = labels.values

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)

In [None]:
X_train.shape

In [None]:
X_test.shape

## Model Training

We will created a fitted linear model using the formula API of the scikit-learn library.

In [None]:
linear_model = LinearRegression()
linear_model.fit(X_train, Y_train) 

### View Parameters 
The $\mathbf{w}$ and $\mathbf{b}$ parameters are referred to as 'coefficients' and 'intercept' in scikit-learn. In other term the model function can be written as $f_{w,b}(\vec{x})$

In [None]:
b = linear_model.intercept_
w = linear_model.coef_
print(f"w = {w:}, b = {b:0.2f}")

Let's give it a try

In [None]:
some_data=X_test[5:10,:]
some_labels=Y_test[5:10]

In [None]:
some_labels_predicted = linear_model.predict(some_data)

In [None]:
print("Predictions (secs):", some_labels_predicted)

In [None]:
print("Labels (secs):", some_labels)

Let’s measure this regression model’s RMSE on the whole training set using Scikit-Learn’s mean_squared_error function

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
lin_mse = mean_squared_error(some_labels, some_labels_predicted)
lin_rmse = numpy.sqrt(lin_mse)
lin_rmse

It's a little bit high. Something doesn't work as expected. The Linear model assumption seems to be not right.

## Calculate accuracy

You can calculate this accuracy of this model by calling the `score` function.

In [None]:
print("Accuracy on training set:", linear_model.score(X_train, Y_train))

The accuracy on the training set is good enough but not high. The model is underfitting.

In [None]:
print("Accuracy on test set:", linear_model.score(X_test, Y_test))