# Strava, Rouvy and Machine Learning

### How to predict 'moving time' on a route by Scikit-Learn
<center><img src="strava.png" alt="My Account from Strava web page" /><img src="rouvy.png" alt="My Account from Strava web page" /></center>

I've had a lot of fun riding my bike over the last few years. Unfortunately the Covid pandemic has greatly limited the opportunities for outdoor outings. Plus the winter is harsh in my area. So I subscribed to a nice app called (rouvy.com) and bought an indoor trainer to pedal at home. 

It was a fantastic experience that continues to this day. Early every day in the morning I can train by pedaling anywhere in the world, tackling the steepest and most legendary climbs. 

I connected Rouvy to my (free) Strava account so that every route I ride under Rouvy is automatically saved to Strava. After 3 years the end result is that I have more than 500 routes (indoor and outdoor) saved in my Strava account. But how does Machine Learning come into play?

When I want to choose the next route to take, I often have only a rough idea of the "commute time" it will take. It would be helpful to have some Moving time prediction to better schedule my time. So I decided to use the data available in Strava to train some Machine Learning models and predict the "commute time" (also 'Moving time') given some parameters (distance, elevation gain, max grade, average grade...) of the route. This data are available 'a priori' in the Rouvy profile of the route.
The notebook, the data and all the pictures are available under my github https://github.com/fabioantonini/strava-moving-time-regressor.

## Outline
Here the topics we are going to talk.

- ### Retrieving data from Strava
- ### Data Exlporation
- ### Data Cleaning
- ### Selecting Features and Labels
- ### Outliers
- ### Save the cleaned data
- ### Data Visualization
- ### Looking for correlation
- ### Avoiding sampling bias
- ### Splitting Training and Test sets
- ### Linear Regression Model
- ### Decision Tree Model
- ### Random Forest Model
- ### Challenge Gunsan-Saemangeum 2002 prediction
- ### Fine-Tune Your Model
- ### Conclusions

## Retrieving data from Strava

The routes data can be exported by the Strava website from the 'My Account' page.

<center><img src="myaccount.png" alt="My Account from Strava web page" /></center>

Search for 'Download or Delete Your Account'. Click on the 'Get Started'.

<img src="export.png" alt="My Account from Strava web page" />

Click on the 'Request Your Archive' button. As explained, an email will be sent to you with the link to download the zip file containing the data of your Strava activities. Prepare to wait for a while. Strava takes its time to arrange the archive. So you might receive the email after some hours.
Anyway in the end you will receive the email and download the zip file (export_31174850.zip for instance).
Let's take a look at its content.

<img src="zipfile.png" alt="My Account from Strava web page" />

For our purposes only the 'activities.csv' file is required. From the size we can realize that it contains a lot of data. My own 'activities.csv' file has been added to the repo and it will be processed next. Let's import it.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats
from sklearn.linear_model import LinearRegression # Regression Model
from sklearn.model_selection import train_test_split # to split train and test sets
plt.style.use("bmh")
%config InlineBackend.figure_formats=["png"]
random_state = 42

In [None]:
activities = pd.read_csv("activities.csv")
print("dataset type is:", type(activities), "length:", len(activities), "shape:", activities.shape)

## Data exploration

The dataset is made of 855 activities (rows), but unfortunately not all of them are rides by bike.

The single route (row) includes 86 columns. Not all the columns contain usable data (many NaN or 'null' are present) because I don't have a full Strava subscription, but only a free account. 

Let's take a look more in depth to undestand which activities are really useful to our purpose.

In [None]:
print("columns: ", len(list(activities.columns)))

In [None]:
activities.head()

In [None]:
activities.describe()

We need to extract only the columns really useful to train a model.
The data appear to be a bit sparsed. Some columns are not valorized at all (because of my free account). Others columns have many 'null' values. We need to identify only the features statistically helpful that are available for each route before riding the route itself.

Let's clean the data.

## Data cleaning

In the next section data will be cleaned and filtered to get only routes done by bike (Outdoor and Indoor).

### Getting only activities done by bike

We defintely need to get only the activities done by bike. They are labeled as 'Ride' and 'Virtual Ride' in the Strava exported dataset. So we will drop the activities tagged as 'Walk' and 'Run'.

In [None]:
activities=activities.loc[activities['Activity Type'].isin(['Ride', 'Virtual Ride'])]

In [None]:
activities.describe()

You can notice that the number of rows decreased a bit. Let's go further.

In [None]:
# as an alternative you can remove the 'Walk' and 'Run' activities
# activities = activities.drop(activities[activities['Activity Type'] == 'Walk'].index)
# activities = activities.drop(activities[activities['Activity Type'] == 'Run'].index)

### Removing short routes

When you ride under Rouvy you can optionally have a 'Warm up' and 'Cool down' before and after respectively the selected route. Also these short routes have been recorded under Strava. They are not useful for our purposes. So let's remove all the routes whose 'Moving Time' is less than 3 minutes (180 secs).

In [None]:
activities = activities.drop(activities[activities['Moving Time'] < 180].index)

In [None]:
activities.describe()

The count number has decreased again.

### Handling fake data

Inspecting the original dataframe we can realize that there are some bad data. For instance it's hard to believe that the 'Max Grade' is 50%. If the min 'Distance' is 0 Km, the route is a 'fake' or the data are corrupted. So these routes can be removed.
Let's clean this data by setting a threshold of 25% for the 'Max Grade', and 3 Km's for 'Distance' respectively.

In [None]:
activities = activities.drop(activities[activities['Max Grade'] > 25].index)

In [None]:
activities = activities.drop(activities[activities['Distance'] < 3].index)

In [None]:
activities.describe()

Now we have 533 bike routes (Indoor and Outdoor). Let's take a look.

In [None]:
activities.head()

Now the data looks better. It's time to get the features really helpful to train our models.

## Handling 'Null' values

We can check the 'null' values.

In [None]:
null_rows_idx = activities.isnull().any(axis=1)
activities.loc[null_rows_idx].head()

We have a couple of policies to handle the 'Null' values.

We can use 'imputation' to set the NaN to the median value of that feature. A more rude approach is to remove the rows with at least one NaN or null value, but in this  case we will lost some data.

Let's try to use imputation in order to save the three rows with NaN values.

#### Imputation

In [None]:
median = activities["Elevation Gain"].median()
activities["Elevation Gain"].fillna(median, inplace=True)  # option 3

median = activities["Max Grade"].median()
activities["Max Grade"].fillna(median, inplace=True)  # option 3

activities.loc[null_rows_idx].head()

#### Droping NaN

You can remove the rows with at least one 'null' value.

In [None]:
#if activities.isnull().values.any():
#    activities=activities.dropna()

In [None]:
activities.describe()

We have 533 records (rows) that can be used to train our models.

## Selecting features and labels

After a fast analysis of the available features, only the following features will be used to train the model:
- Distance
- Elevation gain
- Max Grade
- Average Grade

The target label is the 'Moving Time'.

- For sure we expect there is some dependency between the 'Distance' and the 'Moving Time' (the longer the route, the longer it takes to complete it). 
- Higher is 'Elevation Gain 'of the route (even if the 'Distance' is small), the longer it takes to complete it.
- Also the 'Max Grade' is a variable related to the 'Moving Time': a short route with a small 'Evevation Gain' can take a long to time be completed if there are few KM's with a strong climb.
- The dataset includes outdoor (recorded from live) routes and indoor routes (recorded by Rouvy). For the outdoor route the 'Average Grade' is close to '0' because I go back home everytime. For indoor routes the 'Average Grade' can be greater than '0' as in the picture here below.

All these informations can be retrieved by Rouvy, for every available route.
Please note that in the picture the Elevation Gain is mapped to the 'Ascended' item.

In the next predictions examples we will get the route data from Rouvy and use them to make prediction of the 'Moving Time'. These additional routes are not included in the original dataset.

<center><img src="rouvy-route.png" alt="Rouvy data for a route" /></center>

In [None]:
activities = activities[["Distance", "Elevation Gain", "Max Grade", "Average Grade", "Moving Time"]]
activities.describe()

## Outliers

Now let's drop some outliers:

In [None]:
from sklearn.ensemble import IsolationForest

isolation_forest = IsolationForest(random_state=42)
outlier_pred = isolation_forest.fit_predict(activities)

In [None]:
outlier_pred

In [None]:
activities = activities.iloc[outlier_pred == 1]
activities.describe()

## Save the cleaned data

Now the dataset has been cleaned and filtered. We will develop some models using this data.
The dataframe can be stored to the filesystem.

In [None]:
activities.to_csv('cleaned_activities.csv')

In [None]:
activities.head()

Let's take a look at the dataframe after the last processing.

## Visualization

We can obtain a first impression of the dependency between variables by examining a multidimensional scatterplot.

In [None]:
from pandas.plotting import scatter_matrix
scatter_matrix(activities, diagonal="kde", figsize=(12,10));

As expected we can see a linear relationship between the Moving Time and the Distance.

In [None]:
activities.plot(kind="scatter", x='Distance', y='Moving Time', grid=True)

there is an approximately linear relationship between Elevation Gain and the Distance: more Kms more the overall gain in altitude

In [None]:
activities.plot(kind="scatter", x='Distance', y='Elevation Gain', grid=True)

We can also generate a 3D plot of the observations, which can sometimes help to interpret the data more easily. Here we plot 'Moving Time' as a function of 'Distance' and 'Elevation Gain'.

In [None]:
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(projection="3d")
ax.scatter(activities["Distance"], activities["Elevation Gain"], activities["Moving Time"])
ax.set_xlabel("Distance")
ax.set_ylabel("Elevation Gain")
ax.set_zlabel("Moving Time")
ax.set_facecolor("white")

In [None]:
%matplotlib inline
activities.hist(bins=50, figsize=(20,15))
plt.show()

## Looking for correlation

You can easily compute the standard correlation coefficient (also called Pearson's r) between every pair of attributes using the 'corr()' method.

In [None]:
corr_matrix= activities.corr()

In [None]:
corr_matrix["Moving Time"].sort_values(ascending=False)

The Moving Time is strongly correlated to the 'Distance' and also to the 'Elevation Gain'. This is asbolutely expected.

We can notice that the 'Max Grade' is weakly correlated to the 'Moving Time'.
The 'Average Grade' is not correlated at all. So we decide to remove it.

In [None]:
activities.drop('Average Grade', inplace=True, axis=1)
activities.describe()

## Avoiding sampling bias

Before splitting the dataset into a Training and a Test set we need to face the problem of 'Sampling bias'. Usually we can use a random sampling approach. This is generally fine if your dataset is large enough (especially relative to the number of attributes), but if it is not, you run the risk of introducing a significant sampling bias. In our case the dataset is quite small. The risk to face sampling bias is high. We need a workaround.

From the previous histograms we can notice that most 'Distance' values are clustered around 10 to 15 Km's, but some 'Distance's go far beyond 70. It is important to have a sufficient number of instances in your dataset for each stratum, or else the estimate of the stratum’s importance may be biased. This means that you should not have too many strata, and each stratum should be large enough. The following code uses the 
pd.cut() function to create an income category attribute with 4 categories (labeled
from 1 to 4): category 1 ranges from 0 to 20 (i.e., less than 20 Km's), category 2 from
20 to 40 Km's, and so on:

In [None]:
activities["Distance_cat"] = pd.cut(activities["Distance"],
 bins=[0, 20, 40, 60, np.inf],
 labels=[1, 2, 3, 4])

In [None]:
activities["Distance_cat"].hist()

Now you are ready to do stratified sampling based on the income category. For this
you can use Scikit-Learn’s StratifiedShuffleSplit class:

In [None]:
activities = activities.reset_index()

In [None]:
activities.pop('index')

In [None]:
activities.head()

In [None]:
activities.index

## Splitting Training and Test sets

Now it's time to get our Training and Test sets.

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
splitter = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=random_state)
strat_splits = []
for train_index, test_index in splitter.split(activities, activities["Distance_cat"]):
    strat_train_set_n = activities.iloc[train_index]
    strat_test_set_n = activities.iloc[test_index]
    strat_splits.append([strat_train_set_n, strat_test_set_n])

In [None]:
print(len(strat_splits))
strat_train_set, strat_test_set = strat_splits[0]

In [None]:
strat_train_set.describe()

In [None]:
strat_test_set.describe()

It's much shorter to get a single stratified split:

In [None]:
strat_train_set, strat_test_set = train_test_split(
    activities, test_size=0.2, stratify=activities["Distance_cat"], random_state=random_state)

Let's extract the labels for the Training and Test sets

In [None]:
strat_train_set_labels=strat_train_set.pop("Moving Time")
print(type(strat_train_set_labels))

In [None]:
strat_test_set_labels=strat_test_set.pop("Moving Time")

In [None]:
strat_train_set.info()

In [None]:
strat_test_set.info()

In [None]:
strat_train_set_labels.info()

In [None]:
strat_test_set_labels.info()

In [None]:
strat_test_set["Distance_cat"].value_counts() / len(strat_test_set)

In [None]:
def distance_cat_proportions(data):
    return data["Distance_cat"].value_counts() / len(data)

train_set, test_set = train_test_split(activities, test_size=0.2, random_state=random_state)

compare_props = pd.DataFrame({
    "Overall %": distance_cat_proportions(activities),
    "Stratified %": distance_cat_proportions(strat_test_set),
    "Random %": distance_cat_proportions(test_set),
}).sort_index()
compare_props.index.name = "Distance Category"
compare_props["Strat. Error %"] = (compare_props["Stratified %"] /
                                   compare_props["Overall %"] - 1)
compare_props["Rand. Error %"] = (compare_props["Random %"] /
                                  compare_props["Overall %"] - 1)
(compare_props * 100).round(2)

In [None]:
for set_ in (strat_train_set, strat_test_set):
    set_.drop("Distance_cat", axis=1, inplace=True)

Now we have a cleaned Training and Test sets. For the time being put aside the Test set and let's work only on the Training set.

## Feature Scaling

One of the most important transformations you need to apply to your data is feature
scaling. With few exceptions, Machine Learning algorithms don’t perform well when
the input numerical attributes have very different scales.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
strat_train_set_norm = pd.DataFrame(scaler.fit_transform(strat_train_set), columns = strat_train_set.columns)
strat_train_set_norm.describe()
print("scaler mean {}".format(scaler.mean_))

test_scaler = StandardScaler()
strat_test_set_norm = pd.DataFrame(test_scaler.fit_transform(strat_test_set), columns = strat_test_set.columns)
strat_test_set_norm.describe()
print("test scaler mean {}".format(test_scaler.mean_))


## Evaluation, prediction and printing functions

Before diving into the models training let's prepare a toolbox of functions that will come back to help to execute the prediction and the evaluation of a generic model. This will be helpful to collect the accuracy of each model in a dictionary and compare them later.

Let's define a dictionary to collect the model's data (model handler, cross validation scores, rmse...)

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

evaluations = {}
models = {}
class model_instance:
    def __init__(self, model):
        self.model = model
        self.name = type(model).__name__
        self.scores = None
        self.cross_val_score = 0
        self.training = {"accuracy": 0, "rmse": 0}
        self.test = {"accuracy": 0, "rmse": 0}
        self.cross_val_score = []
        self.cross_val_score_rms = []
        
    def prediction(self, moving_time, data):
        predicted_moving_time = self.model.predict(data)
        error = abs(100*(predicted_moving_time-moving_time)/moving_time)
        return predicted_moving_time, error
    
    def accuracy(self, dataset_name, dataset, dataset_labels):
        dataset_predictions = self.model.predict(dataset)
        if dataset_name == "training":
            self.training['accuracy'] = self.model.score(dataset, dataset_labels)
        elif dataset_name == "test":
            self.test['accuracy'] = self.model.score(dataset, dataset_labels)

    def rmse(self, dataset_name, dataset, dataset_labels):
        dataset_predictions = self.model.predict(dataset)
        if dataset_name == "training":
            self.training['rmse'] = np.sqrt(mean_squared_error(dataset_labels, dataset_predictions))
        elif dataset_name == "test":
            self.test['rmse'] = np.sqrt(mean_squared_error(dataset_labels, dataset_predictions))

    def print_model_accuracy(self, dataset_name):
        if dataset_name == "training":
            print("Model {}: Accuracy on {} set:{}".format(self.name, dataset_name, self.training['accuracy']))
        elif dataset_name == "test":
            print("Model {}: Accuracy on {} set:{}".format(self.name, dataset_name, self.test['accuracy']))
            
    def print_model_rmse(self, dataset_name):
        if dataset_name == "training":
            print("Model {}: Rmse on {} set:{}".format(self.name, dataset_name, self.training['rmse']))
        elif dataset_name == "test":
            print("Model {}: Rmse on {} set:{}".format(self.name, dataset_name, self.test['rmse']))

    def cross_val_score_eval(self, dataset, dataset_labels):
        self.cross_val_score = cross_val_score(self.model, dataset, dataset_labels, scoring="neg_mean_squared_error", cv=10)
        self.cross_val_score_rmse = np.sqrt(-self.cross_val_score)
        
    def print_model_cross_val_score(self):
        print("Cross Val Score {}".format(self.cross_val_score))
        print("Cross Val Rmse {}".format(self.cross_val_score_rmse))
        print("Mean:", self.cross_val_score_rmse.mean())
        print("Standard deviation:", self.cross_val_score_rmse.std())
        
    def plot_learning_curves(self, X, y):
        X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=random_state)
        train_errors, val_errors = [], []
        for m in range(50, len(X_train)):
            self.model.fit(X_train[:m], y_train[:m])
            y_train_predict = self.model.predict(X_train[:m])
            y_val_predict = self.model.predict(X_val)
            train_errors.append(mean_squared_error(y_train[:m], y_train_predict))
            val_errors.append(mean_squared_error(y_val, y_val_predict))
        plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train")
        plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="val")


### Better Evaluation Using Cross-Validation
One way to evaluate a model would be to use the train_test_split function to split the training set into a smaller training set and a validation set, then train your models against the smaller training set and evaluate them against the validation set. It’s a bit of work, but nothing too difficult and it would work fairly well.

A great alternative is to use Scikit-Learn’s K-fold cross-validation feature. The code of the method 'cross_val_score_eval' randomly splits the training set into 10 distinct subsets called folds, then it trains and evaluates the model 10 times, picking a different fold for evaluation every time and training on the other 9 folds. The result is an array containing the 10 evaluation scores. This approach will be used to evaluate each model on the Training set.

## Linear Regression Model

We will created a fitted linear model using the formula API of the scikit-learn library.

In [None]:
linear_reg = LinearRegression()
linear_reg_name = type(linear_reg).__name__
models[linear_reg_name] = model_instance(linear_reg)

Let's train the Linear Regressor model on the stratified features and labels.

In [None]:
models[linear_reg_name].model.fit(strat_train_set_norm, strat_train_set_labels) 

### Linear Regressor Parameters 
The $\mathbf{w}$ and $\mathbf{b}$ parameters are referred to as 'coefficients' and 'intercept' in scikit-learn. In other term the model function can be written as $f_{w,b}(\vec{x})$

In [None]:
b = models[linear_reg_name].model.intercept_
w = models[linear_reg_name].model.coef_
print(f"w = {w:}, b = {b:0.2f}")

### Accuracy and  Root Mean squared error of the LinearRegressor Model

Let’s measure the regression model’s RMSE on the whole training set using the preiously defined function 'evaluation'.

In [None]:
models[linear_reg_name].accuracy("training", strat_train_set_norm, strat_train_set_labels)
models[linear_reg_name].rmse("training", strat_train_set_norm, strat_train_set_labels)
models[linear_reg_name].cross_val_score_eval(strat_train_set, strat_train_set_labels)

In [None]:
models[linear_reg_name].print_model_accuracy('training')

In [None]:
models[linear_reg_name].print_model_rmse('training')

In [None]:
models[linear_reg_name].print_model_cross_val_score()

### Learning Curves of the Linear Regressor Model

In [None]:
models[linear_reg_name].plot_learning_curves(strat_train_set_norm, strat_train_set_labels.ravel())

## Ridge Regression Model

Ridge Regression (also called Tikhonov regularization) is a regularized version of Lin‐
ear Regression: a regularization term is added to the cost function. This forces the learning algorithm to not only fit the data but also keep the model weights as small as possible.

In [None]:
from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha=1, solver="cholesky")
ridge_reg_name = type(ridge_reg).__name__
models[ridge_reg_name] = model_instance(ridge_reg)

In [None]:
models[ridge_reg_name].model.fit(strat_train_set_norm, strat_train_set_labels) 

### Mean squared error and Accuracy of the Ridge Model

Let's evauluate the Accuracy and RMSE of the Decision Tree Model

In [None]:
models[ridge_reg_name].accuracy("training", strat_train_set_norm, strat_train_set_labels)
models[ridge_reg_name].rmse("training", strat_train_set_norm, strat_train_set_labels)
models[ridge_reg_name].cross_val_score_eval(strat_train_set, strat_train_set_labels)

In [None]:
models[ridge_reg_name].print_model_accuracy('training')

In [None]:
models[ridge_reg_name].print_model_rmse('training')

In [None]:
models[ridge_reg_name].print_model_cross_val_score()

### Learning Curves for the Ridge Regressor Model

In [None]:
models[ridge_reg_name].plot_learning_curves(strat_train_set_norm, strat_train_set_labels.ravel())

## Decision tree Model

In order to try to improve the accuracy on the Training set let's try a different model able to catch nonlinear patterns in the data.
Let’s train a DecisionTreeRegressor. This is a powerful model, capable of finding complex nonlinear relationships in the data.

In [None]:
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg_name = type(tree_reg).__name__
models[tree_reg_name] = model_instance(tree_reg)

In [None]:
models[tree_reg_name].model.fit(strat_train_set_norm, strat_train_set_labels) 

### Mean squared error and Accuracy of the Decision Tree Model

Let's evauluate the Accuracy and RMSE of the Decision Tree Model

In [None]:
models[tree_reg_name].accuracy("training", strat_train_set_norm, strat_train_set_labels)
models[tree_reg_name].rmse("training", strat_train_set_norm, strat_train_set_labels)
models[tree_reg_name].cross_val_score_eval(strat_train_set, strat_train_set_labels)

In [None]:
models[tree_reg_name].print_model_accuracy('training')

In [None]:
models[tree_reg_name].print_model_rmse('training')

Not error at all on the Training data? At a first glance the Model seems to be perfect.
Of course, it is much more likely that the model has badly overfit the data. How can you be sure?
We'll use part of the training set for training, and part for model validation.


In [None]:
models[tree_reg_name].print_model_cross_val_score()

Now the real nature of the Decision Tree has come to light. The mean RMSE is about 1500 secs and the Standard Deviation is around 470 secs. This is not so far from the Linear Regression Model.

The Decision Tree model is overfitting so badly that it performs worse than the Linear Regression model.

### Learning Curves for the Decision Tree Model

In [None]:
models[tree_reg_name].plot_learning_curves(strat_train_set_norm, strat_train_set_labels.ravel())

## Random Forest Model

Let’s try one last model now: the RandomForestRegressor. Random Forests work by training many Decision Trees on random subsets of the features, then averaging out their predictions. Building a model on top of many other models is called Ensemble Learning, and it is often a great way to push ML algorithms even further.

In [None]:
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg_name = type(forest_reg).__name__
models[forest_reg_name] = model_instance(forest_reg)

In [None]:
models[forest_reg_name].model.fit(strat_train_set_norm, strat_train_set_labels) 

### Mean squared error and Accuracy of the Random Forest Model

Let's evauluate the Accuracy and RMSE of the Random Forest Model

In [None]:
models[forest_reg_name].accuracy("training", strat_train_set_norm, strat_train_set_labels.ravel())
models[forest_reg_name].rmse("training", strat_train_set_norm, strat_train_set_labels.ravel())
models[forest_reg_name].cross_val_score_eval(strat_train_set, strat_train_set_labels.ravel())

In [None]:
models[forest_reg_name].print_model_accuracy('training')

In [None]:
models[forest_reg_name].print_model_rmse('training')

In [None]:
models[forest_reg_name].print_model_cross_val_score()

Now it sounds better and more reasonable. The Standard Deviation decreased a bit if compared to the Decision Tree Model.

### Learning Curves for the Random Forest Model

In [None]:
models[forest_reg_name].plot_learning_curves(strat_train_set_norm, strat_train_set_labels.ravel())

## Polynomial Model

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
degree=2
poly_reg=make_pipeline(PolynomialFeatures(degree),LinearRegression())

poly_reg_name = type(poly_reg).__name__
models[poly_reg_name] = model_instance(poly_reg)

In [None]:
models[poly_reg_name].model.fit(strat_train_set_norm, strat_train_set_labels) 

### Mean squared error and Accuracy of the Polynomial Regressor Model

In [None]:
models[poly_reg_name].accuracy("training", strat_train_set_norm, strat_train_set_labels)
models[poly_reg_name].rmse("training", strat_train_set_norm, strat_train_set_labels)
models[poly_reg_name].cross_val_score_eval(strat_train_set, strat_train_set_labels)

In [None]:
models[poly_reg_name].print_model_accuracy('training')

In [None]:
models[poly_reg_name].print_model_rmse('training')

In [None]:
models[poly_reg_name].print_model_cross_val_score()

### Learning Curves for the Polynomial Regression Model

In [None]:
models[poly_reg_name].plot_learning_curves(strat_train_set_norm, strat_train_set_labels.ravel())

## Stochastic Gradient Descent Regressor

In [None]:
from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(max_iter=1000, penalty="l2")
sgdr_reg_name = type(sgd_reg).__name__
models[sgdr_reg_name] = model_instance(sgd_reg)

In [None]:
models[sgdr_reg_name].model.fit(strat_train_set_norm, strat_train_set_labels)
print(f"number of iterations completed: {models[sgdr_reg_name].model.n_iter_}, number of weight updates: {models[sgdr_reg_name].model.t_}")

### Mean squared error and Accuracy of the SGD Regressor Model

In [None]:
models[sgdr_reg_name].accuracy("training", strat_train_set_norm, strat_train_set_labels.ravel())
models[sgdr_reg_name].rmse("training", strat_train_set_norm, strat_train_set_labels.ravel())
models[sgdr_reg_name].cross_val_score_eval(strat_train_set_norm, strat_train_set_labels.ravel())

In [None]:
models[sgdr_reg_name].print_model_accuracy('training')

In [None]:
models[sgdr_reg_name].print_model_rmse('training')

In [None]:
models[sgdr_reg_name].print_model_cross_val_score()

### Learning Curves for the SGD Regressor Model

In [None]:
models[sgdr_reg_name].plot_learning_curves(strat_train_set_norm, strat_train_set_labels.ravel())

## SVM Regression

In [None]:
#from sklearn.svm import LinearSVR
from sklearn.svm import SVR
#svm_reg = LinearSVR(epsilon=1.5)
svm_reg = SVR(kernel="poly", degree=3, C=100, epsilon=0.01)
svm_reg_name = type(svm_reg).__name__
models[svm_reg_name] = model_instance(svm_reg)

In [None]:
models[svm_reg_name].model.fit(strat_train_set_norm, strat_train_set_labels)

### Mean squared error and Accuracy of the SVM Regressor Model

In [None]:
models[svm_reg_name].accuracy("training", strat_train_set_norm, strat_train_set_labels.ravel())
models[svm_reg_name].rmse("training", strat_train_set_norm, strat_train_set_labels.ravel())
models[svm_reg_name].cross_val_score_eval(strat_train_set_norm, strat_train_set_labels.ravel())

In [None]:
models[svm_reg_name].print_model_accuracy('training')

In [None]:
models[svm_reg_name].print_model_rmse('training')

In [None]:
models[svm_reg_name].print_model_cross_val_score()

### Learning Curves for the SVM Regressor Model

In [None]:
models[svm_reg_name].plot_learning_curves(strat_train_set_norm, strat_train_set_labels.ravel())

## Performance on the Test Set

In [None]:
for model in models.keys():
    models[model].accuracy("test", strat_test_set_norm, strat_test_set_labels.ravel())
    models[model].rmse("test", strat_test_set_norm, strat_test_set_labels.ravel())
    models[model].print_model_accuracy('test')
    models[model].print_model_rmse('test')
    print("")

In [None]:
training_accuracies = []
test_accuracies = []
model_names = []
for model in models.keys():
    training_accuracies.append(models[model].training['accuracy'])
    test_accuracies.append(models[model].test['accuracy'])
    model_names.append(models[model].name)

In [None]:
plt.plot(model_names, training_accuracies, "r+", linewidth=2, label="train")
plt.plot(test_accuracies, "b+", linewidth=2, label="test")
plt.xticks(rotation = 90)
plt.legend(framealpha=1, frameon=True);

## Challenge Gunsan-Saemangeum 2002 prediction

Let's try to predict the Moving Time of a new route I rode the last weew, the Challenge Gunsan-Saemangeum 2022.

The input data are:
- Distance: 30 Km's
- Elevation Gain: 26
- Max Grade: 3%

The real Moving Time is 53 minutes

<center><img src="gunsam.png" alt="Challenge Gunsan-Saemangeum 2022 - Rouvy" /><img src="test-route.png" alt="Challenge Gunsan-Saemangeum 2022 - Strava" /></center>

In [None]:
gunsan_real_moving_time=53*60 # secs

In [None]:
gunsan_route_data = pd.DataFrame([[29.99, 26, 3]], columns = strat_train_set.columns)
#print(scaler.mean_)
#gunsan_route_data = [[29.99, 26, 3]]
gunsan_route_data_norm = test_scaler.transform(gunsan_route_data) #gunsan_route_data.reshape(1,-1)
#print(type(gunsan_route_data_norm))
#print((gunsan_route_data_norm))
gunsan_route_data_norm_df = pd.DataFrame(gunsan_route_data_norm, columns = strat_train_set.columns)
#print(type(gunsan_route_data_norm_df))
#print((gunsan_route_data_norm_df))

### Model's prediction comparison on the new route

Let's print out the predictions of the three models trained so far.

In [None]:
for model in models.keys():
    predicted_moving_time, error = models[model].prediction(gunsan_real_moving_time, gunsan_route_data_norm_df)
    print("Model {}: Moving time {} mins, error {} %:".format(models[model].name, predicted_moving_time/60, error))

##  (Brunnen) - Tour de Suisse 2022

Let's try another route. The input data are

The input data are:

- Distance: 27.7 Km's
- Elevation Gain: 467
- Max Grade: 13%

The real Moving Time is 1:03:52 = 3832 secs 

<center>
<img src="brunnen-rouvy.png" alt="Stage 4 (Brunnen) - Tour de Suisse 2022 - Rouvy" />
<img src="brunnen-strava.png" alt="Stage 4 (Brunnen) - Tour de Suisse 2022 - Strava" />
</center>

In [None]:
brunnen_real_moving_time=3600 + 3*60 + 52 # secs

In [None]:
#print(scaler.mean_)
brunnen_route_data = pd.DataFrame([[27.87, 467, 13]], columns = strat_train_set.columns)
#brunnen_route_data = [[27.87, 467, 13]]
brunnen_route_data_norm = test_scaler.transform(brunnen_route_data)
#print(type(brunnen_route_data_norm))
#print((brunnen_route_data_norm))
brunnen_route_data_norm_df = pd.DataFrame(brunnen_route_data_norm, columns = strat_train_set.columns)
#print(typebrunnen_route_data_norm_df))
#print((brunnen_route_data_norm_df))

### Model's prediction comparison for the "Brunnen" route

Let's print out the predictions of the models trained so far.

In [None]:
predicted_moving_times = []
errors = []
for model in models.keys():
    predicted_moving_time, error = models[model].prediction(brunnen_real_moving_time, brunnen_route_data_norm_df)
    print("Model {}: Moving time {} mins, error {} %:".format(models[model].name, predicted_moving_time/60, error))
    predicted_moving_times.append(predicted_moving_time)
    errors.append(error)

In [None]:
plt.plot(model_names, errors, "r+", linewidth=2, label="Predicted Moving Time")
plt.xticks(rotation = 90)
plt.legend(framealpha=1, frameon=True);

## Conclusions

The purpose of this exercise was to to develop a Machine Learning Model to predict the 'Moving Time' of a Route based on Rouvy data. Here below a recap of what has been done

- Data includes more of 500 'outdoor' and 'indoor' routes 
- The 'indoor' routes have been exported by Rouvy to Strava
- The 'outdoor' routes have been recorded directly by Strava
- The data have been cleaned and prepared for training
- The dataset has been splitted into a 'Training' and 'Test' sets
- A bunch of Regressor Models have been trained on the 'Training' set.
- Each model has been validated by cross validation score and accuracy
- For each model the Learning curves have been plotted for 'Training' and 'Test' sets
- Decision Tree model overfit badly the 'Training' set.
- LinearRegressor, Ridge, RandomForest, Pipeline and SGDRegressor models have a performance of 90% (more or less).
- The Polynomial Regressor (Pipeline) showed the highest accuracy on the 'Test' set.
- On the two sample routes the best 'Moving Time' has been predicted by the Polynomial, SVR and SGD models
- In order to improve the performance of the models more data are required
- As a possible follow up, an Ensemble model including the most promising models can be trained

## Fine-Tune Your Model

 You now need to fine-tune the Random Forest. Let’s look at a few ways you can do that.

### Grid Search

One way to do that would be to fiddle with the hyperparameters manually, until you
find a great combination of hyperparameter values. This would be very tedious work,
and you may not have time to explore many combinations. 
Instead you should get Scikit-Learn’s GridSearchCV to search for you. All you need to
do is tell it which hyperparameters you want it to experiment with, and what values to
try out, and it will evaluate all the possible combinations of hyperparameter values,
using cross-validation. For example, the following code searches for the best combi‐
nation of hyperparameter values for the RandomForestRegressor:

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = [
 {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
 {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]

In [None]:
grid_search = GridSearchCV(models[forest_reg_name].model, param_grid, cv=10,  scoring='neg_mean_squared_error', 
                           return_train_score=True)
grid_search.fit(strat_train_set_norm, strat_train_set_labels)

In [None]:
grid_search.best_params_

In [None]:
cvres = grid_search.cv_results_

In [None]:
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

In [None]:
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances

In [None]:
best_random_forest = grid_search.best_estimator_
best_random_forest_name = "BestRandomForest"
models[best_random_forest_name] = model_instance(best_random_forest)

In [None]:
models[best_random_forest_name].accuracy("test", strat_test_set_norm, strat_test_set_labels)
models[best_random_forest_name].rmse("test", strat_test_set_norm, strat_test_set_labels)
models[best_random_forest_name].print_model_accuracy('test')
models[best_random_forest_name].print_model_rmse('test')

In [None]:
predicted_moving_time, error = models[best_random_forest_name].prediction(gunsan_real_moving_time, gunsan_route_data_norm_df)
print("Model {}: Moving time {} mins, error {} %:".format(models[best_random_forest_name].name, predicted_moving_time/60, error))

### Learning curves of the best Random Forest Model

Let’s look at the learning curves of the plain Best Random Forest

In [None]:
models[best_random_forest_name].plot_learning_curves(strat_train_set, strat_train_set_labels.ravel())