# Recurrent artificial neural networks
# Long Short-Term Memory - LSTM
--------

This notebook will guide you through the training and testing of a LSTM network in the task of predicting the speed of a runner for a given slope by using her/his previous speeds during a given race.

Before running the cells in this notebook you have to upload some files with data from some races and the Python module for parsing those files and preparing a dataset.
### left panel -> Files -> Upload
Then select the compressed folder **strava.zip** and the Python module **strava.py**

### OR
Uncomment and update the code in the following cell if your data is in your google drive

In [None]:
#from google.colab import drive
#drive.mount('/content/drive')

#!cp 'drive/My Drive/Colab Notebooks/strava.zip' .
#!cp 'drive/My Drive/Colab Notebooks/strava.py' .

Let us start by loading some Python modules

In [None]:
import numpy as np
from matplotlib import pyplot as pl
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import os

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM

import strava as st

Then, let us unzip the compressed file you uploaded

In [None]:
if os.path.exists('strava'):
  print('Data are already in folder')
else:
  if os.path.exists('strava.zip'):
    !unzip strava.zip
  else:
    print('You must upload the data first!')

Declare some constants

In [None]:
PATH_DATA = 'strava'

FEATURES = ['time', 'speed', 'slope']    # selected from ['time', 'elevation', 'distance', 'speed', 'slope']
SPEED_OUTLIER = 8.0                      # speed > 30km/h
SLOPE_OUTLIER = 80                       # slope > +-80%
TIME_PERIOD = 1*60                       # period of time to average
SEGMENT_LENGTH = 100                     # length of the segment to average data
AVERAGE_SPEED_TH = 2.4                   # threshold to further clean the dataset

### Read the data
Parse the files in the **strava** folder

In [None]:
importer = st.RunImport(SPEED_OUTLIER, SLOPE_OUTLIER, TIME_PERIOD, SEGMENT_LENGTH, AVERAGE_SPEED_TH)
dataset = importer.import_path(PATH_DATA)

### Example of a race
The following cell shows an example of the data from a race

In [None]:
st.plot_race(dataset, np.random.randint(dataset['race'].max()))

### Normalize the dataset
The following cell normalizes the features into the interval [0, 1]

In [None]:
# Copy the dataset before normalisation, feature selection, numpy conversion, etc.
original_dataset = dataset.copy(deep=True)
min_speed = original_dataset['speed'].min()
max_speed = original_dataset['speed'].max()

#normalize only the selected features
#transform to numpy
normalized_dataset = dataset.filter(items=FEATURES).values
scaler = MinMaxScaler()
scaler.fit(normalized_dataset)
normalized_dataset = scaler.transform(normalized_dataset)
#transform back to dataframe
normalized_dataset = pd.DataFrame(normalized_dataset, index=dataset.index, columns=FEATURES)

#update the dataset with the new values
dataset.update(normalized_dataset)

print("Min (per features):", scaler.data_min_)
print("Max (per features):", scaler.data_max_)
display(dataset.head())

### Create a training and a testing subset
Split the dataset into training and testing


In [None]:
TIMESTEPS = 10                   #define sequence length
TEST_SIZE = 0.2                  #value between ]0;1[
TRAINING_SIZE = 1 - TEST_SIZE

In [None]:
all_races = np.unique(dataset['race'])
print('Number of races', len(all_races))
RACES_TRAINING = int(np.floor(TRAINING_SIZE * len(all_races)))
races_train = np.random.choice(all_races, RACES_TRAINING, replace=False)
#print(races_training)
races_test = list(set(all_races) - set(races_train))
#print(races_test)
print(len(races_train), 'used during training --- Number of samples', np.sum(np.isin(dataset['race'], races_train)))
print(len(races_test), 'used during test\t --- Number of samples', np.sum(np.isin(dataset['race'], races_test)))

### Create inputs and outputs
The folowing cell contains the function that will be used to create the inputs and outputs for training the models

In [None]:
#take a dataframe as input and return the splitted version with the prediction as a numpy array
def create_x_y(data, races):
  speed_index = data[FEATURES].columns.get_loc('speed') #get speed index
  slope_index = data[FEATURES].columns.get_loc('slope') #get slope index
  time_index = data[FEATURES].columns.get_loc('time') #get time index
  x = None
  y = None

  #iterate over every race
  for r in races:
    #filter race
    race_df = data.loc[data['race'] == r]
    #filter features
    race_np = race_df[FEATURES].values
    #split into timesteps (timesteps + 1 to take the target value)
    race_np = [race_np[i:(i+TIMESTEPS+1)] for i in range(race_np.shape[0] - (TIMESTEPS+2))]

    if len(race_np) == 0:
      print("Warning: not enough values in race", r)
      continue

    race_np = np.stack(race_np, axis=0)

    temp_x = np.dstack([race_np[:,1:,time_index],       # last TIMESTEPS-1 times and next time
                       race_np[:,1:,slope_index],       # last TIMESTEPS-1 slopes and next slope
                       race_np[:,:-1,speed_index]])     # last TIMESTEPS speeds
    temp_y = race_np[:,-1, speed_index]                 # next speed

    if x is None:
      x = temp_x
      y = temp_y
    else:
      x = np.append(x, temp_x, axis=0)
      y = np.append(y, temp_y, axis=0)

  return x, y

In [None]:
print('original shape:', dataset.shape)

X_train, y_train = create_x_y(dataset, races_train)
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)

X_test, y_test = create_x_y(dataset, races_test)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)


### Create a model and train it
The following cells create a LSTM network and train it with the training subset

In [None]:
BATCH_SIZE = 64          # Size of the batch for training
NB_EPOCHS = 2    # Number of times the training dataset is presented
NB_UNITS = 1         # Number of LSTM units

# Create and fit the LSTM network
model = Sequential()
model.add(LSTM(NB_UNITS, input_shape=(TIMESTEPS, len(FEATURES))))
#dense layer 1 : connect all LSTM cell to one cell -> output shape as (*, 1)
model.add(Dense(1))

model.compile(loss='mean_squared_error', optimizer='adam')
model.summary()

In [None]:
history = model.fit(X_train, y_train, epochs=NB_EPOCHS, batch_size=BATCH_SIZE, verbose=1, validation_data=(X_test, y_test))

In [None]:
# Plot the training and testing
pl.plot(history.history['loss'], label='Training')
pl.plot(history.history['val_loss'], label='Testing')
pl.xlabel('epochs')
pl.ylabel('mse')
pl.legend()
pl.grid()

### Evaluate the performance of the model
The following cell computes the correlation between the actual speed of the runner and the model's output

In [None]:
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

print('Training correlation coefficient:', np.corrcoef(y_train.T, y_train_pred.T)[0,1])
print('Test correlation coefficient:', np.corrcoef(y_test.T, y_test_pred.T)[0,1])

### Visualize the results
The following cell visualize the output of the LSTM for a single race in the testing subset and compare it with the actual speed of the runner

In [None]:
random_race = np.random.choice(races_test)
X, y = create_x_y(dataset, [random_race])
X_o, y_o = create_x_y(original_dataset, [random_race])            # select inputs and output from the unnormalized dataset also

y_pred_o = model.predict(X) * (max_speed - min_speed) + min_speed # unnormalize the prediction

pl.figure(figsize=(14,4))
pl.plot(X_o[:,-1,0], y_o, label='actual speed')
pl.plot(X_o[:,-1,0], y_pred_o, label='prediction')
pl.plot(X_o[:,-1,0], np.abs(y_o - y_pred_o[:,0]), label='abs. error')
pl.legend()
pl.title('race number: ' + str(random_race))
pl.xlabel('time [s]')
pl.ylabel('speed [m/s]');


# Exercise

1. Change the number of units and epochs of the LSTM network. Show the configuration that performed the best.
2. What is the largest error (speed prediction) you observed? Do you observe that most of those large errors show up for high speeds ? or low speeds? Why?
3. Using the predicted speeds for a given race, compute the expected time for a race and compute the difference between the real race time and the predicted race time in minutes. Provide the code of the cell that computes this prediction error.