# Predict hiking times based on GPX tracks

We attempt to predict the walking times and durations for hikes based on GPX files using recurrent neural networks. Using personal GPX records as input, the outputs are personalized as well.

We use the `tf.keras` API, see [this guide](https://www.tensorflow.org/guide/keras) for details.

In [None]:
import pathlib

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models
import numpy as np
import h5py

import gpx_stats
import utils

print(tf.__version__)

### Load data

Load the dataset that was prepared by running the following command in a shell:

`run prepareData.py '~/GPX-Tracks' 'Wandern'`

In [None]:
train_dataset_file = 'hiking_data_training.hdf5'
test_dataset_file = 'hiking_data_test.hdf5'

Convert data to a pandas dataframe:

In [None]:
def read_hdf5_to_dict(file_name):
    hdf5_data = h5py.File(file_name, 'r')
    hdf5_data_dict = {}
    for name in gpx_stats.GpxSegmentStats.get_header():
        if name == 'Path':
            hdf5_data_dict[name] = [gpx_stats.PathFeature(data) for data in hdf5_data[name]]
        else:
            hdf5_data_dict[name] = hdf5_data[name][...]

    hdf5_data.close()
    return hdf5_data_dict

train_hdf5_data_dict = read_hdf5_to_dict(train_dataset_file)
test_hdf5_data_dict = read_hdf5_to_dict(test_dataset_file)

path_features_shape = train_hdf5_data_dict['Path'][0].shape
    
train_dataset = pd.DataFrame.from_dict(train_hdf5_data_dict)
test_dataset = pd.DataFrame.from_dict(test_hdf5_data_dict)

train_dataset.tail()

### Split features from labels

Separate the target value, or "label", from the features. This label is the value that we will train the model to predict.

In [None]:
data_columns = ["Path"]
label_columns = ['MovingTime']
train_labels_data = []
test_labels_data = []

train_data = train_dataset[data_columns]
train_labels = train_dataset[label_columns]

test_data = test_dataset[data_columns]
test_labels = test_dataset[label_columns]

train_paths_as_array = gpx_stats.convert_paths_to_array(train_data["Path"])
test_paths_as_array = gpx_stats.convert_paths_to_array(test_data["Path"])

Filter paths with `NaN` entries:

In [None]:
train_nan_mask = np.any(np.isnan(train_paths_as_array), axis=(1, 2))
test_nan_mask = np.any(np.isnan(test_paths_as_array), axis=(1, 2))

train_paths_as_array = train_paths_as_array[~train_nan_mask]
train_labels = train_labels[~train_nan_mask]

test_paths_as_array = test_paths_as_array[~test_nan_mask]
test_labels = test_labels[~test_nan_mask]

## The model

### Build the model

Let's build our model. It consists of a few 1d convolutional layers, following by two LSTM layers and two dense layers. The 1d convolutional layers provide some preprocessing of the tracks and are numerially very cheap. As we only have few tracks available, reducing the number of parameters in the model helps to avoid overfitting. The last fully connected layer outputs duration, moving time and stopped time. The model building steps are wrapped in a function, `build_model`, for convencience.

In [None]:
learning_rate = 0.001
batch_size = 256

def build_model():
    path_inputs = layers.Input(shape=path_features_shape, name='PathInputs')
    conv1 = layers.Conv1D(filters=6, kernel_size=3, padding='same', activation=tf.nn.relu)(path_inputs)
    conv2 = layers.Conv1D(filters=12, kernel_size=3, padding='same', activation=tf.nn.relu)(conv1)
    bn1 = layers.BatchNormalization()(conv2)
    conv3 = layers.Conv1D(filters=24, kernel_size=3, padding='same', activation=tf.nn.relu)(bn1)
    conv4 = layers.Conv1D(filters=6, kernel_size=3, padding='same', activation=tf.nn.relu)(conv3)
    bn2 = layers.BatchNormalization()(conv4)
    
    lstm1 = layers.LSTM(16, return_sequences=True)(bn2)
    dropout = layers.Dropout(0.25)(lstm1)
    lstm2 = layers.LSTM(16, return_sequences=False)(dropout)
    dropout2 = layers.Dropout(0.25)(lstm2)
    dense1 = layers.Dense(32, activation=tf.nn.relu)(dropout2)

    outputs = layers.Dense(len(train_labels.keys()), activation=None)(dense1)
    
    model = models.Model(inputs=path_inputs, outputs=outputs)

    optimizer = tf.keras.optimizers.RMSprop(learning_rate=learning_rate)

    model.compile(loss='mean_squared_error', 
                  optimizer=optimizer,
                  metrics=['mean_absolute_error', 'mean_squared_error'])
    return model

model = build_model()

### Inspect the model

Use the `.summary` method to print a simple description of the model

In [None]:
model.summary()


Now try out the model. Take a batch of `4` examples from the training data and call `model.predict` on it.

In [None]:
example_batch = train_paths_as_array[:4]
example_result = model.predict(example_batch)
example_result

### Train the model

Train the model for up to 500 epochs with the training set, and record the training and validation accuracy in the `history` object.

In [None]:
EPOCHS = 500

reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5,
                                                 patience=10, min_lr=0.01*learning_rate)
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=20, verbose=1, 
                                                  restore_best_weights=True)

callbacks = [reduce_lr, early_stopping]


history = model.fit(train_paths_as_array, 
                    train_labels.values,
                    epochs=EPOCHS, validation_split=0.2, verbose=1,
                    batch_size=batch_size,
                    callbacks=callbacks)

Visualize the model's training progress using the stats stored in the `history` object.

In [None]:
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
hist.tail()

In [None]:
utils.plot_history(history)

As our data set of real GPX tracks is very small, the optimal choice of the epoch for ending training depends on the distribution of tracks between training, validation and testing data. The above choice led to quite stable results. The model trains quite well and is slightly better than the simple model discussed in the other notebook.


### Make predictions

Finally, predict walking time values using data in the testing set:

In [None]:
test_predictions = model.predict(test_paths_as_array)

utils.scatter_plot(test_labels["MovingTime"].values, test_predictions, "Moving time")

It looks like our model predicts durations and moving times reasonably well, as the results should be as close to the diagonal as possible. There are interesting gaps in the plots at low times. Let's take a look at the error distribution.

In [None]:
utils.plot_error_hist(test_labels["MovingTime"].values, test_predictions[:, 0], "Moving time")

It's not quite gaussian, but we might expect that because the number of samples is very small.

Export model for use in inference:

In [None]:
model.save('model_hikingTimePrediction_recurrent.h5')

## Comparison of predictions with standard formula for predicting hiking durations

In the following, the predictions are compared with the estimates for hiking durations from a standard formula.

In [None]:
compute_standard_walking_time_vectorized = np.vectorize(utils.compute_standard_walking_time)

standard_estimate_walking_time = compute_standard_walking_time_vectorized(test_dataset['Length2d'].values,
                                                                          test_dataset['TotalUphill'].values,
                                                                          test_dataset['TotalDownhill'].values)

utils.scatter_plot(test_predictions[:, 0], standard_estimate_walking_time, 'Standard estimate for duration')

In [None]:
utils.plot_error_hist(test_predictions[:, 0], 
                      standard_estimate_walking_time, 
                      'Standard estimate for duration')

## Conclusion

This notebook presents the training of an algorithm for predicting moving and stopping times as well as total duration for hiking (but it can also be applied to many other outdoor activities).

In [None]:
model.evaluate(test_paths_as_array, test_labels, return_dict=True)