# Predict hiking times based on statistics from GPX files

We attempt to predict the walking times and durations of hikes based on statistics extracted from GPX files. Using personal GPX records as input, the outputs are personalized as well.

We use the `tf.keras` API, see [this guide](https://www.tensorflow.org/guide/keras) for details.

In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

import pathlib

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

import h5py
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models

physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)

import gpx_stats
import utils

print(tf.__version__)

### Load data
Load the dataset that was prepared by running the following command in a terminal:

`python prepareData.py '~/GPX-Tracks' 'Wandern'`

In [None]:
train_dataset_file = 'hiking_data_training.hdf5'
test_dataset_file = 'hiking_data_test.hdf5'

Convert data to a pandas dataframe:

In [None]:
def read_hdf5_to_dict(file_name):
    hdf5_data = h5py.File(file_name, 'r')
    hdf5_data_dict = {}
    for name in gpx_stats.GpxSegmentStats.getHeader():
        if name == 'Path':
            hdf5_data_dict[name] = [gpx_stats.PathFeature(data) for data in hdf5_data[name]]
        else:
            hdf5_data_dict[name] = hdf5_data[name][...]

    hdf5_data.close()
    return hdf5_data_dict

train_hdf5_data_dict = read_hdf5_to_dict(train_dataset_file)
test_hdf5_data_dict = read_hdf5_to_dict(test_dataset_file)

path_features_shape = train_hdf5_data_dict['Path'][0].shape
    
train_dataset = pd.DataFrame.from_dict(train_hdf5_data_dict)
test_dataset = pd.DataFrame.from_dict(test_hdf5_data_dict)

train_dataset.tail()

### Inspect the data

Have a quick look at the joint distribution of a few pairs of columns from the data set.

In [None]:
sns.pairplot(train_dataset[["Duration", "Length2d", "Length3d", "MovingTime", "StoppedTime", "TotalUphill", "TotalDownhill"]], diag_kind="kde")

These plots reveal that all features extracted from the GPX track segments show some correlation with the duration and moving time.

Also look at the overall statistics:

In [None]:
train_dataset_stats = train_dataset.describe()
train_dataset_stats.pop("MovingTime")
train_dataset_stats.pop("StoppedTime")
train_dataset_stats.pop("Duration")
train_dataset_stats = train_dataset_stats.transpose()
train_dataset_stats

### Split features from labels

Separate the target value, or "label", from the features. This label is the value that we will train the model to predict.

In [None]:
label_columns = ['MovingTime']
drop_columns = ["StoppedTime", "Duration"]
train_dataset.drop(columns=drop_columns, inplace=True)
test_dataset.drop(columns=drop_columns, inplace=True)

train_labels_data = []
test_labels_data = []

for col_name in label_columns:
    train_labels_data.append(train_dataset.pop(col_name))
    test_labels_data.append(test_dataset.pop(col_name))

train_labels = pd.concat(train_labels_data, axis=1)
train_labels.sort_index(inplace=True)
test_labels = pd.concat(test_labels_data, axis=1)
test_labels.sort_index(inplace=True)

### Normalize the data

The statistics about `train_stats` in the above block shows a wide variation of ranges for all features. Although a model *might* converge without feature normalization, the latter usually improves convergence properties.

Note: Although we intentionally generate these statistics from only the training dataset, these statistics will also be used to normalize the test dataset. We need to do that to project the test dataset into the same distribution as the one the model has been trained on.

In [None]:
def norm(x):
    return (x - train_dataset_stats['mean']) / train_dataset_stats['std']

train_paths = train_dataset.pop('Path')
normed_train_data = norm(train_dataset)

test_paths = test_dataset.pop('Path')
normed_test_data = norm(test_dataset)

train_paths_as_array = gpx_stats.convert_paths_to_array(train_paths)
test_paths_as_array = gpx_stats.convert_paths_to_array(test_paths)

# TODO: Properly fix NaN issue in elevation!!!
# Quick and dirty fix: Set NaN values to 0
train_paths_as_array = np.nan_to_num(train_paths_as_array, copy=False)
test_paths_as_array = np.nan_to_num(test_paths_as_array, copy=False)

with open('train_dataset_stats.csv', 'w') as csvfile:
    train_dataset_stats.to_csv(csvfile, sep=' ')

`normed_train_data` is what we will use to train the model.

Caution: The statistics used to normalize the inputs here (mean and standard deviation) need to be applied to any other data that is fed to the model. This includes the test dataset as well as input during inference. Thus we need to save the normalization numbers together with the model.

## The model

### Build the model

Let's build our model. In this notebook, the architectures from the other two notebooks are combined: The statistical features are processed with fully-connected layers and the path features with LSTM layers. The resulting outputs are concatenated and processed with three more dense layers that finally output duration, moving time and stopped time. The model building steps are wrapped in a function, `build_model`, for convencience.

In [None]:
learning_rate = 0.001

def build_model():
    stat_inputs = layers.Input(shape=(normed_train_data.shape[1],), name='StatInput')
    stat_dense1 = layers.Dense(16, activation=tf.nn.relu)(stat_inputs)
    stat_dense2 = layers.Dense(16, activation=tf.nn.relu)(stat_dense1)
    stat_dense3 = layers.Dense(16, activation=tf.nn.relu)(stat_dense2)
    stat_dense4 = layers.Dense(16, activation=tf.nn.relu)(stat_dense3)
    
    path_inputs = layers.Input(shape=path_features_shape, name='PathInputs')
    conv1 = layers.Conv1D(filters=6, kernel_size=3, padding='same', activation=tf.nn.relu)(path_inputs)
    conv2 = layers.Conv1D(filters=12, kernel_size=3, padding='same', activation=tf.nn.relu)(conv1)
    bn1 = layers.BatchNormalization()(conv2)
    conv3 = layers.Conv1D(filters=24, kernel_size=3, padding='same', activation=tf.nn.relu)(bn1)
    conv4 = layers.Conv1D(filters=6, kernel_size=3, padding='same', activation=tf.nn.relu)(conv3)
    bn2 = layers.BatchNormalization()(conv4)
    
    lstm1 = layers.LSTM(16, return_sequences=True)(bn2)
    dropout = layers.Dropout(0.25)(lstm1)
    lstm2 = layers.LSTM(16, return_sequences=False)(dropout)
    dropout2 = layers.Dropout(0.25)(lstm2)
    lstm_dense1 = layers.Dense(16, activation=tf.nn.relu)(dropout2)
    
    concat = layers.Concatenate()([stat_dense4, lstm_dense1])
    dense1 = layers.Dense(32, activation=tf.nn.relu)(concat)
    dense2 = layers.Dense(32, activation=tf.nn.relu)(dense1)
    
    outputs = layers.Dense(len(train_labels.keys()), activation=None)(dense2)
    
    model = models.Model(inputs=[stat_inputs, path_inputs], outputs=outputs)

    optimizer = tf.keras.optimizers.RMSprop(learning_rate=learning_rate)

    model.compile(loss='mean_squared_error', 
                  optimizer=optimizer,
                  metrics=['mean_absolute_error', 'mean_squared_error'])
    return model

model = build_model()

model.summary()


Now try out the model. Take a batch of `4` examples from the training data and call `model.predict` on it.

In [None]:
example_batch = [normed_train_data[:4], train_paths_as_array[:4]]
example_result = model.predict(example_batch)
example_result

### Train the model

Train the model for up to 500 epochs with the training set, and record the training and validation accuracy in the `history` object.

In [None]:
EPOCHS = 500

reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5,
                                                 patience=20, min_lr=0.01*learning_rate)
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=40, verbose=1, 
                                                  restore_best_weights=True)

callbacks = [reduce_lr, early_stopping]


history = model.fit([normed_train_data, train_paths_as_array], 
                    train_labels.values,
                    epochs=EPOCHS,
                    validation_data=([normed_test_data, test_paths_as_array], test_labels),
                    verbose=1,
                    callbacks=callbacks)

Visualize the model's training progress using the stats stored in the `history` object.

In [None]:
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
hist.tail()

In [None]:
utils.plot_history(history)

### Make predictions

Finally, predict walking time values using data in the testing set:

In [None]:
compute_standard_walking_time_vectorized = np.vectorize(utils.compute_standard_walking_time)

standard_estimate_walking_time = compute_standard_walking_time_vectorized(test_dataset['Length2d'].values,
                                                                          test_dataset['TotalUphill'].values,
                                                                          test_dataset['TotalDownhill'].values)

In [None]:
test_predictions = model.predict([normed_test_data, test_paths_as_array])

In [None]:
for i, label_col_name in enumerate(label_columns):
    utils.scatter_plot(test_labels[label_col_name],
                       test_predictions[:, i],
                       label_col_name,
                       max_val=150)

It looks like our model predicts durations and moving times reasonably well, as the results should be as close to the diagonal as possible. However, it has quite some problems with predicting the stopped times. Let's take a look at the error distribution.

In [None]:
for i, label_col_name in enumerate(label_columns):
    utils.plot_error_hist(test_labels[label_col_name], test_predictions[:, i], label_col_name)

It's not quite gaussian, but we might expect that because the number of samples is very small.

In [None]:
plt.hist(test_predictions[:, i], alpha=0.5, density=False, label="Pred", bins=50)
plt.hist(test_labels[label_col_name], alpha=0.5, density=False, label="True", bins=50)
plt.xlabel("Moving time [s]")
plt.ylabel("Count")
plt.legend()
plt.show()

In [None]:
model.evaluate([normed_test_data, test_paths_as_array], test_labels)

Export model for use in inference:

In [None]:
model.save('model_hikingTimePrediction_mixed.h5')

## Conclusion

This notebook presents the training of an algorithm for predicting moving and stopping times as well as total duration for hiking (but it can also be applied to many other outdoor activities). The model that takes statistical features and path features as inputs performs a bit better than the other two architectures.