# Predict hiking times based on statistics from GPX files

We attempt to predict the walking times and durations of hikes based on statistics extracted from GPX files. Using personal GPX records as input, the outputs are personalized as well.

The code is partially based on the `basic_regression.ipynb` notebook from [Tensorflow tutorials](https://github.com/tensorflow/docs/blob/r1.13/site/en/tutorials/keras/basic_regression.ipynb).

We use the `tf.keras` API, see [this guide](https://www.tensorflow.org/guide/keras) for details.

In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

import pathlib

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models
import numpy as np
import h5py

import gpx_stats
import utils

print(tf.__version__)

### Load data

Load the dataset that was prepared by running the following command in a shell:

`run prepareData.py '~/GPX-Tracks' 'Wandern'`

In [None]:
train_dataset_file = 'hiking_data_training.hdf5'
test_dataset_file = 'hiking_data_test.hdf5'

Convert data to a pandas dataframe:

In [None]:
def read_hdf5_to_dict(file_name):
    hdf5_data = h5py.File(file_name, 'r')
    hdf5_data_dict = {}
    for name in gpx_stats.GpxSegmentStats.get_header():
        if name == 'Path':
            hdf5_data_dict[name] = [gpx_stats.PathFeature(data) for data in hdf5_data[name]]
        else:
            hdf5_data_dict[name] = hdf5_data[name][...]

    hdf5_data.close()
    return hdf5_data_dict

train_hdf5_data_dict = read_hdf5_to_dict(train_dataset_file)
test_hdf5_data_dict = read_hdf5_to_dict(test_dataset_file)

train_path_features_shape = train_hdf5_data_dict['Path'][0].shape
test_path_features_shape = test_hdf5_data_dict['Path'][0].shape
    
train_dataset = pd.DataFrame.from_dict(train_hdf5_data_dict)
train_dataset.pop('Path')
test_dataset = pd.DataFrame.from_dict(test_hdf5_data_dict)
test_dataset.pop('Path')


train_dataset.tail()

### Inspect the data

Have a quick look at the joint distribution of a few pairs of columns from the data set.

In [None]:
sns.pairplot(train_dataset[["Duration", "Length2d", "Length3d", "MovingTime", "StoppedTime", "TotalUphill", "TotalDownhill"]], diag_kind="kde")

All features show some correlation with the moving time and duration.

Also look at the overall feature statistics:

In [None]:
train_dataset_stats = train_dataset.describe()
train_dataset_stats.pop("MovingTime")
train_dataset_stats.pop("StoppedTime")
train_dataset_stats.pop("Duration")
train_dataset_stats = train_dataset_stats.transpose()
train_dataset_stats

### Split features from labels

Separate the target value, or "label", from the features. This label is the value that we will train the model to predict.

In [None]:
label_columns = ['MovingTime', 'StoppedTime', 'Duration']
train_labels_data = []
test_labels_data = []

for col_name in label_columns:
    train_labels_data.append(train_dataset.pop(col_name))
    test_labels_data.append(test_dataset.pop(col_name))

train_labels = pd.concat(train_labels_data, axis=1)
train_labels.sort_index(inplace=True)
test_labels = pd.concat(test_labels_data, axis=1)
test_labels.sort_index(inplace=True)

In [None]:
print(train_dataset.tail())
print(train_labels.tail())

### Normalize the data

The statistics about `train_dataset_stats` in the above block shows a wide variation of ranges for all features. Although a model *might* converge without feature normalization, the latter usually improves convergence properties.

Note: Although we intentionally generate these statistics from only the training dataset, these statistics will also be used to normalize the test dataset. We need to do that to project the test dataset into the same distribution as the one the model has been trained on.

In [None]:
def norm(x):
    return (x - train_dataset_stats['mean']) / train_dataset_stats['std']

normed_train_data = norm(train_dataset)
normed_test_data = norm(test_dataset)

with open('train_stats_simple.csv', 'w') as csvfile:
    train_dataset_stats.to_csv(csvfile, sep=' ')

`normed_train_data` is what we will use to train the model.

Caution: The statistics used to normalize the inputs here (mean and standard deviation) need to be applied to any other data that is fed to the model. This includes the test dataset as well as input during inference. Thus we need to save the normalization numbers together with the model.

## The model

### Build the model

Let's build our model. In this notebook, we use a very simple multi-layer perceptron with a few fully connected layers. The last layer hast three units without activation function, suitable for a regression problem. for predicting duration, moving time and stopped time. The model building steps are wrapped in a function, `build_model`, for convencience.

In [None]:
learning_rate = 0.01
batch_size = 256

def build_model():
    statistics_inputs = layers.Input(shape=(train_dataset.shape[1],), name='StatisticsInput')
    dense1 = layers.Dense(32, activation=tf.nn.relu)(statistics_inputs)
    dense2 = layers.Dense(32, activation=tf.nn.relu)(dense1)
    dense3 = layers.Dense(32, activation=tf.nn.relu)(dense2)
    dense4 = layers.Dense(32, activation=tf.nn.relu)(dense3)
    outputs = layers.Dense(len(train_labels.keys()), activation=None)(dense4)
    
    model = models.Model(inputs=statistics_inputs, outputs=outputs)

    optimizer = tf.keras.optimizers.RMSprop(learning_rate=learning_rate)

    model.compile(loss='mean_squared_error', 
                  optimizer=optimizer,
                  metrics=['mean_absolute_error', 'mean_squared_error'])
    return model

model = build_model()

How did we specify the architecture of the model?

The architecture is chosen such that we can reproduce esimates for walking times from a standard formula that is implemented in the function `compute_standard_walking_time` in `utils.py`.

### Inspect the model

Use the `.summary` method to print a simple description of the model

In [None]:
model.summary()


Now try out the model. Take a batch of `4` examples from the training data and call `model.predict` on it.

In [None]:
example_batch = normed_train_data[:4]
example_result = model.predict(example_batch)
example_result

### Train the model

Train the model for up to 500 epochs with the training set, and record the training and validation accuracy in the `history` object.

In [None]:
EPOCHS = 500

reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5,
                                                 patience=20, min_lr=0.01*learning_rate)
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=40, verbose=1, 
                                                  restore_best_weights=True)

callbacks = [reduce_lr, early_stopping]


history = model.fit(normed_train_data.values, train_labels.values,
                    epochs=EPOCHS, validation_split=0.2, verbose=1,
                    batch_size=batch_size,
                    callbacks=callbacks)

Visualize the model's training progress using the stats stored in the `history` object.

In [None]:
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
hist.tail()

In [None]:
utils.plot_history(history)

As our data set of real GPX tracks is very small, the optimal choice of the epoch for ending training depends on the distribution of tracks between training, validation and testing data. The above choice led to quite stable results.


### Make predictions

Finally, predict walking time values using data in the testing set:

In [None]:
compute_standard_walking_time_vectorized = np.vectorize(utils.compute_standard_walking_time)

standard_estimate_walking_time = compute_standard_walking_time_vectorized(test_dataset['Length2d'].values,
                                                                          test_dataset['TotalUphill'].values,
                                                                          test_dataset['TotalDownhill'].values)

In [None]:
moving_stopped_duration_data_list = []
for index, data in test_labels.iterrows():
    moving_stopped_duration_data_list.append([data['MovingTime'], data['StoppedTime'], data['Duration']])
moving_stopped_duration_data = np.array(moving_stopped_duration_data_list)
moving_stopped_duration_data = np.transpose(moving_stopped_duration_data)


test_predictions = np.transpose(model.predict(normed_test_data))

In [None]:
label_texts = ['Moving Time', 'Stopped Time', 'Duration']
    
for i in range(3):
    utils.scatter_plot(moving_stopped_duration_data[i], test_predictions[i], label_texts[i])

It looks like our model predicts reasonably well, as the results should be as close to the diagonal as possible. Let's take a look at the error distribution.

In [None]:
for i in range(3):
    utils.plot_error_hist(moving_stopped_duration_data[i], test_predictions[i], label_texts[i])

It's not quite gaussian, but we might expect that because the number of samples is very small.

Export model for use in inference:

In [None]:
model.save('model_hikingTimePrediction_simple.h5')

## Comparison of predictions with standard formula for predicting hiking durations

In the following, the predictions are compared with the estimates for hiking durations from a standard formula.

In [None]:
utils.scatter_plot(moving_stopped_duration_data[2], standard_estimate_walking_time, 'Standard estimate for duration')

In [None]:
utils.plot_error_hist(moving_stopped_duration_data[2], 
                      standard_estimate_walking_time, 
                      'Standard estimate for duration')

## Conclusion

This notebook presents the training of an algorithm for predicting moving and stopping times as well as total duration for hiking (but it can also be applied to many other outdoor activities). We accomplished this by treating it as a regression problem and it is demonstrated that the model predictions are significantly better than those of a standard formulate for estimating hiking times. The other two notebooks present somewhat more sophisticated models that lead to better results.

In [None]:
model.evaluate(normed_test_data, test_labels, return_dict=True)