# DM - Mobility Task

Team members:
 * Frick Bernhard (a01505541@unet.univie.ac.at)
 * Postlmayr Billie Rosalie (a01307120@unet.univie.ac.at)

Former team members that opted out of the course:
 * Decsi István (a11834026@unet.univie.ac.at)
 * Krivanek Yvonne-Nadine (a01404589@unet.univie.ac.at)

Tokens:
 * Frick: dm19_byrzma (id: 35)
 * Postlmayr: dm19_postlmayr (id: 32)

In [None]:
%matplotlib inline

In [None]:
import os
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import mplleaflet
import re
import dateutil.parser
from matplotlib.dates import date2num
from IPython.display import display, HTML
from datetime import datetime

In [None]:
data_dir = "data"

In [None]:
all_files = os.listdir(data_dir)

## Removing non-whitelisted trips

To account for wrongly annotated trips, we used a whitelist where we listed all trips that are correctly annotated:

https://docs.google.com/spreadsheets/d/11Ta24L86uB1UiKwSa5of5MKfUIaoXxq53kGWQ8U-AyI/edit#gid=0

In [None]:
whitelist_ids = {
    5, 46, 74, 127, 128, 129, 131, 165, 99, 105,
    107, 108, 109, 152, 202, 203, 13, 28, 43, 52,
    145, 146, 22, 24, 26, 42, 102, 103, 147, 214,
    215, 19, 134, 137, 139, 142, 220, 222, 223, 224,
    26, 248, 246, 245, 241, 84, 93, 235, 236, 33,
    36, 37, 40, 49, 120, 113, 114, 115, 116, 117,
    218, 219, 226, 227, 240, 21, 32, 38, 70, 95,
    199, 78, 82, 83, 160, 161, 208, 209, 55, 9,
    35, 88, 210, 211, 217, 72, 47, 201, 204, 91,
    110, 228, 233, 234, 237
}

In [None]:
print("Number of whitelisted trips:", len(whitelist_ids))

In [None]:
def filter_trips(trip):
    # skip files that are not a trip
    if re.search("^\d+_\d+_\d{4}-\d{2}-\d{2}T\d{6}\.\d{1,3}$", trip) is None:
        return False

    # extract the trip id
    trip_id = int(trip.split('_')[1])
    
    # keep trips that are whitelisted
    if trip_id in whitelist_ids:
        return True
    
    # otherwise: skip the trip
    return False

In [None]:
whitelisted_trips = list(filter(filter_trips, all_files))

In [None]:
# dict: trip-id -> folder name
all_trips = dict(map(lambda trip: (int(trip.split('_')[1]), trip), whitelisted_trips))

## 1. Our Trips

The following script puts out data about the following data for a given list of trips:
 * The trip ID
 * The duration of the trip
 * A table with all markers
 * A plot with the acceleration data and markers
 * A plot of the location data of the trip

To plot the location data, we used https://github.com/jwass/mplleaflet, a wrapper for pyplot and openstreetmap.org.

In [None]:
def get_trip_folders(ids):
    return dict(filter(lambda elem: elem[0] in ids, all_trips.items()))

In [None]:
def trip_overview(trips):
    print("Number of trips: ", len(trips))
    
    trips = get_trip_folders(trips)

    for tid, trip in trips.items():

        display(HTML(f"<h1>Trip id {tid}</h1>".format(tid)))

        # read markers file
        marker_col_names = ["time", "key", "value", "mode", "longitude", "latitude", "col7", "col8", "col9"]
        markers = pd.read_csv(os.path.join("data", trip, "markers.csv"), sep=';', names=marker_col_names, skiprows=1)

        # total duration
        start = dateutil.parser.parse(markers.loc[markers.index[0], 'time'])
        stop = dateutil.parser.parse(markers.loc[markers.index[-1], 'time'])
        print("Duration:", stop-start)

        # print markers table
        mode_changes = markers[markers["key"] == "CGT_MODE_CHANGED"]
        display(HTML(mode_changes.to_html()))

        # prepare acceleration data for plot
        acceleration = pd.read_csv(os.path.join("data", trip, "acceleration.csv"))
        acceleration['time'] = date2num(pd.to_datetime(acceleration['time']))
        acceleration['acc_norm'] = np.linalg.norm(acceleration[['x', 'y', 'z']].values, axis=1)
        acceleration = acceleration.drop(['x', 'y', 'z'], axis=1)

        # plot acceleration data
        plt.figure(figsize=(20,10))
        plt.grid(True)
        plt.plot_date(acceleration['time'], acceleration['acc_norm'], linewidth=1, color='black', linestyle='solid', marker='None')

        max_acc = np.max(acceleration['acc_norm'])

        # markers for acceleration data
        for index, row in mode_changes.iterrows():
            x = date2num(pd.to_datetime(row["time"]))
            # vertical line
            plt.axvline(x=x)
            # mode text
            plt.text(x=x, y=max_acc*.95, s=row["mode"], rotation=90, fontsize=16)
        plt.show()

        # plot map with location data
        positions = pd.read_csv(os.path.join("data", trip, "positions.csv"))
        positions = positions[positions["location_source"] == 1]
        positions = positions.filter(items=['longitude', 'latitude'])
        plt.figure(figsize=(20,8))
        plt.plot(positions['longitude'], positions['latitude'], 'r.', markersize=4)
        display(mplleaflet.display())

### Bernhard Frick

In [None]:
frick_trip_ids = {113, 114, 115, 116, 117, 218, 219, 226, 227, 240}

In [None]:
trip_overview(frick_trip_ids)

### Billie Rosalie Postlmayr

In [None]:
postlmayr_trip_ids = {84, 93, 235, 236}

In [None]:
trip_overview(postlmayr_trip_ids)

## 2. Data Preprocessing

The following script loops over all whitelisted trips and performs the following preprocessing steps:

 * Calculating the norm of the x, y and z dimensions of the acceleration data
 * Downsampling the acceleration data
 * Removing bimodal segments
 * Combining all segments into one file (`export.csv`)

`export.csv` is then in turn used by the next script to train the model.

---
Split each trip into segments of 10 seconds (after preprocessing steps this should be a 1-dimensional time series of 10 * 10 = 100 observations). Skip segments that are
a. shorter than 10 seconds (typically the last segment of a trip), or
b. bimodal (segments covering two transport modes, at the change of the transport mode)

Looks like our data is resampled to ms?

---

In [None]:
def replace_date(s):
    return s.group(0).replace('Z', '') + '.000Z'

In [None]:
def format_time(m):
    m = [re.sub(r':\d\d+Z', replace_date, sample) for sample in m]
    m = [float(datetime.strptime(sample, "%Y-%m-%dT%H:%M:%S.%fZ").strftime('%s.%f')) for sample in m]
    return m

In [None]:
processed = None

processed_count = 0
total_trips = len(whitelisted_trips)

for trip in whitelisted_trips:
    user_id = re.search('\d+', trip).group(0)
    trip_id = re.search('_\d+_', trip).group(0)
    trip_id = re.search('\d+', trip_id).group(0)

    processed_count = processed_count + 1
    print("processing user: ", user_id, ", trip: ", trip_id, " --- (", processed_count, "/", total_trips, ")")
    
    # markers
    path_markers = os.path.join(data_dir, trip, 'markers.csv')
    col_names = ["value", "key", "time", "mode", "col5", "col6", "col7", "station", "col9"]
    markers = pd.read_csv(path_markers, sep=';', names=col_names, skiprows=4)
    markers = markers.drop(['value', 'key', 'col5', 'col6', 'col7', 'col9'], axis=1)
    markers.drop(markers.tail(1).index, inplace=True)

    # acceleration
    path_acc = os.path.join(data_dir, trip, 'acceleration.csv')
    acceleration = pd.read_csv(path_acc, sep=',')
    acceleration['acc_norm'] = np.linalg.norm(acceleration[['x', 'y', 'z']].values, axis=1)
    acceleration = acceleration.drop(['x', 'y', 'z'], axis=1)

    # activity
#    path_activity = os.path.join(data_dir, trip, 'activity_records.csv')
#    activity = pd.read_csv(path_activity, sep=',')

    # time formatting
    acceleration['time'] = format_time(acceleration['time'])
    acceleration['time'] = acceleration['time'].astype(int)
    markers['time'] = format_time(markers['time'])
    markers['time'] = markers['time'].astype(int)

    # downsample
    acceleration = acceleration.set_index(['time'])
    acceleration.index = pd.to_datetime(acceleration.index, unit='ms')
    acceleration = acceleration.resample('1L').mean()
    markers = markers.set_index(['time'])
    markers.index = pd.to_datetime(markers.index, unit='ms')

    # combine acceleration and markers
    df_trip = acceleration.merge(markers, on="time", how='left')
    df_trip = df_trip.ffill()

    # eliminate bimodal segments
    df_trip = df_trip.reset_index()
    drop_multimodal = []
    for i in range(int(df_trip.shape[0] / 10)):
        counts = df_trip.iloc[i * 10:i * 10 + 10].groupby('mode').count()
        if counts.shape[0] != 1:
            for val in range(i * 10, i * 10 + 10):
                drop_multimodal.append(val)
    df_trip.drop(drop_multimodal, inplace=True)

    # add ids and add to overall dataframe
    df_trip['user_id'] = user_id
    df_trip['trip_id'] = trip_id
    if processed is None:
        processed = df_trip.copy()
    else:
        processed = processed.append(df_trip, ignore_index=True, sort=False)

processed = processed.set_index(['time'])
processed.to_csv('export.csv', header=True)

## 3. Insights

The data set consists of 94 trips by 17 students.

In [None]:
export = pd.read_csv(os.path.join("export.csv"))

The preprocessed data consists of the following entry counts per transportation mode:

In [None]:
counts = export.groupby('mode').agg(['count']).stack()['time']
print(counts)

In [None]:
percent = counts/export.count()['time']
percent

It looks like the different transportation modes are nearly indistinguishable when only considering the mean, std and quartiles:

In [None]:
for mode in export['mode'].unique():
    print("Mode:", mode)
    mode_norm = export[export['mode'] == mode]['acc_norm']
    desc = mode_norm.describe().drop('count')
    desc['median'] = mode_norm.median()
    print(desc)
    plt.plot(desc, label=mode, marker='o', linestyle='none')
    print()

plt.legend()
plt.show()

## 4. Single Classifier model (SCM)

report the following performance measures:
 * accuracy
 * precision (macro and weighted)
 * recall (macro and weighted)
 * F1-scores (macro and weighted)

In [1]:
from sklearn.metrics import precision_recall_fscore_support

In [None]:
precision_recall_fscore_support(y_true, y_pred, labels=[]) # labels: per label precision, recall, f1

## 5. Ensemble Walk Classifier model (EWCM)

report the following performance measures:
 * accuracy
 * precision (macro and weighted)
 * recall (macro and weighted)
 * F1-scores (macro and weighted)

## 6. Sources

 * https://machinelearningmastery.com/cnn-models-for-human-activity-recognition-time-series-classification/
 * https://machinelearningmastery.com/deep-learning-models-for-human-activity-recognition/