# RNN Primer
# Part 1: Basic test

### Advantages of RNNs over traditional ML model on time-series data

This notebook showcases the advantage of using RNN models (LSTMs, GRUs) over Decision Trees for a classification task on time series data.

### Task description
**General**: classify the transport mode of a user given device sensor data.

**Specific**: we will do a **binary classification** between "walk" and "train" modes, which correspond to a person walking or riding a train.

### Input data
For simplicity, we are going to use a single feature: **speed** at each timestep.

Each sample consists of many timesteps, and with walk and train modes following each other, thus representing a time series. We will model speed data based on common sense assumptions.

In [1]:
# Load the TensorBoard notebook extension
%load_ext tensorboard
import plotly.graph_objects as go

import numpy as np
import pandas as pd
pd.options.plotting.backend = "plotly"

import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from rnnprimer.datagen import generate_sample, Dataset

## Sample without outliers
Our sample has speed generated as follows:
* For walk segment, the speed is always constant and equals to 5 (km/h).
* For train segment, the speed quickly reaches 100 (km/h), stays constant, and then decreases to 0 again. There can be more than 1 consequtive train segment, which models stops of a train.

We randomly separate 5 train samples into two separated with a walk. This is not relevant for traditional model where we feed one sample at a time.

In [4]:
sample = generate_sample()
fig = sample.get_figure()
fig.update_layout(title_text="Sample without outliers")
fig.show()

## Train sample with outliers

When the data contains no outliers, RNNs won't have advantages over traditional ML models.
Real sensor data, however, always have outliers of some sorts. In our case we will introduce outliers in train segments at a random point in time, with a given probability. The outlier speed would be the same as for walk segment.

In [3]:
sample = generate_sample(outlier_prob=0.05)
fig = sample.get_figure()
fig.update_layout(title_text="Sample with outliers")
fig.show()

## Traditional ML models
We're going to try two traditional ML models:
* Decision Trees
* Feed-forward neural network

As those models don't have memory and cannot accept sequences as input, we flatten our sequential samples and give features at each timestamp as individual training samples.

We then start to inroduce outliers with increasing probability in our train samples and measure what happens with the accuracy.

### Decision Trees

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
data_trees = []
for outlier_prob in (0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0):
    X, y = Dataset.generate(train_outlier_prob=outlier_prob).get_flat_X_y()
    clf = RandomForestClassifier(n_estimators=50, class_weight="balanced")
    clf.fit(X, y)
    X_test, y_test = Dataset.generate(train_outlier_prob=outlier_prob, n_samples=20).get_flat_X_y()
    y_pred = clf.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    data_trees.append({'outlier_prob': outlier_prob, 'accuracy': acc})
df_trees = pd.DataFrame(data_trees)

In [6]:
df_trees.plot(kind='line', x='outlier_prob', y='accuracy', range_y=[0, 1])

As expected, since outlier train speeds are not distinguishable from walk speeds, we see a linear dicrease in accuracy for an increasing outlier probability

## RNN models: GRU
Now we're going to try RNN (with GRU) on our data.
The difference here is that we feed the network with the whole sample at once, so that it can learn the patterns and hopefully demonstrate better performance in presence of outliers.

In [2]:
import tensorflow as tf
rnn_model = tf.keras.Sequential(
    [
        tf.keras.layers.Reshape((1000, 1)),
        tf.keras.layers.GRU(8, return_sequences=True, recurrent_dropout=0.5, input_shape=(None, 1)),
        tf.keras.layers.Dense(1, activation="sigmoid", bias_initializer=None)
    ]
)
rnn_model.compile(
    loss="binary_crossentropy",
    optimizer="adam",
    metrics=[tf.keras.metrics.BinaryAccuracy()]
)

In [None]:
data_gru = []
for outlier_prob in (0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0):
    dataset = Dataset.generate(train_outlier_prob=outlier_prob, n_samples=100)

    rnn_model.fit(
        x=dataset.to_tfds(),
        epochs=50,
        verbose=0
    )
    dataset = Dataset.generate(train_outlier_prob=outlier_prob, n_samples=20)
    res = rnn_model.evaluate(dataset.to_tfds(), verbose=0)
    data_gru.append({'outlier_prob': outlier_prob, 'accuracy': res[1]})
    
df_gru = pd.DataFrame(data_gru)

In [7]:
df_gru.plot(kind='line', x='outlier_prob', y="accuracy")

In [3]:
# Clear any logs from previous runs
from datetime import datetime
!rm -rf ./logs/
log_dir = "logs/fit/" + datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

dataset = Dataset.generate(train_outlier_prob=0.10, n_samples=200)

rnn_model.fit(
    x=dataset.to_tfds(),
    epochs=50,
    callbacks=[tensorboard_callback]
)

%tensorboard --logdir logs/fit

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
