# RNN Primer
### Advantages of sequential NNs over traditional ML model on time-series data

This notebook showcases the advantage of using RNN models (LSTMs, GRUs) over Decision Trees for a classification task on time series data.

### Task description
**General**: classify the transport mode of a device (user) given some sensor data.

**Specific**: we will do a **binary classification** between "walk" and "train" modes, which correspond to a person walking or being in a train.

### Input data
Let's assume we have locations of a device, and we're going to compute two features based on locations:
1. Speed at each timestep
2. Distance to a public transport stop at each timestep.

Each train or walk sample would consist of many timesteps, thus representing a time series sequence.

For demonstration purposes, the **features are generated syntetically** based on obervations from real sensor data.

In [1]:
import numpy as np
import plotly.express as px
import pandas as pd

import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from rnnprimer.datagen import generate_train_sample, generate_walk_sample, Dataset

## Walk sample

Walk sample has distance and speed generated as follows:
* distance is a unform random number between 0 and 10000 (meters)
* speed is normally distributed with a mean at 5 and standard deviation at 2.5 (km/h)

In [17]:
sample = generate_walk_sample(seq_size=100)
fig.update_layout(title_text="Walk sample")
fig = sample.get_figure()
fig.show()

## Train sample with no outliers
Train sample has distance and speed generated as follows:
* distance linearly increases from 0 to 10000 (meters) and back to 0 along a segment
* speed quickly reaches 100 (km/h), stays contants, and then decreases to 0 again

Train sample can constist of one of more segments (5 on the image below)

In [10]:
sample = generate_train_sample(seg_size=100)
fig = sample.get_figure()
fig.update_layout(title_text="Train sample without outliers")
fig.show()

## Train sample with outliers

Outliers are introduced in the speed feature at a random point in time, with a given probability. Outlier has a speed drawn from the same distribution as that of a walk sample. 

In [12]:
sample = generate_train_sample(seg_size=100, outlier_prob=0.01)
fig = sample.get_figure()
fig.update_layout(title_text="Train sample with outliers")
fig.show()

## Traditional ML models
As an instance of a traditional ML model we're going to use a random forest classifier.

For training and evaluation, we are going to flatten our sequential samples and give features at each timestamp as individual training samples.

We then start to inroduce outliers with increasing probability in our train samples and measure what happens with **Precision** and **Recall** metrics.

In [28]:
from sklearn.ensemble import RandomForestClassifier

In [29]:
data = []
from sklearn.model_selection import StratifiedShuffleSplit, KFold, cross_val_score
for outlier_prob in (0.01, 0.05, 0.10, 0.20, 0.30, 0.40, 0.50):
    X, y = Dataset.generate(train_outlier_prob=outlier_prob).get_flat_X_y()
    cv = StratifiedShuffleSplit(n_splits=2, test_size=0.2)
    clf = RandomForestClassifier(n_estimators=50, class_weight="balanced")
    precision = np.mean(cross_val_score(clf, np.array(X), y, cv=cv, scoring="precision", n_jobs=-1))
    recall = np.mean(cross_val_score(clf, np.array(X), y, cv=cv, scoring="recall", n_jobs=-1))
    data.append({'outlier_prob': outlier_prob, 'precision': precision, 'recall': recall})

In [30]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=[d['outlier_prob'] for d in data], y=[d['precision'] for d in data],
                    name='precision'))
fig.add_trace(go.Scatter(x=[d['outlier_prob'] for d in data], y=[d['recall'] for d in data],
                    name='recall'))

fig.show()

## Test flat feature with a simple NN model

Currently doesn't work, why? It seems it settles on always predicting 1 for all samples.


1. works better when I increase the number of neurons in the input layer. Why? I thought input layer should have the same # of neurons as the # of features?
2. SGD optimizer doesn't work even when I increate the number of input neurons.

In [10]:
import tensorflow as tf
model = tf.keras.Sequential(
    [
            tf.keras.layers.Dense(2, activation="relu"),
            tf.keras.layers.Dense(16, activation="relu"),
            tf.keras.layers.Dense(1, activation="sigmoid")
    ]
)
model.compile(
    loss="binary_crossentropy",
    optimizer="adam",
    metrics=[tf.keras.metrics.Recall(), tf.keras.metrics.Precision()]
)

In [12]:
X, y = Dataset.generate(train_outlier_prob=0, n_samples=100).get_flat_X_y()

model.fit(
    x=X,
    y=y,
    batch_size=100,
    epochs=50,
    validation_split=0.1
)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50


Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x13d152250>

## RNN model showcase: LSTM

Doesn't work at all, not sure why, gives an error about layer rank...\

In [13]:
import tensorflow as tf
model = tf.keras.Sequential(
    [
        tf.keras.layers.GRU(8, return_sequences=True, recurrent_dropout=0.5, input_shape=(None, 2)),
        tf.keras.layers.TimeDistributed(
            tf.keras.layers.Dense(4, activation="relu", input_shape=(8,))
        ),
        tf.keras.layers.TimeDistributed(
            tf.keras.layers.Dense(
                1, activation="sigmoid", input_shape=(4,), bias_initializer=None,
            )
        ),
    ]
)
model.compile(
    loss="binary_crossentropy",
    optimizer="adam",
    metrics=[tf.keras.metrics.Recall(), tf.keras.metrics.Precision()],
    sample_weight_mode="temporal",
)

In [14]:
dataset = Dataset.generate(train_outlier_prob=0, n_samples=10)

model.fit(
    x=dataset.to_tfds(),
    epochs=50
)

Epoch 1/50


ValueError: in user code:

    /usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py:571 train_function  *
        outputs = self.distribute_strategy.run(
    /usr/local/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py:951 run  **
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    /usr/local/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py:2290 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /usr/local/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py:2649 _call_for_each_replica
        return fn(*args, **kwargs)
    /usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py:531 train_step  **
        y_pred = self(x, training=True)
    /usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py:886 __call__
        self.name)
    /usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/input_spec.py:168 assert_input_compatibility
        layer_name + ' is incompatible with the layer: '

    ValueError: Input 0 of layer sequential_5 is incompatible with the layer: its rank is undefined, but the layer requires a defined rank.
