# RNN Primer
## Part 2: Padding and masking

In real world, we typically won't have time series samples of the same length. For example, one user was tracking his movement of 30 minutes, another for 1 hour, another for 10 minutes, etc.

In the previous notebook, we generated samples of the same length. To accomodate samples of different lenght, we need to use the techniques called **padding** and **masking**. Let's see how it's done.

In [1]:
# Load the TensorBoard notebook extension
%load_ext tensorboard
import altair as alt

import numpy as np
import pandas as pd

import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from rnnprimer.datagen import generate_sample, Dataset

First we adapt our sample generation function such that it can generate a variable number of train segments. For simplicity, we keep the train/walk split at 50%.

In [3]:
sample = generate_sample(total_train_seg_n=1)
fig = sample.get_figure()
fig.properties(title="Sample without outliers", width=800)

Now let's generate a dataset where a size of a train segment is a random integer from 1 to 10.

In [4]:
train_seg_n = lambda: np.random.randint(1,11)
dataset = Dataset.generate(train_outlier_prob=0, n_samples=100, total_train_seg=train_seg_n)
sample_size_df = pd.DataFrame([len(s) for s in dataset.samples], columns=['# of timesteps'])
alt.Chart(sample_size_df).mark_bar().encode(
    alt.X("# of timesteps:Q", bin=True),
    y='count()',
)

1 train segment is 100 timesteps + 100 timesteps for walk, so 200 timesteps in total. We can see that now we have a more or less uniform distribution of different sample lenghts in our datasets.

In [5]:
for batch in dataset.to_tfds():
    features, labels = batch
    break

In [6]:
print(features[0])

tf.Tensor(
[[ 0.05]
 [ 0.05]
 [ 0.05]
 ...
 [-1.  ]
 [-1.  ]
 [-1.  ]], shape=(1600, 1), dtype=float32)


The last elements of the first feature vector were set to -1 and the total size is 1600 elements. This means that in this batch there is a sample with a maximum of 1600 features.

In [20]:
import tensorflow as tf

lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
        0.1,
        decay_steps=100,
        decay_rate=0.5)

rnn_model = tf.keras.Sequential(
    [
        tf.keras.layers.Masking(mask_value=np.array([-1])),
        tf.keras.layers.GRU(8, return_sequences=True),
        tf.keras.layers.Dense(1, activation="sigmoid")
    ]
)
rnn_model.compile(
    loss="binary_crossentropy",
    optimizer=tf.keras.optimizers.RMSprop(learning_rate=lr_schedule),
    metrics=[tf.keras.metrics.BinaryAccuracy()]
)

In [21]:
from datetime import datetime
!rm -rf ./logs/
log_dir = "logs/fit/" + datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

rnn_model.fit(
    x=dataset.to_tfds(),
    epochs=50,
    callbacks=[tensorboard_callback]
)


%tensorboard --logdir logs/fit

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


See https://www.tensorflow.org/guide/keras/masking_and_padding for more details.