### Generating sequences using encoder-decoder LSTM

In [63]:
import random
import numpy as np
import pandas as pd

The function `generate_sequence` generate a sequence of random integers of length `n`.

In [64]:
def generate_sequence(length = 25):
    return [random.randint(0, 99) for _ in range(length)]

One-Hot encoder transforms the number (or even string) to a binary value. In this example, `0 = [1, 0, 0, 0, ..., 0], ..., 99 = [0, 0, 0, ..., 1]`.

In [65]:
def one_hot_encode(sequence, n_unique = 100):
    encoding = list()
    for x in sequence:
        vector = [0 for _ in range(n_unique)]
        vector[x] = 1
        encoding.append(vector)
    return np.array(encoding)

The function `one_hot_decode` decodes a binary array encoded with function `one_hot_encode`. `np.argmax` returns the index of the higher value in the binary array. Since there is only one `1` in the array, it will be the index of the value encoded.

In [66]:
def one_hot_decode(sequence):
    return [np.argmax(x) for x in sequence]

The next step is, given a sequence, to transform the random generated sequence on a structured dataset that can be used to train LSTM network in a supervised manner.

The `x` will be a sequence of 5 items from the sequence randomly generated and `y` will be the first 3 items from the `x` array.

In [101]:
def to_dataset(encoded_sequence, size_x = 5, size_y = 3):
    df = pd.DataFrame(encoded_sequence)
    df = pd.concat([df.shift(size_x - i - 1) for i in range(size_x)], axis=1)
    # Drop rows with NaN (missing values).
    df.dropna(inplace=True)
    # Return DataFrame as an array.
    values = df.values
    # Size of the arrays that have the encoded values.
    width = encoded_sequence.shape[1]
    x = values.reshape(len(values), size_x, width)
    y = values[:, 0:(size_y * width)].reshape(len(values), size_y, width)
    return x, y

### Test

In [103]:
sequence = generate_sequence(5)
encoded_sequence = one_hot_encode(sequence)
x, y = to_dataset(encoded_sequence)
print(x)
print(y)

[[[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
   0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
   0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
   0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
   0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
   0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.
   0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
   0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
   0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
   0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
   0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
   0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
   0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0