# Recurrent Neural Networks

We previously examined how we can be more flexible with the input sizes and exploit spatial proximity and invariance the features with convolutional neural networks.

Recurrent Neural Networks can do this, too, but have a different perspective on the data: More loosely coupled sequences of inputs and outputs and an internal state (see slides).

With this we once more tackle the Fashion MNIST data as an easy example, and then try to teach the computer to generate character sequences.

## The usual imports

In [None]:
from keras.models import Sequential
from keras.layers import LSTM, Dense
from keras.datasets import fashion_mnist
from keras.utils import np_utils
import sklearn.metrics as metrics
import numpy as np
from matplotlib import pyplot as plt

## Data

We'll use the same data as before, so we included this step already.

In [None]:
# load dataset
(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()

print ("y_train.shape = {}".format(y_train.shape))
print ("x_train.shape = {}".format(x_train.shape))

print(len(y_train), 'train samples')
print(len(y_test), 'test samples')

In [None]:
from keras.utils import np_utils

# 10 categories in our data
num_class = 10
img_rows, img_cols = 28, 28

# Normalizing color values
x_train = x_train / float(255)
x_test  = x_test / float(255)

# one hot vectors of the shape 10 ()
y_train_hot = np_utils.to_categorical(y_train, num_class)
y_test_hot  = np_utils.to_categorical(y_test, num_class)

# Shapes 
print ("y_train_hot.shape = {}".format(y_train_hot.shape))
print ("x_train.shape = {}".format(x_train.shape))

## Define Model

This time we want to process the images one row, while always feeding our internal state as an input into the next row. This way, when we reach the last row, we have built a state that combines all the information of the image.

We want to use an LSTM Layer with multiple nodes. The sequence or time steps would be the rows of the image, while each row has 28 pixels as features.

***Hints:***
- the `LSTM` layer needs an 3D or 2D input shape (#batchSize, #timeSteps, #features), but batchsize is optional. Time steps means the number of inputs in one training example.
- End the model stack with a dense softmax layer as before

In [None]:
LSTM?

In [None]:
batch_size = 128
epochs = 10

# Parameters for LSTM network
lstmLayerWidth = 30
timeSteps = img_rows
dim_input_vector = img_cols

input_shape = (timeSteps, dim_input_vector)

In [None]:
# Build LSTM network
model = Sequential()
model.add(LSTM(lstmLayerWidth, input_shape=input_shape))
model.add(Dense(num_class, activation='softmax'))
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()

## Training

Same as in earlier exercises ...

In [None]:
# Train
history = model.fit(x_train, y_train_hot, epochs=epochs, batch_size=batch_size, shuffle=True, verbose=1)

# Evaluate

The same procedure as last time ...

In [None]:
# Evaluate
evaluation = model.evaluate(x_test, y_test_hot, batch_size=batch_size, verbose=1)
print('Summary: Loss over the test dataset: %.2f, Accuracy: %.2f' % (evaluation[0], evaluation[1]))

predictions = model.predict_classes(x_test);
cm = metrics.confusion_matrix(y_test ,predictions)

# Output
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(cm, cmap=plt.cm.gray)
fig.colorbar(cax)
plt.show()

As the progression shows, we could probably improve this model with more training epochs.
However, we want to try now other kind of problems that are not solvable with non-recurrent networks. Let's see what's this about:

## Memorizing a sequence

This is a dummy use case, but it shows nicely how hard it is to memorize a sentence, if you are only being told how wrong you are. The goal is to create a model than can recreate a complete sentence from only the first character. This sentence is 

***Der Cottbuser Postkutscher putzt den Cottbuser Postkutschkasten***

The model will only get one character as input and has to create the next character. What makes this special is, that the input ***t*** has different outputs (*t,b,k,s,z*) and the correct one depends on what has been observed up to that point in the sequence. This information will reside in the internal states of the LSTM cells.

Let's import some necessary things:

In [None]:
from keras.models import Model
from keras.layers import LSTM, Dense, TimeDistributed, GRU, Input
from keras.utils import np_utils
import numpy as np
from matplotlib import pyplot as plt

## Define the training data

Now we define the learning goal, i.e. the string to repeat. The first character is special as it will function as a initiator to start the repetition once the training is complete.

Similar to processing words, we have to create a *vocabulary* of possible characters from that string in order to represent them in the numerical input vectors that we feed into the network.

We also predefined the number of nodes per hidden layer.

In [None]:
LEARN_STRING = "*Der Cottbuser Postkutscher putzt den Cottbuser Postkutschkasten."
VOCAB = list(set(LEARN_STRING))

NUM_FEATURES = len(VOCAB) # dimensions of the input = one character as one hot vector
HIDDEN_WIDTH = 50         # how many nodes per hidden layer = arbitrary number, could be bigger or smaller.

Now, the actual input data is just the sequence of characters and the corresponding, desired output is the next character in line, i.e. the y vector is just the same as the x vector but shifted by one position.

It should look like this:

|  row | x   | y   |
|------|-----|-----|
| 1    | *   | C   |
| 2    | C   | o   |
| 3    | o   | t   |
| 4    | t   | t   |
| 5    | t   | b   |
| 6    | b   | u   |
| 7    | u   | s   |
| ...  | ... | ... |
| n    | .   | *   |

But with one hot vectors instead of the characters. As we have seen in the MNIT example above, the RNN needs the data in the shape of (batch_size, timesteps, features). We have only one training sequence, therefore the batchsize is one.

In [None]:
print(len(LEARN_STRING))
print(len(VOCAB))

# create two numpy arrays with the correct amount of values (zeroes for initialization)
# and reshape them into the desired input shape
x = np.zeros(len(VOCAB)*len(LEARN_STRING)).reshape(1, len(LEARN_STRING), len(VOCAB))
y = np.zeros(len(VOCAB)*len(LEARN_STRING)).reshape(1, len(LEARN_STRING), len(VOCAB))

# fill them with training data: 
# one input character as one hot vector -> one output character as one hot vector
for i, char in zip(range(len(LEARN_STRING)), LEARN_STRING):
    x[0][i][VOCAB.index(char)] = 1
    y[0][(i-1)%len(LEARN_STRING)][VOCAB.index(char)] = 1

print(x.shape)


In [None]:
# Short check if the sequences match in the way we want to:
x_string = ""
y_string = "";
for i in range(len(LEARN_STRING)):
    x_string += (VOCAB[np.argmax( x[0,i,:])])
    y_string += (VOCAB[np.argmax( y[0,i,:])])
print(x_string)
print(y_string)

## Define the model

Sequence-to-sequence learning and predicting is a bit trickier: 
- We want to learn from the whole sequence, not individual character pairs. The model shall learn the influence of the internal state, too.
- During prediction we only feed it one character at the time and build up the sequence from the output of the model itself.

This means we have to define the same network in slightly different but compatible configurations. Therefore, we define a procedure to build the model and call it later with different parameters.

We provided the interface and the building the model. You have to define the layers.

***Hints:***
- Take a look at the layers we imported at the beginning of this section.
- use `batch_input_shape` for the first layer, to define the input shape. This will be needed later on.
- The `LSTM` layers need to `return_sequences`, otherwise they produce only one output at the end of the sequence. 
- The "statefulness" of the layers needs to be set according to the provided parameter
- The output layer needs to be encapsulated into a `TimeDistributed`, which means that it collects one output for each input throughout the whole sequence.
- Instead of LSTM, you can also use `GRU` layers which have the exactly same interface but learn a bit better in this use-case.

In [None]:
TimeDistributed?

In [None]:
GRU?

In [None]:
def create_model(num_timesteps, num_features, num_hidden_nodes, stateful=False, batch_size=None):
    
    model = Sequential()

    model.add(GRU(num_hidden_nodes, 
                     batch_input_shape=(batch_size, num_timesteps, num_features),
                     return_sequences=True, #produce output for each word, not just last one
                     stateful=stateful))
    model.add(GRU(num_hidden_nodes,
                     return_sequences=True, #produce output for each word, not just last one
                     stateful=stateful))
    model.add(TimeDistributed(Dense(num_features, activation="softmax"), 
                                   name='output_layer'))

    #Specify loss function and optimization algorithm, compile model
    model.compile(loss="categorical_crossentropy",
                  optimizer='adam')
    
    return model

## Create and train the training model

We now create a training model with our method `create_model` using the constant values defined above. The length of the training string is the number of time steps in one sequence.

In [None]:
# create training model
train_model = create_model(len(LEARN_STRING), NUM_FEATURES, HIDDEN_WIDTH)

Because we only have one sample, we have to go through many epochs. You can set `verbose` to `False` in order to not get spammed with stats. Let's start with 100 epochs and see what we get. (Spoiler: 100 will not be enough)

In [None]:
# fit model to data for some epochs
train_model.fit(x, y, epochs=100, batch_size=1, verbose=False)
train_model.fit(x, y, epochs=1, batch_size=1) # one last quick epoch to get the current stats

As already said: the training model will not help us when we want to make predictions because it only accepts a whole sequence as input and it resets it's state after each prediction.

Therefore we create a structurally similar model, but we make it `stateful` which means we control when the states are being reset, and we set the batch size and sequence length to one.

Of course this new model is untrained, but we can `set_weights` to the values of the already trained one :)

In [None]:
# create prediciont model and take over weights
predict_model = create_model(1, NUM_FEATURES, HIDDEN_WIDTH, batch_size=1, stateful=True)
predict_model.set_weights(train_model.get_weights())

For making predictions we create an initial single input from the first character of the target string (Hint: shape(1,1,NUM_FEATURES)) and reset the state of the model.

Now you can use the output of the one prediction as input into the next prediction and collect all the characters which get created this way. 


In [None]:
# start token
start = np.zeros(len(VOCAB)).reshape(1,1,len(VOCAB))
start[0][0][VOCAB.index(LEARN_STRING[0])] = 1

# init
predict_model.reset_states()
input_char = start

result = ""
for i in range(len(LEARN_STRING)):
    input_char = predict_model.predict(input_char)
    result += VOCAB[np.argmax(input_char)]
print(result)

The first try will be a disaster. But if it produces more than just a single repeated all over, you are on the right track! Just make more epochs, recreate the prediction model and retry. 

50 epochs at a time should show a nice progression. 500 epochs should solve the problem

In [None]:
epochs_per_step = 30
for run in range(500//epochs_per_step):
    # make some epochs more
    history = train_model.fit(x, y, epochs=epochs_per_step, batch_size=1, verbose=False)
    # transfer weights
    predict_model.set_weights(train_model.get_weights())
    # create output
    predict_model.reset_states()
    input_char = start
    result = ""
    for j in range(len(LEARN_STRING)):
        input_char = predict_model.predict(input_char)
        result += VOCAB[np.argmax(input_char)]
    print("Result after", ((run+1)*epochs_per_step+100), "epochs:",result)
    if (("*"+result) == (LEARN_STRING+"*")): break