# Tutorial: Recurrent Neural Networks in Keras 
## CS175 Discussion #4,  Jan. 31st, 2018
Author: [Eric Nalisnick](http://www.ics.uci.edu/~enalisni/)

### Goals of this Lesson
- Introduce
    - RNNs for classification
    - RNNs for language generation
- Implement... 
    - Elman RNN
    - Long Short-Term Memory (LSTM) RNN
    - Convolutional RNNs

### References 
- [Keras](https://keras.io/) 
- [*The Unreasonable Effectiveness of Recurrent Neural Networks* by Andrej Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
- [*Understanding LSTM Networks* by Chris Olah](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)

## 0.  Python Preliminaries
As usual, first we need to import Numpy and MatPlotLib...

In [153]:
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

## 1.  IMDB Dataset

Let's start by loading the data we'll be working with.  We'll use the [IMDB dataset](https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification), which contains 25,000 movie reviews labeled with their sentiment (positive vs negative)...  

In [154]:
from keras.datasets import imdb
from keras.preprocessing import sequence

# truncation variables
n_words_to_keep = 5000
max_review_length = 500

# load data
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=n_words_to_keep)

# pad / crop reviews so all are 300 'words'
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

In [155]:
print("Input:")
print(X_train)

print("\nLabels:")
print(y_train)

Input:
[[   0    0    0 ...,   19  178   32]
 [   0    0    0 ...,   16  145   95]
 [   0    0    0 ...,    7  129  113]
 ..., 
 [   0    0    0 ...,    4 3586    2]
 [   0    0    0 ...,   12    9   23]
 [   0    0    0 ...,  204  131    9]]

Labels:
[1 0 0 ..., 0 1 0]


### 2.  Simple (Elman) RNN for Sentence Classification

First, we define the most basic RNN implementation, also known as an Elman RNN, for binary classification.  Recall, that given the sequence of words in a review, we classify the review as positive ($y_{i}=1$) or negative ($y_{i}=0$).  First, let's write down the model's recurrent dynamics: $$ \mathbf{h}_{i, t} = f_{h}( x_{i,t} \mathbf{W} + \mathbf{h}_{i,t-1} \mathbf{U} + \mathbf{b}_{1} )$$ where $f_{h}$ is some type of activation function (Keras defaults to *tanh*), $x_{i,t}$ is the input for the $i$th example at time step $t$, $\mathbf{h}_{i,t-1}$ are the hidden units produced at the previous time step, and $\mathbf{b}_{1}$ is a bias vector.  After all words in the review have been input into the RNN, it's time to generate a prediction: $$\mathbb{E}[y_{i}] = f_{out}(\mathbf{h}_{i, T} \boldsymbol{\beta} + \mathbf{b}_{2}) $$ where $f_{out}$ is the logistic function in the case of binary classification.  Note that the hidden units are those computed at the *last* time step $t=T$.  To train the model, we again use the *cross-entropy* error function: $$ \mathcal{L}_{CE} = -y_{i} \log \mathbb{E}[y_{i}] + -(1-y_{i}) \log (1-\mathbb{E}[y_{i}]).$$  Learning, i.e. model optimization, is performed by taking gradients w.r.t. $\mathbf{W}$, $\mathbf{U}$, $\mathbf{b_{1}}$, $\mathbf{b_{2}}$, and $\boldsymbol{\beta}$.

In [156]:
import keras
from keras.models import Sequential
from keras.layers import Dense, Embedding, SimpleRNN

h1_size = 32
h_RNN_size = 100
output_size = 1

model = Sequential()
model.add(Embedding(n_words_to_keep, h1_size, input_length=max_review_length))
model.add(SimpleRNN(units=h_RNN_size))
model.add(Dense(output_size, activation='sigmoid'))

Define a loss function and an optimizer...

In [157]:
adam = keras.optimizers.Adam(lr=0.001)
model.compile(optimizer=adam, loss='binary_crossentropy', metrics=['accuracy'])

Keras has a nice function for summarizing the model...

In [158]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_12 (Embedding)     (None, 500, 32)           160000    
_________________________________________________________________
simple_rnn_7 (SimpleRNN)     (None, 100)               13300     
_________________________________________________________________
dense_11 (Dense)             (None, 1)                 101       
Total params: 173,401
Trainable params: 173,401
Non-trainable params: 0
_________________________________________________________________


And we can now train the model...

In [159]:
train_log = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=3, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


Lastly, let's get the final test error...

In [160]:
scores = model.evaluate(X_test, y_test, verbose=1)
print("Elman RNN Test Accuracy: %.3f%%" % (scores[1]*100))

Elman RNN Test Accuracy: 74.532%


## 3.  Long Short-Term Memory Network for Sentence Classification

Elman networks have difficulties learning long-term dependencies in the input.  Recognizing this, [Hochreiter and Schmidhuber (1997)](http://www.bioinf.jku.at/publications/older/2604.pdf) proposed the *long short-term memory* unit to solve the problem.  We won't go into the details behind the unit; see [Chris Olah's blog post](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) for an intuitive overview.  Running an LSTM in Keras is just a straightforward change of the code above...

In [161]:
from keras.layers import LSTM

model = Sequential()
model.add(Embedding(n_words_to_keep, h1_size, input_length=max_review_length))
model.add(LSTM(units=h_RNN_size))
model.add(Dense(output_size, activation='sigmoid'))

Define a loss function and an optimizer...

In [162]:
adam = keras.optimizers.Adam(lr=0.001)
model.compile(optimizer=adam, loss='binary_crossentropy', metrics=['accuracy'])

And summarize the model...

In [163]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_13 (Embedding)     (None, 500, 32)           160000    
_________________________________________________________________
lstm_6 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_12 (Dense)             (None, 1)                 101       
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________


In [164]:
train_log = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=3, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [165]:
scores = model.evaluate(X_test, y_test, verbose=1)
print("LSTM RNN Test Accuracy: %.3f%%" % (scores[1]*100))

LSTM RNN Test Accuracy: 87.032%


## 4.  Convolutional LSTM Networks for Sentence Classification

As we know from the success of the [bag-of-words (BOW) assumption](https://en.wikipedia.org/wiki/Bag-of-words_model), classifying text doesn't necessarily need to account for word ordering.  In fact, LSTMs can fail to generalize if they become too particular to word orderings in the training data.  One way to bridge the BOW assumption and the strict order dependence of RNNs is to add a one-dimensional [convolutional layer](https://en.wikipedia.org/wiki/Convolutional_neural_network#Convolutional_layer) before the recurrent units.  Intuitively, this means that the RNN's input is order invariant outside of some filter size.  For instance, the network would treat the sentences 'the cat ran around the house' and 'around the house, the cat ran' roughly the same.  

Let's define the model...

In [166]:
from keras.layers.convolutional import Conv1D, MaxPooling1D

model = Sequential()
model.add(Embedding(n_words_to_keep, h1_size, input_length=max_review_length))

# add convolutional layer w/ max pooling
n_filters = 32 # number of output features
filter_size = 3 # length of window
model.add(Conv1D(filters=n_filters, kernel_size=filter_size, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))

model.add(LSTM(units=h_RNN_size))
model.add(Dense(output_size, activation='sigmoid'))

Define a loss function and an optimizer...

In [167]:
adam = keras.optimizers.Adam(lr=0.001)
model.compile(optimizer=adam, loss='binary_crossentropy', metrics=['accuracy'])

Summarize the model...

In [None]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_14 (Embedding)     (None, 500, 32)           160000    
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 500, 32)           3104      
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 250, 32)           0         
_________________________________________________________________
lstm_7 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_13 (Dense)             (None, 1)                 101       
Total params: 216,405
Trainable params: 216,405
Non-trainable params: 0
_________________________________________________________________


Train the model...

In [None]:
train_log = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=3, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/3

In [None]:
scores = model.evaluate(X_test, y_test, verbose=1)
print("CNN-LSTM RNN Test Accuracy: %.3f%%" % (scores[1]*100))

## 5.  LSTM Network for Language Generation

RNNs are unique from the classification models we've been studying so far in that they are *generative*.  In other words, we can train them to generate language, not just classify it.  Yet, training a generative model essentially reverts to classification: the model predicts what will come next.  They do this by factorizing the joint probability as $p(w_{1},\ldots,w_{T}) = p(w_{1})p(w_{2}|w_{1})p(w_{3}|w_{2},w_{1})\ldots p(w_{T}|w_{T-1},\ldots,w_{1})$.  Writing these probablities as a maximum likelihood objective, we have: $$ \mathcal{L}_{gen} = -\sum_{t=1}^{T} \log p(w_{t} | w_{t-1},\ldots,w_{1}).$$  

Let's build a language model for Herman Melville's *Moby Dick*.  First, we read the text file and tokenize to extract the most common 10,000 words... 

In [None]:
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical

with open('data/moby_dick.txt', 'r') as f:
    moby_dick_text = f.read()
    
# translate text into int tokens
vocab_size = 10000 + 1
tokenizer = Tokenizer(num_words=vocab_size-1)
tokenizer.fit_on_texts([moby_dick_text])
encoded_text = tokenizer.texts_to_sequences([moby_dick_text])[0]

print("TEXT: "+moby_dick_text[69:300])
print("\nENCODINGS: "+str(encoded_text[:100]))

Then we form sequences... 

In [None]:
# create word -> word sequences
sequences = list()
for i in range(1, len(encoded_text)):
    sequence = encoded_text[i-1:i+1]
    sequences.append(sequence)
print('Total Sequences: %d' % len(sequences))

Lastly, make training labels out of the subsequent words... 

In [None]:
# split into X and y elements
sequences = np.array(sequences)
X_train, y_train = sequences[:,0], sequences[:,1]
y_train = to_categorical(y_train, num_classes=vocab_size)

Then we define an LSTM just as before (but a bit bigger one, this time)...

In [None]:
h1_size = 100
h_RNN_size = 500

# define model
model = Sequential()
model.add(Embedding(vocab_size, h1_size, input_length=1))
model.add(LSTM(h_RNN_size))
model.add(Dense(vocab_size, activation='softmax'))

model.summary()

Define optimizer and train...

In [None]:
adam = keras.optimizers.Adam(lr=0.005)
model.compile(optimizer=adam, loss='categorical_crossentropy', metrics=['accuracy'])

In [None]:
#train_log = model.fit(X_train, y_train, epochs=10, batch_size=64)
model = load_model('moby_dick_rnn.h5')

## 6.  Sampling from the Model

After training, we can simulate 'new' Moby Dick passages by sampling from the model.  Given a seed word, we run a forward pass through the model and sample from the resutling softmax over words: $$ w_{t+1} \sim p(w_{t+1} | w_{t},\ldots,w_{1}) = \text{Multinoulli}(p=f_{softmax}(w_{1},\ldots,w_{t})).$$  We then use the sampled word as input to the model for the next time step.  By repeating this process, we can then generate whole passages.  Here's a function that performs the sampling...

In [None]:
# NOTE: there are more effecient way to implement this
def generate_seq(model, tokenizer, seed_text, n_words, temperature=1.):
    in_text, result = seed_text, seed_text
    
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        encoded = np.array(encoded)
        
        # get the probabilities of each vocabulary word
        word_probs = model.predict(encoded, verbose=0)[0]
        
        # add a temperature to create more sampling variation
        word_probs = np.log(word_probs) / temperature 
        word_probs = np.exp(word_probs) / np.sum(np.exp(word_probs)) 
        yhat = np.random.choice(range(word_probs.shape[0]), p=word_probs)
        
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
                
        # append to input
        in_text, result = out_word, result + ' ' + out_word
        
    return result

In [None]:
generate_seq(model, tokenizer, seed_text='ship', n_words=50, temperature=1.)

In [None]:
generate_seq(model, tokenizer, seed_text='ship', n_words=50, temperature=10.)

In [None]:
generate_seq(model, tokenizer, seed_text='ship', n_words=50, temperature=.1)