<a id = 'returnToTop'></a>

# Table of contents
  * A. [Introduction to RNN](#introToRnn) 
    * 1. [Model Structure](#modelStructure)
    * 2. [Multi-Layer Cells](#multiLayerCells)
    * 3. [Batching and Truncated Backpropagation Through Time (BPTT)](#backProp)
    * 4. [Implementing a very simple RNN](#implementRnn)  
  * B. [Implementing a simple RNNLM and training on a trivial corpus](#runOnToyInput)  
  * C. [Introducing the Dataset input utility](#tfDataset)  
  * D. [Run on larger data and use Tensorboard to visualize](#runOnLargerInput) 
  * E. [Refining results (using sampled softmax loss) -- go to part II](#goToPartII)  

<a id = 'introToRnn'></a>
# A. Introduction to Recurrent Neural Network Language Model

In this part, we'll learn about building a recurrent neural network language model (RNNLM) using TensorFlow Keras. This class of models represented the cutting edge in language modeling about 5 years ago.  Even though nowadays transformers (like BERT) represent the state-of-the-art for overall accuracy, LSTMs tend to take much less time to train and so with a limited amount of training time and compute resources they can produce surpprisingly good results.  Analyzing and building them is also useful for understanding fundamental concepts of all neural network architectures (states, input / output dimensions, batching, setting up loss and metrics.)

As a reference, you may want to review the following:

- [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) (Andrej Karpathy, 2015)
- [Understanding LSTM Networks](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) (Chris Olah, 2015)
- [A Tensorflow / Keras tutorial on using RNNs for text generation](https://www.tensorflow.org/tutorials/text/text_generation) (updated Oct. 2020)

The specific model we'll build is based on the following papers. You should skim these (particularly the first one), but you don't need to read them in detail:

- [Recurrent neural network based language model](http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf) (Mikolov, et al. 2010)
- [Exploring the Limits of Language Modeling](http://arxiv.org/pdf/1602.02410.pdf) (Jozefowicz, et al. 2016)

We'll build our model entirely in TensorFlow Keras, so you may want to review the [TensorFlow section of assignment 1](../a1/tensorflow/tensorflow.ipynb).

Finally, you'll possibly want to consult the [TensorFlow Keras API reference](https://www.tensorflow.org/api_docs/python/tf/keras), and pay special attention to the types, dimensions and order of arguments for each function.  As we suggested you do in Assignment 1, you'll want to **draw the shape of any matrices you work with on a scrap paper** or you may have trouble keeping track of your forward path!

In [7]:
import tensorflow as tf
assert(tf.__version__.startswith("2."))

# Notebook Overview

Notebook I consists of 7 parts:



[Return to Top](#returnToTop)  
<a id = 'modelStructure'></a>
## RNNLM Model Structure

![RNNLM](images/rnnlm_layers.png)

Here's the basic spec for our model. We'll use the following notation:

- $w^{(i)}$ for the $i^{th}$ word of the sequence (as an integer index)
- $x^{(i)}$ for the vector representation of $w^{(i)}$
- $h^{(i)}$ for the $i^{th}$ hidden state, with indices as in Section 5.8 of the async
- $o^{(i)}$ for the $i^{th}$ output state, which may or may not be the same as the hidden state
- $y^{(i)}$ for the $i^{th}$ target word, which for a language model is always equal to $w^{(i+1)}$

Let $ h^{(-1)} = h^{init} $ be an initial state. For an input sequence of $n$ words and $i = 0, ..., n-1$, we have:

- **Embedding layer:** $ x^{(i)} = W_{in}[w^{(i)}] $
- **Recurrent layer:** $ (o^{(i)}, h^{(i)}) = \text{CellFunc}(x^{(i)}, h^{(i-1)}) $
- **Output layer:** $\hat{P}(y^{(i)}) = \hat{P}(w^{(i+1)}) = \text{softmax}(o^{(i)}W_{out} + b_{out}) $
 
$\text{CellFunc}$ can be an arbitrary function representing our recurrent cell - it can be a simple RNN cell, or something more complicated like an LSTM, or even a stacked multi-layer cell. *Note that the cell has its own internal, trainable parameters.*

It may be convenient to deal with the logits of the output layer, which are the un-normalized inputs to the softmax:

$$ \text{logits}^{(i)} = o^{(i)}W_{out} + b_{out} $$

We'll use these as shorthand for important dimensions:
- `V` : vocabulary size
- `H` : hidden state size = embedding size = per-cell output size

[Return to Top](#returnToTop)  
<a id = 'multiLayerCells'></a>
### Multi-Layer Cells

One popular technique for improving the performance of RNNs is to stack multiple layers. Conceptually, this is similar to an ordinary multi-layer network, such as those you implemented on Assignment 1.

![RNNLM - multicell](images/rnnlm_multicell.png)

**Recurent layer 1** will take embeddings $ x^{(i)} $ as inputs and produce outputs $o^{(i)}_0$. We can feed these in to **Recurrent layer 2**, and get another set of outputs $o^{(i)}_1$, and so on. Note that because the input dimension of an RNN cell is typically the same as the output, all of these layers will have the same shape.

In TensorFlow, a single RNN layer is composed of an [LSTM cell](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTMCell), which represents one unit of time in our model (one word in our diagram above.  If you want to stack multiple layers at each time-step you can use [tf.keras.layers.StackedRNNCells](https://www.tensorflow.org/api_docs/python/tf/keras/layers/StackedRNNCells). The `StackedRNNCells` object provides a vertically-stacked cell, as shown by the dashed green lines above.

Effectively the concept of time is an abstraction and ultimately at the end of training the model parameters of `cell` above are identical at each time position.

[Return to Top](#returnToTop)  
<a id = 'backProp'></a>
## Batching and Truncated Backpropagation Through Time (BPTT)

Batching for an RNN works the same as for any neural network: we'll run several copies of the RNN simultaneously, each with their own hidden state and outputs. Most TensorFlow functions are batch-aware, and expect `batch_size` as the first dimension.

With RNNs, however, we also need to consider the sequence length. In theory, we model our RNN as operating on sequences of arbitary length, but in practice it's much more efficient to work with batches where all the sequences have the same (maximum) length. TensorFlow calls this dimension `max_time`.  _Note: since LSTMs model sequences, a lot of the nomenclature around them mentions "time".  Whenever you see a reference to "time" in documentation, just read it as "word sequence position(s)"._

Put together, it looks like this, where our inputs $w$ and targets $y$ will both be 2D arrays of shape `[batch_size, max_time]`.

![RNNLM - batching](images/rnnlm_batching.png)

Note that along the batch dimension, sequences are independent. Along the time dimension, the output of one timestep is fed into the next. 

In the common case of processing sequences longer than `max_time`, we can chop the input up into smaller chunks, and carry the final hidden state from one batch as the input to the next. For example, given the input `[a b c d e f g h]` and `max_time = 4`, we would run twice:
```
h_init    -> RNN on [a b c d] -> h_final_0
h_final_0 -> RNN on [e f g h] -> h_final_1
```
We can also do this with batches, taking care to construct our batches in such a way that each batch lines up with it's predecessor. For example, with inputs `[a b c d e f g h]` and `[s t u v w x y z]`, we would do:
```
h_init    -> RNN on [a b c d] -> h_final_0
                    [s t u v]

h_final_0 -> RNN on [e f g h] -> h_final_1
                    [w x y z]
```
where our hidden states `h_init`, etc. have shape `[batch_size, state_size]`. (*Note that `state_size = H` for a simple RNN, but is larger for LSTMs or stacked cells.*)

Training in this setting is known as *truncated backpropagation through time*, or truncated BPTT. We can backpropagate errors within a batch for up to `max_time` timesteps, but not any further past the batch boundary. In practice with `max_time` greater than 20 or so, this doesn't significantly hurt the performance of our language model.

[Return to Top](#returnToTop)  
<a id = 'implementRnn'></a>
## Choosing an optimizer

For training steps, you can use any optimizer, but we recommend `tf.train.AdamOptimizer` with gradient clipping (`tf.clip_by_global_norm`).  Adam adjusts the learning rate on a per-variable basis, and also adds a "momentum" term that improves the speed of convergence. See [An overview of gradient descent optimization algorithms](http://ruder.io/optimizing-gradient-descent/) for more.

For training with AdamOptimizer, you want to use the `learning_rate = 0.01` as defined under "Training Parameters" (next to batch size, num epochs, etc.). If you use `learning_rate = 0.1` with Adam, the model will likely overfit or training may be unstable. (However, 0.1 works well with Adagrad and vanilla SGD.)


[Return to Top](#returnToTop)  
<a id = 'runOnToyInput'></a>
# (B) Let's implement a simple RNNLM

In [9]:
# Imports
# NumPy and TensorFlow


import numpy as np
import tensorflow as tf
assert(tf.__version__.startswith("2."))

from tensorflow import keras
from tensorflow.keras import layers, backend as K
from tensorflow.keras.models import Model
from tensorflow.keras import backend
assert(tf.__version__.startswith("2."))
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras import Input
from tensorflow.keras import Model, layers

# Tensorboard
from tensorflow.python.keras.callbacks import TensorBoard
import datetime

# Tensorboard
from tensorflow.python.keras.callbacks import TensorBoard
import datetime

# Helper libraries
from w266_common import utils, vocabulary, tf_embed_viz

In [32]:
# Construct a very simple one-layer RNN
# We'll stick with using Keras functional format models as these give you more flexibility.

# Use a simple toy corpus
toy_corpus = "<s> Mary had a little lamb . <s> The lamb was white as snow . <s>"
sequence = np.array(toy_corpus.split())
labels = np.array("Mary had a little lamb . <s> The lamb was white as snow . <s> And".split())
vocab = vocabulary.Vocabulary(sequence)

# Model parameters
modelParams = dict(
    max_time = 8,   # length of words per batch
    batch_size = 2, 
    learning_rate = 0.01,
    V = vocab.size,
    H = 10,
    num_layers = 1,
    dropout_rate = 0.1, 
    metrics = ['accuracy'],
    # metrics = ['categorical_accuracy']
)

# Define a function to create our LSTM model
def create_lstm_model(**kwargs):
    max_time = kwargs.get('max_time', 8)   # length of words per batch
    batch_size = kwargs.get('batch_size', 2) 
    learning_rate = kwargs.get('learning_rate', 0.01)
    V = kwargs.get('V', vocab.size)
    H = kwargs.get('H', 10)
    num_layers = kwargs.get('num_layers', 1)
    dropout_rate = kwargs.get('dropout_rate', 0.1)
    metrics = kwargs.get('metrics', ['accuracy'])
    # Create an input layer for our model
    # input_ = Input(batch_shape = [batch_size, max_time], name="x")
    input_ = Input(shape = [max_time], name="x")
    # Create an embedding layer of dimension V x H
    embedding_layer = Embedding(input_dim=V, output_dim=H, 
                                embeddings_initializer=tf.keras.initializers.RandomUniform(minval=-1, maxval=1), 
                                trainable=True)
    x = embedding_layer(input_)
    
    # Create hidden layers
    for i in range(num_layers):
        # x, state, memory = keras.layers.LSTM(H, return_sequences = True, 
        #                   return_state=True, stateful=False)(x)
        x = keras.layers.LSTM(H, return_sequences = True, 
                          return_state=False, stateful=False)(x)
        x = keras.layers.Dropout(dropout_rate)(x)
    
    # Create an output layer with softmax output type
    outlayer = keras.layers.Dense(V, 
                                  activation = "softmax", 
                                  kernel_initializer = 
                                  tf.keras.initializers.RandomUniform(minval=-1.0, maxval=1.0), 
                                  bias_initializer = tf.keras.initializers.Zeros())
    
    # Predicted outputs from output layer
    yhat = outlayer(x)
    # Build / compile the model
    ## Use Adam optimizer
    optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, beta_1=0.9, 
                                         beta_2=0.999, epsilon=1e-07, amsgrad=False, 
                                         name='Adam')       
    ## Use accuracy metrics for evaluation
    # metrics = "accuracy"
    
    ## Use sparse categorical crossentropy since we are outputting a single word 
    ## ...rather than a one-hot vector.
    loss_ = "sparse_categorical_crossentropy"
    
    ## Build / Compile model
    model = Model(inputs=input_, outputs=yhat, name="rnn_prediction_model")
    model.compile(optimizer = optimizer, loss = "sparse_categorical_crossentropy",
                       metrics = metrics)
    return model

modelSimple = create_lstm_model(**modelParams)

In [33]:
# Let's inspect our model to get a feel for dimensions! 

# You can refer to model layers and weights by using the self.layers and self.weights objects
output_layer = modelSimple.layers[-1]
print('Output layer weights are dimension: {} and biases are dimension: {}.\n\n'.format(
    output_layer.weights[0].shape, output_layer.weights[1].shape))

hidden_layer = modelSimple.layers[-3]
print('Hidden layer weights are dimension:\n   input weights: {},\n   state-input weights: {},\n  and biases: {}'.format(
    hidden_layer.weights[0].shape, hidden_layer.weights[1].shape, 
    hidden_layer.weights[2].shape))
print('Note that input and state-input weights have H (10) each for:\n  --input,\n  --forget,\n  --memory, and\n  --output gates')

embedding_layer = modelSimple.layers[-4]
print('Shapes of embedding layer weights: {}'.format(
    embedding_layer.weights[0].shape))


Output layer weights are dimension: (10, 15) and biases are dimension: (15,).


Hidden layer weights are dimension:
   input weights: (10, 40),
   state-input weights: (10, 40),
  and biases: (40,)
Note that input and state-input weights have H (10) each for:
  --input,
  --forget,
  --memory, and
  --output gates
Shapes of embedding layer weights: (15, 10)


In [34]:
# Now train the model
ids = vocab.words_to_ids(sequence)
ids = np.vstack([ids]*100)
ids = ids.reshape([-1, 8])
y = vocab.words_to_ids(labels)
y = np.vstack([y]*100).reshape([-1, 8])
print(ids.shape, y.shape)
modelSimple.fit(ids, y, epochs = 20)

(200, 8) (200, 8)
Train on 200 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x149a10650>

<a id = 'tfDataset'></a>
[Return to Top](#top)
# C. Introducing the Dataset input utility

To make life easier, Keras offers a [Dataset](https://www.tensorflow.org/guide/data) interface to easily wrangle input and label data for feeding and training models (not just for an LSTM model but any Keras model.)

Let's use this to create an input dataset for a larger corpus for our next RNN model!

In [115]:
# Load the dataset
V = 10000
vocab, train_ids, test_ids = utils.load_corpus("brown", split=0.8, V=V, shuffle=42)

max_time = 25   # length of words per batch
batch_size = 100

# Look at the output shapes for train_ids, test_ids
print('Raw shapes:\n  train_ids: {},\n  test_ids: {}\n\n'.format(train_ids.shape, test_ids.shape))


# Truncate training_set to have complete rows only (considering our max_time parameter.)
trainlen_round = int((train_ids.shape[0] - max_time)/ max_time) * max_time
# trainlen_round = int((1000 - max_time)/ max_time) * max_time
trainroundx = train_ids[:trainlen_round]
trainroundy = train_ids[1:trainlen_round+1] # input words one position ahead
# trainx_shaped = np.reshape(train_ids[:trainlen_round],[-1,max_time])
# trainy_shaped = np.reshape(train_ids[1:trainlen_round + 1],[-1,max_time])
print('Training input (x) shape: {},  training labels (y) shape: {}'.format(trainroundx.shape, trainroundy.shape))

# A Dataset object packages input numpy objects into tensorflow objects of the same dims


# Package trainx, trainy into a train input dataset
# trainds = tf.data.Dataset.from_tensor_slices((trainx_shaped, trainy_shaped)).shuffle(buffer_size = 3).batch(batch_size)
trainds = tf.data.Dataset.from_tensor_slices((trainroundx, trainroundy))

print(trainds)

# Dataset has a batch method.  Use this first to create sequences of max_time:
trainds = trainds.batch(max_time)

# Show that dataset batching did the right thing:
print('\n\nDemo-ing using batch to create sequences:')
for batch in trainds.take(4):
  print('Input array: {}\nLabel array: {}\n\n'.format(batch[0].numpy(), batch[1].numpy()))
 
# You can and should shuffle the rows of a dataset.  
# In fact, TF will warn you if you use an un-shuffled dataset.
# You can read about shuffle here:  ()
trainds = trainds.shuffle(max_time * batch_size_)  # Shuffle rows from each of 100-row buffers

# Finally, create a batch dimension for input data
trainds = trainds.batch(batch_size_)


# Repeat for test set
testlen_round = int((test_ids.shape[0] - max_time) / max_time) * max_time
testroundx = test_ids[:testlen_round]
testroundy = test_ids[1:testlen_round + 1]
# Bundle testx, testy into a validation input dataset
testds = tf.data.Dataset.from_tensor_slices((testroundx, testroundy)).batch(max_time).batch(batch_size_)

[nltk_data] Downloading package brown to /Users/drewplant/nltk_data...
[nltk_data]   Package brown is already up-to-date!


Vocabulary: 10,000 types
Loaded 57,340 sentences (1.16119e+06 tokens)
Training set: 45,872 sentences (924,077 tokens)
Test set: 11,468 sentences (237,115 tokens)
Raw shapes:
  train_ids: (969950,),
  test_ids: (248584,)


Training input (x) shape: (969925,),  training labels (y) shape: (969925,)
<TensorSliceDataset shapes: ((), ()), types: (tf.int32, tf.int32)>


Demo-ing using batch to create sequences:
Input array: [   0  304  657  434    0    7   42  213   42   36  977  391    5    0
 5970 1097    3  250   34    3    2    8    3 1196    5]
Label array: [ 304  657  434    0    7   42  213   42   36  977  391    5    0 5970
 1097    3  250   34    3    2    8    3 1196    5    0]


Input array: [   0   28   58 1398   42    8  140    9 2684 2806    6  126  113    3
 1114   15    3 4526   58  422    5    0    3    2   16]
Label array: [  28   58 1398   42    8  140    9 2684 2806    6  126  113    3 1114
   15    3 4526   58  422    5    0    3    2   16 5080]


Input array: [5080 8564 

<a id = 'runOnLargerInput'></a> 
[Top](#top)
# D. An industrial-strength RNN on a larger data
Let's build a new RNN model and fit to our larger Brown corpus dataset

You can use [Tensorboard](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/TensorBoard) for visualizing your model's structure, as well as viewing training and validation accuracy as your training fit progresses.

To create tensorboard data, simply create a callback and include the callback inside the call to fit() as shown below.  Then run the tensorboard command in a terminal window.

In [8]:
# Define model parameters
modelParams = dict(
    max_time = 25,   # length of words per batch
    batch_size = 100, 
    learning_rate = 0.01,
    V = vocab.size,
    H = 200,
    num_layers = 2,
    dropout_rate = 0.1,
    metrics = ['sparse_categorical_accuracy']
)

modelindustrial = create_lstm_model(**modelParams)

# Create a tensorboard callback
tb_logpath = './logs'
tb_callback = tf.keras.callbacks.TensorBoard(tb_logpath, update_freq=1)
print('To open tensorboard, execute the following command:\n  tensorboard --logdir={}\n\n'.format(tb_logpath))

# Train our model using training and verification datasets
modelindustrial.fit(trainds, validation_data = testds, epochs = 10, 
                    callbacks = [tb_callback])

To open tensorboard, execute the following command:
  tensorboard --logdir=./logs




NameError: name 'trainds' is not defined

<a id = 'goToPartII'></a> 
[Top](#top)
# E. Intro to part II -- Advanced training using sampled softmax loss  

This notebook has shown how to build and train a basic LSTM model using Keras.  In part II (coming) you will be introduced to customizing your model, loss and metric functions in order to provide greater flexibility into how training and inference are carried out with your LSTM.