<a id = 'top'></a>

# Notebook Contents
  * A. [Introduction to RNN](#introToRnn) 
    * 1. [Model Structure](#modelStructure)
    * 2. [Multi-Layer Cells](#multiLayerCells)
    * 3. [Batching and Truncated Backpropagation Through Time (BPTT)](#backProp)
  * B. [Introducing the StringLookup and Dataset utilities](#kerasUtilities)  
  * C. [Create model / setup Tensorboard / train](#createTensorboardTrain)

<a id = 'introToRnn'></a>
# A. Introduction to Recurrent Neural Network Language Model

In this part, we'll learn about building a recurrent neural network language model (RNNLM) using TensorFlow Keras. This class of models represented the cutting edge in language modeling about 5 years ago.  Even though nowadays transformers (like BERT) represent the state-of-the-art for overall accuracy, LSTMs tend to take much less time to train and so with a limited amount of training time and compute resources they can produce surpprisingly good results.  Analyzing and building them is also useful for understanding fundamental concepts of all neural network architectures (states, input / output dimensions, batching, setting up loss and metrics.)

As a reference, you may want to review the following:

- [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) (Andrej Karpathy, 2015)
- [Understanding LSTM Networks](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) (Chris Olah, 2015)
- [A Tensorflow / Keras tutorial on using RNNs for text generation](https://www.tensorflow.org/tutorials/text/text_generation) (updated Oct. 2020)

The specific model we'll build is based on the following papers. You should skim these (particularly the first one), but you don't need to read them in detail:

- [Recurrent neural network based language model](http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf) (Mikolov, et al. 2010)
- [Exploring the Limits of Language Modeling](http://arxiv.org/pdf/1602.02410.pdf) (Jozefowicz, et al. 2016)

We'll build our model entirely in TensorFlow Keras, so you may want to review the [TensorFlow section of assignment 1](../a1/tensorflow/tensorflow.ipynb).

Finally, you'll possibly want to consult the [TensorFlow Keras API reference](https://www.tensorflow.org/api_docs/python/tf/keras), and pay special attention to the types, dimensions and order of arguments for each function.  As we suggested you do in Assignment 1, you'll want to **draw the shape of any matrices you work with on a scrap paper** or you may have trouble keeping track of your forward path!

# Notebook Overview

Notebook I consists of 7 parts:



[Return to Top](#returnToTop)  
<a id = 'modelStructure'></a>
## RNNLM Model Structure

![RNNLM](images/rnnlm_layers.png)

Here's the basic spec for our model. We'll use the following notation:

- $w^{(i)}$ for the $i^{th}$ word of the sequence (as an integer index)
- $x^{(i)}$ for the vector representation of $w^{(i)}$
- $h^{(i)}$ for the $i^{th}$ hidden state, with indices as in Section 5.8 of the async
- $o^{(i)}$ for the $i^{th}$ output state, which may or may not be the same as the hidden state
- $y^{(i)}$ for the $i^{th}$ target word, which for a language model is always equal to $w^{(i+1)}$

Let $ h^{(-1)} = h^{init} $ be an initial state. For an input sequence of $n$ words and $i = 0, ..., n-1$, we have:

- **Embedding layer:** $ x^{(i)} = W_{in}[w^{(i)}] $
- **Recurrent layer:** $ (o^{(i)}, h^{(i)}) = \text{CellFunc}(x^{(i)}, h^{(i-1)}) $
- **Output layer:** $\hat{P}(y^{(i)}) = \hat{P}(w^{(i+1)}) = \text{softmax}(o^{(i)}W_{out} + b_{out}) $
 
$\text{CellFunc}$ can be an arbitrary function representing our recurrent cell - it can be a simple RNN cell, or something more complicated like an LSTM, or even a stacked multi-layer cell. *Note that the cell has its own internal, trainable parameters.*

It may be convenient to deal with the logits of the output layer, which are the un-normalized inputs to the softmax:

$$ \text{logits}^{(i)} = o^{(i)}W_{out} + b_{out} $$

We'll use these as shorthand for important dimensions:
- `V` : vocabulary size
- `H` : hidden state size = embedding size = per-cell output size

[Return to Top](#returnToTop)  
<a id = 'multiLayerCells'></a>
### Multi-Layer Cells

One popular technique for improving the performance of RNNs is to stack multiple layers. Conceptually, this is similar to an ordinary multi-layer network, such as those you implemented on Assignment 1.

![RNNLM - multicell](images/rnnlm_multicell.png)

**Recurrent layer 1** will take embeddings $ x^{(i)} $ as inputs and produce outputs $o^{(i)}_0$. We can feed these in to **Recurrent layer 2**, and get another set of outputs $o^{(i)}_1$, and so on. Note that because the input dimension of an RNN cell is typically the same as the output, all of these layers will have the same shape.

In TensorFlow, a single RNN layer is composed of an [LSTM cell](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTMCell), which represents one unit of time in our model (one word in our diagram above.  If you want to stack multiple layers at each time-step you can use [tf.keras.layers.StackedRNNCells](https://www.tensorflow.org/api_docs/python/tf/keras/layers/StackedRNNCells). The `StackedRNNCells` object provides a vertically-stacked cell, as shown by the dashed green lines above.

Effectively the concept of time is an abstraction and ultimately at the end of training the model parameters of `cell` above are identical at each time position.

[Return to Top](#returnToTop)  
<a id = 'backProp'></a>
## Batching and Truncated Backpropagation Through Time (BPTT)

Batching for an RNN works the same as for any neural network: we'll run several copies of the RNN simultaneously, each with their own hidden state and outputs. Most TensorFlow functions are batch-aware, and expect `batch_size` as the first dimension.

With RNNs, however, we also need to consider the sequence length. In theory, we model our RNN as operating on sequences of arbitary length, but in practice it's much more efficient to work with batches where all the sequences have the same (maximum) length. TensorFlow calls this dimension `max_time`.  _Note: since LSTMs model sequences, a lot of the nomenclature around them mentions "time".  Whenever you see a reference to "time" in documentation, just read it as "word sequence position(s)"._

Put together, it looks like this, where our inputs $w$ and targets $y$ will both be 2D arrays of shape `[batch_size, max_time]`.

![RNNLM - batching](images/rnnlm_batching.png)

Note that along the batch dimension, sequences are independent. Along the time dimension, the output of one timestep is fed into the next. 

In the common case of processing sequences longer than `max_time`, we can chop the input up into smaller chunks, and carry the final hidden state from one batch as the input to the next. For example, given the input `[a b c d e f g h]` and `max_time = 4`, we would run twice:
```
h_init    -> RNN on [a b c d] -> h_final_0
h_final_0 -> RNN on [e f g h] -> h_final_1
```
We can also do this with batches, taking care to construct our batches in such a way that each batch lines up with it's predecessor. For example, with inputs `[a b c d e f g h]` and `[s t u v w x y z]`, we would do:
```
h_init    -> RNN on [a b c d] -> h_final_0
                    [s t u v]

h_final_0 -> RNN on [e f g h] -> h_final_1
                    [w x y z]
```
where our hidden states `h_init`, etc. have shape `[batch_size, state_size]`. (*Note that `state_size = H` for a simple RNN, but is larger for LSTMs or stacked cells.*)

Training in this setting is known as *truncated backpropagation through time*, or truncated BPTT. We can backpropagate errors within a batch for up to `max_time` timesteps, but not any further past the batch boundary. In practice with `max_time` greater than 20 or so, this doesn't significantly hurt the performance of our language model.

[Return to Top](#returnToTop)  
<a id = 'implementRnn'></a>
## Choosing an optimizer

For training steps, you can use any optimizer, but we recommend `tf.train.AdamOptimizer` with gradient clipping (`tf.clip_by_global_norm`).  Adam adjusts the learning rate on a per-variable basis, and also adds a "momentum" term that improves the speed of convergence. See [An overview of gradient descent optimization algorithms](http://ruder.io/optimizing-gradient-descent/) for more.

For training with AdamOptimizer, you want to use the `learning_rate = 0.01` as defined under "Training Parameters" (next to batch size, num epochs, etc.). If you use `learning_rate = 0.1` with Adam, the model will likely overfit or training may be unstable. (However, 0.1 works well with Adagrad and vanilla SGD.)


[Return to Top](#returnToTop)  
<a id = 'runOnToyInput'></a>
# (B) Let's implement a simple RNNLM

In [40]:
# Imports
# NumPy and TensorFlow


import numpy as np
import os
import tensorflow as tf
print(tf.__version__)
assert(tf.__version__.startswith("2."))

from tensorflow import keras
from tensorflow.keras import layers, backend as K
from tensorflow.keras.models import Model
from tensorflow.keras import backend
assert(tf.__version__.startswith("2."))
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras import Input
from tensorflow.keras import Model, layers
from tensorflow.keras.layers.experimental.preprocessing import StringLookup

# Tensorboard
from tensorflow.python.keras.callbacks import TensorBoard
import datetime

# Helper libraries
from w266_common import utils, vocabulary, tf_embed_viz

# From sklearn 
from sklearn.model_selection import train_test_split

import time

2.3.0


<a id = 'kerasUtilities'></a>
[Return to Top](#top)
# B. Keras utilities 
We'll introduce __StringLookup__ for vocabulary creation and __Dataset__ for training / label pre-processing.)

In [2]:
# Load "brown" dataset
from importlib import reload
reload(utils)
name = "brown"
import nltk
assert(nltk.download(name))
corpus = nltk.corpus.__getattr__(name)
sentences = list(corpus.sents())

# Standardize words -- lower-case characters, convert numbers to a standard code, etc.
# Also insert a '<s> character at the beginning and end of every sentence.'
canonsentences = np.array([['<s>'] + [utils.canonicalize_word(word) for word in sentence] + ['<s>'] for sentence in sentences ])
print('An example of pre-standardized sentence:\n  {}'.format(sentences[0]))
print('\n\nand after standardization:\n  {}'.format(canonsentences[0]))

[nltk_data] Downloading package brown to /Users/drewplant/nltk_data...
[nltk_data]   Package brown is already up-to-date!


An example of pre-standardized sentence:
  ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.']


and after standardization:
  ['<s>', 'the', 'fulton', 'county', 'grand', 'jury', 'said', 'friday', 'an', 'investigation', 'of', "atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.', '<s>']


## Keras PreProcessing:  StringLookup

If you're using your own input data-set in order to train a custom embedding layer, you will need a way to convert words to ids for the embedding-layer lookup operation.  For this purpose, Keras provides a convenient utility to create an input vocabulary as well as a lookup dictionary, using the StringLookup object.  Refer to Keras documentation [here.](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/StringLookup)

In [3]:
# Size of corpus
print('Length of brown corpus is {} sentences'.format(len(canonsentences)))

# Convert to single dimension of words
canonwords = [ word for sentence in canonsentences for word in sentence]
print('Length of words in brown corpus is {}'.format(len(canonwords)))

Length of brown corpus is 57340 sentences
Length of words in brown corpus is 1275872


In [4]:
# Create the string lookup object using the 10000 most-popular words
words_to_ids = StringLookup(max_tokens = 10000)

# Process the input corpus words, creating a vocabulary / id lookup:
words_to_ids.adapt(canonwords)

# Get vocabulary size
V = len(words_to_ids.get_vocabulary())
print('Extracted vocabulary length is {}'.format(V))

# Also create an object to convert from ids back to words from the same vocabulary:
ids_to_words = StringLookup(vocabulary=words_to_ids.get_vocabulary(), invert=True)

Extracted vocabulary length is 10000


## Keras Dataset input utility

To make life easier Keras offers a [Dataset](https://www.tensorflow.org/guide/data) interface to easily wrangle input and label data for feeding and training models (not just for an LSTM model but any Keras model.)

Let's use this to create an input dataset for training and evaluating our RNN model!

In [12]:
# Create training / test sets of word ids 
corpus_ids = words_to_ids(canonwords).numpy()

# Split into train (80%) dev (10%) test (10%)
train_ids, dev_test_ids = train_test_split(corpus_ids, train_size=0.8, random_state=42, shuffle=False)

dev_ids, test_ids = train_test_split(dev_test_ids, train_size=0.5, random_state=42, shuffle=False)

x_ids_train = train_ids[:-1]
y_ids_train = train_ids[1:]

# inputs of length max_time words
max_time = 25   # length of words per sequence
buffer_size = 100
batch_size = 100

ids_labels_dataset = tf.data.Dataset.from_tensor_slices((x_ids_train, y_ids_train))
# examples_per_epoch = len(corpus_ids)//(max_time+1)

# Create a train sequence dimension for words.  
sequences_train = ids_labels_dataset.batch(max_time, drop_remainder=True).shuffle(buffer_size).batch(
    batch_size, drop_remainder=True)

# Create a dataset for validating during fit
x_dev = dev_ids[:-1]
y_dev = dev_ids[1:]
ids_labels_validation = tf.data.Dataset.from_tensor_slices((x_dev, y_dev))
sequences_val = ids_labels_validation.batch(max_time, drop_remainder=True).shuffle(buffer_size).batch(
    batch_size, drop_remainder=True)





In [5]:
# inputs of length max_time words
max_time = 25   # length of words per sequence
batch_size = 100

# Create training / test sets of word ids without using dataset
corpus_ids = words_to_ids(canonwords).numpy()

# Truncate to have even number of sequences
trunc_id = (( len(corpus_ids)-1 ) // (max_time * batch_size)) * (max_time * batch_size)
corpus_x_ids = corpus_ids[:trunc_id]
corpus_y_ids = corpus_ids[1:trunc_id+1]

# Add in word_length dimension
corpus_x_ids = corpus_x_ids.reshape([-1, max_time])
corpus_y_ids = corpus_y_ids.reshape([-1, max_time])

# Split into train (80%) dev (10%) test (10%)
train_x, test_x, train_y, test_y = train_test_split(corpus_x_ids, corpus_y_ids,  train_size=0.8, random_state=42, shuffle=True)

# dev_ids, test_ids = train_test_split(dev_test_ids, train_size=0.5, random_state=42, shuffle=False)

<a id = 'createTensorboardTrain'></a> 
[Top](#top)
# C. Create the model.  Setup tensorboard.  Train the model.
Let's build a new RNN model and fit to our larger Brown corpus dataset

You can use [Tensorboard](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/TensorBoard) for visualizing your model's structure, as well as viewing training and validation accuracy as your training fit progresses.

To create tensorboard data, simply create a callback and include the callback inside the call to fit() as shown below.  Then run the tensorboard command in a terminal window.

In [114]:
# 1. This one works
# The LSTM layer provides two arguments:
#   return_state (which returns lstm_state, lstm_last_time_state, cell_state)
#   return_sequence (which ensures that the 'lstm_state' returned object is the vector output
#   for all time positions in the sequence.)
#
#   Note that for the case (return_sequence = False, return_state = True) lstm_state and lstm_last_time_state
#   are the same tensor.
#
# Here is a good article illustrating the two options:
#    https://machinelearningmastery.com/return-sequences-and-return-states-for-lstms-in-keras/


# Let's build a model class to instantiate our model 
# ...and more closely control training / inference behavior.
class MyModel(keras.Model):
    def __init__(self, vocab_size, embedding_dim, rnn_units, hidden_activation, hidden_initializer,
                batchnorm = True):
        super().__init__(self)
        self.embedding = keras.layers.Embedding(vocab_size, embedding_dim)
        # self.rnn = keras.layers.GRU(rnn_units, return_sequences = True, return_state=True, 
        #                              activation = hidden_activation, kernel_initializer = hidden_initializer, 
        #                              stateful=False)
        
        self.rnn = tf.keras.layers.LSTM(rnn_units, return_sequences=True, return_state=True,
                                        activation = hidden_activation, 
                                        kernel_initializer = hidden_initializer)
        
        self.norm = tf.keras.layers.BatchNormalization()
        self.do_batchnorm = batchnorm
    
        # tf.keras.layers.GRU(rnn_units,
        #                                return_sequences=True, 
        #                                return_state=True)
        self.dense = keras.layers.Dense(vocab_size)
 
    # You must set return_sequences=True when stacking LSTM layers so that the second LSTM layer has a three-dimensional sequence input.

    def call(self, inputs, states=None, return_state=False, training=False):
        x = inputs
        x = self.embedding(x, training=training)
        if states is None:
            states = self.rnn.get_initial_state(x)
            
        # In the following expression the return values are:  
        #        x = the sequence of outputs from the layer, 
        #   states = the final state vector (at the last time-step)
        #        _ = the cell memory at the final step
        x, state_h, state_c = self.rnn(x, initial_state=states, training=training)
        if self.do_batchnorm:
            x = self.norm(x)
        
        # Output layer outputs logits rather than softmax as we didn't specify any activation
        x = self.dense(x, training=training)
        
        if return_state:
            return x, [state_h, state_c]
        else: 
            return x

In [153]:
# 2. Trying this one with multiple hidden layers
# The LSTM layer provides two arguments:
#   return_state (which returns lstm_state, lstm_last_time_state, cell_state)
#   return_sequence (which ensures that the 'lstm_state' returned object is the vector output
#   for all time positions in the sequence.)
#
#   Note that for the case (return_sequence = False, return_state = True) lstm_state and lstm_last_time_state
#   are the same tensor.
#
# Here is a good article illustrating the two options:
#    https://machinelearningmastery.com/return-sequences-and-return-states-for-lstms-in-keras/


# Let's build a model class to instantiate our model 
# ...and more closely control training / inference behavior.
class MyModel(keras.Model):
    def __init__(self, vocab_size, embedding_dim, n_layers, rnn_units, hidden_activation, dropout_rate,
                 hidden_initializer, batchnorm = True):
        super().__init__(self)
        self.embedding = keras.layers.Embedding(vocab_size, embedding_dim)
        # self.rnn = keras.layers.GRU(rnn_units, return_sequences = True, return_state=True, 
        #                              activation = hidden_activation, kernel_initializer = hidden_initializer, 
        #                              stateful=False)
        self.n_layers = n_layers
        self.rnn = []
        self.norm = []
        self.dropout = []
        for i in range(n_layers):
            self.rnn.append(tf.keras.layers.LSTM(rnn_units, return_sequences=True, return_state=True, 
                                                 activation = hidden_activation,
                                                 kernel_initializer = hidden_initializer))
            self.norm.append(tf.keras.layers.BatchNormalization())
            self.dropout.append(tf.keras.layers.Dropout(dropout_rate))
            
        # self.rnn = tf.keras.layers.LSTM(rnn_units, return_sequences=True, return_state=True,
        #                                 activation = hidden_activation, 
        #                                 kernel_initializer = hidden_initializer)
        
        self.do_batchnorm = batchnorm
    
        # tf.keras.layers.GRU(rnn_units,
        #                                return_sequences=True, 
        #                                return_state=True)
        self.dense = keras.layers.Dense(vocab_size)
 
    # You must set return_sequences=True when stacking LSTM layers so that the second LSTM layer has a three-dimensional sequence input.

    def call(self, inputs, passin_states=None, return_state=False, training=False):
        x = inputs
        x = self.embedding(x, training=training)
            
        # In the following expression the return values are:  
        #         x = the sequence of outputs from the layer, 
        #   state_h = the final state vector (at the last time-step)
        #   state_c = the cell memory at the final step
        states = []
        for i in range(self.n_layers):
            if passin_states is None:
                statesi = self.rnn[i].get_initial_state(x)
            else:
                statesi = passin_states[i]
            x, state_h, state_c = self.rnn[i](x, initial_state=statesi, training=training)
            # x, state_h, state_c = self.rnn(x, initial_state=states, training=training)
            x = self.dropout[i](x)
            if self.do_batchnorm:
                x = self.norm[i](x)
            states.append((state_h, state_c))
        
        # Output layer outputs logits rather than softmax as we didn't specify any activation
        x = self.dense(x, training=training)
        
        if return_state:
            return x, states
        else: 
            return x

In [162]:
# Length of the vocabulary in chars
vocab_size = 10000

# The embedding dimension
# embedding_dim = 256
embedding_dim = 50

# Number of hidden layers
n_layers = 2

# Number of RNN units
rnn_units = 100

hidden_activation = 'relu'

hidden_initializer = 'he_uniform'

# Dropout
dropout_rate = 0.1

# Create model instance
model = MyModel(
    # Be sure the vocabulary size matches the `StringLookup` layers.
    vocab_size=vocab_size,
    embedding_dim=embedding_dim,
    n_layers = n_layers,
    rnn_units=rnn_units,
    hidden_activation = hidden_activation, 
    hidden_initializer = hidden_initializer,
    dropout_rate = dropout_rate,
    batchnorm = True)

In [163]:
# Get a feel for looking at training samples in our input Dataset
for input_example_batch, target_example_batch in sequences_train.take(1):
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")
    print(input_example_batch.shape, target_example_batch.shape)

# Print out a model summary
model.summary()

(100, 25, 10000) # (batch_size, sequence_length, vocab_size)
(100, 25) (100, 25)
Model: "my_model_13"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_12 (Embedding)     multiple                  500000    
_________________________________________________________________
lstm_24 (LSTM)               multiple                  60400     
_________________________________________________________________
lstm_25 (LSTM)               multiple                  80400     
_________________________________________________________________
batch_normalization_15 (Batc multiple                  400       
_________________________________________________________________
batch_normalization_16 (Batc multiple                  400       
_________________________________________________________________
dropout_6 (Dropout)          multiple                  0         
________________________________________

In [164]:
# See the behavior of loss function, how to take mean loss over batch
# We will use "from_logits" = True since our outputs are logits rather than softmax (ie, [batch,seq_len,V])
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)

# We can calculate an example loss using eager execution.
example_batch_loss = loss(target_example_batch, example_batch_predictions)
print('Shape of example batch loss: {}'.format(example_batch_loss.numpy().shape))
mean_loss = example_batch_loss.numpy().mean()
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("Mean loss:        ", mean_loss)

# We know that tensorflow uses natural log (base e) for crossentropy calculation 
checkbase = loss(np.array([[[0]]]), np.array([[[0., 20, 0.]]]))
print(checkbase.numpy())
# Confirm that exp(mean loss ~ V)  (why?)
# If initialization is good, each q ~ 1 / V => p ln(V) = ln(V) -> exp(ln(V)) = V !!
# ln(x) = ln(2^log2(x)) = log2(x) * ln(2) => 
print(tf.exp(mean_loss).numpy())

Shape of example batch loss: ()
Prediction shape:  (100, 25, 10000)  # (batch_size, sequence_length, vocab_size)
Mean loss:         9.210451
20.0
10001.107


In [165]:
# Compile model
model.compile(optimizer='adam', loss=loss, metrics = ['sparse_categorical_accuracy'])

In [166]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

In [167]:
# Train
EPOCHS = 20
history = model.fit(sequences_train, 
                    validation_data = sequences_val, epochs=EPOCHS, 
                    callbacks=[checkpoint_callback])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [168]:
#### Generate text
class OneStep(tf.keras.Model):
  def __init__(self, model, ids_to_words, words_to_ids, temperature=1.0):
    super().__init__()
    self.temperature=temperature
    self.model = model
    self.ids_to_words = ids_to_words
    self.words_to_ids = words_to_ids

    # Create a mask to prevent "" or "[UNK]" from being generated.
    skip_ids = self.words_to_ids(['','[UNK]'])[:, None]
    sparse_mask = tf.SparseTensor(
        # Put a -inf at each bad index.
        values=[-float('inf')]*len(skip_ids),
        indices = skip_ids,
        # Match the shape to the vocabulary
        dense_shape=[len(words_to_ids.get_vocabulary())]) 
    self.prediction_mask = tf.sparse.to_dense(sparse_mask)

  # @tf.function
  def generate_one_step(self, input_words, passin_states=None):
    # Convert strings to token IDs.
    # input_words = tf.strings.unicode_split(inputs, 'UTF-8')
    # input_ids = self.words_to_ids(input_words).to_tensor()
    input_words = tf.strings.split(input_words)
    input_ids = self.words_to_ids(input_words.to_tensor())

    # Run the model.
    # predicted_logits.shape is [batch, word, next_word_logits] 
    predicted_logits, states =  self.model(inputs=input_ids, passin_states=passin_states, 
                                          return_state=True)
    # Only use the prediction in the final time-position.
    predicted_logits = predicted_logits[:, -1, :]
    predicted_logits = predicted_logits/self.temperature
    # Apply the prediction mask: prevent "" or "[UNK]" from being generated.
    predicted_logits = predicted_logits + self.prediction_mask

    # Sample the output logits to generate token IDs.
    predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
    predicted_ids = tf.squeeze(predicted_ids, axis=-1)

    # Convert from token ids to words
    predicted_words = self.ids_to_words(predicted_ids)

    # Return the words and model state.
    return predicted_words, states

In [169]:
# Generate an inference instance
one_step_model = OneStep(model, ids_to_words, words_to_ids)

# 
start = time.time()
states = None
# next_word = tf.constant(['hello, my name is'])
next_word = np.array(['hello, my name is'])
# next_word = tf.constant([['hello, my name is'],['hello', 'my', 'name', 'is']])
result = [next_word]

for n in range(100):
    next_word, states = one_step_model.generate_one_step(next_word, passin_states=states)
    result.append(next_word)

result = tf.strings.join(result, separator=' ')
end = time.time()

print('Generated language:')
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)

print(f"\nRun time: {end - start}")

Generated language:
hello, my name is shown only and what you had imagine those '' . <s> <s> mr. brown was on the state owned , the rows of scanned out out on the porch , and laughed what he had been talking comparable some while he moved back back to the halt . <s> <s> he had better at the moment , to the playing one day show in daytime and they'll repeat him ) down the ball . <s> <s> `` don't have before his own enough '' ? ? <s> <s> another deputies and his women , colonel van miller at a relatively 

________________________________________________________________________________

Run time: 0.6813898086547852


<a id = 'endOfNotebook'></a>
_End of Notebook_  
[Back to top](#top)  