# Word-level RNN as a Language Model

*Prepared by Sebastian C. Ibañez*

--- 

<a href="https://colab.research.google.com/github/aim-msds/msds2022-ml3/blob/main/notebooks/05_RNN/02_word-rnn.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" style="float: left;"></a><br>

In this notebook, our goal is to create a language model by implementing a word-level RNN.

For reference, here's the architecture we want to build:

<img src="images/wordlevel-rnn.png" width="1000">

In [1]:
import numpy as np
import tensorflow as tf

Let's look at the data:

In [2]:
# Load data
path_to_file = 'data/shakespeare.txt'

text = open(path_to_file, 'rb').read().decode(encoding='utf-8', errors='ignore')

print(text[:250])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



Now we can tokenize our data. Let's do this by splitting our data into "words" by whitespaces.

Note that this is just a very quick and dirty way to perform tokenization. In practice, there are a lot of sophisticated methods we can do.

In [3]:
# Naive tokenization
tokens = text.split()

print(f'Total number of tokens: {len(tokens)}\n')
print(tokens[:250])

Total number of tokens: 202651

['First', 'Citizen:', 'Before', 'we', 'proceed', 'any', 'further,', 'hear', 'me', 'speak.', 'All:', 'Speak,', 'speak.', 'First', 'Citizen:', 'You', 'are', 'all', 'resolved', 'rather', 'to', 'die', 'than', 'to', 'famish?', 'All:', 'Resolved.', 'resolved.', 'First', 'Citizen:', 'First,', 'you', 'know', 'Caius', 'Marcius', 'is', 'chief', 'enemy', 'to', 'the', 'people.', 'All:', 'We', "know't,", 'we', "know't.", 'First', 'Citizen:', 'Let', 'us', 'kill', 'him,', 'and', "we'll", 'have', 'corn', 'at', 'our', 'own', 'price.', "Is't", 'a', 'verdict?', 'All:', 'No', 'more', 'talking', "on't;", 'let', 'it', 'be', 'done:', 'away,', 'away!', 'Second', 'Citizen:', 'One', 'word,', 'good', 'citizens.', 'First', 'Citizen:', 'We', 'are', 'accounted', 'poor', 'citizens,', 'the', 'patricians', 'good.', 'What', 'authority', 'surfeits', 'on', 'would', 'relieve', 'us:', 'if', 'they', 'would', 'yield', 'us', 'but', 'the', 'superfluity,', 'while', 'it', 'were', 'wholesome,', 'we

Next, let's create a vocabulary to store all the unique tokens.

In [4]:
# Create vocabulary
vocab = list(set(tokens))
vocab_size = len(vocab)

print(f'Number of unique tokens: {vocab_size}')

Number of unique tokens: 25670


Tensorflow/Keras has some useful utilities for easy processing of text data. 

Let's create a `tf.keras.layers.StringLookup` layer and convert our tokens to index values based on the vocabulary we just made.

In [5]:
# Use tf.keras to make a lookup table (token -> id)
ids_from_tokens = tf.keras.layers.StringLookup(vocabulary=vocab)

ids = ids_from_tokens(tokens)
ids

<tf.Tensor: shape=(202651,), dtype=int64, numpy=array([12663, 18387, 15266, ...,  1900, 23059,  7514], dtype=int64)>

In order to interpret our predictions at inference time, we need create a reverse lookup table that converts an index back into an actual token.

In [6]:
# Reverse lookup
tokens_from_ids = tf.keras.layers.StringLookup(vocabulary=ids_from_tokens.get_vocabulary(), invert=True) 

Next, we can use a `tf.data.Dataset` object that will allow us to conveniently batch and loop through our data.

In [7]:
# Create a tf dataset object for easy batching / looping over data
ids_dataset = tf.data.Dataset.from_tensor_slices(ids_from_tokens(tokens))

for ids in ids_dataset.take(10):
    print(tokens_from_ids(ids).numpy().decode('utf-8'))

First
Citizen:
Before
we
proceed
any
further,
hear
me
speak.


Next, we need to pre-define the length of our sequences for batching and training.

In [8]:
seq_length = 20
examples_per_epoch = len(tokens)

sequences = ids_dataset.batch(seq_length+1, drop_remainder=True) # +1 because we want to predict the next word

for seq in sequences.take(1):
    print(tokens_from_ids(seq))

tf.Tensor(
[b'First' b'Citizen:' b'Before' b'we' b'proceed' b'any' b'further,'
 b'hear' b'me' b'speak.' b'All:' b'Speak,' b'speak.' b'First' b'Citizen:'
 b'You' b'are' b'all' b'resolved' b'rather' b'to'], shape=(21,), dtype=string)


Let's create a function that can split a sequence into input and target.

In [9]:
def split_input_target(sequence):
    input_text = sequence[:-1]
    target_text = sequence[1:]
    return input_text, target_text

split_input_target('Tensorflow is very cool.'.split())

(['Tensorflow', 'is', 'very'], ['is', 'very', 'cool.'])

Not let's apply the `split_input_target` function on all our sequences.

In [10]:
dataset = sequences.map(split_input_target)

for input_example, target_example in dataset.take(1):
    print("Input :", tokens_from_ids(input_example))
    print("Target:", tokens_from_ids(target_example))

Input : tf.Tensor(
[b'First' b'Citizen:' b'Before' b'we' b'proceed' b'any' b'further,'
 b'hear' b'me' b'speak.' b'All:' b'Speak,' b'speak.' b'First' b'Citizen:'
 b'You' b'are' b'all' b'resolved' b'rather'], shape=(20,), dtype=string)
Target: tf.Tensor(
[b'Citizen:' b'Before' b'we' b'proceed' b'any' b'further,' b'hear' b'me'
 b'speak.' b'All:' b'Speak,' b'speak.' b'First' b'Citizen:' b'You' b'are'
 b'all' b'resolved' b'rather' b'to'], shape=(20,), dtype=string)


Let's finalize the construction of our dataset.

In [11]:
# Batch size
batch_size = 256

# Buffer size to shuffle the dataset
buffer_size = 1000

dataset = sequences.map(split_input_target)
dataset = dataset.shuffle(buffer_size).batch(batch_size, drop_remainder=True) # Shuffle and batch

dataset

<BatchDataset shapes: ((256, 20), (256, 20)), types: (tf.int64, tf.int64)>

Now, let's build the model.

In [14]:
# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 512

# Word-level RNN
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(vocab_size+1, embedding_dim)) # +1 for the unknown token
model.add(tf.keras.layers.SimpleRNN(rnn_units, return_sequences=True)) # return outputs at every time step
model.add(tf.keras.layers.Dense(vocab_size+1)) # notice how theres no softmax here (you can put it in the loss function!)
          
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 256)         6571776   
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, None, 512)         393728    
_________________________________________________________________
dense_1 (Dense)              (None, None, 25671)       13169223  
Total params: 20,134,727
Trainable params: 20,134,727
Non-trainable params: 0
_________________________________________________________________


Let's define the loss function and choose our optimizer.

Also, you can use checkpointing to save your model at regular intervals.

In [15]:
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)

model.compile(optimizer='adam', loss=loss)

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath='./training_checkpoints/ckpt_{epoch}',
    save_weights_only=True)

In [26]:
epochs = 20

history = model.fit(dataset, epochs=epochs, callbacks=[checkpoint_callback])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


Before we generate text, let's briefly describe a common parameter used to control the "variance" of our logits called **temperature**.

In [27]:
np.random.seed(1)
logits = np.random.normal(loc=0, scale=1, size=5)

print(f'  logits = {logits}')
print(f'temp 0.5 = {tf.nn.softmax(logits/0.5).numpy()}')
print(f'temp 1.0 = {tf.nn.softmax(logits/1.0).numpy()}')
print(f'temp 2.0 = {tf.nn.softmax(logits/2.0).numpy()}')

  logits = [ 1.62434536 -0.61175641 -0.52817175 -1.07296862  0.86540763]
temp 0.5 = [0.80087103 0.00914764 0.0108121  0.00363668 0.17553254]
temp 1.0 = [0.56862917 0.06077185 0.06606977 0.0383178  0.26621141]
temp 2.0 = [0.38290729 0.12517866 0.13052103 0.09939839 0.26199464]


Notice what happens to the probabilities as we divide our logits by the temperature parameter. Lowering the temperature below 1.0 decreases the variance of the distribution, while increasing it also increases the variance. 

This will allow us to control how "wild" the generated text is when we sample it.

Now that everything is done, we can generate new text by passing a prompt into our model and randomly sampling from the final output (which is a distribution).

In [39]:
temperature = 1.75 # Tuneable
prompt = 'ROMEO:'

gen_len = 20

for i in range(gen_len):

    output = model(tf.expand_dims(ids_from_tokens(prompt.split()), axis=0))
    output = output[:, -1, :]
    output = output/temperature
    output = tf.random.categorical(output, num_samples=1)
    output = tf.squeeze(output, axis=-1)

    output_text = tokens_from_ids(output)
    output_text = output_text.numpy()[0].decode('utf-8')
    
    prompt = prompt + ' ' + output_text

print(prompt)

ROMEO: Well, Here, why, nature is framed, and to be sufficiency, as your sea York And doubt thou choose custom. instruct,


In [41]:
# Load pre-trained
new_model = tf.keras.models.load_model('models/word-rnn-shakespeare_v1.h5')

temperature = 1.75 # Tuneable
prompt = 'ROMEO:'

gen_len = 20

for i in range(gen_len):

    output = new_model(tf.expand_dims(ids_from_tokens(prompt.split()), axis=0))
    output = output[:, -1, :]
    output = output/temperature
    output = tf.random.categorical(output, num_samples=1)
    output = tf.squeeze(output, axis=-1)

    output_text = tokens_from_ids(output)
    output_text = output_text.numpy()[0].decode('utf-8')
    
    prompt = prompt + ' ' + output_text

print(prompt)

ROMEO: Amen, amen! that, then and tell me, do your cousin's death? HENRY BOLINGBROKE: Sir Dear Harry, take me Master much:


## Refinements and Extensions

- Keep increasing epochs.


- Refine pre-processing.


- Refine post-processing.


- Tune hyperparameters.


- Try different RNNs.


- Try a different dataset.

## References

---

[1] https://www.tensorflow.org/text/tutorials/text_generation

[2] https://www.tensorflow.org/tutorials/keras/text_classification