# 4. Language Models

A common approach for natural language tasks is to use neural networks. 
We will therefore continue to explore RNNs introduced in Section 3, training a **character RNN**, trained to predict the next character in a sentence.

We follow the Char-RNN project by Andrej Karpathy (https://github.com/karpathy/char-rnn).
We will be using his Shakespeare data to create our own Char-RNN.

# 4.1 Creating the Training Dataset

Let's download the Shakespeare dataset and take a look:

In [1]:
import tensorflow as tf

# Download the Shakespeare dataset
shakespeare_url = "https://homl.info/shakespeare"  # shortcut URL
filepath = tf.keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

# Shows a short text sample
print(shakespeare_text[:420])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!


In [2]:
import urllib.request
import os

# Download the Shakespeare dataset
shakespeare_url = 'https://homl.info/shakespeare'
filepath = 'shakespeare.txt'

if not os.path.exists(filepath):
    urllib.request.urlretrieve(shakespeare_url, filepath)

with open(filepath) as f:
    shakespeare_text = f.read()

# Shows a short text sample
print(shakespeare_text[:420])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!


The input to our model will be a the beginning of a Shakespeare sonnet (i.e. a sequence of characters).
Given this sequence of characters, we want our model to predict the next character.

In [3]:
vocab = "".join(sorted(set(shakespeare_text.lower())))
vocab_size = len(set(shakespeare_text.lower()))
print("Our vocabulary: " + vocab)
print("Number of distinct characters: " + str(vocab_size))

Our vocabulary: 
 !$&',-.3:;?abcdefghijklmnopqrstuvwxyz
Number of distinct characters: 39


We must encode every character as an integer. 
It's easiest to do this using `keras.layers.TextVectorization` layer to encode this text.
We set `split="character"` to get character-level encoding rather than the default word-level encoding, and we use `standardize="lower"` to convert the text to lowercase (which will simplify the task):

In [4]:
text_vec_layer = tf.keras.layers.TextVectorization(split="character",
                                                   standardize="lower")
text_vec_layer.adapt([shakespeare_text])
encoded = text_vec_layer([shakespeare_text])[0]
encoded

<tf.Tensor: shape=(1115394,), dtype=int64, numpy=array([21,  7, 10, ..., 22, 28, 12])>

Each character is now mapped to an integer, starting at 2. The `TextVectorization` layer reserved the value 0 for padding tokens, and it reserved 1 for unknown characters. We won’t need either of these tokens for now, so let’s subtract 2 from the character IDs and compute the number of distinct characters and the total number of characters:

In [5]:
encoded -= 2  # drop tokens 0 (pad) and 1 (unknown), which we will not use
n_tokens = text_vec_layer.vocabulary_size() - 2  # number of distinct chars = 39
dataset_size = len(encoded)  # total number of chars = 1,115,394

print("Dataset size: ", dataset_size)
print("Number of tokens: ", n_tokens)

Dataset size:  1115394
Number of tokens:  39


To train a sequence-to-sequence RNN, we can convert this long sequence into input/target pairs. This dataset creation involves dividing the data into windows of a fixed size. For instance, a dataset sequence of the text "to be or not to b" will be turned into input as "to be or not to" sequence and target as "o be or not to be" sequence. This target sequence indicates that for a given input sequence, the next character should be "e". The model can then be trained on these input/target pairs to learn the underlying patterns in the text and generate more text of a similar style. 

The function `to_dataset` will convert our long sequence of character IDs (encoded text) into a dataset of input/target window pairs:

In [6]:
def to_dataset(sequence, length, shuffle=False, seed=None, batch_size=32):

    # Prepare dataset of character IDs to be processed by tensorflow
    ds = tf.data.Dataset.from_tensor_slices(sequence)

    # Create windows of size length + 1
    ds = ds.window(length + 1, shift=1, drop_remainder=True)
    ds = ds.flat_map(lambda window_ds: window_ds.batch(length + 1))

    if shuffle:
        ds = ds.shuffle(100_000, seed=seed)

    # Batch the resulting dataset
    ds = ds.batch(batch_size)

    # Create input/output sequences by taking first *length* characters as input
    # and last *length* characters as output
    return ds.map(lambda window: (window[:, :-1], window[:, 1:])).prefetch(1)

This diagram illustrates what `to_dataset` is doing:

<img src="to_dataset.png" width="500" style="display: block; margin: 0 auto">

Batching is a technique used to divide large datasets into smaller subsets or batches.
Instead of feeding the entire dataset (of our input/output pairs of windows) to the model at once, we divide it into batches, which are fed to the model one-by-one during training.
Each batch is processed independently, and the model updates its weights after processing each batch.
Batching makes training more efficient. 

In [7]:
# example using to_dataset()
# This code creates a dataset with a single training example: an input/output pair
# The input represents "to b" and the output represents "o be", so the model 
# should learn to predict the next character, i.e., "e"
list(to_dataset(text_vec_layer(["To be"])[0], length=4))

[(<tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[ 4,  5,  2, 23]])>,
  <tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[ 5,  2, 23,  3]])>)]

Since the entire dataset is 1,115,394 characters long and we have limited time, we will use a smaller portion of the dataset to make sure we can finish training during this workshop.
We will split it up so we use roughly 90% for training, 5% for validation and the remaining 5% for testing.

We initially specified the window length as 100, but it is worth experimenting with different window lengths.
While shorter lengths make it easier and quicker to train the RNN, as the RNN is not able to learn any pattern that is longer than the specified length, it is important to avoid choosing a window length that is too small.

In [8]:
length = 100
subset_proportion = 0.5
reduced_dataset_size = int(dataset_size * subset_proportion)

tf.random.set_seed(42)
train_set = to_dataset(encoded[:int(reduced_dataset_size * 0.9)], length=length, shuffle=True,
                       seed=42)
valid_set = to_dataset(encoded[int(reduced_dataset_size * 0.9):int(reduced_dataset_size * 0.95)], length=length)
test_set = to_dataset(encoded[int(reduced_dataset_size * 0.95):reduced_dataset_size], length=length)

## Building and Training the Char-RNN Model

**Warning**: the following code may one or two hours to run, depending on your GPU. Without a GPU, it may take over 24 hours. If you don't want to wait, just skip the next two code cells and run the code below to download a pretrained model.

To make GPU work, in terminal: `python -m pip install tensorflow-metal`

**Note**: the `GRU` class will only use cuDNN acceleration (assuming you have a GPU) when using the default values for the following arguments: `activation`, `recurrent_activation`, `recurrent_dropout`, `unroll`, `use_bias` and `reset_after`.

Since our dataset is reasonably large, and modeling language is quite a difficult task, we need more than a simple RNN with a few recurrent neurons.
Let’s build and train a model with one GRU layer composed of 128 units (you can try tweaking the number of layers and units later, if needed).

Let’s go over this code:

- We use an `Embedding` layer as the first layer, to encode the character IDs (embeddings were introduced in Chapter 13). The `Embedding` layer’s number of input dimensions is the number of distinct character IDs, and the number of output dimensions is a hyperparameter you can tune—we’ll set it to 16 for now. Whereas the inputs of the `Embedding` layer will be 2D tensors of shape *[batch size, window length]*, the output of the Embedding layer will be a 3D tensor of shape *[batch size, window length, embedding size]*.

- We use a `Dense` layer for the output layer: it must have 39 units (n_tokens) because there are 39 distinct characters in the text, and we want to output a probability for each possible character (at each time step). The 39 output probabilities should sum up to 1 at each time step, so we apply the softmax activation function to the outputs of the Dense layer.

- Lastly, we compile this model, using the `"sparse_categorical_crossentropy"` loss and a Nadam optimizer, and we train the model for several epochs, using a `ModelCheckpoint` callback to save the best model (in terms of validation accuracy) as training progresses.

In [9]:
# # TODO: this takes too long, can we get GPU in Github Workspaces?

# tf.random.set_seed(42)  # extra code – ensures reproducibility on CPU
# model = tf.keras.Sequential([
#     tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16),
#     tf.keras.layers.GRU(128, return_sequences=True),
#     tf.keras.layers.Dense(n_tokens, activation="softmax")
# ])
# model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",
#               metrics=["accuracy"])
# model_ckpt = tf.keras.callbacks.ModelCheckpoint(
#     "my_shakespeare_model", monitor="val_accuracy", save_best_only=True)
# history = model.fit(train_set, validation_data=valid_set, epochs=10,
#                     callbacks=[model_ckpt])

This model does not handle text preprocessing, so let’s wrap it in a final model containing the `tf.keras.layers.TextVectorization` layer as the first layer, plus a `tf.keras.layers.Lambda` layer to subtract 2 from the character IDs (since we’re not using the padding and unknown tokens for now):

In [10]:
# shakespeare_model = tf.keras.Sequential([
#     text_vec_layer,
#     tf.keras.layers.Lambda(lambda X: X - 2),  # no <PAD> or <UNK> tokens
#     model
# ])

If you don't want to wait for training to complete, we have a pretrained model for you.
The following code will download it.
Uncomment the last line if you want to use it instead of the model trained above.

In [11]:
import tensorflow as tf
tf.config.list_physical_devices('GPU')

[]

In [12]:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 5482547038441510248
xla_global_id: -1
]


In [13]:
from pathlib import Path

# extra code – downloads a pretrained model
with tf.device('/CPU:0'):
    url = "https://github.com/ageron/data/raw/main/shakespeare_model.tgz"
    path = tf.keras.utils.get_file("shakespeare_model.tgz", url, extract=True) 
    model_path = Path(path).with_name("shakespeare_model")
    shakespeare_model = tf.keras.models.load_model(model_path)

In [14]:
shakespeare_model.weights

[<keras.src.layers.preprocessing.index_lookup.VocabWeightHandler at 0x30fc39fc0>,
 <tf.Variable 'embedding/embeddings:0' shape=(39, 16) dtype=float32, numpy=
 array([[ 1.32050097e+00, -2.59510159e-01, -4.11443859e-02,
         -3.42633694e-01, -1.31173706e+00,  1.34531796e-01,
          6.91836357e-01,  7.36093968e-02, -7.89562762e-01,
         -1.52055562e-01, -8.17673504e-01,  7.70714700e-01,
         -1.44369400e+00, -5.65679260e-02,  1.52639374e-01,
          5.92843831e-01],
        [-7.78888166e-01,  6.68907881e-01, -6.08024187e-03,
         -7.67722666e-01, -3.88999522e-01, -8.06803167e-01,
          4.08726960e-01, -1.02228457e-02, -8.20410475e-02,
         -1.16070993e-02,  1.23549990e-01,  2.96948791e-01,
          1.32559168e+00,  2.00111166e-01,  1.09568262e+00,
         -9.38163251e-02],
        [ 1.02407813e+00, -9.66559649e-02,  1.31750107e-01,
         -5.60444713e-01,  3.55749339e-01,  2.71324843e-01,
         -9.60624337e-01, -6.04912732e-03, -2.26338282e-01,
        

Now let's use it to predict the next character in the sequence:

In [15]:
y_proba = shakespeare_model.predict(["To be or not to b"])[0, -1]
y_pred = tf.argmax(y_proba)  # choose the most probable character ID
text_vec_layer.get_vocabulary()[y_pred + 2]



'e'

## Generating Fake Shakespearean Text

Wee issue here: we cannot set tensorflow to use GPU for training and using the model. The output is poo-poo. CPU output is good. DILEMMA.

To generate new text using the char-RNN model, we could feed it some text, make the model predict the most likely next letter, add it to the end of the text, then give the extended text to the model to guess the next letter, and so on. This is called greedy decoding. But in practice this often leads to the same words being repeated over and over again. Instead, we can sample the next character randomly, with a probability equal to the estimated probability, using TensorFlow’s tf.random.categorical() function. This will generate more diverse and interesting text. The categorical() function samples random class indices, given the class log probabilities (logits). For example:

In [16]:
log_probas = tf.math.log([[0.5, 0.4, 0.1]])  # probas = 50%, 40%, and 10%
tf.random.categorical(log_probas, num_samples=8)  # draw 8 samples

<tf.Tensor: shape=(1, 8), dtype=int64, numpy=array([[0, 1, 0, 2, 1, 0, 0, 1]])>

To have more control over the diversity of the generated text, we can divide the logits by a number called the temperature, which we can tweak as we wish. A temperature close to zero favors high-probability characters, while a high temperature gives all characters an equal probability. Lower temperatures are typically preferred when generating fairly rigid and precise text, such as mathematical equations, while higher temperatures are preferred when generating more diverse and creative text. The following next_char() custom helper function uses this approach to pick the next character to add to the input text:

In [17]:
def next_char(text, temperature=1):

    # Generate the predicted probabilities for the next character in the sequence
    # based on the current text
    # Select the final output vector from this prediction, i.e. the last character in the sequence
    y_proba = shakespeare_model.predict([text])[0, -1:]
    
    # Rescale the probability distribution using the temperature parameter
    rescaled_logits = tf.math.log(y_proba) / temperature

    # Sample the next character ID from this rescaled distribution
    char_id = tf.random.categorical(rescaled_logits, num_samples=1)[0, 0]

    # Return the character corresponding to the sampled ID
    return text_vec_layer.get_vocabulary()[char_id + 2]

Next, we can write another small helper function that will repeatedly call `next_char()` to get the next character and append it to the given text:

In [18]:
def extend_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

In [19]:
tf.random.set_seed(42)
print(extend_text("To be or not to be", temperature=0.01))

To be or not to be the duke
as it is a proper strange death,
and the


In [20]:
print(extend_text("To be or not to be", temperature=1))

To be or not to behold?

second push:
gremio, lord all, a sistermen,


In [21]:
print(extend_text("To be or not to be", temperature=1))

To be or not to be edward knows
whose stew, am i bid them for you, g


In [22]:
print(extend_text("To be or not to be", temperature=0.9))

To be or not to bed tire
the strangeness pity. what is't your allay,
