# 4. Language Models

A common approach for natural language tasks is to use neural networks. 
We will therefore continue to explore RNNs introduced in Section 3, training a **character RNN**, trained to predict the next character in a sentence.

We follow the Char-RNN project by Andrej Karpathy (https://github.com/karpathy/char-rnn).
We will be using his Shakespeare data to create our own Char-RNN.

# 4.1 Creating the Training Dataset

Let's download the Shakespeare dataset and take a look:

In [1]:
import tensorflow as tf

# Download the Shakespeare dataset
shakespeare_url = "https://homl.info/shakespeare"  # shortcut URL
filepath = tf.keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

# Shows a short text sample
print(shakespeare_text[:420])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!


The input to our model will be a the beginning of a Shakespeare sonnet (i.e. a sequence of characters).
Given this sequence of characters, we want our model to predict the next character.
For simplicity, we will only use **lowercase** characters.

Given `shakespeare_text` is our dataset, create the following variables:
- `vocab`, which is a string of all lowercase characters that appear in `shakespeare_text`
- `vocab_size`, which is the number of distinct characters in `shakespeare_text`

In [2]:
# Your code here
vocab = "".join(sorted(set(shakespeare_text.lower())))
vocab_size = len(set(shakespeare_text.lower()))


# Code you don't worry about
print("Our vocabulary: " + vocab)
print("Number of distinct characters: " + str(vocab_size))

Our vocabulary: 
 !$&',-.3:;?abcdefghijklmnopqrstuvwxyz
Number of distinct characters: 39


We must encode every character as an integer because the input to a neural network must be numerical. 

It's easiest to do this using `keras.layers.TextVectorization` layer to encode this text (i.e. convert it from characters to integer IDs).
This layer turns raw strings into an encoded representation that can be read by neural network layers.
We set `split="character"` to get character-level encoding rather than the default word-level encoding, and we use `standardize="lower"` to convert the text to lowercase.

In [3]:
# Use tf.keras.layers.TextVectorization to create a text_vec_layer object. 
# Set the split parameter to "character" and standardize parameter to "lower".
text_vec_layer = tf.keras.layers.TextVectorization(split="character",
                                                   standardize="lower")

# Call the adapt() method on text_vec_layer passing in a list of one item, shakespeare_text.
text_vec_layer.adapt([shakespeare_text])

# Use text_vec_layer on shakespeare_text (as list of one item) to obtain encoded
# character ID sequences.
# Save the first list element of this result to a variable called encoded
encoded = text_vec_layer([shakespeare_text])[0]
encoded

<tf.Tensor: shape=(1115394,), dtype=int64, numpy=array([21,  7, 10, ..., 22, 28, 12])>

Each character is now mapped to an integer, starting at 2. The `TextVectorization` layer reserved the value 0 for padding tokens, and it reserved 1 for unknown characters.
We won’t need either of these tokens for now because neither are in the vocabulary, so we won't be using them to write our sonnets either.
(When have you seen Shakespeare make up unknown characters? That's why.)

Let’s subtract 2 from the character IDs and compute the number of distinct characters and the total number of characters:

In [4]:
# Drop tokens 0 (pad) and 1 (unknown) by subtracting 2 from the character IDs
# stored in variable encoded
# (They were not in original vocabulary anyway)
encoded -= 2

# Compute the number of distinct characters n_tokens using vocabulary_size()
# method of text_vec_layer
# Remember: we are removing 2 
n_tokens = text_vec_layer.vocabulary_size() - 2

# Compute the total number of characters 
dataset_size = len(encoded)


# Code you don't worry about 
print("Dataset size: ", dataset_size)
print("Number of tokens: ", n_tokens)

Dataset size:  1115394
Number of tokens:  39


To train a sequence-to-sequence RNN, we can convert this long sequence into input/target pairs. This dataset creation involves dividing the data into windows of a fixed size. For instance, a dataset sequence of the text "to be or not to b" will be turned into input as "to be or not to" sequence and target as "o be or not to be" sequence. This target sequence indicates that for a given input sequence, the next character should be "e". The model can then be trained on these input/target pairs to learn the underlying patterns in the text and generate more text of a similar style. 

The function `to_dataset` will convert our long sequence of character IDs (encoded text) into a dataset of input/target window pairs:

In [9]:
def to_dataset(sequence, length, shuffle=False, seed=None, batch_size=32):

    # Prepare dataset of character IDs to be processed by tensorflow.
    # Use tf.data.Dataset.from_tensor_slices and pass in sequence as argument
    # and store as variable ds.
    ds = tf.data.Dataset.from_tensor_slices(sequence)

    # Create windows of size length + 1.
    # Call the window() method on ds, setting the size parameter to length+1,
    # the shift parameter to 1, drop_remainder parameter to True.
    # Save the result in variable ds.
    ds = ds.window(size=length + 1, shift=1, drop_remainder=True)
    
    # Don't worry about this
    ds = ds.flat_map(lambda window_ds: window_ds.batch(length + 1))

    # If the shuffle is set to True, update variable ds by calling shuffle()
    # method on it with parameter buffer_size set to 100_000 and seed parameter
    # set to seed
    if shuffle:
        ds = ds.shuffle(buffer_size=100_000, seed=seed)

    # Batch the resulting dataset. 
    # Update the ds variable by calling the batch() method on ds with parameter 
    # batch_size=batch_size
    ds = ds.batch(batch_size=batch_size)

    # Create input/output sequences by taking first *length* characters as input
    # and last *length* characters as output.
    # Do this by using the map() method on ds, and use the lambda function to 
    # create a tuple with the first length characters as the first element
    # and the last length characters as the second element
    ds = ds.map(lambda window: (window[:, :-1], window[:, 1:]))
    
    # Don't worry about this
    return ds.prefetch(1)

This diagram illustrates what `to_dataset` is doing:

<img src="to_dataset.png" width="500" style="display: block; margin: 0 auto">

Batching is a technique used to divide large datasets into smaller subsets or batches.
Instead of feeding the entire dataset (of our input/output pairs of windows) to the model at once, we divide it into batches, which are fed to the model one-by-one during training.
Each batch is processed independently, and the model updates its weights after processing each batch.
Batching makes training more efficient. 

In [10]:
# example using to_dataset()
# This code creates a dataset with a single training example: an input/output pair
# The input represents "to b" and the output represents "o be", so the model 
# should learn to predict the next character, i.e., "e"
list(to_dataset(text_vec_layer(["To be"])[0], length=4))

[(<tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[ 4,  5,  2, 23]])>,
  <tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[ 5,  2, 23,  3]])>)]

Since the entire dataset is 1,115,394 characters long and we have limited time, we will use a smaller portion of the dataset to make sure we can finish training during this workshop.
We will split it up so we use roughly 90% for training, 5% for validation and the remaining 5% for testing.

We initially specified the window length as 100, but it is worth experimenting with different window lengths.
While shorter lengths make it easier and quicker to train the RNN, as the RNN is not able to learn any pattern that is longer than the specified length, it is important to avoid choosing a window length that is too small.

In [11]:
# Don't worry about this code
length = 100
subset_proportion = 0.5

# Create variable reduced_dataset_size that calculates the size of the dataset
# we will be using in training using dataset_size and subset_proportion.
# Your result should be an integer.
reduced_dataset_size = int(dataset_size * subset_proportion)

# Slice the encoded data into training, validation, and test sets by the following proportions:
# - 90% for training (train_encoded)
# - 5% for validation (validation_encoded)
# - 5% for testing (test_encoded)

# The following code slices encoded data as per the above specifications.

# Slice the first 90% of data for training using integer indexing
train_encoded = encoded[:int(reduced_dataset_size * 0.9)]

# Slice the next 5% of data for validation
validation_encoded = encoded[int(reduced_dataset_size * 0.9):int(reduced_dataset_size * 0.95)]

# Slice the last 5% of data for testing
test_encoded = encoded[int(reduced_dataset_size * 0.95):reduced_dataset_size]

# Set a random seed (42) for reproducibility
tf.random.set_seed(42)

# Use the function to_dataset() to create training, validation, and test sets
# by calling the function with the corresponding slicing of the encoded data

# Call the function to create a training set with the following specifications:
# - Pass in train_encoded as the sequence parameter
# - Set length of the window to 100 using length parameter
# - Set shuffle parameter to True to shuffle the data
# - Set seed to 42 for reproducibility
train_set = to_dataset(train_encoded, length=length, shuffle=True, seed=42)

# Call the function to create a validation set with the following specifications:
# - Pass in validation_encoded as the sequence parameter
# - Set length of the window to 100 using length parameter
valid_set = to_dataset(validation_encoded, length=length)

# Call the function to create a test set with the following specifications:
# - Pass in test_encoded as the sequence parameter
# - Set length of the window to 100 using length parameter
test_set = to_dataset(test_encoded, length=length)

## Building and Training the Char-RNN Model

**Warning**: the following code may one or two hours to run, depending on your GPU. Without a GPU, it may take over 24 hours. If you don't want to wait, just skip the next two code cells and run the code below to download a pretrained model.

To make GPU work, in terminal: `python -m pip install tensorflow-metal`

**Note**: the `GRU` class will only use cuDNN acceleration (assuming you have a GPU) when using the default values for the following arguments: `activation`, `recurrent_activation`, `recurrent_dropout`, `unroll`, `use_bias` and `reset_after`.

Since our dataset is reasonably large, and modeling language is quite a difficult task, we need more than a simple RNN with a few recurrent neurons.
Let’s build and train a model with one GRU layer composed of 128 units (you can try tweaking the number of layers and units later, if needed).

Let’s go over this code:

- This model does not handle text preprocessing, so let’s wrap it in a final model containing the `tf.keras.layers.TextVectorization` layer as the first layer, plus a `tf.keras.layers.Lambda` layer to subtract 2 from the character IDs (since we’re not using the padding and unknown tokens for now):

- We use an `Embedding` layer as the first layer, to encode the character IDs (embeddings were introduced in Chapter 13). The `Embedding` layer’s number of input dimensions is the number of distinct character IDs, and the number of output dimensions is a hyperparameter you can tune—we’ll set it to 16 for now. Whereas the inputs of the `Embedding` layer will be 2D tensors of shape *[batch size, window length]*, the output of the Embedding layer will be a 3D tensor of shape *[batch size, window length, embedding size]*.

- We use a `Dense` layer for the output layer: it must have 39 units (n_tokens) because there are 39 distinct characters in the text, and we want to output a probability for each possible character (at each time step). The 39 output probabilities should sum up to 1 at each time step, so we apply the softmax activation function to the outputs of the Dense layer.

- Lastly, we compile this model, using the `"sparse_categorical_crossentropy"` loss and a Nadam optimizer, and we train the model for several epochs, using a `ModelCheckpoint` callback to save the best model (in terms of validation accuracy) as training progresses.

In [17]:
tf.random.set_seed(42)  
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16),
    tf.keras.layers.GRU(128, return_sequences=True),
    tf.keras.layers.Dense(n_tokens, activation="softmax")
])

# Call the compile method on model, setting these parameters:
# - loss as "sparse_categorical_crossentropy"
# - optimiser as "nadam"
# - metrics as a single-element list containing only "accuracy"
model.compile(loss="sparse_categorical_crossentropy",
              optimizer="nadam",
              metrics=["accuracy"])

# A callback is a set of functions that can be applied during training to 
# perform various tasks, such as saving the best model weights, early stopping 
# if the validation loss stops improving, etc.
# Use tf.keras.callbacks.ModelCheckpoint to create a model_ckpt object. 
# Set the filepath parameter to "my_shakespeare_model", monitor parameter to
# "val_accuracy" and save_best_only parameter to True.
model_ckpt = tf.keras.callbacks.ModelCheckpoint(
    filepath="my_shakespeare_model",
    monitor="val_accuracy",
    save_best_only=True)

# Train the model using the fit() method. Pass the training and validation sets 
# to the train_set and valid_set parameters, respectively. 
# Also, set the number of epochs to a number between 3 and 5 and the callback
# parameter to a list with only one element: model_ckpt
history = model.fit(train_set,
                    validation_data=valid_set,
                    epochs=5,
                    callbacks=[model_ckpt])

Epoch 1/5
    266/Unknown - 21s 46ms/step - loss: 2.8339 - accuracy: 0.2172

KeyboardInterrupt: 

This model does not handle text preprocessing, so let’s wrap it in a final model containing the `tf.keras.layers.TextVectorization` layer as the first layer, plus a `tf.keras.layers.Lambda` layer to subtract 2 from the character IDs (since we’re not using the padding and unknown tokens for now):

In [33]:
model = tf.keras.Sequential([
    text_vec_layer,
    tf.keras.layers.Lambda(lambda X: X - 2),  # no <PAD> or <UNK> tokens
    model
])

NameError: name 'model' is not defined

Since model training takes a long time, we have a pretrained model for you.
The following code will download it.
Uncomment the last line if you want to use it instead of the model trained above.

In [18]:
from pathlib import Path

# Downloads a pretrained model
with tf.device('/CPU:0'):
    url = "https://github.com/ageron/data/raw/main/shakespeare_model.tgz"
    path = tf.keras.utils.get_file("shakespeare_model.tgz", url, extract=True) 
    model_path = Path(path).with_name("shakespeare_model")
    model = tf.keras.models.load_model(model_path)

Now let's use it to predict the next character in the sequence:

In [27]:
# Call the predict method on ["To be or not to b"]
y_prob = model.predict(["To be or not to b"])[0, -1]
y_pred = tf.argmax(y_prob)  # choose the most probable character ID
text_vec_layer.get_vocabulary()[y_pred + 2]



'e'

In [29]:
model.predict(["To be or not to b"])



array([[[1.81564450e-01, 6.64177760e-02, 6.29598182e-03, 1.06545657e-01,
         2.56400798e-02, 7.74683431e-02, 3.66694391e-01, 4.69605159e-03,
         3.51578929e-02, 1.17589252e-05, 9.23981052e-03, 1.84964873e-02,
         5.56145835e-07, 1.79435220e-02, 2.96746817e-04, 6.57378184e-03,
         1.29411556e-02, 2.60792151e-02, 1.12853933e-03, 9.53491326e-05,
         1.11382190e-06, 1.23063379e-04, 1.17037701e-03, 9.44408122e-03,
         2.98097063e-07, 3.17247171e-08, 1.23338075e-02, 5.36633609e-03,
         1.70396885e-03, 3.56346532e-03, 1.34389929e-03, 1.66126771e-03,
         5.96528821e-07, 1.22643662e-09, 1.27850730e-09, 1.44478562e-07,
         6.61337929e-10, 4.58128244e-19, 9.39734246e-09],
        [7.48782039e-01, 8.53534402e-06, 1.50150236e-05, 1.44324629e-02,
         8.39594213e-06, 7.65351113e-04, 2.05993318e-04, 6.82524813e-04,
         7.88733065e-02, 2.04552170e-02, 1.16905849e-02, 1.67817920e-02,
         1.00985453e-05, 1.16155930e-02, 1.32352859e-02, 9.4074131

Yay! Our model made the correct prediction!
It is now ready to write full sonnets!

## Generating Fake Shakespearean Text

Wee issue here: we cannot set tensorflow to use GPU for training and using the model. The output is poo-poo. CPU output is good. DILEMMA.

To generate new text using the char-RNN model, we could feed it some text, make the model predict the most likely next letter, add it to the end of the text, then give the extended text to the model to guess the next letter, and so on. This is called greedy decoding. But in practice this often leads to the same words being repeated over and over again. Instead, we can sample the next character randomly, with a probability equal to the estimated probability, using TensorFlow’s tf.random.categorical() function. This will generate more diverse and interesting text. The categorical() function samples random class indices, given the class log probabilities (logits). For example:

In [20]:
log_probas = tf.math.log([[0.5, 0.4, 0.1]])  # probas = 50%, 40%, and 10%
tf.random.categorical(log_probas, num_samples=8)  # draw 8 samples

<tf.Tensor: shape=(1, 8), dtype=int64, numpy=array([[0, 1, 0, 2, 1, 0, 0, 1]])>

To have more control over the diversity of the generated text, we can divide the logits by a number called the temperature, which we can tweak as we wish. A temperature close to zero favors high-probability characters, while a high temperature gives all characters an equal probability. Lower temperatures are typically preferred when generating fairly rigid and precise text, such as mathematical equations, while higher temperatures are preferred when generating more diverse and creative text. The following next_char() custom helper function uses this approach to pick the next character to add to the input text:

In [21]:
def next_char(text, temperature=1):

    # Generate the predicted probabilities for the next character in the sequence
    # based on the current text
    # Select the final output vector from this prediction, i.e. the last character in the sequence
    y_proba = model.predict([text])[0, -1:]
    
    # Rescale the probability distribution using the temperature parameter
    rescaled_logits = tf.math.log(y_proba) / temperature

    # Sample the next character ID from this rescaled distribution
    char_id = tf.random.categorical(rescaled_logits, num_samples=1)[0, 0]

    # Return the character corresponding to the sampled ID
    return text_vec_layer.get_vocabulary()[char_id + 2]

Next, we can write another small helper function that will repeatedly call `next_char()` to get the next character and append it to the given text:

In [22]:
def extend_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

In [23]:
tf.random.set_seed(42)
print(extend_text("To be or not to be", temperature=0.01))

To be or not to be the duke
as it is a proper strange death,
and the


In [25]:
print(extend_text("To be or not to be", temperature=1))

To be or not to be edward knows
whose stew, am i bid them for you, g


In [26]:
print(extend_text("To be or not to be", temperature=100))

To be or not to bed tire
the strangeness pity. what is't your allay,
