# Language Models

In this notebook, we explore how neural networks are used for language tasks, and show how to train a model to write Shakespearian sonnets!
In the last session, we explored the value of inductive biases in designing neural networks, so we start with a discussion.

# 1 From Application to Challenges: Why We Model Language

## 1.1 Applications of Language Models
Language models allow computers to understand and generate human-like text, which has various applications, such as:
- Machine Translation: converting texts from one language to another;
- Text Generation: generating natural language sentences or paragraphs;
- Sentiment Analysis: determining the attitude or emotion expressed in a piece of text, e.g. in product reviews;
- Question Answering;
- Text Summarization;

## 1.2 Challenges with Language Modelling
However, despite the progress made in language modelling, there are still several challenges that need to be addressed:
- Ambiguity: words or phrases with several possible meanings or interpretations;
- Context: the meaning of a word can depend on the context it appears in;
- Out-of-vocabulary (OOV) words: some words don't appear in the training data (e.g. new internet slang);
- Long-term dependencies: understanding the meaning of a sentence often requires keeping track of information from earlier in the text;
- Domain-specific knowledge: e.g. medical texts will contain a lot of specialized medical terminologies.

Despite the challenges, language models are vital to enable machines to understand and generate natural language, paving the way for a wide range of NLP applications.

### Task 1: In pairs, discuss how you would design a neural network architecture to be able to process language. What features would it have?  What aspects of language are you capturing with this architecture?

# 2 Getting the Data

Our goal is to create a model that generates Shakespearian sonnets. 
One of the easiest ways to do this is to give the model some Shakespearian text and get it to predict the next letter.
For example, if we give the model "to be or not to b", it can output "e" to complete the phrase - "to be or not to be".
Then, if we give that output to the model as input, the model can give us the next character, and so on.
We might not get "to be or not to be, that is the question." as the final output, but we can get something that sounds vaguely Shakespearian.

But first, we need to get the data!

In [1]:
import urllib.request
import os

# Download the Shakespeare dataset
shakespeare_url = "https://homl.info/shakespeare"
filepath = os.path.join(os.getcwd(), "shakespeare.txt")
urllib.request.urlretrieve(shakespeare_url, filepath)

with open(filepath) as f:
    shakespeare_text = f.read()

# Shows a short text sample
print(shakespeare_text[:420])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!


That's Shakespeare, alright! 

The input to our model will be a the beginning of a Shakespeare sonnet (i.e. a sequence of characters).
Given this sequence of characters, we want our model to predict the next character.
For simplicity, we will only use **lowercase** characters.

In [2]:
vocab = "".join(sorted(set(shakespeare_text.lower())))
vocab_size = len(set(shakespeare_text.lower()))

print("Our vocabulary: " + vocab)
print("Number of distinct characters: " + str(vocab_size))

Our vocabulary: 
 !$&',-.3:;?abcdefghijklmnopqrstuvwxyz
Number of distinct characters: 39


# 3 Creating the Training Dataset

The inputs to a neural network must be numerical, so we must encode every character as an integer.

It's easiest to do this using `keras.layers.TextVectorization` layer to encode this text (i.e. convert it from characters to integer IDs).
This layer turns raw strings into an encoded representation that can be read by neural network layers.
We set `split="character"` to get character-level encoding rather than the default word-level encoding, and we use `standardize="lower"` to convert the text to lowercase.

In [3]:
import tensorflow as tf

# Create a TextVectorization layer
text_vec_layer = tf.keras.layers.TextVectorization(split="character",
                                                   standardize="lower")

# Build a vocabulary of all characters in the Shakespeare text
text_vec_layer.adapt([shakespeare_text])

# Use text_vec_layer on shakespeare_text to obtain encoded character ID sequences
encoded = text_vec_layer([shakespeare_text])[0]

# Visualize the encoding
print("--- Original text:\n", shakespeare_text[:60])
print("\n--- Encoded sequence:\n", encoded[:60])
print("\n--- Mapping of letters to integers:")
for i, char in enumerate(text_vec_layer.get_vocabulary()[:20]):
    print(char, "-->", i)

--- Original text:
 First Citizen:
Before we proceed any further, hear me speak.

--- Encoded sequence:
 tf.Tensor(
[21  7 10  9  4  2 20  7  4  7 37  3 11 25 12 23  3 21  5 10  3  2 18  3
  2 24 10  5 20  3  3 14  2  6 11 17  2 21 15 10  4  8  3 10 19  2  8  3
  6 10  2 16  3  2  9 24  3  6 26 28], shape=(60,), dtype=int64)

--- Mapping of letters to integers:
 --> 0
[UNK] --> 1
  --> 2
e --> 3
t --> 4
o --> 5
a --> 6
i --> 7
h --> 8
s --> 9
r --> 10
n --> 11

 --> 12
l --> 13
d --> 14
u --> 15
m --> 16
y --> 17
w --> 18
, --> 19


Looking at the above output, we can see that each character is now mapped to an integer, starting at 2. 

The `TextVectorization` layer reserved the value 0 for padding tokens, and it reserved 1 for unknown characters.
We won’t need either of these tokens for now because neither are in the vocabulary, so we won't be using them to write our sonnets either.
(When have you seen Shakespeare make up unknown characters? That's why.)

Let’s subtract 2 from the character IDs and compute the number of distinct characters and the total number of characters:

In [4]:
# Drop tokens 0 (pad) and 1 (unknown) by subtracting 2 from the character IDs
encoded -= 2

# Compute the number of distinct characters
n_tokens = text_vec_layer.vocabulary_size() - 2

# Compute the total number of characters 
dataset_size = len(encoded)

print("Dataset size: ", dataset_size)
print("Number of tokens: ", n_tokens)

Dataset size:  1115394
Number of tokens:  39


As we've said, our aim is to give the model a sequence of characters (e.g. "to be or not to b"), and get it to output the next letter "e".
We can also frame this as the input being "to be or not to b" being turned into output as "o be or not to be" sequence and target as "o be or not to be" sequence.
This target sequence indicates that for a given input sequence, the next character should be "e".

To train such a sequence-to-sequence RNN, we can convert this long sequence into input/target pairs.
This dataset creation involves dividing the data into windows of a fixed size. 
The model can then be trained on these input/target pairs to learn the underlying patterns in the text and generate more text of a similar style. 

The function `to_dataset` will convert our long sequence of character IDs (encoded text) into a dataset of input/target window pairs.

### Task 2: In the code below, create input/output sequences by taking first `length` characters as input and last `length` characters as output.

In [17]:
def to_dataset(sequence, length, shuffle=False, seed=None, batch_size=32):

    # Prepare dataset of character IDs to be processed by tensorflow.
    ds = tf.data.Dataset.from_tensor_slices(sequence)

    # Create windows of size length + 1.
    ds = ds.window(size=length + 1, shift=1, drop_remainder=True)
    ds = ds.flat_map(lambda window_ds: window_ds.batch(length + 1))

    if shuffle:
        ds = ds.shuffle(buffer_size=100_000, seed=seed)

    # Batch the resulting dataset
    ds = ds.batch(batch_size=batch_size)

    # TODO: Create input/output sequences by taking first *length* characters 
    # as input and last *length* characters as output.
    # Hint: using the map() method on ds, create a tuple with the first [length]
    # characters as the 1st element and the last [length] characters as the 2nd element; 
    # [window] has shape (batch_size, length + 1)
    ds = ds.map(lambda window: (window[:, :-1], window[:, 1:]))
    
    return ds.prefetch(1)

This diagram illustrates what `to_dataset` is doing:

<img src="to_dataset.png" width="500" style="display: block; margin: 0 auto">

Batching is a technique used to divide large datasets into smaller subsets or batches.
Instead of feeding the entire dataset (of our input/output pairs of windows) to the model at once, we divide it into batches, which are fed to the model one-by-one during training.
Each batch is processed independently, and the model updates its weights after processing each batch.
Batching makes training more efficient. 

Let's look an an example of `to_dataset()`. 
The code below creates a dataset with a single training example: an input/output pair.
The input represents "to b" and the output represents "o be", so the model should learn to predict the next character, i.e., "e"

In [18]:
list(to_dataset(text_vec_layer(["To be"])[0], length=4))

tf.Tensor([[ 4  5  2 23]], shape=(1, 4), dtype=int64)


[(<tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[ 4,  5,  2, 23]])>,
  <tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[ 5,  2, 23,  3]])>)]

Since the entire dataset is 1,115,394 characters long and we have limited time, we will use a smaller portion of the dataset to make sure we can finish training during this workshop.
We will split it up so we use roughly 90% for training, 5% for validation and the remaining 5% for testing.

We initially specified the window length as 100, but it is worth experimenting with different window lengths.
While shorter lengths make it easier and quicker to train the RNN, as the RNN is not able to learn any pattern that is longer than the specified length, it is important to avoid choosing a window length that is too small.

### Task 3: Slice the data into training, validation and test sets using the proportions specified.

In [19]:
length = 100
subset_proportion = 0.05
reduced_dataset_size = int(dataset_size * subset_proportion)

# TODO: Slice the encoded data into training, validation, and test sets using 
# the proportions 90%, 5%, 5%
train_encoded = encoded[:int(reduced_dataset_size * 0.9)]
validation_encoded = encoded[int(reduced_dataset_size * 0.9):int(reduced_dataset_size * 0.95)]
test_encoded = encoded[int(reduced_dataset_size * 0.95):reduced_dataset_size]

# Create datasets
tf.random.set_seed(42)
train_set = to_dataset(train_encoded, length=length, shuffle=True, seed=42)
valid_set = to_dataset(validation_encoded, length=length)
test_set = to_dataset(test_encoded, length=length)

tf.Tensor(
[[13 23 10 ...  8  1  4]
 [ 4  2  7 ...  3 21 11]
 [ 0  3 19 ... 10 21  1]
 ...
 [14 15  0 ...  0  4  7]
 [ 1  8  4 ...  5  7  6]
 [13  8  0 ...  1  2  0]], shape=(32, 100), dtype=int64)
tf.Tensor(
[[13 10 16 ...  0  2  3]
 [ 8 15  0 ... 25  5  7]
 [ 9  1 17 ... 14  0  5]
 ...
 [ 2  5 35 ...  0 14  1]
 [ 8 23 10 ...  2 15  0]
 [ 3  8  7 ...  2  6  8]], shape=(32, 100), dtype=int64)
tf.Tensor(
[[11  3 25 ... 11  4 18]
 [ 3 25  1 ...  4 18  1]
 [25  1  7 ... 18  1 26]
 ...
 [ 0  9  3 ...  8  5  3]
 [ 9  3  2 ...  5  3 11]
 [ 3  2  0 ...  3 11  4]], shape=(32, 100), dtype=int64)
tf.Tensor(
[[ 2  0  2 ... 11  4  9]
 [ 0  2  3 ...  4  9 13]
 [ 2  3  0 ...  9 13  7]
 ...
 [ 6 15  0 ...  4  2  0]
 [15  0 18 ...  2  0 15]
 [ 0 18  3 ...  0 15  3]], shape=(32, 100), dtype=int64)
tf.Tensor(
[[27 12  0 ...  2  6  4]
 [12  0  4 ...  6  4  9]
 [ 0  4  2 ...  4  9  0]
 ...
 [ 0 22  8 ... 28  0  8]
 [22  8  1 ...  0  8  1]
 [ 8  1 18 ...  8  1 16]], shape=(32, 100), dtype=int64)
tf.Tensor(

# 4 Training Our Own Shakespeare

Since our dataset is reasonably large, and modeling language is quite a difficult task, we need more than a simple RNN with a few recurrent neurons.
Let’s build and train a model with one GRU layer (type of RNN layer) composed of 128 units (you can try tweaking the number of layers and units later, if needed).

Let’s go over this code:

- We use an `Embedding` layer as the first layer, to encode the character IDs (embeddings were introduced in Chapter 13). The `Embedding` layer’s number of input dimensions is the number of distinct character IDs, and the number of output dimensions is a hyperparameter you can tune — we’ll set it to 16 for now. Whereas the inputs of the `Embedding` layer will be 2D tensors of shape *[batch size, window length]*, the output of the Embedding layer will be a 3D tensor of shape *[batch size, window length, embedding size]*.

- We use a `Dense` layer for the output layer: it must have 39 units (n_tokens) because there are 39 distinct characters in the text, and we want to output a probability for each possible character (at each time step). The 39 output probabilities should sum up to 1 at each time step, so we apply the softmax activation function to the outputs of the Dense layer.

- Lastly, we compile this model, using the `"sparse_categorical_crossentropy"` loss and a Nadam optimizer, and we train the model for several epochs, using a `ModelCheckpoint` callback to save the best model (in terms of validation accuracy) as training progresses.

### Task 4: Experiment with different values for epochs
In machine learning, an epoch refers to one iteration of the entire training dataset through a neural network, i.e. one pass forward and backward through the model.
During training, the data is usually divided into batches, and the training process involves iterating through all the batches in one complete iteration.
Increasing the number of epochs will increase the number of times the model gets to refine its weights and improve its predictions on the training set.

In [20]:
# Create the model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16),
    tf.keras.layers.GRU(128, return_sequences=True),
    tf.keras.layers.Dense(n_tokens, activation="softmax")
])

# Compile the model, i.e. give it loss function, optimizer and metrics
model.compile(loss="sparse_categorical_crossentropy",
              optimizer="nadam",
              metrics=["accuracy"])

# A callback is a set of functions that can be applied during training to 
# perform various tasks, such as saving the best model weights, early stopping 
# if the validation loss stops improving, etc.
# Create a ModelCheckpoint callback that saves the best model weights to a file
model_ckpt = tf.keras.callbacks.ModelCheckpoint(
    filepath="my_shakespeare_model",
    monitor="val_accuracy",
    save_best_only=True)

# Train the model using the fit() method. Pass the training and validation sets 
# to the train_set and valid_set parameters, respectively. 
# TODO: Choose the number of epochs to train the model for (e.g. 2-5)
history = model.fit(train_set,
                    validation_data=valid_set,
                    epochs=3,
                    callbacks=[model_ckpt])

Epoch 1/3
   1566/Unknown - 81s 49ms/step - loss: 2.0356 - accuracy: 0.4062INFO:tensorflow:Assets written to: my_shakespeare_model/assets


INFO:tensorflow:Assets written to: my_shakespeare_model/assets


Epoch 2/3
Epoch 3/3


This model does not handle text preprocessing, so let’s wrap it in a final model containing the `tf.keras.layers.TextVectorization` layer as the first layer, plus a `tf.keras.layers.Lambda` layer to subtract 2 from the character IDs (since we’re not using the padding and unknown tokens for now):

In [21]:
# Add text preprocessing to the model
model = tf.keras.Sequential([
    text_vec_layer,
    tf.keras.layers.Lambda(lambda X: X - 2),  # no <PAD> or <UNK> tokens
    model
])

Since model training takes a long time, we have a pretrained model for you.
The following code will download it.
Uncomment the last line if you want to use it instead of the model trained above.

In [22]:
from pathlib import Path

# Downloads a pretrained model
url = "https://github.com/ageron/data/raw/main/shakespeare_model.tgz"
path = tf.keras.utils.get_file("shakespeare_model.tgz", url, extract=True) 
model_path = Path(path).with_name("shakespeare_model")
# model = tf.keras.models.load_model(model_path)

Let's give it a spin!

In [23]:
# Call the predict method on ["To be or not to b"]
# original array is nested, so need to access the first element, 
# get the last element of the array, i.e. the last letter, the prediction
y_prob = model.predict(["To be or not to b"])[0, -1] 
y_pred = tf.argmax(y_prob)  # choose the most probable character ID

# Use the vocabulary of the text_vec_layer to get the character corresponding to y_pred
text_vec_layer.get_vocabulary()[y_pred + 2]



'e'

Yay! Our model made a prediction (hopefully a correct one)!
It is now ready to write full sonnets!

# 5 Making Inferences, i.e. writing sonnets

To generate new text using the char-RNN model, we could feed it some text, make the model predict the most likely next letter, add it to the end of the text, then give the extended text to the model to guess the next letter, and so on.
This is called greedy decoding.
But in practice this often leads to the same words being repeated over and over again.

Instead, we can sample the next character randomly, with a probability equal to the estimated probability, using TensorFlow’s `tf.random.categorical()` function.
This will generate more diverse and interesting text. The `categorical()` function samples random class indices, given the class log probabilities (logits). For example:

In [24]:
log_probas = tf.math.log([[0.5, 0.4, 0.1]])  # probas = 50%, 40%, and 10%
tf.random.categorical(log_probas, num_samples=8)  # draw 8 samples

<tf.Tensor: shape=(1, 8), dtype=int64, numpy=array([[0, 1, 0, 2, 1, 0, 0, 1]])>

To have more control over the diversity of the generated text, we can divide the logits by a number called the **temperature**, which we can tweak as we wish.
A temperature close to zero favors high-probability characters, while a high temperature gives all characters an equal probability.
Lower temperatures are typically preferred when generating fairly rigid and precise text, such as mathematical equations, while higher temperatures are preferred when generating more diverse and creative text.

The following `next_char()` helper function uses this approach to pick the next character to add to the input text:

In [25]:
def next_char(text, temperature=1):

    # Generate the predicted probabilities for the next character in the 
    # sequence based on the current text
    # Select the final output vector from this prediction, 
    # i.e. the last character in the sequence
    y_proba = model.predict([text])[0, -1:]
    
    # Rescale the probability distribution using the temperature parameter
    rescaled_logits = tf.math.log(y_proba) / temperature

    # Sample the next character ID from this rescaled distribution
    char_id = tf.random.categorical(rescaled_logits, num_samples=1)[0, 0]

    # Return the character corresponding to the sampled ID
    return text_vec_layer.get_vocabulary()[char_id + 2]

Next, we can write another small helper function that will repeatedly call `next_char()` to get the next character and append it to the given text:

In [26]:
def extend_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

That's all we need!

### Task 5: Tune the temperature to see the impact on sonnet quality

In [27]:
# TODO: change the temperature parameter to see the effect on the generated text
print(extend_text("To be or not to be", temperature=0.01))

To be or not to bear at the people in many to him, and the hather be


# 6 More on Language Models

In this notebook, we explored the essential concepts of Language Models and the various applications they have in the field of NLP.
We also discussed the challenges that come with building an accurate language model, such as ambiguity, context, out-of-vocabulary words, long-term dependencies and data sparsity.

While we were able to build a simple language model that works at the character level, we must keep in mind that natural language is much more complex than this. 
Language models that can also understand the structure of words in sentences and comprehend their meaning require more sophisticated architectures and techniques such as Word Embeddings, Recurrent Neural Networks and Transformers.

We encourage you to take what you have learned in this workshop and experiment with more with the Shakespeare model: increase the proportion of the data used, train for more epochs and add more layers. 
(The pre-trained model we loaded used 10 epochs and the full training set).
Additionally, you can also experiment with different text preprocessing techniques, different architectures and hyperparameters to achieve better results. 
The possibilities are endless, and there is always more to learn!
