#**Generative Text Model Trained on a Corpus of Shakespear**





Let's download some Shakespear

In [1]:
import keras

filename = keras.utils.get_file(
    origin="https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt",
)
shakespeare = open(filename, "r").read()

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt
[1m1115394/1115394[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1us/step


Let's see what's in the corpus...

In [2]:
# The first 250 characters of text... including line breaks.
print(shakespeare[:250])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



Let's make some training observations...

In [10]:
import tensorflow as tf

# Each observation will be 100 characters long.
sequence_length = 100

# We take the original document and we split it into 100 character sequence chunks.
def split_input(input, sequence_length):

    # Loop over each 100-character segment of the text
    for i in range(0, len(input), sequence_length):

        # Return a list of those text chunks.
        yield input[i:i + sequence_length]

# Features (x's) are mutually exclusive blocks / chunks of 100 characters from the original text.
features = list(split_input(shakespeare[:-1], sequence_length))

# Labels are also mutually exclusive blocks / chunks of 100 characters, offset from the input sequence by 1 character.
# The goal of this training model is to learn weights for our embedding layer and an RNN that we can use in our generative model
# For single token predictions.
# Our training model will be a throw-away model that generates 100 next-token predictions at a time.
labels = list(split_input(shakespeare[1:], sequence_length))

# We make a Tensorflow Dataset from this.
dataset = tf.data.Dataset.from_tensor_slices((features, labels))

Here are the first two observations

In [11]:
for i, (features, labels) in enumerate(dataset.as_numpy_iterator()):
  print(features[:20])
  print(labels[:20])
  if i==1:
    break

b'First Citizen:\nBefor'
b'irst Citizen:\nBefore'
b' are all resolved ra'
b'are all resolved rat'


We now use a TextVectorization layer to split out individual characters in each sequence and convert them into numeric values (integer indices). We apply that layer to our Tensorflow Dataset. We will work with a character-level model becuase there are way fewer values to predict in our softmax this way :). We have 26 letters * 2 (upper vs. lower case), plus punctuation, spaces, line breaks and such. Indeed, we have just 67 unique characters in the entire document.

In [12]:
from keras import layers

# Define the layer
tokenizer = layers.TextVectorization(
    standardize=None,
    split="character",
    output_sequence_length=sequence_length,
)

# Apply it to the text we pull from our dataset (dropping the labels since we don't need those when adapting the TextVectorization layer)
tokenizer.adapt(dataset.map(lambda text, labels: text))

vocabulary_size = tokenizer.vocabulary_size()

print(f'We have {vocabulary_size} unique characters in our document.')

We have 67 unique characters in our document.


Now we can use the tokenizer to tokenize the text coming out of the Dataset.

In [13]:
dataset = dataset.map(
    lambda features, labels: (tokenizer(features), tokenizer(labels)),
    num_parallel_calls=8,
)

# We have to shuffle the data up front here because we are using a Tensorflow dataset object.
# Unlike when all our data is sitting in memory, model.fit(shuffle=True) will not shuffle the order of observations
# when we have a Dataset object as input. This is because the fit() function only has access to one batch at a time.
# It actually just shuffles the order of batches that it pulls from the Dataset.
# In contrast, when data is all in memory, the model.fit() call shuffles observations ahead of time, before creating batches.
training_data = dataset.batch(64).cache().shuffle(10_000)

print(f'We have created a dataset object that has {dataset.cardinality()} observations.')

We have created a dataset object that has 11154 observations.


#*The Architecture for our Throw-away Language Model...*

In [14]:
# Input will be 100 integer values per sequence (x implicit batch size)
inputs = layers.Input(shape=(sequence_length,), dtype="int", name="integer_seq_input")

# We project each integer index into a 256 dimensional embedding space.
x = layers.Embedding(input_dim=vocabulary_size, output_dim=256)(inputs)

# We pass the sequence of embeddings into an RNN (a GRU in this case).
# We have 1,024 GRU units here (so we will obtain 1,024 scalar values at each step)
# Again, the hidden state at each point in the sequence can be thought of as a form of embedding for the sequence up to that point.
x = layers.GRU(1024, return_sequences=True)(x)

# We then have a bit of dropout
x = layers.Dropout(0.1)(x)

# And a softmax prediction among the 67 unique characters, the shape of this will be 100 characters by 67 softmax values, per input observation.
outputs = layers.Dense(vocabulary_size, activation="softmax")(x)

model = keras.Model(inputs, outputs)

# For all 100 characters that enter the model in a given observation...
# we make parallel predictions for the character that should come next using a softmax.
# Notice the shape of our output layer is 100, 67-dimensional predictions per observation.
model.summary()

In [15]:
model.compile(
    optimizer="adam",
    loss="sparse_categorical_crossentropy",
    metrics=["sparse_categorical_accuracy"],
)

model.fit(training_data, epochs=20)

Epoch 1/20
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 11ms/step - loss: 3.0730 - sparse_categorical_accuracy: 0.2397
Epoch 2/20
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 11ms/step - loss: 1.9993 - sparse_categorical_accuracy: 0.4127
Epoch 3/20
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 11ms/step - loss: 1.7402 - sparse_categorical_accuracy: 0.4824
Epoch 4/20
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 11ms/step - loss: 1.5775 - sparse_categorical_accuracy: 0.5265
Epoch 5/20
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 11ms/step - loss: 1.4782 - sparse_categorical_accuracy: 0.5519
Epoch 6/20
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 11ms/step - loss: 1.4141 - sparse_categorical_accuracy: 0.5687
Epoch 7/20
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 11ms/step - loss: 1.3631 - sparse_categorical_accuracy: 0.5811
Epoch 8/20
[1m175/1

<keras.src.callbacks.history.History at 0x7ac09390f310>

Now let's see what this model produces as output. To make this work, we need to implement the feedback mechanisms for auto-regressive text production.

In [None]:
# We need two input layers for our generative model.
# One input layer will receive a single character token (the last prediction)
# The second input layer will receive the hidden / embedded representation of the sequence as of the last prediction that was made.
# We are going to udpate and carry the hidden state forward into each subsequent prediction.
# This is kind of like what an RNN does internally, but we are doing it manually outside of the model.

# Note that our generation model will take only 1 token as input.
inputs = keras.Input(shape=(1,), dtype="int", name="integer_seq_input")

# It also takes a 1,024 embedded representation of the sequence (up to the point of the current prediction) as additional input.
input_state = keras.Input(shape=(1024,), name="state")

# The last token is passed through our embedding layer
x = layers.Embedding(vocabulary_size, 256)(inputs)

# The 1,024 embedded representation gets passed into the GRU, along with the embedding of the last token we produced.
# Note that this GRU does NOT return sequences, so it produces a single token output at the end of the text sequence, rather than output for every input token (like we did in our training model).
x, output_state = layers.GRU(1024, return_state=True)(x, initial_state=input_state)

# We predict the next token.
outputs = layers.Dense(vocabulary_size, activation="softmax")(x)

# And our model will return i) the next token prediction, as well as ii) the current 1,024 dimensional representation of the sequence at the point of the new prediction.
generation_model = keras.Model(
    inputs=(inputs, input_state),
    outputs=(outputs, output_state),
)

# This function does a fuzzy match essentially between the old model and the new one, and copies weights where it can
# So, it will take the Embedding layer weights from the old model, and look for the first embedding layer in the new model
# where the shape is the same, and copy weights over as soon as it sees that.
# The same happens with the GRU weights. Layers in the new model with no match in the old model have randomly initialized weights.

# So, the GRU layer and embedding layer in our generative model now have meaningful weights baked into them.
generation_model.set_weights(model.get_weights())

Prepare data and dictionaries to work with our generative model...

In [None]:
tokens = tokenizer.get_vocabulary()
token_ids = range(vocabulary_size)

# We are making dictionaries that we will use to go back and forth from integer indices to readable words...
char_to_id = dict(zip(tokens, token_ids))
id_to_char = dict(zip(token_ids, tokens))

# Our starting prompt for the language model...
# We will pass these tokens through our generative model to 'burn in' a starting value for the hidden state, so the model can then produce something meaningful as output.
prompt = """
KING RICHARD III:
"""

Now we 'burn in' the generative model, so it can learn a meaningful 'state' value to start with...

In [None]:
input_ids = [char_to_id[c] for c in prompt]

# We start with a state of 0, and we will update the state vector by passing over the first several characters.
state = keras.ops.zeros(shape=(1, 1024))

for token_id in input_ids:
    inputs = keras.ops.expand_dims([token_id], axis=0)
    predictions, state = generation_model((inputs, state))

    # At each token, we can see what it is predicting to begin with, and we see it is making mistakes during the burn-in period.
    # But this is the generative model essentially first calibrating to the prompt.
    print(id_to_char[np.argmax(predictions)])
    state = keras.ops.expand_dims(state, axis=0)

T
I
N
G
 
H
I
C
H
A
R
D
 
I
I
:
:


S


Now we can use the 'calibrated' model to produce subsequent text.

In [None]:
import numpy as np

# We will append our autoregressive predictions into a list, one at a time.
generated_ids = []

# Let's produce 250 tokens of output.
max_length = 250

for i in range(max_length):
    next_char = int(np.array(keras.ops.argmax(predictions, axis=-1)[0]))
    generated_ids.append(next_char)
    inputs = keras.ops.expand_dims([next_char], axis=0)

    # Reshape the state vector so its the right dimensionality for our input layer.
    state = keras.ops.expand_dims(state, axis=0)
    predictions, state = generation_model((inputs, state))

# We will now join all the predicted characters together and add them to the original prompt, printing the result out.
output = "".join([id_to_char[token_id] for token_id in generated_ids])
print(prompt + output)


KING RICHARD III:
'll play the morning of this presence
Master of a man and husbandry.

AUFIDIUS:
I know you well.

LADY ANNE:
What thou art too heart?
O, the part is well as false as dangerous too:
Thou art too hate him here as leave it is and least
As they are but o
