## Generaing Shakespeare Text with a Character-based RNN

Now we'll take a look at using RNNs to generate text that's similar to Shakespeare's content.

We download the shakespeare data like this:

In [30]:
import tensorflow as tf
from tensorflow import keras
import pathlib

In [3]:
shakespeare_url = "https://homl.info/shakespeare"
filepath = tf.keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()
print(len(shakespeare_text))

1115394


Then, we need to encode this text as vectors, using a character-based encoding:

In [6]:
# split based on characters
text_vec_layer = tf.keras.layers.TextVectorization(split="character", standardize="lower")
# fit the text vectorizer on our text
text_vec_layer.adapt([shakespeare_text])
encoded = text_vec_layer([shakespeare_text])[0]
print(f"Shakespeare characters:\n{shakespeare_text[:10]}")
print(f"Encoded text:\n{encoded[:10]}")
print(f"Note the 'i's, represented by 7's in the 2nd, 8th and 10th positions")

2023-06-30 12:56:57.384880: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


Shakespeare characters:
First Citi
Encoded text:
[21  7 10  9  4  2 20  7  4  7]
Note the 'i's, represented by 7's in the 2nd, 8th and 10th positions


Now, we need a function that will generate a `tf.data.Dataset` from the input sequence of integers. 

We need to:

* have a sliding window
* batch the windows into batches
* shuffle those samples (note we can do this here, even though it's a time series forecasting task, because our window includes our label; so our data is alread in X-y form)
* generate tuples of (X, y) pairs from our shuffled batches

In [23]:
def to_dataset(sequence, length, shuffle=False, seed=None, batch_size=32, prefetch=True):
    """
    Given a sequence of tensors, returns a dataset of windows of that sequence.
    """
    # create a tf.data.Dataset from a Tensor sequence
    dataset = tf.data.Dataset.from_tensor_slices(sequence)
    # create a window iterator over the tensors - note that this is length + 1, so it'll include our label
    dataset = dataset.window(length + 1, shift=1, drop_remainder=True)
    # batch each sample 
    dataset = dataset.flat_map(lambda window: window.batch(length + 1))
    if shuffle:
        dataset = dataset.shuffle(buffer_size=100_000, seed=seed)
    dataset = dataset.batch(batch_size)
    # split the samples into X and y, then activate prefetching
    if prefetch:
        return dataset.map(lambda window: (window[:, :-1], window[:, 1:])).prefetch(1)
    else:
        return dataset.map(lambda window: (window[:, :-1], window[:, 1:]))

# set the window length
length = 100 
# set a random seed
seed=42
tf.random.set_seed(seed)
# generate iterables from the encoded text
train_set = to_dataset(encoded[:1_000_000], length=length, shuffle=True, seed=seed)
valid_set = to_dataset(encoded[1_000_000:1_060_000], length=length)
test_set = to_dataset(encoded[1_060_000:], length=length)

Now, we can build a Character-based RNN model:

In [29]:
%%time

n_tokens = text_vec_layer.vocabulary_size()
path_to_saved_shakespeare_character_rnn = pathlib.Path("shakespeare_model")

if path_to_saved_shakespeare_character_rnn.exists():
    model = keras.models.load_model(path_to_saved_shakespeare_character_rnn)
else:
    model = tf.keras.Sequential([
        # encode integers to dense vectors
        tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16),
        tf.keras.layers.GRU(128, return_sequences=True),
        # we need n_tokens units - this has to match your vocab size!
        tf.keras.layers.Dense(n_tokens, activation="softmax")
    ])
    model.compile(
        loss="sparse_categorical_crossentropy",
        optimizer="nadam",
        metrics=["accuracy"]
    )
    model_ckpt = tf.keras.callbacks.ModelCheckpoint(
        "shakespeare_model",
        monitor="val_accuracy",
        save_best_only=True,
    )
    history = model.fit(
        train_set,
        validation_data=valid_set,
        epochs=2,
        callbacks=[model_ckpt]
    )
    model.save(path_to_saved_shakespeare_character_rnn)

Epoch 1/2


2023-06-30 13:19:40.390480: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-06-30 13:19:46.499920: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-06-30 13:19:46.660644: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


  31247/Unknown - 756s 24ms/step - loss: 1.4022 - accuracy: 0.5712

2023-06-30 13:32:16.156344: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 10294842070907974379
2023-06-30 13:32:16.156356: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 16869031335745393553
2023-06-30 13:32:16.156360: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 12493847632041027229
2023-06-30 13:32:16.156364: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 8800648251343980905
2023-06-30 13:32:16.156390: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 16716365723107324064
2023-06-30 13:32:16.156394: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 12314640515660100190
2023-06-30 13:32:16.156399: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous rec

INFO:tensorflow:Assets written to: shakespeare_model/assets


INFO:tensorflow:Assets written to: shakespeare_model/assets


Epoch 2/2


INFO:tensorflow:Assets written to: shakespeare_model/assets


CPU times: user 23min 52s, sys: 8min 15s, total: 32min 7s
Wall time: 25min 37s


Now we take that model, and wrap it into another model that will perform the character embedding step to make it easier for us to make predictions:

In [32]:
shakespeare_model = tf.keras.Sequential([
    text_vec_layer,
    model
])

And finally, we can test our model at predicting the next **letter** of a sequence:

In [34]:
example_sentence = "To be or not to b"
prediction_probabilities = shakespeare_model.predict([example_sentence])[0, -1]
prediction = tf.argmax(prediction_probabilities)
print(f"Input sentence: {example_sentence}")
print(f"Predicted character: {text_vec_layer.get_vocabulary()[prediction]}")

Input sentence: To be or not to b
Predicted character: e


As you can see, we've correctly predicted the final `e` on "To be or not to be".

Now, let's trying generating some Shakespearean text:

In [46]:
def predict_next_character(text, temperature=1):
    probabilities = shakespeare_model.predict([text])[0, -1:]
    rescaled_logits = tf.math.log(probabilities) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1)[0, 0]
    return text_vec_layer.get_vocabulary()[char_id]

def generate_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += predict_next_character(text, temperature)
    return text

In [47]:
tf.random.set_seed(42)

print(generate_text("To be or not to be", temperature=0.01))



2023-06-30 18:31:24.722212: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-06-30 18:31:24.820394: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


To be or not to be a strange,
and the duke is a man and the state,
a


## Text Classification with Transformer

In this section, we'll implement a custom Transformer block in Keras, and use it to classify text.

In [38]:
from tensorflow.keras import layers

class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super().__init__()
        self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = keras.Sequential(
            [layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super().__init__()
        self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions

Now, we can download the dataset (IMDB movie reviews):

In [36]:
vocab_size = 20000  # Only consider the top 20k words
maxlen = 200  # Only consider the first 200 words of each movie review
(x_train, y_train), (x_val, y_val) = keras.datasets.imdb.load_data(num_words=vocab_size)
print(len(x_train), "Training sequences")
print(len(x_val), "Validation sequences")
x_train = keras.utils.pad_sequences(x_train, maxlen=maxlen)
x_val = keras.utils.pad_sequences(x_val, maxlen=maxlen)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
25000 Training sequences
25000 Validation sequences


In [39]:
embed_dim = 32  # Embedding size for each token
num_heads = 2  # Number of attention heads
ff_dim = 32  # Hidden layer size in feed forward network inside transformer

inputs = layers.Input(shape=(maxlen,))
embedding_layer = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)
x = embedding_layer(inputs)
transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)
x = transformer_block(x)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.1)(x)
x = layers.Dense(20, activation="relu")(x)
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(2, activation="softmax")(x)

model = keras.Model(inputs=inputs, outputs=outputs)

In [40]:
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
history = model.fit(
    x_train, y_train, batch_size=32, epochs=2, validation_data=(x_val, y_val)
)

Epoch 1/2


2023-06-30 17:57:09.549088: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.




2023-06-30 17:57:50.591405: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


Epoch 2/2


## Text Classification with Hugging Face and DistilBERT

In this section, we'll use the popular Hugging Face libraries to download a trained version of DistilBERT, and perform text classification.

In [41]:
from transformers import pipeline

Look how easy it is to download a pre-trained model artifact, and run tasks using HuggingFace:

In [43]:
model_name = "huggingface/distilbert-base-uncased-finetuned-mnli"
classifier = pipeline("sentiment-analysis", model=model_name)
result = classifier("She loves me. She loves me not.")
print(result)

Downloading (…)lve/main/config.json:   0%|          | 0.00/729 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


Downloading (…)okenizer_config.json:   0%|          | 0.00/58.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

[{'label': 'contradiction', 'score': 0.984832227230072}]


In this example, we can see that the model correctly identified this sentence as a CONTRADICTION, with a confidence of 98.48%.