# Generating Text with Neural Networks


This program is desgined to train a model that can replicate text in the format of shakespeare. 

# Getting the Data

In [1]:
import tensorflow as tf

shakespeare_url = "https://homl.info/shakespeare"  # shortcut URL
filepath = tf.keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

2023-12-01 10:49:50.778491: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Downloading data from https://homl.info/shakespeare


The section is importing the tensorflow library for the machine learning. The next line is defining the URL for thw text I want to work with. The third line is saving it and the fourth is opening it so that it can be read. 

In [2]:
print(shakespeare_text[:80]) # not relevant to machine learning but relevant to exploring the data

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.


This is printing the first 80 characters from the shakespeare_text file. 

# Preparing the Data

In [3]:
text_vec_layer = tf.keras.layers.TextVectorization(split="character",
                                                   standardize="lower")
text_vec_layer.adapt([shakespeare_text])
encoded = text_vec_layer([shakespeare_text])[0]

2023-12-01 10:50:05.389154: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [4]:
print(text_vec_layer([shakespeare_text]))

tf.Tensor([[21  7 10 ... 22 28 12]], shape=(1, 1115394), dtype=int64)


Above the text is being converted to numbers so that it can be inputted in the ML model. 

In [5]:
encoded -= 2  # drop tokens 0 (pad) and 1 (unknown), which we will not use
n_tokens = text_vec_layer.vocabulary_size() - 2  # number of distinct chars = 39
dataset_size = len(encoded)  # total number of chars = 1,115,394

This is cleaning the data so that it is suitable for the model. 

In [6]:
print(n_tokens, dataset_size)

39 1115394


In [7]:
def to_dataset(sequence, length, shuffle=False, seed=None, batch_size=32):
    ds = tf.data.Dataset.from_tensor_slices(sequence)
    ds = ds.window(length + 1, shift=1, drop_remainder=True)
    ds = ds.flat_map(lambda window_ds: window_ds.batch(length + 1))
    if shuffle:
        ds = ds.shuffle(100_000, seed=seed)
    ds = ds.batch(batch_size)
    return ds.map(lambda window: (window[:, :-1], window[:, 1:])).prefetch(1)

This section is preparing the data for training. The model building stage is when layers are defined in order to outline the models structure. 

In [8]:
length = 100
tf.random.set_seed(42)
train_set = to_dataset(encoded[:1_000_000], length=length, shuffle=True,
                       seed=42)
valid_set = to_dataset(encoded[1_000_000:1_060_000], length=length)
test_set = to_dataset(encoded[1_060_000:], length=length)

Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089


In this section of code there are different stages happening. Train_set is training the data for the ML model. Vaild_set is validating the effectiiveness so that test_set can move onto the evaluation stage to assess how well the model is working. 

# Building and Training the Model

In [None]:
tf.random.set_seed(42)  # extra code – ensures reproducibility on CPU
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16),
    tf.keras.layers.GRU(128, return_sequences=True),
    tf.keras.layers.Dense(n_tokens, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
model_ckpt = tf.keras.callbacks.ModelCheckpoint(
    "my_shakespeare_model", monitor="val_accuracy", save_best_only=True)
history = model.fit(train_set, validation_data=valid_set, epochs=10,
                    callbacks=[model_ckpt])

Epoch 1/10


2023-12-01 11:49:46.500148: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:392] Filling up shuffle buffer (this may take a while): 67027 of 100000
2023-12-01 11:49:50.691429: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:417] Shuffle buffer filled.


  31247/Unknown - 3300s 105ms/step - loss: 1.4025 - accuracy: 0.5709



INFO:tensorflow:Assets written to: my_shakespeare_model/assets


INFO:tensorflow:Assets written to: my_shakespeare_model/assets


Epoch 2/10


2023-12-01 12:46:09.328919: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:392] Filling up shuffle buffer (this may take a while): 69234 of 100000


    1/31247 [..............................] - ETA: 126:27:21 - loss: 1.7032 - accuracy: 0.4906

2023-12-01 12:46:13.778264: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:417] Shuffle buffer filled.






INFO:tensorflow:Assets written to: my_shakespeare_model/assets


INFO:tensorflow:Assets written to: my_shakespeare_model/assets


Epoch 3/10


2023-12-01 13:40:46.646204: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:392] Filling up shuffle buffer (this may take a while): 43863 of 100000
2023-12-01 13:40:56.646305: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:392] Filling up shuffle buffer (this may take a while): 94909 of 100000


    1/31247 [..............................] - ETA: 183:28:19 - loss: 1.6285 - accuracy: 0.5078

2023-12-01 13:40:57.675029: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:417] Shuffle buffer filled.






INFO:tensorflow:Assets written to: my_shakespeare_model/assets


INFO:tensorflow:Assets written to: my_shakespeare_model/assets


Epoch 4/10


2023-12-01 14:31:42.922700: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:392] Filling up shuffle buffer (this may take a while): 60652 of 100000


    1/31247 [..............................] - ETA: 128:34:14 - loss: 1.6259 - accuracy: 0.5172

2023-12-01 14:31:47.636207: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:417] Shuffle buffer filled.






INFO:tensorflow:Assets written to: my_shakespeare_model/assets


INFO:tensorflow:Assets written to: my_shakespeare_model/assets


Epoch 5/10


2023-12-03 21:53:58.576070: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:392] Filling up shuffle buffer (this may take a while): 30513 of 100000
2023-12-03 21:54:08.571831: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:392] Filling up shuffle buffer (this may take a while): 71386 of 100000


    1/31247 [..............................] - ETA: 226:41:18 - loss: 1.6166 - accuracy: 0.5013

2023-12-03 21:54:14.555839: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:417] Shuffle buffer filled.


  616/31247 [..............................] - ETA: 54:39 - loss: 1.3273 - accuracy: 0.5866

In [None]:
shakespeare_model = tf.keras.Sequential([
    text_vec_layer,
    tf.keras.layers.Lambda(lambda X: X - 2),  # no <PAD> or <UNK> tokens
    model
])

# Generating Text

In [None]:
y_proba = shakespeare_model.predict(["To be or not to b"])[0, -1]
y_pred = tf.argmax(y_proba)  # choose the most probable character ID
text_vec_layer.get_vocabulary()[y_pred + 2]

This is allowing for predictions to be made through the already trained model. 

In [None]:
log_probas = tf.math.log([[0.5, 0.4, 0.1]])  # probas = 50%, 40%, and 10%
tf.random.set_seed(42)
tf.random.categorical(log_probas, num_samples=8)  # draw 8 samples

I think here the mathematical feature log is being used to calculate probabilites to produce samples of text. This is a confusing step specifically for those without a mathematical background. 

In [None]:
def next_char(text, temperature=1):
    y_proba = shakespeare_model.predict([text])[0, -1:]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1)[0, 0]
    return text_vec_layer.get_vocabulary()[char_id + 2]

This is used to provide a random text output when given a random input. 

In [None]:
def extend_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

This should produce longer random text outputs. 

In [None]:
tf.random.set_seed(42)  # extra code – ensures reproducibility on CPU

This is ensuring reproducibility across the model making it easier to evaulate and fix any issues. 

In [None]:
print(extend_text("To be or not to be", temperature=0.01))

In [None]:
print(extend_text("To be or not to be", temperature=1))

In [None]:
print(extend_text("To be or not to be", temperature=100))

It could be interesting to use this model for different languages and historical documents. This could be used for educational purposes to learn about a different culture or time period. To alter this model the 'shakespeare_text' dataset would need to be replaced with another dataset that had a similar format. Ethical concerns include misuse of the model to produce something either harmful or culturally inappropriate, there are also copyright concerns around the use of different datasets. 