# Lab | Text Generation from Shakespeare's Sonnet

This notebook explores the fascinating domain of text generation using a deep learning model trained on Shakespeare's sonnets. 

The objective is to create a neural network capable of generating text sequences that mimic the style and language of Shakespeare.

By utilizing a Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) layers, this project aims to demonstrate how a model can learn and replicate the complex patterns of early modern English. 

The dataset used consists of Shakespeare's sonnets, which are preprocessed and tokenized to serve as input for the model.

Throughout this notebook, you will see the steps taken to prepare the data, build and train the model, and evaluate its performance in generating text. 

This lab provides a hands-on approach to understanding the intricacies of natural language processing (NLP) and the potential of machine learning in creative text generation.

Let's import necessary libraries

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import regularizers

import tensorflow.keras.utils as ku 
import numpy as np

Let's get the data!

In [None]:
'''import requests
url = 'https://raw.githubusercontent.com/martin-gorner/tensorflow-rnn-shakespeare/master/shakespeare/sonnets.txt'
resp = requests.get(url)
with open('sonnets.txt', 'wb') as f:
    f.write(resp.content)'''

data = open('sonnets.txt').read()

corpus = data.lower().split("\n")

Step 1: Initialise a tokenizer and fit it on the corpus variable using .fit_on_texts

In [None]:
# Your code here :

# Instantiate the Tokenizer
tokenizer = Tokenizer()

# Fit the tokenizer on the corpus
tokenizer.fit_on_texts(corpus)

# Print the word index to see how words are mapped to integers
print(tokenizer.word_index)

Step 2: Calculate the Vocabulary Size

Let's figure out how many unique words are in your corpus. This will be the size of your vocabulary.

Calculate the length of tokenizer.word_index, add 1 to it and store it in a variable called total_words.

In [None]:
# Your code here :
total_words = len(tokenizer.word_index) + 1

print(len(tokenizer.word_index))
print(total_words)


Create an empty list called input_sequences.

For each sentence in your corpus, convert the text into a sequence of integers using the tokenizer.
Then, generate n-gram sequences from these tokens.

Store the result in the list input_sequences.

In [None]:
# Your code here :
"""input_sequences = []

# Define the n-gram size
n_gram_size = 3

for sentence in corpus:
    # Convert the sentence to a sequence of integers
    sequence = tokenizer.texts_to_sequences([sentence])[0]

    #print(sentence, sequence)

    # Generate n-gram sequences
    n_grams = []
    sequence_length = len(sequence)

    for i in range(2, sequence_length + 1):
        n_grams.append(sequence[0:i])
    
    # Store the n-gram sequences in the input_sequences list
    input_sequences.extend(n_grams)
"""



input_sequences = []

for line in corpus:
    token_list = tokenizer.texts_to_sequences([line])[0]

    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i + 1]
        input_sequences.append(n_gram_sequence)



# Print the input_sequences list
print(input_sequences)



'''
[3, 2],
[3, 2, 313],
[3, 2, 313, 1375],
[3, 2, 313, 1375, 4]
'''

""" - Larry
input_sequences = []

for line in corpus:
    # Convert each line into a sequence of integers
    token_list = tokenizer.texts_to_sequences([line])[0]

    for i in range(1, len(token_list)):
        # Generate n-grams for each sequence
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

# Print first few sequences for verification
print(input_sequences)
"""

Calculate the length of the longest sequence in input_sequences. Assign the result to a variable called max_sequence_len.

Now pad the sequences using pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre').
Convert it to a numpy array and assign the result back to our variable called input_sequences.

In [None]:
# Your code here :

# Calculate the length of the longest sequence in input_sequences. Assign the result to a variable called max_sequence_len.
max_sequence_len = max([len(i) for i in input_sequences])
print(max_sequence_len)

# Now pad the sequences using pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre').
input_sequences = pad_sequences(input_sequences, maxlen = max_sequence_len, padding = 'pre') # Shape (15484, 11)

#Convert it to a numpy array and assign the result back to our variable called input_sequences.
input_sequences = np.array(input_sequences)
print(input_sequences, input_sequences.shape) # Shape (15484, 11)



Prepare Predictors and Labels

Split the sequences into two parts:

- Predictors: All elements from input_sequences except the last one.
- Labels: The last element of each sequence in input_sequences.

In [None]:
# Your code here :

predictors = input_sequences[:, :-1]
labels = input_sequences[:, -1]


print(input_sequences[:10])
print(predictors)
print(labels)

# print(input_sequences[0]) == [0 0 0 0 0 0 0 0 0 3 2]
# print(input_sequences[0][0]) == 0
# print(input_sequences[0][:-1]) == [0 0 0 0 0 0 0 0 0 3]
# print(input_sequences[0][-1]) == 2


One-Hot Encode the Labels :

Convert the labels (which are integers) into one-hot encoded vectors. 

Ensure the length of these vectors matches the total number of unique words in your vocabulary.

Use ku.to_categorical() on labels with num_classes = total_words

Assign the result back to our variable labels.

In [None]:
# Your code here :
labels = ku.to_categorical(labels, num_classes = total_words)

print(labels, total_words)

# Initialize the Model

Start by creating a Sequential model.

Add Layers to the Model:

Embedding Layer: The first layer is an embedding layer. It converts word indices into dense vectors of fixed size (100 in this case). Set the input length to the maximum sequence length minus one, which corresponds to the number of previous words the model will consider when predicting the next word.

Bidirectional LSTM Layer: Add a Bidirectional LSTM layer with 150 units. This layer allows the model to learn context from both directions (past and future) in the sequence. return_sequences=True

Dropout Layer: Add a dropout layer with a rate of 0.2 to prevent overfitting by randomly setting 20% of the input units to 0 during training.

LSTM Layer: Add a second LSTM layer with 100 units. This layer processes the sequence and passes its output to the next layer.

Dense Layer (Intermediate): Add a dense layer with half the total number of words as units, using ReLU activation. A regularization term (L2) is added to prevent overfitting.

Dense Layer (Output): The final dense layer has as many units as there are words in the vocabulary, with a softmax activation function to output a probability distribution over all words.

In [None]:
model = Sequential([

    # Your code here :

    # Embedding Layer
    Embedding(total_words, 100, input_length = max_sequence_len - 1),

    # Bidirectional LSTM Layer
    Bidirectional(LSTM(150, return_sequences=True)),

    # Dropout Layer
    Dropout(rate = 0.2),

    # LSTM Layer
    LSTM(100),

    # Dense Layer (Intermediate)
    Dense(units = (total_words // 2), activation = "relu", kernel_regularizer = regularizers.l2(0.01)),

    # Dense Layer (Output)
    Dense(units = total_words, activation = "softmax")

])

# Compile the Model:

Compile the model using categorical crossentropy as the loss function, the Adam optimizer for efficient training, and accuracy as the metric to evaluate during training.

In [None]:
# Your code here :
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])


# Print Model Summary:

Use model.summary() to print a summary of the model, which shows the layers, their output shapes, and the number of parameters.

In [None]:
# Your code here :
model.build(input_shape=(None, max_sequence_len - 1))

model.summary()

# Now train the model for 50 epochs and assign it to a variable called history.

Training the model with 50 epochs should get you around 40% accuracy.

You can train the model for as many epochs as you like depending on the time and computing constraints you are facing. Ideally train it for a larger amount of epochs than 50.

That way you will get better text generation at the end.

However, dont waste your time.

In [None]:
# Your code here :
history = model.fit(predictors, labels, epochs = 50, verbose = 1)

# Use plt from matplotlib to plot the training accuracy over epochs and the loss over epochs

First you will have to get the accuracy and loss data over epochs, you can do this by using methods on your model.

In [None]:
# Your code here :
import matplotlib.pyplot as plt

# Plot accuracy
plt.plot(history.history['accuracy'])
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train'], loc='upper left')
plt.show()

# Plot loss
plt.plot(history.history['loss'])
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train'], loc='upper left')
plt.show()

# Generate text with the model based on a seed text

Now you will create two variables :

- seed_text = 'Write the text you want the model to use as a starting point to generate the next words'
- next_words = number_of_words_you_want_the_model_to_generate

Please change number_of_words_you_want_the_model_to_generate by an actual integer.

In [None]:
# Your code here :
seed_text = 'What we did after we left was'
next_words = 10 # number_of_words_you_want_the_model_to_generate


Now create a loop that runs based on the next_words variable and generates new text based on your seed_text input string. Print the full text with the generated text at the end.

This time you dont get detailed instructions.

Have fun!

In [None]:
# Your code here :
for _ in range(next_words):
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences([token_list], maxlen = max_sequence_len - 1, padding = 'pre')

    predicted = model.predict(token_list, verbose = 0)
    predicted_word_index = np.argmax(predicted)

    predicted_word = tokenizer.index_word[predicted_word_index]

    seed_text += " " + predicted_word

print(seed_text)

Experiment with at least 3 different seed_text strings and see what happens!

In [None]:
# Your code here :
seed_text2 = 'Now I know, wothout a doubt'
for _ in range(next_words):
    token_list = tokenizer.texts_to_sequences([seed_text2])[0]
    token_list = pad_sequences([token_list], maxlen = max_sequence_len - 1, padding = 'pre')

    predicted = model.predict(token_list, verbose = 0)
    predicted_word_index = np.argmax(predicted)

    predicted_word = tokenizer.index_word[predicted_word_index]

    seed_text2 += " " + predicted_word
print(seed_text2)

seed_text3 = 'That table right there is'
for _ in range(next_words):
    token_list = tokenizer.texts_to_sequences([seed_text3])[0]
    token_list = pad_sequences([token_list], maxlen = max_sequence_len - 1, padding = 'pre')

    predicted = model.predict(token_list, verbose = 0)
    predicted_word_index = np.argmax(predicted)

    predicted_word = tokenizer.index_word[predicted_word_index]

    seed_text3 += " " + predicted_word
print(seed_text3)

seed_text4 = 'The Count of Mount'
for _ in range(next_words):
    token_list = tokenizer.texts_to_sequences([seed_text4])[0]
    token_list = pad_sequences([token_list], maxlen = max_sequence_len - 1, padding = 'pre')

    predicted = model.predict(token_list, verbose = 0)
    predicted_word_index = np.argmax(predicted)

    predicted_word = tokenizer.index_word[predicted_word_index]

    seed_text4 += " " + predicted_word
print(seed_text4)