# Text Generation with LSTM

In this exercise, I will build a neural network text generation model using LSTM (Long Short-Term Memory), a type of Recurrent Neural Network (RNN) designed to capture long-range dependencies in sequential data. LSTMs are particularly well-suited for tasks involving sequences, like text generation, because they can retain information over time, handle varying sequence lengths, and process both short- and long-term context. These characteristics make them ideal for applications such as text generation, speech recognition, and time series prediction.

For text generation, the ability of LSTMs to remember previous words and phrases is crucial for generating coherent and meaningful text, which is the primary goal of this exercise.

In [1]:
# Packages to import we will be needing to solve this exercise

import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import get_file
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
import string
from keras.callbacks import EarlyStopping
import requests
import re

In [2]:
url = 'https://www.gutenberg.org/files/84/84-0.txt'
response = requests.get(url)

if response.status_code == 200:
    with open('frankenstein.txt', 'w', encoding='utf-8') as f:
        f.write(response.text.lower())
    print("Downloaded 'Frankenstein' successfully!")
else:
    print(f"Failed to download. Status code: {response.status_code}")


Downloaded 'Frankenstein' successfully!


In [3]:
with open('frankenstein.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# Print the first 500 characters to check the content
print(text[:1000])

*** start of the project gutenberg ebook 84 ***



frankenstein;



or, the modern prometheus



by mary wollstonecraft (godwin) shelley





 contents



 letter 1

 letter 2

 letter 3

 letter 4

 chapter 1

 chapter 2

 chapter 3

 chapter 4

 chapter 5

 chapter 6

 chapter 7

 chapter 8

 chapter 9

 chapter 10

 chapter 11

 chapter 12

 chapter 13

 chapter 14

 chapter 15

 chapter 16

 chapter 17

 chapter 18

 chapter 19

 chapter 20

 chapter 21

 chapter 22

 chapter 23

 chapter 24









letter 1



_to mrs. saville, england._





st. petersburgh, dec. 11th, 17—.





you will rejoice to hear that no disaster has accompanied the

commencement of an enterprise which you have regarded with such evil

forebodings. i arrived here yesterday, and my first task is to assure

my dear sister of my welfare and increasing confidence in the success

of my undertaking.



i am already far north of london, and as i walk in the streets of

petersburgh, i feel a cold northern breeze 

In [4]:
# Clean the text: remove unwanted characters, keep only letters and basic punctuation
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove special characters, keeping letters, numbers, spaces, and common punctuation
    text = re.sub(r"[^a-z0-9,.';:?!\s]", '', text)
    return text

# Apply preprocessing
cleaned_text = preprocess_text(text)

# Check the cleaned text
print(cleaned_text[:1000])

 start of the project gutenberg ebook 84 



frankenstein;



or, the modern prometheus



by mary wollstonecraft godwin shelley





 contents



 letter 1

 letter 2

 letter 3

 letter 4

 chapter 1

 chapter 2

 chapter 3

 chapter 4

 chapter 5

 chapter 6

 chapter 7

 chapter 8

 chapter 9

 chapter 10

 chapter 11

 chapter 12

 chapter 13

 chapter 14

 chapter 15

 chapter 16

 chapter 17

 chapter 18

 chapter 19

 chapter 20

 chapter 21

 chapter 22

 chapter 23

 chapter 24









letter 1



to mrs. saville, england.





st. petersburgh, dec. 11th, 17.





you will rejoice to hear that no disaster has accompanied the

commencement of an enterprise which you have regarded with such evil

forebodings. i arrived here yesterday, and my first task is to assure

my dear sister of my welfare and increasing confidence in the success

of my undertaking.



i am already far north of london, and as i walk in the streets of

petersburgh, i feel a cold northern breeze play upon m

After loading the "text" we have to tokenize it. The tokenization of a text can be either using words as tokens or characters as tokens. I believe word-level tokenization is often better suited for text generation tasks where meaning, context, and coherence between words are essential. These are some of the advantages I find decisive in the selection of word tokenization for text generation:

1. Captures Meaning: Words carry semantic meaning, while characters do not. Tokenizing at the word level helps the model learn higher-level concepts directly.

2. Faster Learning: Word-level tokenization reduces the vocabulary size compared to characters, enabling faster model training and better efficiency.

3. Better Generalization: It allows the model to recognize common word patterns (e.g., "the cat") rather than learning character sequences from scratch.

4. Simpler Models: Word-level tokenization results in simpler architectures, requiring less complex processing than character-level models.

5. Natural Language Structure: Text generation aligns more with human language, which is structured around words, not individual characters.

6. Handling Rare Words: The model can handle rare or complex words as single tokens, avoiding the need to generate them from individual characters.



In [5]:
# Tokenize the entire text once (using words as tokens, not characters)
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])  # Fit the tokenizer to the text corpus (mapping words to integers)
total_words = len(tokenizer.word_index) + 1  # Total number of unique words in the dataset

In [6]:
# Convert the entire text into a sequence of integers (mapping words to integer indices)
tokenized_text = tokenizer.texts_to_sequences([text])[0]

In [7]:
sequence_length = 5  # The length of the input sequence (how many previous words to consider for the prediction)

# Pre-allocate NumPy arrays for efficiency (storing input sequences and their corresponding target words)
num_sequences = len(tokenized_text) - sequence_length
X = np.zeros((num_sequences, sequence_length), dtype=np.int32)  # Input sequences (X)
y = np.zeros(num_sequences, dtype=np.int32)  # Target words (y)

# Generate input sequences (X) and their corresponding targets (y)
for i in range(num_sequences):
    X[i] = tokenized_text[i:i + sequence_length]  # The sequence of tokens as input
    y[i] = tokenized_text[i + sequence_length]  # The next word as the target

# Convert y to a numpy array of integer type
y = np.array(y, dtype=np.int32)

# Print shapes to confirm the data generation is correct
print(f"Generated {len(X)} sequences.")
print(f"Shape of X: {X.shape}")
print(f"Shape of y: {y.shape}")

Generated 75516 sequences.
Shape of X: (75516, 5)
Shape of y: (75516,)


Now that we tokenized the text and built X and y, we can start building the LSTM

**Model Architecture:**

Embedding Layer: This converts the input sequence of integer indices into dense vector representations of size 128.

LSTM Layer: This layer has 256 hidden units. It is the core of the RNN, which will learn the dependencies between characters in the input sequence.

Dropout Layer: This is used to prevent overfitting by randomly setting 20% of the inputs to 0 during training.

Dense Layer: This fully connected layer with softmax activation outputs the probabilities for the next character in the sequence. The number of units is equal to the number of unique characters in the corpus.

In [8]:
# Using the tokenizer's word index to determine the vocabulary size (based on words, not characters)
vocab_size = total_words  # Vocabulary size based on the tokenizer (number of unique words)

In [9]:
# Model definition
model = Sequential([
    # Embedding Layer: Maps each integer token to a dense vector of size 128 (adjustable)
    Embedding(input_dim=vocab_size, output_dim=128),  # Convert input integers to dense vectors

    # LSTM Layer: 256 hidden units, learning temporal dependencies in the data
    LSTM(256, return_sequences=True),
    
    # LSTM Layer: 256 hidden units, learning temporal dependencies in the data
    LSTM(256, return_sequences=False),  # return_sequences=False because it's the final LSTM layer
    
    # Dropout Layer: Prevents overfitting by randomly setting 20% of the LSTM units to 0 during training
    Dropout(0.2),

    # Dense Output Layer: Predicts the next word with softmax activation (multi-class classification)
    Dense(vocab_size, activation='softmax')  # Output layer with softmax activation for classification
])

In [10]:
# Compile the model with SparseCategoricalCrossentropy loss (integer labels)
model.compile(loss='SparseCategoricalCrossentropy', optimizer='adam', metrics=['accuracy'])

In [12]:
# Train the model on the data
# Early stopping to avoid overfitting
# Batch size and epochs must be definied taking into account computational power
early_stopping = EarlyStopping(monitor='loss', patience=5)
model.fit(X, y, batch_size=128, epochs=30, verbose=1, callbacks=[early_stopping])

Epoch 1/30
[1m590/590[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m56s[0m 95ms/step - accuracy: 0.1931 - loss: 4.3399
Epoch 2/30
[1m590/590[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m60s[0m 102ms/step - accuracy: 0.2095 - loss: 4.1354
Epoch 3/30
[1m590/590[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m79s[0m 96ms/step - accuracy: 0.2233 - loss: 3.9653
Epoch 4/30
[1m590/590[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m78s[0m 89ms/step - accuracy: 0.2445 - loss: 3.7802
Epoch 5/30
[1m590/590[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m58s[0m 98ms/step - accuracy: 0.2711 - loss: 3.5897
Epoch 6/30
[1m590/590[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m58s[0m 99ms/step - accuracy: 0.2939 - loss: 3.4224
Epoch 7/30
[1m590/590[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m79s[0m 93ms/step - accuracy: 0.3235 - loss: 3.2431
Epoch 8/30
[1m590/590[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 91ms/step - accuracy: 0.3529 - loss: 3.0710
Epoch 9/30
[1m590/590

<keras.src.callbacks.history.History at 0x1d40d11d1c0>

Taking into account your computer, define batch size and epochs. If you desire to improve the model, you can keep training it to improve accuracy. Furthermore, I added early stopping combating overfitting and decreasing training time; I highly recommend it! In my particular case, I trained 10 epochs (to test the code and the training time), and after I trained 30 more. I arrived to a pretty good accuracy score; I could go on but my computational resources are already glitching a lot, and to test the model it would be enough.

In [13]:
# To save the model
model.save("frakenstein_tg_model.keras")

In [18]:
# To load already saved model
# model = keras.models.load_model("frakenstein_tg_model.keras")

Now let's test the model! Let's try and generate some text for different temperatures.
Btw, in case you are not familiar, lower temperature (e.g., 0.5) produces more predictable and coherent text, while higher temperature (e.g., 1.0 or above) produces more creative but sometimes erratic text.

In [14]:
def sample_with_temperature(predictions, temperature=1.0):

    predictions = np.asarray(predictions).astype('float64')
    predictions = np.log(predictions + 1e-8) / temperature  # Apply temperature scaling
    exp_preds = np.exp(predictions)
    probabilities = exp_preds / np.sum(exp_preds)           # Re-normalize to get a probability distribution

    return np.random.choice(len(probabilities), p=probabilities)

In [15]:
def generate_text(model, tokenizer, seed_text, sequence_length, num_words_to_generate=50, temperature=1.0):

    # Tokenize the seed text
    tokenized_seed = tokenizer.texts_to_sequences([seed_text])[0]
    
    # Ensure the seed is long enough (pad with zeros if needed)
    if len(tokenized_seed) < sequence_length:
        tokenized_seed = [0] * (sequence_length - len(tokenized_seed)) + tokenized_seed

    # Initialize the generated text with the seed text
    generated_text = seed_text

    for _ in range(num_words_to_generate):
        # Prepare the input sequence for the model
        input_sequence = np.array(tokenized_seed[-sequence_length:]).reshape(1, sequence_length)

        # Predict the next word probabilities
        predicted_probabilities = model.predict(input_sequence, verbose=0)[0]

        # Sample the next word index using temperature scaling
        predicted_word_index = sample_with_temperature(predicted_probabilities, temperature=temperature)

        # Convert the predicted index back to a word
        predicted_word = tokenizer.index_word.get(predicted_word_index, '')

        # Break if the predicted word is an empty string (unknown word)
        if predicted_word == '':
            break

        # Append the predicted word to the generated text
        generated_text += ' ' + predicted_word

        # Add the predicted word index to the sequence for the next prediction
        tokenized_seed.append(predicted_word_index)

    return generated_text


In [17]:
# Example seed text
seed_text = "I had worked hard for nearly two years, for the sole purpose of infusing life into an inanimate body"

# Generate text with different temperatures
print("Low Temperature (0.5):")
print(generate_text(model, tokenizer, seed_text, sequence_length=5, num_words_to_generate=20, temperature=0.5))

print("\nMedium Temperature (1.0):")
print(generate_text(model, tokenizer, seed_text, sequence_length=5, num_words_to_generate=20, temperature=1.0))

print("\nHigh Temperature (1.5):")
print(generate_text(model, tokenizer, seed_text, sequence_length=5, num_words_to_generate=20, temperature=1.5))


Low Temperature (0.5):
I had worked hard for nearly two years, for the sole purpose of infusing life into an inanimate body for this i had deprived him of rest and health he had gone before i learned that they also would

Medium Temperature (1.0):
I had worked hard for nearly two years, for the sole purpose of infusing life into an inanimate body for this will i have been rather a shattered and pleasant snatched around the mountains which had now the arbiters

High Temperature (1.5):
I had worked hard for nearly two years, for the sole purpose of infusing life into an inanimate body with mine they had nursed on and a family lest my mind should you have spent unguarded and human will


In this notebook, we successfully trained a model to generate text based on *Frankenstein* by Mary Shelley. The results were fascinating, showing how temperature settings influenced the creativity of the generated text. With lower temperatures, the model produced more predictable and coherent results that closely mirrored the original text. As the temperature increased, the generated text became more diverse and creative, introducing interesting variations while maintaining some connection to the source material.

Despite the increasing randomness at higher temperatures, the model was still able to produce thought-provoking outputs, showcasing its potential for creative writing and storytelling. These results demonstrate that even with simple models, we can explore text generation in a way that pushes the boundaries of classical works like *Frankenstein*.

Looking forward, there are opportunities to fine-tune the model for better coherence and creativity, as well as to explore more advanced techniques like transformers for even more impressive text generation. This experiment opens the door to endless possibilities in generating unique and engaging narratives from classic literature.
