# Long Short-Term Memory Model for Text Generation
### Cassi Mason
### 12/16/2024
### CS379

## Project Description
This project trains a long short-term memory (LSTM) model on the Alice In Wonderland book to generate text. LSTM is a type of recurrent neural network that can remember past predictions to help generate new ones (Rohrer, 2017). This is really useful for teaching computers to read and write, because the model remembers patterns in syntax and word frequency. 

This project is on a local computer, which made training and fitting the model take a long time with a single CPU processor. A way to scale this project to accept more data, find more features, and be trained faster is with distributed computing on GPU processors. The way LSTM works is performing the same computation on every piece of data across an entire layer, then sending the results to the next layer. That layer performs a different computation on every piece of data and sends the results to the next layer, etc. GPU's are designed to perform simple, repetitive computations in parallel, so they could quickly process the inputs and generate outputs for a single layer in parallel. Within each layer, the neurons are independent, so a single layer could be distributed between many GPU's for even more parallel computing. Communication/consolidation is needed between the different GPU's when they send their outputs to the next layer of the model so that the neurons in the next layer are seeing the output from the neuron's in the previous layer that connect to them. A final consolidation function would take the results from each GPU and produce the final output.  

I tested out 2 models. One model has 3 layers: an LSTM layer, a pruning layer, and an output layer. It was trained on the entire dataset 2 times to begin improving the weights of each neuron at each layer with backpropogation. It's loss value is 2.71 and it can produce text that has some semblance to english words. The second model has 5 layers: 2 LSTM layers, 2 pruning layers, and an output layer. It was trained on the entire dataset 10 times to better improve the model's weights with backpropogation. Epoch 5 showed the lowest loss value (1.99), so the weights calculated in epoch 5 were used when the model generated text. The output used more real words and repeated a longer pattern than the first model, showing it is getting better at "reading" and "writing".  

I followed the example from Brownlee (2016). My description of how the code works mostly comes from the Tutorialspoint Keras tutorial.

## References
Brownlee, J. (2016, August 3). Text Generation With LSTM Recurrent Neural Networks in Python with Keras. MachineLearningMastery.Com. https://www.machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/

Kerasâ€”Overview of Deep learning. (n.d.). Retrieved December 15, 2024, from https://www.tutorialspoint.com/keras/keras_overview_of_deep_learning.htm

Rohrer, B. (Director). (2017, June 27). Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) [Video recording]. https://www.youtube.com/watch?v=WCUNPb-5EYI

## Import Libraries

In [1]:
import numpy as np # numerical array manipulation
import sys

In [2]:
import tensorflow as tf # deep learning library

# use keras on top of tensorflow for easier user implementation of deep learning
from tensorflow.keras.models import Sequential # linearly stack layers in a neural network
from tensorflow.keras.layers import Dense # A highly connected layer where every neuron in the dense layer is connected to every other neuron in the previous layer
from tensorflow.keras.layers import Dropout # a pruning layer that will reset a certain fraction of random neurons in the layer to zero to prevent overfitting
from tensorflow.keras.layers import LSTM # a layer that has long short term memory capability
from tensorflow.keras.callbacks import ModelCheckpoint # While a model is being trained, it saves the state of the model when model performance improves
from tensorflow.keras.utils import to_categorical # turn object data into one-hot encoded data


## Import Dataset

In [3]:
# data source
filepath = "alice_in_wonderland.txt"

In [4]:
# save the dataset to a variable, reading the content of the dataset with utf-8 text to bit encoding. 
raw_text = open(filepath, 'r', encoding='utf-8').read()
# Make everything lowercase to simplify the vocabulary set for the computer
raw_text = raw_text.lower()

In [5]:
print(raw_text)

alice's adventures in wonderland

                alice's adventures in wonderland

                          lewis carroll

               the millennium fulcrum edition 3.0




                            chapter i

                      down the rabbit-hole


  alice was beginning to get very tired of sitting by her sister
on the bank, and of having nothing to do:  once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
thought alice `without pictures or conversation?'

  so she was considering in her own mind (as well as she could,
for the hot day made her feel very sleepy and stupid), whether
the pleasure of making a daisy-chain would be worth the trouble
of getting up and picking the daisies, when suddenly a white
rabbit with pink eyes ran close by her.

  there was nothing so very remarkable in that; nor did alice
think it so very much out of the way to hear the rabbit say to
itself, `oh d

We can see we have imported the entire book, Alice in Wonderland. We also converted everything to lowercase. None of the proper nouns or starts of sentences are capitalized. This is our training data. We will use it to teach our model to "speak". It will be given the set of possible vocabulary it can use (all the different characters), learn dependencies/probabilities in the sequencing of characters, then create its own story with prediction. A possible modification would be to use every unique word as the vocabulary of the language. That way, the model will predict real words, but it will be limited to the words used in this book. 

## Process Dataset

In [6]:
# create a list of the unique characters in the text to generate the vocabulary of this language. 
chars = sorted(list(set(raw_text)))
# create a dictionary, mapping each unique character to an integer for computer processing
char_to_int = dict((c, i) for i, c in enumerate(chars))

In [7]:
print(char_to_int)

{'\n': 0, ' ': 1, '!': 2, '"': 3, "'": 4, '(': 5, ')': 6, '*': 7, ',': 8, '-': 9, '.': 10, '0': 11, '3': 12, ':': 13, ';': 14, '?': 15, '[': 16, ']': 17, '_': 18, '`': 19, 'a': 20, 'b': 21, 'c': 22, 'd': 23, 'e': 24, 'f': 25, 'g': 26, 'h': 27, 'i': 28, 'j': 29, 'k': 30, 'l': 31, 'm': 32, 'n': 33, 'o': 34, 'p': 35, 'q': 36, 'r': 37, 's': 38, 't': 39, 'u': 40, 'v': 41, 'w': 42, 'x': 43, 'y': 44, 'z': 45}


We get a dictionary mapping every type of character used in the book "Alice in Wonderland" to an integer. For example, a newline is represented as 0, the letter 'p' is represented as 35.

## Explore Dataset

In [8]:
# the total number of characters in the entire book
n_chars = len(raw_text)
# the vocabulary of the language: unique characters in the book
n_vocab = len(chars)
print("Total Characters in the book Alice in Wonderland: ", n_chars)
print("Total Vocabulary (unique characters): ", n_vocab)

Total Characters in the book Alice in Wonderland:  148574
Total Vocabulary (unique characters):  46


Another way to look at this dataset is by words, rather than characters. We could do this with something like term frequency inverse document frequency (tf-idf) which finds prevalent words in a document compared to a corpus of documents. We could also count all the words and all the unique words. The unique words would become the vocabulary of the language.

In [9]:
# determine most common terms with tf-idf

## Prepare Training Data

In [10]:
# create input-output pairs to train model to predict the next character
# set length of a pattern sequence. Used to split text segments into inputs/outputs
seq_length = 100
# empty array to hold input sequences
dataX = []
# empty array to hold output sequences
dataY = []

# loop over every character in raw text from 0 to one sequence length from end
for i in range(0, n_chars - seq_length, 1):
    # define input sequence: current character through the sequence length
	seq_in = raw_text[i:i + seq_length]
    # define output sequence: the single character immediately following the input sequence
	seq_out = raw_text[i + seq_length]
    # Add current input sequence, as an integer, to the input sequence array
	dataX.append([char_to_int[char] for char in seq_in])
    # Add current output character, as an integer, to the output sequence array
	dataY.append(char_to_int[seq_out])
    
# determine total number of sequence patterns by counting length of input sequence array    
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)

Total Patterns:  148474


Format the data in a way an LSTM model can understand. 

The input is put into a table of sample patterns and the character at each step in the pattern. It is then scaled between 0 and 1 for easier interpretation by the LSTM model by dividing the entire table by the number of unique characters in the language.

The output is formatted so the model can easily calculate the loss function (how much error there is between the actual value and the predicted value) between the different possible output characters with one hot encoding. All the different characters (represented as integers from a conversion in the last cell) become their own column. A character is represented as a 1 for that column in a row of 0s for every other column. 

In [11]:
# define feature data. Use an array that reshapes the input array so the first dimension is the input samples, 
# the second dimension is the time steps or sequence length
X = np.reshape(dataX, (n_patterns, seq_length, 1))
# normalize input data to floating point values between 0 and 1
X = X / float(n_vocab)
# one hot encode the output variable
y = to_categorical(dataY)

Now our data is in a format a recurrent neural network will understand: Patterns of input characters of length seq_length that map to one-hot encoded output characters of length 1. One way to tune this model would be to change how long seq_length is. We used 100 for this project, but trying different lengths could yield different resuts. It also makes sense to me to make each pattern a full sentence. I think that would teach the model better grammar structures, sentence lengths, punctuation, etc. 

## Train Model

Here we will define what our model will look like: how many layers of a neural network it will have, what those layers do, and how the model weights will be improved for better predictive ability. 

The LSTM layer uses short and long term memory to pass on predictions of the next character in the sequence. The dropout layer prevents overfitting to the training data by pruning a percentage of random neurons so they don't remember their prediction. This helps the model from recreating the exact text of Alice in Wonderland. The dense layer condenses all the inputs into probabilities of each of the 46 unique characters in the language being the next character to produce an output. 

The activation function decides whether a neuron should fire or not, which influences the output of the model. Softmax activation is a function that is good to use with mutli-class classification problems. 

When the model compiles, it measures the loss of the model (difference between prediction and true value) with a function called categorical cross entropy, which works well with categorical data. The optimizer function adjusts the weights of the neural network to minimize the loss function during backpropagation. Adam is a type of optimizer function. 

In [12]:
# Create Long Short Term Memory Recurrent Neural Network

# type of model adds layers sequentially
model = Sequential()

# add an LSTM layer with 256 neurons (memory cells) with a data input shape of number of timesteps, and number of features at each timestamp
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))

# add a layer that randomly erases the memory of 20% of the neurons to prevent over-fitting
model.add(Dropout(0.2))

# add a dense layer. Every neuron is connected to every neuron in the previous layer. 
# Number of neurons is equal to the number of options in the output array
# use a softmax activation function to determine most probable class of a multi-classification problem
model.add(Dense(y.shape[1], activation='softmax'))

# Compile model measuring loss with categorical crossentropy
# optimize the weights of the neural network to minimize the loss function with the Adam optimizer
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [13]:
# Find best-tuned model hyperparameters

# create file, name of file saved with the number of epochs and the loss value
filepath="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"

# Save model weights to the file while monitoring the loss value
# save the model only if the loss value decreases
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')

# create list of the different checkpoints saved as the model is fitted with the data
callbacks_list = [checkpoint]

In [14]:
# fit the model
# number of times the model goes through the entire dataset: 2
# Batches are subsets of the dataset. Weights are updated after each batch is processed. Number of samples in a batch = 128
# save the best model weights by putting them in the callback list
model.fit(X, y, epochs=2, batch_size=128, callbacks=callbacks_list)

Epoch 1/2
Epoch 00001: loss improved from inf to 2.95911, saving model to weights-improvement-01-2.9591.hdf5
Epoch 2/2
Epoch 00002: loss improved from 2.95911 to 2.71301, saving model to weights-improvement-02-2.7130.hdf5


<tensorflow.python.keras.callbacks.History at 0x1f1061c0610>

After fitting the model, we can see the loss value decreased from the first epoch to the second in our checkpoint files. This probably means our model would benefit from doing more iterations to learn the optimal weights. Therefore, I also created a bigger neural network to see how that effects the performance of generating text. The next model has more layers to the neural network, more epochs to learn the data and backpropogate weights, and smaller batch sizes so the weights are updated more frequently.

In [29]:
# change hyperparameters to see difference in model 

# linear stack of layers in our model
model3 = Sequential()
# LSTM layer with 256 neurons, looks at time steps and features, and !!!1!
model3.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
# pruning layer
model3.add(Dropout(0.2))
# second LSTM layer, also with 256 neurons
model3.add(LSTM(256))
# second pruning layer
model3.add(Dropout(0.2))
# dense output layer
model3.add(Dense(y.shape[1], activation='softmax'))

# compile model measuring categorical cross entropy loss and optimizing model
model.compile(loss='categorical_crossentropy', optimizer='adam')

# Save model weights for each epoch as long as the loss function improves
filepath3 = "weights-improvement-{epoch:02d}-{loss:.4f}-bigger.hdf5"
checkpoint3 = ModelCheckpoint(filepath3, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list3 = [checkpoint3]

# fit the model, using 10 full passthroughs of the dataset, 64 patterns per batch, and storing best model weights
model3.fit(X, y, epochs=10, batch_size=64, callbacks=callbacks_list3)

Epoch 1/10
Epoch 00001: loss improved from inf to 2.77221, saving model to weights-improvement-01-2.7722-bigger.hdf5
Epoch 2/10
Epoch 00002: loss improved from 2.77221 to 2.43082, saving model to weights-improvement-02-2.4308-bigger.hdf5
Epoch 3/10
Epoch 00003: loss improved from 2.43082 to 2.23383, saving model to weights-improvement-03-2.2338-bigger.hdf5
Epoch 4/10
Epoch 00004: loss improved from 2.23383 to 2.09806, saving model to weights-improvement-04-2.0981-bigger.hdf5
Epoch 5/10
Epoch 00005: loss improved from 2.09806 to 1.99331, saving model to weights-improvement-05-1.9933-bigger.hdf5
Epoch 6/10
Epoch 00006: loss did not improve from 1.99331
Epoch 7/10
Epoch 00007: loss did not improve from 1.99331
Epoch 8/10
Epoch 00008: loss did not improve from 1.99331
Epoch 9/10
Epoch 00009: loss did not improve from 1.99331
Epoch 10/10
Epoch 00010: loss did not improve from 1.99331


<tensorflow.python.keras.callbacks.History at 0x1f11c7040a0>

## Test Model

Now that we have our models created, trained and fitted to the dataset, lets see what they come up with when generating text. We will give them one of the training patterns from our input set and then let it generate characters to produce and output. 

In [18]:
# Transform numerical representations of characters used in dataset back to actual characters for human readability
int_to_char = dict((i, c) for i, c in enumerate(chars))

In [31]:
# create a function that takes an input pattern and generates an output string, essentially writing a story

# function parameters are the input array, the model used, and the file containing the best weights for the model based on minimizing the loss function
def print_prediction(input_array, lstm_model, weight_improvement_file):
    # load optimized weights
    filename = weight_improvement_file
    lstm_model.load_weights(filename)

    # pick a random pattern to give the model from our input array
    start = np.random.randint(0, len(input_array)-1)
    pattern = input_array[start]
    print("Seed:")
    print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
    
    # generate characters as output to write text
    # for loop to write for 1000 characters
    for i in range(1000):
        # format input pattern in a way LSTM understands
    	x = np.reshape(pattern, (1, len(pattern), 1))
        # normalize 
    	x = x / float(n_vocab)
        
        # use model to predict the next character based on the input pattern. 
        # Returns array of probability over all characters in language
    	prediction = lstm_model.predict(x, verbose=0)
        # find the character with the highest probability from the prediction array
    	index = np.argmax(prediction)
        
        # call on function to convert integers to their character values for the selected character and for the input pattern
    	result = int_to_char[index]
    	seq_in = [int_to_char[value] for value in pattern]

        # write results to screen
    	sys.stdout.write(result)
        # add output to input pattern to help the model predict the next character (short term memory)
    	pattern.append(index)
        # reset pattern to drop the first character (index = 0) and include the newly added character
    	pattern = pattern[1:len(pattern)]
    print("\nDone.")

In [32]:
# simple model, 3 layers and 2 epochs
print_prediction(dataX, model, "weights-improvement-02-2.7130.hdf5")

Seed:
"  with an m?' said alice.

  `why not?' said the march hare.

  alice was silent.

  the dormouse had "
t the ton toe toet the toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe toet toe to

We can see the seed pattern to get the model started. The model started off with some words, some nonsense, and then settled into a pattern of saying the word "toe" and then the nonsensical word "toet" over and over. As "toe" and "toet" were added to the input pattern sequence, the model saw the probability of that particular pattern increase, and got stuck in a trap of repeating the same phrase over and over because it follows a very regular pattern. 

In [33]:
# more complex model, 5 layers and 10 epochs. 
# After the 5th epoch, the loss function did not get any smaller, so the weights determined in the 5th epoch are used to generate text
print_prediction(dataX, model3, "weights-improvement-05-1.9933-bigger.hdf5")

Seed:
" he next peeped
out the fish-footman was gone, and the other was sitting on the
ground near the door, "
 and the was good the dourouse the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to the was to 

The model with more layers and more epochs spoke almost in english words, except for "dourouse". It also fell into a repetitive pattern, repeating "to the was" over and over after the first few words. However, all these words are real words in English and it is a longer pattern than the simpler model. It is learning! 