**Text generation with LSTM RNN** .

Text Generation With LSTM Recurrent Neural Networks in Python with Keras
by Jason Brownlee on August 4, 2016 in Deep Learning for Natural Language Processing
Tweet   Share
Recurrent neural networks can also be used as generative models.

This means that in addition to being used for predictive models (making predictions) they can learn the sequences of a problem and then generate entirely new plausible sequences for the problem domain.

Generative models like this are useful not only to study how well a model has learned a problem, but to learn more about the problem domain itself.

In this post you will discover how to create a generative model for text, character-by-character using LSTM recurrent neural networks in Python with Keras.


After reading this post you will know:


*   Where to download a free corpus of text that you can use to train text generative models.
*   How to frame the problem of text sequences to a recurrent neural network generative model.
*  How to develop an LSTM to generate plausible text sequences for a given problem.

**NOTE :**  LSTM recurrent neural networks can be slow to train and it is highly recommend that you train them on GPU hardware.

**Problem Description**

In this tutorial we are going to use a **shakespeare** Poems.

We are going to learn the dependencies between characters and the conditional probabilities of characters in sequences so that we can in turn generate wholly new and original sequences of characters.

This is a lot of fun and I recommend repeating these experiments with other books from Project Gutenberg, .[here is a list of the most popular books on the sitet](https://www.gutenberg.org/ebooks/search/%3Fsort_order%3Ddownloads)


These experiments are not limited to text, you can also experiment with other ASCII data, such as computer source code, marked up documents in LaTeX, HTML or Markdown and more.

You can [download the complete text in ASCII format](http://www.gutenberg.org/cache/epub/11/pg11.txt) (Plain Text UTF-8) for this book for free and place it in your working directory with the filename wonderland.txt.

Now we need to prepare the dataset ready for modeling.

**Develop a Small LSTM Recurrent Neural Network**


In this section we will develop a simple LSTM network to learn sequences of characters from Shakespeare poems. In the next section we will use this model to generate new sequences of characters.

Let’s start off by importing the classes and functions we intend to use to train our model.

In [1]:
import sys
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.layers import CuDNNLSTM
from keras.layers import BatchNormalization 
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

Using TensorFlow backend.


Next, we need to load the ASCII text for the Poems into memory and convert all of the characters to lowercase to reduce the vocabulary that the network must learn.

In [0]:
# load ascii text and covert to lowercase
filename = "/content/shakespeare.txt"
raw_text = open(filename).read()
raw_text = raw_text.lower()


In [14]:
raw_text[:100]

'the sonnets\n\nby william shakespeare\n\n1\nfrom fairest creatures we desire increase,\nthat thereby beaut'

Now that the book is loaded, we must prepare the data for modeling by the neural network. We cannot model the characters directly, instead we must convert the characters to integers.

We can do this easily by first creating a set of all of the distinct characters in the book, then creating a map of each character to a unique integer.

Also, when preparing the mapping of unique characters to integers, we must also create a reverse mapping that we can use to convert the integers back to characters so that we can understand the predictions

In [15]:

# create mapping of unique chars to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))

int_to_char = dict((i, c) for i, c in enumerate(chars))
raw_text[:100]

'the sonnets\n\nby william shakespeare\n\n1\nfrom fairest creatures we desire increase,\nthat thereby beaut'

Now that the book has been loaded and the mapping prepared, we can summarize the dataset.

In [11]:
n_chars = len(raw_text)
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)

Total Characters:  94630
Total Vocab:  48


We can see that the book has just under 94630 characters and that when converted to lowercase that there are only 48 distinct characters in the vocabulary for the network to learn. Much more than the 26 in the alphabet.

We now need to define the training data for the network. There is a lot of flexibility in how you choose to break up the text and expose it to the network during training.

In this tutorial we will split the book text up into subsequences with a fixed length of 100 characters, an arbitrary length. We could just as easily split the data up by sentences and pad the shorter sequences and truncate the longer ones.

Each training pattern of the network is comprised of 100 time steps of one character (X) followed by one character output (y). When creating these sequences, we slide this window along the whole book one character at a time, allowing each character a chance to be learned from the 100 characters that preceded it (except the first 100 characters of course).

For example, if the sequence length is 5 (for simplicity) then the first two training patterns would be as follows:

CHAPT -> E

HAPTE -> R


As we split up the book into these sequences, we convert the characters to integers using our lookup table we prepared earlier.

In [13]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
	seq_in = raw_text[i:i + seq_length]
	seq_out = raw_text[i + seq_length]
	dataX.append([char_to_int[char] for char in seq_in])
	dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)


Total Patterns:  94530


Running the above code to this point shows us that when we split up the dataset into training data for the network to learn that we have just under 94530 training pattens. This makes sense as excluding the first 100 characters, we have one training pattern to predict each of the remaining characters.


First we must transform the list of input sequences into the form [samples, time steps, features] expected by an LSTM network.

Next we need to rescale the integers to the range 0-to-1 to make the patterns easier to learn by the LSTM network that uses the sigmoid activation function by default.

Finally, we need to convert the output patterns (single characters converted to integers) into a one hot encoding. This is so that we can configure the network to predict the probability of each of the 47 different characters in the vocabulary (an easier representation) rather than trying to force it to predict precisely the next character. Each y value is converted into a sparse vector with a length of 47, full of zeros except with a 1 in the column for the letter (integer) that the pattern represents.

We can implement these steps as below

In [0]:
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (n_patterns, seq_length, 1))
# normalize
X = X / float(n_vocab)
# one hot encode the output variable
y = np_utils.to_categorical(dataY)

We can now define our LSTM model. Here we define a single hidden LSTM layer with 256 memory units. The network uses dropout with a probability of 20. The output layer is a Dense layer using the softmax activation function to output a probability prediction for each of the 48 characters between 0 and 1.

The problem is really a single character classification problem with 48 classes and as such is defined as optimizing the log loss (cross entropy), here using the ADAM optimization algorithm for speed

In [0]:
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(0.2))

model.add(LSTM(256, return_sequences=True))
model.add(Dropout(0.2))

model.add(LSTM(256, return_sequences=True))
model.add(Dropout(0.2))

model.add(LSTM(256))
model.add(Dropout(0.2))

model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_5 (LSTM)                (None, 100, 256)          264192    
_________________________________________________________________
dropout_5 (Dropout)          (None, 100, 256)          0         
_________________________________________________________________
lstm_6 (LSTM)                (None, 100, 256)          525312    
_________________________________________________________________
dropout_6 (Dropout)          (None, 100, 256)          0         
_________________________________________________________________
lstm_7 (LSTM)                (None, 100, 256)          525312    
_________________________________________________________________
dropout_7 (Dropout)          (None, 100, 256)          0         
_________________________________________________________________
lstm_8 (LSTM)                (None, 256)               525312    
__________

There is no test dataset. We are modeling the entire training dataset to learn the probability of each character in a sequence.

We are not interested in the most accurate (classification accuracy) model of the training dataset. This would be a model that predicts each character in the training dataset perfectly. Instead we are interested in a generalization of the dataset that minimizes the chosen loss function. We are seeking a balance between generalization and overfitting but short of memorization.

The network is slow to train (about 8min  per epoch on an Google Colab GPU). Because of the slowness and because of our optimization requirements, we will use model checkpointing to record all of the network weights to file each time an improvement in loss is observed at the end of the epoch. We will use the best set of weights (lowest loss) to instantiate our generative model in the next section.

In [0]:
# define the checkpoint
filepath="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

We can now fit our model to the data. Here we use a modest number of 30 epochs and a large batch size of 64 patterns.

In [0]:
model.fit(X, y, epochs=30, batch_size=64, callbacks=callbacks_list)

Epoch 1/30

Epoch 00001: loss improved from inf to 2.99053, saving model to weights-improvement-01-2.9905.hdf5
Epoch 2/30

Epoch 00002: loss improved from 2.99053 to 2.59595, saving model to weights-improvement-02-2.5960.hdf5
Epoch 3/30

Epoch 00003: loss improved from 2.59595 to 2.39341, saving model to weights-improvement-03-2.3934.hdf5
Epoch 4/30

Epoch 00004: loss improved from 2.39341 to 2.24109, saving model to weights-improvement-04-2.2411.hdf5
Epoch 5/30

Epoch 00005: loss improved from 2.24109 to 2.13826, saving model to weights-improvement-05-2.1383.hdf5
Epoch 6/30

Epoch 00006: loss improved from 2.13826 to 2.05642, saving model to weights-improvement-06-2.0564.hdf5
Epoch 7/30

Epoch 00007: loss improved from 2.05642 to 1.99439, saving model to weights-improvement-07-1.9944.hdf5
Epoch 8/30

Epoch 00008: loss improved from 1.99439 to 1.94409, saving model to weights-improvement-08-1.9441.hdf5
Epoch 9/30

Epoch 00009: loss improved from 1.94409 to 1.90171, saving model to weig

<keras.callbacks.History at 0x7f51846fd550>

You will see different results because of the stochastic nature of the model, and because it is hard to fix the random seed for LSTM models to get 100% reproducible results. This is not a concern for this generative model.

After running the example, you should have a number of weight checkpoint files in the local directory.

You can delete them all except the one with the smallest loss value. For example, when I ran this example, below was the checkpoint with the smallest loss that I achieved.

In my case **saving model to weights-improvement-30-1.4482.hdf5**

The network loss decreased almost every epoch and I expect the network could benefit from training for many more epochs.

In the next section we will look at using this model to generate new text sequences.

**Generating Text with an LSTM Network**


Generating text using the trained LSTM network is relatively straightforward.

Firstly, we load the data and define the network in exactly the same way, except the network weights are loaded from a checkpoint file and the network does not need to be trained.

In [0]:
# load the network weights
filename = "weights-improvement-28-1.4749.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

Finally, we need to actually make predictions.

The simplest way to use the Keras LSTM model to make predictions is to first start off with a seed sequence as input, generate the next character then update the seed sequence to add the generated character on the end and trim off the first character. This process is repeated for as long as we want to predict new characters (e.g. a sequence of 1,000 characters in length).

We can pick a random input pattern as our seed sequence, then print generated characters as we generate them.

In [0]:

# pick a random seed
start = numpy.random.randint(0, len(dataX)-1)
pattern = dataX[start]
print("Seed:")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
# generate characters
for i in range(200):
	x = numpy.reshape(pattern, (1, len(pattern), 1))
	x = x / float(n_vocab)
	prediction = model.predict(x, verbose=0)
	index = numpy.argmax(prediction)
	result = int_to_char[index]
	seq_in = [int_to_char[value] for value in pattern]
	sys.stdout.write(result)
	pattern.append(index)
	pattern = pattern[1:len(pattern)]
print("\nDone.")

Seed:
" ave added feathers to the learned's wing,
and given grace a double majesty.
yet be most proud of tha "
t which i that see thee all thee,
the stn of thee the stn of thee,
then should the stn of thee in thee in thee,
the sun of thee the stn of thee,
then should the stn of thee in thee in thee,
the sun of
Done.


Running this example first outputs the selected random seed, then each character as it is generated.

For example, above are the results from one run of this text generator. The random seed was:


We can note some observations about the generated text.

*   It generally conforms to the line format observed in the original text of less than 80 characters before a new line.
*   The characters are separated into word-like groups and most groups are actual English words (e.g. “than “the”), but many do not (e.g. "then should the stn of thee in thee in thee")


The fact that this character based model of the book produces output like this is very impressive. It gives you a sense of the learning capabilities of LSTM networks.
