<a href="https://colab.research.google.com/github/bosecodes/slytherin-slingshot/blob/master/Text_Generation_Project_Gutenberg.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Develop a Small LSTM Recurrent Neural Network

In this section we will develop a simple LSTM network to learn the sequences of characters from Alice in Wonderland. In the next section, we will use this model to generate new sequences of characters.

Let's start off by importing the classes and the functions we intend to use to train our model.

In [21]:
import numpy as np
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, Dropout, LSTM
from tensorflow.python.keras.callbacks import ModelCheckpoint
from tensorflow.python.keras.utils import np_utils

Num GPUs Available:  1


Next we need to load the ASCII text for the book into memory and convert all the characters to lowercase to reduce the vocabulary that the network must learn

In [0]:
# load ascii text and convert to lowercase
location = '/content/wonderland.txt'
raw_text = open(location, 'r', encoding = 'utf-8').read()
raw_text = raw_text.lower()

In [23]:
raw_text[:100]

"project gutenberg's alice's adventures in wonderland, by lewis carroll\n\nthis ebook is for the use of"

Now that the book is loaded, we must prepare the data for modelling by the neural network. We can't model the characters directly, we instead must convert the characters to integers.

We can do this easily by first creating a set of all the distinct characters in the book, then creating a map of each character to a unique integer.

In [0]:
# create a mapping of unique chars to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c,i) for i, c in enumerate(chars))

In [25]:
char_to_int

{'\n': 0,
 ' ': 1,
 '!': 2,
 '"': 3,
 '#': 4,
 '$': 5,
 '%': 6,
 "'": 7,
 '(': 8,
 ')': 9,
 '*': 10,
 ',': 11,
 '-': 12,
 '.': 13,
 '/': 14,
 '0': 15,
 '1': 16,
 '2': 17,
 '3': 18,
 '4': 19,
 '5': 20,
 '6': 21,
 '7': 22,
 '8': 23,
 '9': 24,
 ':': 25,
 ';': 26,
 '?': 27,
 '@': 28,
 '[': 29,
 ']': 30,
 '_': 31,
 'a': 32,
 'b': 33,
 'c': 34,
 'd': 35,
 'e': 36,
 'f': 37,
 'g': 38,
 'h': 39,
 'i': 40,
 'j': 41,
 'k': 42,
 'l': 43,
 'm': 44,
 'n': 45,
 'o': 46,
 'p': 47,
 'q': 48,
 'r': 49,
 's': 50,
 't': 51,
 'u': 52,
 'v': 53,
 'w': 54,
 'x': 55,
 'y': 56,
 'z': 57}

We can further remove some other characters that would further clean up the dataset that will reduce the vocabulary and improve the modelling process.

Now that the book has been loaded and the mapping prepared, we can summarise the dataset

In [26]:
n_chars = len(raw_text)
n_vocab = len(chars)
print('Total Characters: ', n_chars)
print('Total Vocab: ', n_vocab)

Total Characters:  163779
Total Vocab:  58


Thus the book has about 160,000 characters and that when converted to lowercase, there are only 58 distinct characters in the vocabulary for the network to learn. Much more than the 26 in the alphabet.

We need to now define the training data for our network. There is a lot of flexibility in how you choose to break up the text and expose it to the network during training.

In this tutorial we will split the book text up to subsequences of 100 characters, an arbitrary length. We could just as easily split the data by sentences and pad the shorter sequences and truncate the longer ones.

Each training pattern is comprised of 100 time steps of one character (X) followed by one character output(y). When creating these sequences, we slide this window along the whole book, one character at a time, allowing each character a chance to be learned from the 100 characters that preceded it.(except the first 100 characters ofcourse).

For example, if the sequence length is 5 (for simplicity, then the two training patterns would be as follows:

CHAPT -> E

HAPTE -> R

As we split up the book into sequences, we convert the characters into integers according to the lookup table(dictionary), that we created earlier.

In [27]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []

for i in range(0, n_chars - seq_length, 1):
    seq_in = raw_text[i: i + seq_length]
    seq_out = raw_text[i+seq_length] # that is, the first character
    # after the sequence is complete
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print('Total Patterns: ', n_patterns)

Total Patterns:  163679


Thus the number of total pattterns we have 100 less patterns, caused by barring the first 100 characters. We have one training pattern to predict each of the remaining characters.

Now that we have the prepared training data, we need to transform it so that it is suitable for use with Keras.
- First we must transform the input sequences to the form [samples, time steps, features] expected by the LSTM network.
- Next we need to rescale the integers to the range of 0-1 to make the patterns easier to learn by the LSTM network that uses a sigmoid activation function by default
- Finally we need to convert the output patterns(single characters converted to integers) to a one hot encoding. This is so that we can configure the network to predict the probability of the 57 different characters in teh vocabulary (an easier representation) rather than trying to force it to predict precisely the next character. Each y value is converted to a Sparse Vector with a length of 57, full of zeroes except with a 1 in the column where teh letter that the pattern represents.

For example, when 'n' integer value 31 is one hot encoded, it looks like:

[ 0 0 0 0 0 0 0 0 0 0
  0 0 0 1 0 0 0 0 0 0
  0 0 0 0 0 0 ]

We can implement these steps as follows:

In [28]:
print(np.reshape(dataX, (n_patterns, seq_length), ))
# that is how the input data is supposed to be
# however we add a third variable in the tuple called '1'
# this is to wrap the entire 2-D array to a 
# 3-D array

[[47 49 46 ...  1 46 37]
 [49 46 41 ... 46 37  1]
 [46 41 36 ... 37  1 32]
 ...
 [ 1 51 46 ... 33 46 46]
 [51 46  1 ... 46 46 42]
 [46  1 39 ... 46 42 50]]


In [29]:
# reshape X to be [samples, time steps, features]
X = np.reshape(dataX, (n_patterns, seq_length,1), )
# Normalize
print(X[:5])
X = X / float(n_vocab)

# One Hot Encode the output variable
y = np_utils.to_categorical(dataY)

[[[47]
  [49]
  [46]
  [41]
  [36]
  [34]
  [51]
  [ 1]
  [38]
  [52]
  [51]
  [36]
  [45]
  [33]
  [36]
  [49]
  [38]
  [ 7]
  [50]
  [ 1]
  [32]
  [43]
  [40]
  [34]
  [36]
  [ 7]
  [50]
  [ 1]
  [32]
  [35]
  [53]
  [36]
  [45]
  [51]
  [52]
  [49]
  [36]
  [50]
  [ 1]
  [40]
  [45]
  [ 1]
  [54]
  [46]
  [45]
  [35]
  [36]
  [49]
  [43]
  [32]
  [45]
  [35]
  [11]
  [ 1]
  [33]
  [56]
  [ 1]
  [43]
  [36]
  [54]
  [40]
  [50]
  [ 1]
  [34]
  [32]
  [49]
  [49]
  [46]
  [43]
  [43]
  [ 0]
  [ 0]
  [51]
  [39]
  [40]
  [50]
  [ 1]
  [36]
  [33]
  [46]
  [46]
  [42]
  [ 1]
  [40]
  [50]
  [ 1]
  [37]
  [46]
  [49]
  [ 1]
  [51]
  [39]
  [36]
  [ 1]
  [52]
  [50]
  [36]
  [ 1]
  [46]
  [37]]

 [[49]
  [46]
  [41]
  [36]
  [34]
  [51]
  [ 1]
  [38]
  [52]
  [51]
  [36]
  [45]
  [33]
  [36]
  [49]
  [38]
  [ 7]
  [50]
  [ 1]
  [32]
  [43]
  [40]
  [34]
  [36]
  [ 7]
  [50]
  [ 1]
  [32]
  [35]
  [53]
  [36]
  [45]
  [51]
  [52]
  [49]
  [36]
  [50]
  [ 1]
  [40]
  [45]
  [ 1]
  [54]
  [4

We can now define our LSTM model. Here, we define a single LSTM layer with 256 memory units. The network uses dropout with a probability of 20. The output layer is a Dense (Fully connected) layer, using the softmax activation function to output the probability prediction for each of the 57 characters between 0 and 1.

The problem is really "a single classification problem with 57 classes" and as such defined as "optimizing the log loss (cross entropy)", here "using the ADAM optimization algorithm" for speed.

In [0]:
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape = (X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation = 'softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer = 'Adam')

There is no test dataset. We are modeling the entire training dataset to learn the probability of each character in a sequence.

We are not interested in the most accurate (classification accuracy) model of the training dataset. This would be a model that predicts each character in the training dataset perfectly. Instead we are interested in the generalization of the dataset that minimizes the chosen loss function. We are seeking a balance between generalization and overfitting but short of memorization.

The network is slow to train, because of the slowness and because of our optimization requirements, we will use model checkpointing to record all of the network weights to file each time an improvement in the loss is observed at the end of the epoch. We will use the best sort of weights (lowest loss), to instantiate our generative model in the next section.

In [0]:
# define the checkpoint
filepath = "weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor = 'loss', verbose = 1,
                            save_best_only = True, mode = 'min')
callbacks_list = [checkpoint]

We can now fit our model to the data. Here we use a modest number of 20 epochs and a large batch size of 128 patterns.

In [32]:
model.fit(X, y, epochs = 20, batch_size = 128, callbacks = callbacks_list)

Epoch 1/20
Epoch 00001: loss improved from inf to 2.97587, saving model to weights-improvement-01-2.9759.hdf5
Epoch 2/20
Epoch 00002: loss improved from 2.97587 to 2.79695, saving model to weights-improvement-02-2.7970.hdf5
Epoch 3/20
Epoch 00003: loss improved from 2.79695 to 2.71407, saving model to weights-improvement-03-2.7141.hdf5
Epoch 4/20
Epoch 00004: loss improved from 2.71407 to 2.64663, saving model to weights-improvement-04-2.6466.hdf5
Epoch 5/20
Epoch 00005: loss improved from 2.64663 to 2.58923, saving model to weights-improvement-05-2.5892.hdf5
Epoch 6/20
Epoch 00006: loss improved from 2.58923 to 2.53683, saving model to weights-improvement-06-2.5368.hdf5
Epoch 7/20
Epoch 00007: loss improved from 2.53683 to 2.48403, saving model to weights-improvement-07-2.4840.hdf5
Epoch 8/20
Epoch 00008: loss improved from 2.48403 to 2.43724, saving model to weights-improvement-08-2.4372.hdf5
Epoch 9/20
Epoch 00009: loss improved from 2.43724 to 2.39278, saving model to weights-impro

<tensorflow.python.keras.callbacks.History at 0x7f327c3ee2e8>

After running the network, we will have a number of checkpoint files in the working directory, we can delete all of them except for the one with the minimum loss.

The network loss decreased almost every epoch and it can thereby be expected that the network would benefit from training for many more epochs.

In the next section, let's concentrate about how we can use this model to generate new text sequeces.

# Generating Text with an LSTM Network

Generating text with this network is actually pretty straightforward now. 
Firstly, we load the data and define teh network in exactly the same way, except that:
- the network weights are loaded from a checkpoint file, and, 
- the network does not need to trained.

In [0]:
# load the network weights
filename = 'weights-improvement-20-2.0475.hdf5'
model.load_weights(filename)
model.compile(loss = 'categorical_crossentropy', optimizer = 'Adam')

Also, when preparing the mapping of the unique characters to integers, we must also create a reverse mapping that we can use to convert the integers back to characters so that we can understand the predictions.

In [0]:
int_to_char = dict((i, c) for i, c in enumerate(chars))

Finally, we need to actually make predictions.

The simplest way to use the Keras LSTM model to make predictions is to first start off with a seed sequence as input, generate the next character and then update the seed sequence to add the generated character on the end and trim off the first character. This process is repeated for as long as we want to predict new characters (eg: A sequence of 1000 characters in length).

We can pick a random input pattern as our sequence, then print the generated chaaracters as we generate them.

In [38]:
# pick a random seed
import sys
start = np.random.randint(0, len(dataX)-1)
pattern = dataX[start]

print('Seed: ')
print( "\"", ''.join([int_to_char[value] for value in pattern]), "\"")

# generate characters
for i in range(1000):
    x = np.reshape(pattern, (1, len(pattern), 1))
    x = x / float(n_vocab)
    prediction = model.predict(x, verbose = 0)
    index = np.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    sys.stdout.write(result)
    pattern.append(index)
    pattern = pattern[1: len(pattern)]
print('\nDone!')

Seed: 
" she went round the court and got behind him, and
very soon found an opportunity of taking it away. s "
he was not in the tooee to tee thet sas the was sored ho the caree in the rabbet sore, and she whit hn was toe tirte to tee the was soenk at inrs an the was sotnd the tas of the couro, and tae toine to be a sore of the toeet an inss to the toeee tf the career, and see woine to betit all the war sot what it was toenking to the toeee tf the caree in the caree in the rabbet sore, and she whit hn was toe tirte to tee the was sor that she was sot thsh the tone, 
'the mase thing so be in toe to tee the gocster saseer ' said the mock turtle. 
'i dane tae thing toe to toen toe toieg to tee thet ' said the mock turtle.

'i dane the mabt thrn saeee'' said the maccit. 
anice was tor tas toe tirtee to tee thet sas thth the mooe, and the was tointing an the woond to tee the was sor that she was sot thsh the tone, 
'the mase thing so be in toe to tee the gocster saseer ' said the mock turt

We can now note some observations about the generated text.

- It generally conforms to the line format observed in the original text of less than 80 characters before a new line.
- The characters are seperated into word-like groups and most groups are actual English words, (e.g., 'the', 'little', 'said' and 'was'), but many do not (e.g. 'lott', 'tiie', 'taede' 'ssemme').

The fact that this character based model of the book produces an output like this is pretty impressive. It gives you a sense of the learning capabilities of LSTM networks.

The results are not perfect. Now, we'll look at the techniques to make some improvements by developing a much larger LSTM network.


# Larger LSTM Recurrent Neural Network

We got results, but not excellent results in the last section. Now, we'll try to improve the quality of the generated network by using a much larger network.

We will keep the number of memory units the same, i.e., 256, but, add a second additional layer.

In [0]:
model = Sequential()
model.add(LSTM(256, input_shape = (X.shape[1], X.shape[2]), return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(256))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation = 'softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer = 'adam')

In [40]:
# Define the checkpoint
filepath = 'weights-improvement-{epoch:02d}-{loss:.4f}-bigger.hdf5'
checkpoint = ModelCheckpoint(filepath, monitor = 'loss', verbose = 1, save_best_only = True, mode = 'min')
callbacks_list = [checkpoint]

# Fit the model
model.fit(X, y, epochs = 50, batch_size = 64, callbacks = callbacks_list)

Epoch 1/50
Epoch 00001: loss improved from inf to 2.81906, saving model to weights-improvement-01-2.8191-bigger.hdf5
Epoch 2/50
Epoch 00002: loss improved from 2.81906 to 2.52433, saving model to weights-improvement-02-2.5243-bigger.hdf5
Epoch 3/50
Epoch 00003: loss improved from 2.52433 to 2.34200, saving model to weights-improvement-03-2.3420-bigger.hdf5
Epoch 4/50
Epoch 00004: loss improved from 2.34200 to 2.20771, saving model to weights-improvement-04-2.2077-bigger.hdf5
Epoch 5/50
Epoch 00005: loss improved from 2.20771 to 2.10220, saving model to weights-improvement-05-2.1022-bigger.hdf5
Epoch 6/50
Epoch 00006: loss improved from 2.10220 to 2.02232, saving model to weights-improvement-06-2.0223-bigger.hdf5
Epoch 7/50
Epoch 00007: loss improved from 2.02232 to 1.95139, saving model to weights-improvement-07-1.9514-bigger.hdf5
Epoch 8/50
Epoch 00008: loss improved from 1.95139 to 1.89716, saving model to weights-improvement-08-1.8972-bigger.hdf5
Epoch 9/50
Epoch 00009: loss improve

<tensorflow.python.keras.callbacks.History at 0x7f3264213470>

We finally choose the weights file with the minimum loss and then use it to predict the text by using a random seed (just as in the previous section)

In [46]:
# pick a random seed
start = np.random.randint(0, len(dataX) - 1)
pattern = dataX[start]
print('Seed: ')
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")

# generate characters
for i in range(1000):
  x = np.reshape(pattern, (1, len(pattern), 1))
  x = x/float(n_vocab)
  prediction = model.predict(x, verbose = 0)
  index = np.argmax(prediction)
  result = int_to_char[index]
  seq_in = [int_to_char[value] for value in pattern]
  sys.stdout.write(result)
  pattern.append(index)
  pattern = pattern[1: len(pattern)]

print('\nDone!')

Seed: 
" eaching it tricks very much, if--if i'd
only been the right size to do it! oh dear! i'd nearly forgo "
ttent the sempe it to and say "with the noce tore.

'what is the dart iist,' said the mock turtle. 
'i dan't she sea, she mock turtle in the dancer!' and the mock turtle said to the thing so a shmed off,

'the mouse sealln,' said the mock turtle. 
'i dan't she sea would be a mine the moose in the dartu,' said alice, ''the mors of menpoesce of the soot of the sea.' 
'i don't know all the boom,' she said to herself, 'it was a linute or two she was so the shme in the distance,

'the dar'' said the ming, 'i must be a better world be a monk all the tort of the sea. aut i must be she shaner to say it is in the mind, and the lors of the shme it was oo thme to oueer the pight with one of the court, and she was not a minute or two of the sooes of the garden with one of the coor, and was going to see it would be a minute or two she was a linute or two of the sooes of the garden with on