# Shakespair

We're going to train a character LSTM recurrent neural network to generate Shakespearean text. We'll need some libraries:

In [13]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, LSTM, Dropout
from keras.callbacks import LambdaCallback

Next, we'll need some data. I prepared some with some simple command line tools, there's a Makefile for it. If we were building this for production and we'd have to deal with changing input files, I'd definitely want to bring the data cleaning into python and test it. As it is, we can take the data file as given.

In [1]:
sentences = open('../data/sentences.txt').read().split('\n')

We need to turn this list of sentences into suitable training data for a character LSTM. Our input will be a sequence of `n_char` characters, and the corresponding output will be the next character. To handle inputs shorter than `n_char` characters, we'll front pad the sequences with a `$` symbol, since this isn't used in the text.

In [2]:
n_char = 40

In [3]:
padded_sentences = [n_char * '$' + s for s in sentences]

In [4]:
input_sentences = [s[i : i + n_char]
                   for s in padded_sentences
                   for i in range(n_char)]

output_characters = [s[i + n_char : i + n_char + 1] # handle that s[n_char] might not exist by using range
                     for s in padded_sentences
                     for i in range(n_char)]

Keras takes a 3-tuple input data shape for LSTM cells, with the shape:

`(number of training examples, number of steps in LSTM, length of feature vector)`

We need to vectorize and binarize our data and transform it to this shape.

In [5]:
character_set = sorted(set(''.join(padded_sentences)))

In [6]:
print(character_set)

[' ', '!', '$', "'", ',', '-', '.', ':', '?', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [7]:
char_to_idx = {c:i for i,c in enumerate(character_set)}
idx_to_char = {i:c for i,c in enumerate(character_set)}

In [10]:
X = np.zeros((len(input_sentences),
              n_char,
              len(character_set)),
             dtype=np.bool)

y = np.zeros((len(input_sentences), len(character_set)),
             dtype=np.bool)

We define an LSTM-flavoured neural network. Parameters arrived at after some experimentation as giving reasonable results. Judging the naturalness of the output of any sort of generative model for content is typically better done by a human than a loss function in any case.

In [12]:
model = Sequential([
    LSTM(256, return_sequences=True,
         input_shape=(n_char, len(character_set))),
    Dropout(0.2),
    LSTM(256),
    Dropout(0.2),
    Dense(len(character_set)),
    Activation('softmax'),
])
model.compile(loss='categorical_crossentropy', optimizer='adam')

Since the output of generative content benefits from a human eye, it'd be nice to see how training is progressing as it does. We can define a callback function to be run at the end of each epoch to show us some generated text. First, we'll have to write the sampling function to actually get the text.

Basically by analogy to Boltzmann sampling, we introduce a "temperature" to the sample to control the diversity of characters generated. See [this paper](https://arxiv.org/pdf/1503.02531.pdf) for some details.

In [14]:
def on_epoch_end(epoch, _):
    """This function is invoked at the end of each epoch.
       It prints some sample text generated by the network at
       its current epoch, and writes the resulting model file
       to disk."""
    print()
    print("Finished training epoch %d" % epoch)
    

In [None]:
model.fit(X, y,
          batch_size=128,
          epochs=200,
          callbacks=LambdaCallback(on_epoch_end=on_epoch_end))