# Shakespair

Like many others, I was duly impressed by Andrej Kaparthy's [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/). Here, we create a character level LSTM to model Shakespeare, and use it to build a sentence completer. Credit is due to the aforementioned blog post, and Robin Sloan's awesome assisted sci-fi writing project [Writing With the Machine](https://www.robinsloan.com/notes/writing-with-the-machine/) for the idea. The implementation closely follows the Keras [LSTM text generation example](https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py) - this shouldn't really be considered novel code.

We're going to train a character LSTM recurrent neural network to generate Shakespearean text. We'll need some libraries:

In [4]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, LSTM, Dropout
from keras.callbacks import LambdaCallback

Using TensorFlow backend.


Next, we'll need some data. I prepared some with some simple command line tools, there's a Makefile for it. If we were building this for production and we'd have to deal with changing input files, I'd definitely want to bring the data cleaning into python and test it. As it is, we can take the data file as given.

In [2]:
sentences = open('../data/sentences.txt').read().split('\n')

We need to turn this list of sentences into suitable training data for a character LSTM. Our input will be a sequence of `n_char` characters, and the corresponding output will be the next character. To handle inputs shorter than `n_char` characters, we'll front pad the sequences with a `$` symbol, since this isn't used in the text.

In [3]:
n_char = 40

In [4]:
padded_sentences = [n_char * '$' + s for s in sentences]

In [5]:
input_sentences = [s[i : i + n_char]
                   for s in padded_sentences
                   for i in range(len(s) - n_char)]

output_characters = [s[i + n_char : i + n_char + 1] # handle that s[n_char] might not exist by using range
                     for s in padded_sentences
                     for i in range(len(s) - n_char)]

In [6]:
assert len(input_sentences) == len(output_characters)

Keras takes a 3-tuple input data shape for LSTM cells, with the shape:

`(number of training examples, number of steps in LSTM, length of feature vector)`

We need to vectorize and binarize our data and transform it to this shape.

In [7]:
character_set = sorted(set(''.join(padded_sentences)))

In [8]:
print(character_set)

[' ', '!', '$', "'", ',', '-', '.', ':', '?', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [9]:
# These are for convenience.
# We'll need to convert characters to indices and vice versa later.
char_to_idx = {c:i for i,c in enumerate(character_set)}
idx_to_char = {i:c for i,c in enumerate(character_set)}

In [10]:
# Initialize zero arrays for vectorized input and output.
X = np.zeros((len(input_sentences),
              n_char,
              len(character_set)),
             dtype=np.bool)

y = np.zeros((len(input_sentences), len(character_set)),
             dtype=np.bool)

# Fill the relevant indices with ones.
# This routine will take a few seconds to run.
for i, sentence in enumerate(input_sentences):
    for t, character in enumerate(sentence):
        X[i, t, char_to_idx[character]] = 1
    y[i, char_to_idx[output_characters[i]]] = 1

In [11]:
assert X.shape[0] == y.shape[0]
assert X.shape[2] == y.shape[1]

We define an LSTM-flavoured neural network. The architecture is slightly simpler than that in Andrej Karpathy's blog post, with some added dropout since it seemed reasonable, after some experimentation. This was largely a training time reduction technique - we could certainly increase the capacity of the model, and probably get better results for it. Judging the naturalness of the output of any sort of generative model for content is typically better done by a human than a loss function in any case.

In [23]:
model = Sequential([
    LSTM(256, return_sequences=True,
         input_shape=(n_char, len(character_set))),
    Dropout(0.2),
    LSTM(256),
    Dropout(0.2),
    Dense(len(character_set)),
    Activation('softmax'),
])
model.compile(loss='categorical_crossentropy', optimizer='adam')

Since the output of generative content benefits from a human eye, it'd be nice to see how training is progressing as it does. We can define a callback function to be run at the end of each epoch to show us some generated text. First, we'll have to write the sampling function to actually get the text.

Basically by analogy to Boltzmann sampling, we introduce a "temperature" to the sample to control the diversity of characters generated. See [this paper](https://arxiv.org/pdf/1503.02531.pdf) for some details.

In [35]:
def sample(predictions, temperature=1.0):
    """Sample a character from the output layer of the network.
       Generates more diverse output for lower values of temperature."""
    # avoiding underflow
    p = np.asarray([max(x,10**-10) for x in predictions]).astype('float64')
    p = np.log(p) / temperature
    p = np.exp(p) / np.sum(np.exp(p))
    probs = np.random.multinomial(1, p)
    return np.argmax(probs)

In [107]:
def on_epoch_end(epoch, _):
    """This function is invoked at the end of each epoch.
       It prints some sample text generated by the network at
       its current epoch, and writes the resulting model file
       to disk."""
    print()
    print("Finished training epoch %d" % epoch)
    
    generated = ''
    
    # seed the sentence with padding
    seed = n_char * '$'
    generated += seed
    last_char = generated[-1:]
    
    # define characters that end sentences
    stop_chars = ['.','?','!']
    while last_char not in stop_chars:
        x_test = np.zeros((1, n_char, len(character_set)))
        for t, character in enumerate(seed):
            x_test[0, t, char_to_idx[character]] = 1
        
        out = model.predict(x_test)[0]
        pred_idx = sample(out)
        pred_char = idx_to_char[pred_idx]
        
        generated += pred_char
        seed = generated[-n_char:]
        last_char = generated[-1:]
        
    # remove padding characters from generated text
    print(generated.replace('$',''))
    model.save('../data/epoch_' + str(epoch) + '.h5')
    #print("Generated text:")

Time to fit the model. This is costly, in both time and, if you're using a paid-for cloud GPU, money. We've set a large batch size to speed up each epoch, though you would probably get better results with smaller batches (which may permit fewer epochs to see good results, I have not experimented much here).

In [None]:
model.fit(X, y,
          batch_size=2048,
          epochs=60,
          callbacks=[LambdaCallback(on_epoch_end=on_epoch_end)])