In [2]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

Load the text and clean the data (capitilization, unnecessary characters)

In [5]:
filepath = "dataset/wows-script.txt"
with open(filepath, "r") as f:
    raw_text = f.read().lower()
print("Text has {} characters".format(len(raw_text)))

Text has 128743 characters


Must convert characters to integers so that the Neural Network can work with the data.

Create a vocabulary, mapping for each unique character.

In [12]:
characters = sorted(list(set(raw_text))) # sorted list of unique chars
char_to_int = dict((c, i) for i, c in enumerate(characters)) # e.g. a: 1, b:2
print("Number of unique mappings(characters): {}".format(len(char_to_int)))

Number of unique mappings(characters): 52


Break up text in to sequences of 100 characters.

There are also other options like splitting up by sentences and padding shorter sentences/truncating longer sentences.

When training, it will use 100 time steps for a single character to give a single output. We move along the text 1 character at a time. Each character will be learned from the preceding 100 characters (except the first 100 characters)

E.g. seq_length = 4
Hell -> Hello

Now convert the data set from characters to integer representations

In [13]:
seq_length = 100
data_x = []
data_y = []
for i in range(0, (len(raw_text)-seq_length), 1):
    seq_in = raw_text[i:i+seq_length] # x
    seq_out = raw_text[i+seq_length]  # y
    data_x.append([char_to_int[char] for char in seq_in])
    data_y.append(char_to_int[seq_out])
n_patterns = len(data_x)
print("Total patterns {}".format(n_patterns))    

Total patterns 128643


We must transform the list of input sequences into the form [samples, time steps, features] expected by an LSTM network.

Next we need to rescale the integers to the range 0-to-1 to make the patterns easier to learn by the LSTM network that uses the sigmoid activation function by default.

Finally, we need to convert the output patterns (single characters converted to integers) into a one hot encoding. This is so that we can configure the network to predict the probability of each of the 47 different characters in the vocabulary (an easier representation) rather than trying to force it to predict precisely the next character

In [14]:
X = np.reshape(data_x, (n_patterns, seq_length, 1)) # reshape to  [samples, time steps, features]
X = X / float(len(char_to_int))  # normalize values
y = np_utils.to_categorical(data_y)  # one hot encode the output variable

### Define network architecture

- Single hidden layer with 256 units
- Droput with 20% probablity
- Output layer is Dense using softmax activation to output a probablity predicition for each of the 47 characters between 0 & 1

*The problem is really a single character classification problem with 47 classes and as such is defined as optimizing the log loss (cross entropy), here using the ADAM optimization algorithm for speed.*

In [16]:
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation="softmax"))

model.compile(loss="categorical_crossentropy", optimizer="adam")
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_2 (LSTM)                (None, 256)               264192    
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 52)                13364     
Total params: 277,556
Trainable params: 277,556
Non-trainable params: 0
_________________________________________________________________


Not interested in most accurate model (classification accuracy). Looking for a model that generlizes the dataset - that minimizes the loss. Seeking a balance between generlaization and overfitting but short of memorization.

Training can be slow so we use a model checkpoint.

In [17]:
checkpoint_filepath = "checkpoints/weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(checkpoint_filepath, monitor="loss", verbose=1, save_best_only=True)
callbacks_list = [checkpoint]

### Training

In [19]:
model.fit(X, y, epochs=20, batch_size=128, callbacks=callbacks_list)

Epoch 1/20

Epoch 00001: loss improved from inf to 3.05693, saving model to checkpoints/weights-imporvement-01-3.0569.hdf5
Epoch 2/20

Epoch 00002: loss improved from 3.05693 to 2.86785, saving model to checkpoints/weights-imporvement-02-2.8678.hdf5
Epoch 3/20

Epoch 00003: loss improved from 2.86785 to 2.78746, saving model to checkpoints/weights-imporvement-03-2.7875.hdf5
Epoch 4/20

Epoch 00004: loss improved from 2.78746 to 2.72895, saving model to checkpoints/weights-imporvement-04-2.7289.hdf5
Epoch 5/20

Epoch 00005: loss improved from 2.72895 to 2.67554, saving model to checkpoints/weights-imporvement-05-2.6755.hdf5
Epoch 6/20

Epoch 00006: loss improved from 2.67554 to 2.62454, saving model to checkpoints/weights-imporvement-06-2.6245.hdf5
Epoch 7/20

Epoch 00007: loss improved from 2.62454 to 2.57822, saving model to checkpoints/weights-imporvement-07-2.5782.hdf5
Epoch 8/20

Epoch 00008: loss improved from 2.57822 to 2.53358, saving model to checkpoints/weights-imporvement-08-

KeyboardInterrupt: 