This notebooks trains a couple of models based on RNNs to train some character models that can predict next characters based on previous ones

In [1]:
import numpy as np

from keras import layers
from keras import models
from keras import optimizers
# from keras import applications
from keras.utils import data_utils
from keras.preprocessing import sequence

Using TensorFlow backend.


In [2]:
path = data_utils.get_file('nietzsche.txt', origin="https://s3.amazonaws.com/text-datasets/nietzsche.txt")
text = open(path).read()

In [3]:
print('corpus length:', len(text))

corpus length: 600893


In [4]:
chars = sorted(list(set(text)))
vocab_size = len(chars) + 1
chars.insert(0, "\0")

In [5]:
''.join(chars[:])

'\x00\n !"\'(),-.0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxyzÆäæéë'

To train the models we have to encode characters to indices

In [6]:
char2idx = dict((c, i) for i, c in enumerate(chars))
idx2char = dict((i, c) for i, c in enumerate(chars))

Encode the whole text as indices, this is the actual data we use to train

In [7]:
idx = [char2idx[c] for c in text]

In [8]:
idx[:10]

[40, 42, 29, 30, 25, 27, 29, 1, 1, 1]

In [9]:
idx[10:20]

[43, 45, 40, 40, 39, 43, 33, 38, 31, 2]

In [10]:
def ids2text(ids):
    return ''.join(idx2char[i] for i in ids)

In [11]:
ids2text(idx[:80])

'PREFACE\n\n\nSUPPOSING that Truth is a woman--what then? Is there not ground\nfor su'

## LSTM

We are going to use training data based on sentences, each continuos string of length 40 is going to be a row of training data

In [12]:
maxlen = 40
sentences = []
next_chars = []

In [13]:
for i in range(0, len(idx) - maxlen+1):
    sentences.append(idx[i: i + maxlen])
    next_chars.append(idx[i+1: i+maxlen+1])

In [14]:
sentences = np.concatenate([[np.array(o)] for o in sentences[:-2]])
next_chars = np.concatenate([[np.array(o)] for o in next_chars[:-2]])

In [15]:
sentences.shape

(600852, 40)

In [16]:
sentences

array([[40, 42, 29, ..., 68, 66, 54],
       [42, 29, 30, ..., 66, 54, 67],
       [29, 30, 25, ..., 54, 67,  9],
       ..., 
       [73, 62, 54, ..., 74, 65, 67],
       [62, 54, 67, ..., 65, 67, 58],
       [54, 67,  2, ..., 67, 58, 72]])

In [17]:
ids2text(sentences[0]), ids2text(next_chars[0])

('PREFACE\n\n\nSUPPOSING that Truth is a woma',
 'REFACE\n\n\nSUPPOSING that Truth is a woman')

In [18]:
ids2text(sentences[1]), ids2text(next_chars[1])

('REFACE\n\n\nSUPPOSING that Truth is a woman',
 'EFACE\n\n\nSUPPOSING that Truth is a woman-')

In [19]:
n_fac = 24

In [20]:
inp = layers.Input((maxlen,))

In [21]:
x = layers.Embedding(vocab_size, n_fac)(inp)
x = layers.LSTM(512, input_shape=(n_fac,), return_sequences=True, dropout=0.2, recurrent_dropout=0.2, implementation=2)(x)
x = layers.Dropout(0.2)(x)
x = layers.LSTM(512, return_sequences=True, dropout=0.2, recurrent_dropout=0.2, implementation=2)(x)
x = layers.Dropout(0.2)(x)
x = layers.TimeDistributed(layers.Dense(vocab_size, activation='softmax'))(x)

In [22]:
model = models.Model(inp, x)

In [23]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 40)                0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 40, 24)            2040      
_________________________________________________________________
lstm_1 (LSTM)                (None, 40, 512)           1099776   
_________________________________________________________________
dropout_1 (Dropout)          (None, 40, 512)           0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 40, 512)           2099200   
_________________________________________________________________
dropout_2 (Dropout)          (None, 40, 512)           0         
_________________________________________________________________
time_distributed_1 (TimeDist (None, 40, 85)            43605     
Total para

In [24]:
model.compile(optimizer=optimizers.Adam(), loss='sparse_categorical_crossentropy')

In [None]:
model.fit(sentences, np.expand_dims(next_chars, -1), batch_size=64, epochs=1)

This takes a while (20 mins) and the loss was about 1.6

The model predicts a single character based on 40 characters. We can take a seed text of 40 chars, predict it and use the new char to make new predictions using the new text.

In [31]:
def get_example(seed_string="ethics is a basic foundation of all that", n=320):
    for i in range(n):
        last_text = seed_string[-40:]
        ids = np.array([char2idx[c] for c in last_text])[np.newaxis, :]
        preds = model.predict(ids, verbose=0)[0][-1]
        preds = preds / np.sum(preds)
        next_char = np.random.choice(chars, p=preds)
        seed_string = seed_string + next_char
    return seed_string

In [32]:
get_example()

'ethics is a basic foundation of all that seems to divine whom it falsified reaplation of questionable one must object as\nhe is a false form of word to one\'s not find the democratic sort of society. And consider\nthe forms of all their good spirits?. "What is stronger, and his standpoints of good and\nrepugnance? But there is not the LAHIERS, in all his counter'

This is kinda nice, the model learned to give create actual words, they dont make much sense and there is some `\n` where they shouldn't but wow