This notebooks trains a couple of models based on RNNs to train some character models that can predict next characters based on previous ones

In [1]:
import numpy as np

from keras import layers
from keras import models
from keras import optimizers
# from keras import applications
from keras.utils import data_utils
from keras.preprocessing import sequence

Using TensorFlow backend.


In [2]:
path = data_utils.get_file('nietzsche.txt', origin="https://s3.amazonaws.com/text-datasets/nietzsche.txt")
text = open(path).read()

In [3]:
print('corpus length:', len(text))

corpus length: 600893


In [4]:
chars = sorted(list(set(text)))
vocab_size = len(chars) + 1
chars.insert(0, "\0")

In [5]:
''.join(chars[:])

'\x00\n !"\'(),-.0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxyzÆäæéë'

To train the models we have to encode characters to indices

In [6]:
char2idx = dict((c, i) for i, c in enumerate(chars))
idx2char = dict((i, c) for i, c in enumerate(chars))

Encode the whole text as indices, this is the actual data we use to train

In [7]:
idx = [char2idx[c] for c in text]

In [8]:
idx[:10]

[40, 42, 29, 30, 25, 27, 29, 1, 1, 1]

In [9]:
idx[10:20]

[43, 45, 40, 40, 39, 43, 33, 38, 31, 2]

In [10]:
def ids2text(ids):
    return ''.join(idx2char[i] for i in ids)

In [11]:
ids2text(idx[:80])

'PREFACE\n\n\nSUPPOSING that Truth is a woman--what then? Is there not ground\nfor su'

## 10 char model

we want to train a model that based on 10 characters predicts the 11th one.

Inputs (10) are going to be (each column is a series of 10 chars from the text):

```
[char1 chart11 char21 ...]
[char2 chart12 char22 ...]
...
[char9 chart19 char29 ...]
[char10 chart20 char30 ...]
```

Output is going to be:

```
[char11 chart21 char31 ...]
```

In [12]:
cs = 10

In [13]:
c_in = [[idx[i+n] for i in range(0, len(idx)-cs-1, cs)] for n in range(cs)]

In [14]:
xs = [np.stack(c[:-2]) for c in c_in]

In [44]:
len(xs), xs[0].shape

(10, (60087,))

In [15]:
xs

[array([40, 43, 73, ..., 73, 62, 54]),
 array([42, 45, 61, ..., 61, 54, 72]),
 array([29, 40, 54, ..., 58, 67,  2]),
 array([30, 40, 73, ...,  1,  2, 73]),
 array([25, 39,  2, ..., 56, 76, 61]),
 array([27, 43, 44, ..., 61, 68, 58]),
 array([29, 33, 71, ..., 71, 71,  2]),
 array([ 1, 38, 74, ..., 62, 65, 62]),
 array([ 1, 31, 73, ..., 72, 57, 67]),
 array([ 1,  2, 61, ..., 73,  2, 57])]

In [16]:
c_out = [idx[i+cs] for i in range(0, len(idx)-cs-1, cs)]

In [17]:
y = np.stack(c_out[:-2])

In [18]:
y

array([43, 73,  2, ..., 62, 54, 62])

For example sequence 1 is:

In [22]:
seq = [arr[1] for arr in xs]
seq

[43, 45, 40, 40, 39, 43, 33, 38, 31, 2]

In [24]:
ids2text(seq)

'SUPPOSING '

Next character is

In [25]:
ids2text([c_out[1]])

't'

Which is the first char of:

In [27]:
seq = [arr[2] for arr in xs]
ids2text(seq)

'that Truth'

### Build the model from scratch

In [28]:
n_fac = 42
n_hidden = 256

In [29]:
def embedding_input(name, n_in, n_out):
    inp = layers.Input(shape=(1,), dtype='int64', name=name + '_in')
    emb = layers.Embedding(n_in, n_out, input_length=1, name=name + '_emb')(inp)
    emb = layers.Flatten()(emb)
    return inp, emb

In [30]:
c_ins = [embedding_input('char_' + str(n), vocab_size, n_fac) for n in range(cs)]

Create first layer and the first character of each sequence goes through `dense_in`, to create our first hidden activations

In [31]:
dense_in = layers.Dense(n_hidden, activation='relu')
dense_hidden = layers.Dense(n_hidden, activation='relu', kernel_initializer='identity')
dense_out = layers.Dense(vocab_size, activation='softmax')

In [32]:
hidden = dense_in(c_ins[0][1])

Then for each successive layer we combine the output of `dense_in` on the next character with the output of `dense_hidden` on the current hidden state, to create the new hidden state.

In [33]:
for i in range(1, cs):
    c_dense = dense_in(c_ins[i][1])
    hidden = dense_hidden(hidden)
    hidden = layers.add([c_dense, hidden])

In [34]:
c_out = dense_out(hidden)

In [35]:
model = models.Model([c[0] for c in c_ins], c_out)

In [36]:
model.compile(optimizer=optimizers.Adam(), loss='sparse_categorical_crossentropy')

In [None]:
model.fit(xs, y, batch_size=64, epochs=12)

Loss was around loss: 1.6453

#### Test model

In [67]:
def get_next_char(model, inp):
    idxs = [np.array(char2idx[c])[np.newaxis] for c in inp]
    p = model.predict(idxs)
    return chars[np.argmax(p)]

In [64]:
get_next_char(model, '  for thos')

'e'

In [65]:
get_next_char(model, '  part of ')

't'

In [71]:
get_next_char(model, ' queens an')

'd'

### Same model using keras

This is the same model as before.

Since we use `Keras.SimpleRNN` instead of having multiple inputs we have a simple input in this shape:

```
[char1 char2 ... chart9 char10]
[char11 char12 ... char19 char20]
...
```

In [37]:
n_fac = 42
n_hidden = 256

In [80]:
xs

[array([40, 43, 73, ..., 73, 62, 54]),
 array([42, 45, 61, ..., 61, 54, 72]),
 array([29, 40, 54, ..., 58, 67,  2]),
 array([30, 40, 73, ...,  1,  2, 73]),
 array([25, 39,  2, ..., 56, 76, 61]),
 array([27, 43, 44, ..., 61, 68, 58]),
 array([29, 33, 71, ..., 71, 71,  2]),
 array([ 1, 38, 74, ..., 62, 65, 62]),
 array([ 1, 31, 73, ..., 72, 57, 67]),
 array([ 1,  2, 61, ..., 73,  2, 57])]

In [81]:
xs_keras = np.concatenate([xs], axis=1).T

In [82]:
xs_keras.shape

(60087, 10)

In [83]:
xs_keras

array([[40, 42, 29, ...,  1,  1,  1],
       [43, 45, 40, ..., 38, 31,  2],
       [73, 61, 54, ..., 74, 73, 61],
       ..., 
       [73, 61, 58, ..., 62, 72, 73],
       [62, 54, 67, ..., 65, 57,  2],
       [54, 72,  2, ..., 62, 67, 57]])

In [84]:
inp = layers.Input((cs,))

In [85]:
x = layers.Embedding(vocab_size, n_fac)(inp)
x = layers.SimpleRNN(n_hidden, activation='relu', recurrent_initializer='identity')(x)
x = layers.Dense(vocab_size, activation='softmax')(x)

In [87]:
model = models.Model(inp, x)

In [88]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 10)                0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 10, 42)            3570      
_________________________________________________________________
simple_rnn_2 (SimpleRNN)     (None, 256)               76544     
_________________________________________________________________
dense_5 (Dense)              (None, 85)                21845     
Total params: 101,959
Trainable params: 101,959
Non-trainable params: 0
_________________________________________________________________


In [89]:
model.compile(optimizer=optimizers.Adam(), loss='sparse_categorical_crossentropy')

In [None]:
model.fit(xs_keras, y, batch_size=64, epochs=8)

Loss was around loss: 1.6856

In [91]:
def get_next_char(model, inp):
    idxs = [char2idx[c] for c in inp]
    arrs = np.array(idxs)[np.newaxis, :]
    p = model.predict(arrs)[0]
    return chars[np.argmax(p)]

In [92]:
get_next_char(model, ' queens an')

'd'

## Predict sequences

Instead of predicting one character predict more!

We need to change the output of the model to be a sequence of characters

In [95]:
cs = 10

In [96]:
c_out = [[idx[i+n] for i in range(1, len(idx)-cs, cs)]for n in range(cs)]

In [98]:
ys = [np.stack(c[:-2]) for c in c_out]

In [102]:
len(ys), ys[0].shape

(10, (60087,))

In [103]:
ys

[array([42, 45, 61, ..., 61, 54, 72]),
 array([29, 40, 54, ..., 58, 67,  2]),
 array([30, 40, 73, ...,  1,  2, 73]),
 array([25, 39,  2, ..., 56, 76, 61]),
 array([27, 43, 44, ..., 61, 68, 58]),
 array([29, 33, 71, ..., 71, 71,  2]),
 array([ 1, 38, 74, ..., 62, 65, 62]),
 array([ 1, 31, 73, ..., 72, 57, 67]),
 array([ 1,  2, 61, ..., 73,  2, 57]),
 array([43, 73,  2, ..., 62, 54, 62])]

In [128]:
x_rnn = np.stack(np.squeeze(xs), axis=1)
y_rnn = np.atleast_3d(np.stack(ys, axis=1))

In [129]:
x_rnn.shape, y_rnn.shape

((60087, 10), (60087, 10, 1))

In [117]:
inp = layers.Input((cs,))

In [121]:
x = layers.Embedding(vocab_size, n_fac)(inp)
x = layers.SimpleRNN(n_hidden, activation='relu', recurrent_initializer='identity', return_sequences=True)(x)
x = layers.TimeDistributed(layers.Dense(vocab_size, activation='softmax'))(x)

In [122]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 10)                0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 10, 42)            3570      
_________________________________________________________________
simple_rnn_2 (SimpleRNN)     (None, 256)               76544     
_________________________________________________________________
dense_5 (Dense)              (None, 85)                21845     
Total params: 101,959
Trainable params: 101,959
Non-trainable params: 0
_________________________________________________________________


In [123]:
model = models.Model(inp, x)

In [133]:
model.compile(optimizer=optimizers.Adam(), loss='sparse_categorical_crossentropy')

In [None]:
model.fit(x_rnn, y_rnn, batch_size=64, epochs=8)

This gave a loss of: 1.6759 

In [137]:
def get_nexts_seq(mode, inp):
    idxs = [char2idx[c] for c in inp]
    arr = np.array(idxs)[np.newaxis,:]
    p = model.predict(arr)[0]
    return [chars[np.argmax(o)] for o in p]

In [146]:
get_nexts_seq(model, '   this is')

['t', ' ', ' ', 'h', 'e', 's', ' ', 'p', 's', ' ']