# Lesson 6: Building RNNs

In this lesson, we will first review what we've learned about pseudo labeling and embeddings from [Lesson 4](https://github.com/fdaham/fastai/blob/master/lesson4.ipynb) and [Lesson 5](https://github.com/fdaham/fastai/blob/master/lesson5.ipynb). Then, we will cover a few RNN architectures before building one from scratch in Theano.

## Reviewing Pseudo Labeling and Embeddings

### Psuedo Labeling

Pseudo labeling allows us to learn more information from unlabeled data--especially when there are large amounts of it--in conjunction with labeled data. To do this, our model is trained using labeled data to make predictions of our test set. Then, another model is built using data from the training set and the pseudo labeled test set in ratios of 2:1 or 3:1 (labeled:unlabeled). Keras doesn't have a built in function to generate batches from different data sets, so we use *MixIterator()*, a class written by Jeremy, to do this.  

```python
class MixIterator(object):
    def __init__(self, iters):
        self.iters = iters
        self.N = sum([it.N for it in self.iters])
        
    def reset(self):
        for it in self.iters: it.reset()
        
    def __iter__(self):
        return self
        
    def next(self, *args, **kwargs):
        nexts = [next(it) for it in self.iters]
        n0 = np.concatenate([n(0) for n in nexts])
        n1 = np.concatenate([n[1] for n in nexts])
        return (n0, n1)
```

### Embeddings

In Lessons 4 and 5, we used embeddings to represent user movie ratings and trained our model to optimize these parameters through gradient descent. In Keras, the embedding matrices were passed to functions for both users and movies to generate embeddings for user/movie IDs. The given user movie ratings (raw data) were used as our third input, or the target outputs of our model. Based on the calculated loss function, the embeddings were updated. For that example, we used matrices to list each element of our embeddings for learning purposes. However, representing them this way is impratical as there are limited numbers of user/movie combinations, leaving their matrices mostly sparse. 

## RNNs

We know RNNs are best used to keep track of memory. Let's try to build an architecture that best reflects this use; keeping track of past states to predict future ones. The example we briefly discussed in Lesson 5 shows how a RNN uses state to guess the next word in a sentence. The network (pictured below) takes in an embedding for a word, passes it through two layers of transformations, then combines it with the transformed embedding of the next word. This is what gives the network a sense of state. At the final layer, a prediction for the last word in the sentence is made. Therefore, the final word is dependent on the information learned from the preceeding words.   

### Building a 4-Character Model

We'll now be taking advantage of Keras' functional API to construct arbitrary architectures! Let's consider a network that predicts the fourth char from inputting the first three: 

<img src="https://i.imgur.com/pM1QA77.png[/img]" alt="Drawing" style="width: 200px;"/>

Every green arrow (input to hidden layer) is performing the same action. They are all essentially weight matrices with equal dimensionality. The orange arrows take the hidden state from the previous char and combine them with the hidden state of the next char. The blue arrow predicts the fourth char, given a hidden state. Therefore, it's fair to assume that the task for transforming and concatenating chars 1 and 2 is the same as for chars 3 and 4, making these tasks **time shift invariant** (of course time, *t*, being sequential in nature). 

Let's start by downloading all necessary libraries and completing a few configuration steps. 

In [3]:
from theano.sandbox import cuda
cuda.use('gpu1')

In [4]:
%matplotlib inline
import utils; reload(utils)
from utils import *
from __future__ import division, print_function

In [3]:
# downloading Nietzsche's collected works
path = get_file('nietzsche.txt', origin="https://s3.amazonaws.com/text-datasets/nietzsche.txt")
text = open(path).read()
print('corpus length:', len(text))

corpus length: 600901


In [4]:
chars = sorted(list(set(text)))
vocab_size = len(chars)+1
print('total chars:', vocab_size)

total chars: 86


In [5]:
# adding 0 for padding
chars.insert(0, "\0")

In [7]:
char_indices = dict((c, i) for i, c in enumerate(chars)) # map from chars to inds
indices_char = dict((i, c) for i, c in enumerate(chars)) # map from inds to chars

In [8]:
# create array of all indices for characters in corpus
idx = [char_indices[c] for c in text]

We are now ready to build our model. Ultimately, we want to predict the fourth character from a sequence of three. Let's start by creating a list of every fourth character in *idx* (our numerical character representation of the text), starting with characters 0-3. Characters 0-2 are our inputs (*x1*-*x3*) and they are stacked into seperate numpy arrays, followed by character 3, our output (*y*).   

In [11]:
cs = 3
c1_dat = [idx[i] for i in xrange(0, len(idx)-1-cs, cs)]
c2_dat = [idx[i+1] for i in xrange(0, len(idx)-1-cs, cs)]
c3_dat = [idx[i+2] for i in xrange(0, len(idx)-1-cs, cs)]
c4_dat = [idx[i+3] for i in xrange(0, len(idx)-1-cs, cs)]

In [12]:
x1 = np.stack(c1_dat[:-2])
x2 = np.stack(c2_dat[:-2])
x3 = np.stack(c3_dat[:-2])

In [13]:
y = np.stack(c4_dat[:-2])

Next, we'll create an embedding for each input character, where each embedding contains 42 latent factors. 

In [17]:
n_fac = 42

In [18]:
def embedding_input(name, n_in, n_out):
    inp = Input(shape=(1,), dtype='int64', name=name)
    emb = Embedding(n_in, n_out, input_length=1)(inp)
    return inp, Flatten()(emb)

In [19]:
c1_in, c1 = embedding_input('c1', vocab_size, n_fac)
c2_in, c2 = embedding_input('c2', vocab_size, n_fac)
c3_in, c3 = embedding_input('c3', vocab_size, n_fac)

We will now focus on our input's path through the hidden state (the green arrow in the diagram above). *n_hidden* is the size of our hidden state, which we chose to be 256, and *dense_in* is the layer operation. Our first hidden activation (*c1_hidden*) is the output generated from the embedding of our first character. 

In [20]:
n_hidden = 256

In [47]:
dense_in = Dense(n_hidden, activation='relu')

In [48]:
c1_hidden = dense_in(c1)

Now, we can define the layer operation from hidden to hidden (the orange arrows in the diagram). This sets us up to build the merging portions of our network using these two layers. 

In [49]:
dense_hidden = Dense(n_hidden, activation='tanh')

In [50]:
c2_dense = dense_in(c2)
hidden_2 = dense_hidden(c1_hidden)
c2_hidden = merge([c2_dense, hidden_2])

In [51]:
c3_dense = dense_in(c3)
hidden_3 = dense_hidden(c2_hidden)
c3_hidden = merge([c3_dense, hidden_3])

Finally, we can define the layer operation from hidden to output (the blue arrow in the diagram). Here, our third hidden state is transformed into our final output prediction. We use this to build and train our *model*.

In [52]:
dense_out = Dense(vocab_size, activation='softmax')

In [53]:
c4_out = dense_out(c3_hidden)

In [54]:
model = Model([c1_in, c2_in, c3_in], c4_out)

In [45]:
model.fit([x1, x2, x3], y, batch_size=64, nb_epoch=4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7ff37ba08a50>

After training our model (not entirely shown above), let's test it. *get_next()* takes in a string arg, makes an array of index values for every character in the input (like we did earlier for the first 3 characters of the text), stacks the input into a numpy array, then returns the char with the maximum index value. 

In [159]:
def get_next(inp):
    idxs = [char_indices[c] for c in inp]
    arrs = [np.array(i)[np.newaxis] for i in idxs]
    p = model.predict(arrs)
    i = np.argmax(p)
    return chars[i]

To test our model, we used the following character sequences: 'phi', ' th', and '  an'. The model is limited because it only makes predictions based off of the preceeding three characters, but does a pretty good job at completing the words 'phil', 'the', and 'and'.

In [160]:
get_next('phi')

'l'

In [161]:
get_next(' th')

'e'

In [162]:
get_next(' an')

'd'

### Building an *N*-Character Model

The diagram above shows the unrolled version of a network that takes in the first three characters as input to predict the fourth character output. This time, we want to build a more arbitrary model that can predict the *n*th character given a sequence of *n-1* characters. To do this, we will stack all the input/hidden layers from our previous model *n-1* times. In the diagram below, we can see this new recurrent network.

<img src="https://i.imgur.com/4kUfG8T.png[/img]" alt="Drawing" style="width: 300px;"/>

**Note:** when stacking Keras on Tensorflow, RNNs can only be implemented in their unrolled form. However, Theano can implement RNNs in their recurrent form (actions highlighted in the red box), which we will use for this build.

We will now build our *n*th character RNN (where *cs = n-1*). In this example, let's say we want to predict the ninth character in a sequence of 8 (*cs = 8*). The input matrices, *c_in_dat*, and desired output, *c_out_dat*, are created and stacked in seperate numpy arrays below.

In [73]:
cs=8

In [74]:
c_in_dat = [[idx[i+n] for i in xrange(0, len(idx)-1-cs, cs)]
            for n in range(cs)]

In [75]:
c_out_dat = [idx[i+cs] for i in xrange(0, len(idx)-1-cs, cs)]

In [76]:
xs = [np.stack(c[:-2]) for c in c_in_dat]

In [45]:
y = np.stack(c_out_dat[:-2])

In [58]:
n_fac = 42

Let's begin by creating the embeddings for each char and defining our layer tasks, similar to our 4-character model. This time, however, we are keeping track of more chars, so we should expect this model to yield better results than our last one. 

In [33]:
def embedding_input(name, n_in, n_out):
    inp = Input(shape=(1,), dtype='int64', name=name+'_in')
    emb = Embedding(n_in, n_out, input_length=1, name=name+'_emb')(inp)
    return inp, Flatten()(emb)

In [34]:
c_ins = [embedding_input('c'+str(n), vocab_size, n_fac) for n in range(cs)]

In [35]:
n_hidden = 256

In [36]:
dense_in = Dense(n_hidden, activation='relu')
dense_hidden = Dense(n_hidden, activation='relu', init='identity')
dense_out = Dense(vocab_size, activation='softmax')

In [37]:
hidden = dense_in(c_ins[0][1])

In [38]:
for i in range(1,cs):
    c_dense = dense_in(c_ins[i][1])
    hidden = dense_hidden(hidden)
    hidden = merge([c_dense, hidden])

In [39]:
c_out = dense_out(hidden)

We are now ready to create and test our model, as we've done before.

In [179]:
model = Model([c[0] for c in c_ins], c_out)
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

In [180]:
model.fit(xs, y, batch_size=64, nb_epoch=12)

Epoch 1/12
Epoch 2/12
Epoch 3/12
Epoch 4/12
Epoch 5/12
Epoch 6/12
Epoch 7/12
Epoch 8/12
Epoch 9/12
Epoch 10/12
Epoch 11/12
Epoch 12/12


<keras.callbacks.History at 0x7f25579a80d0>

In [181]:
def get_next(inp):
    idxs = [np.array(char_indices[c])[np.newaxis] for c in inp]
    p = model.predict(idxs)
    return chars[np.argmax(p)]

In [182]:
get_next('for thos')

'e'

In [432]:
get_next('part of ')

't'

In [433]:
get_next('queens a')

'n'

Our model has successfully predicted the last character given a list of the preceeding characters. This type of RNN is best suited for tasks like sentiment analysis, which uses sequences of chars/words as input. Keras has a built in implementation of this RNN that we can use for our sequential model.

In [31]:
model=Sequential([
        Embedding(vocab_size, n_fac, input_length=cs),
        SimpleRNN(n_hidden, activation='relu', inner_init='identity'),
        Dense(vocab_size, activation='softmax')
    ])

In [32]:
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_5 (Embedding)          (None, 8, 42)         3612        embedding_input_2[0][0]          
____________________________________________________________________________________________________
simplernn_2 (SimpleRNN)          (None, 256)           76544       embedding_5[0][0]                
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 86)            22102       simplernn_2[0][0]                
Total params: 102258
____________________________________________________________________________________________________


In [24]:
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

In [217]:
model.fit(np.concatenate(xs,axis=1), y, batch_size=64, nb_epoch=8)

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x7fa18f2c0890>

In [222]:
def get_next_keras(inp):
    idxs = [char_indices[c] for c in inp]
    arrs = np.array(idxs)[np.newaxis,:]
    p = model.predict(arrs)[0]
    return chars[np.argmax(p)]

In [223]:
get_next_keras('this is ')

't'

In [224]:
get_next_keras('part of ')

't'

In [225]:
get_next_keras('queens a')

'n'

It's important to note that, so far, we've be initializing our hidden to hidden weight matrices as indentity matrices. For keeping track of state, this makes sense, however, hidden to hidden layer transformations are meant to give us information on how information from the previous state should be transformed before being combined with the newly transformed input. The best way to do this is to pass the information from the previous state directly to the contruction of our new one, optimized through SGD. 

### Building a 2 to *N* Character Sequence Model

Let's now consider a model that returns sequences; instead of having our model predict the *n*th character from the preceeding *n-1* characters, let's have our model predict chars 2 to *n* from the preceeding *n-1* chars. Referencing the previous computational diagram, our output is now included in our red highlighted iteration box. This model will increase the number of predictions we make on our training set. Instead of making only one prediction, we are making *n-1* times that. Because of this, our model is able to handle more long-term memory tasks. To build this model, we need to change our output, *c_out_dat*, to a sequence.

In [64]:
c_out_dat = [[idx[i+n] for i in xrange(1, len(idx)-cs, cs)]
            for n in range(cs)]

In [65]:
ys = [np.stack(c[:-2]) for c in c_out_dat]

We can create and train our model as we've done before. However, instead of starting with an indentity matrix, we will first pass a zero vector. 

In [47]:
dense_in = Dense(n_hidden, activation='relu')
dense_hidden = Dense(n_hidden, activation='relu', init='identity')
dense_out = Dense(vocab_size, activation='softmax', name='output')

In [48]:
inp1 = Input(shape=(n_fac,), name='zeros')
hidden = dense_in(inp1)

In [66]:
outs = []

for i in range(cs):
    c_dense = dense_in(c_ins[i][1])
    hidden = dense_hidden(hidden)
    hidden = merge([c_dense, hidden], mode='sum')
    # every layer now has an output
    outs.append(dense_out(hidden))

In [67]:
model = Model([inp1] + [c[0] for c in c_ins], outs)
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

In [68]:
zeros = np.tile(np.zeros(n_fac), (len(xs[0]),1))
zeros.shape

(75110, 42)

In [394]:
model.fit([zeros]+xs, ys, batch_size=64, nb_epoch=12)

INFO (theano.gof.compilelock): Refreshing lock /home/jhoward/.theano/compiledir_Linux-4.4--generic-x86_64-with-Ubuntu-16.04-xenial-x86_64-2.7.12-64/lock_dir/lock


Epoch 1/12
Epoch 2/12
Epoch 3/12
Epoch 4/12
Epoch 5/12
Epoch 6/12
Epoch 7/12
Epoch 8/12
Epoch 9/12
Epoch 10/12
Epoch 11/12
Epoch 12/12


<keras.callbacks.History at 0x7fa168d005d0>

When testing our model, we see that we've attained pretty good results. In the first example, when given ' this is', our model was able to provide a space after predicting the word 'this'. It also knew to start a word after the space. In the second example, when given ' part of', our model was able to give a space before and after accurately predicting the word 'of'.

In [395]:
def get_nexts(inp):
    idxs = [char_indices[c] for c in inp]
    arrs = [np.array(i)[np.newaxis] for i in idxs]
    p = model.predict([np.zeros(n_fac)[np.newaxis,:]] + arrs)
    print(list(inp))
    return [chars[np.argmax(o)] for o in p]

In [396]:
get_nexts(' this is')

[' ', 't', 'h', 'i', 's', ' ', 'i', 's']


['t', 'h', 'e', 't', ' ', 'c', 's', ' ']

In [397]:
get_nexts(' part of')

[' ', 'p', 'a', 'r', 't', ' ', 'o', 'f']


['t', 'o', 'r', 't', ' ', 'o', 'f', ' ']

We can also create this model in Keras using its sequential API. The only differences would be setting the parameter *return_sequences* to *True* to allow the prediction to step into the next iteration and changing our targets into the necessary sequences.

In [67]:
model=Sequential([
        Embedding(vocab_size, n_fac, input_length=cs),
        SimpleRNN(n_hidden, return_sequences=True, activation='relu', inner_init='identity'),
        TimeDistributed(Dense(vocab_size, activation='softmax')),
    ])

In [52]:
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_6 (Embedding)          (None, 8, 42)         3612        embedding_input_3[0][0]          
____________________________________________________________________________________________________
simplernn_3 (SimpleRNN)          (None, 8, 256)        76544       embedding_6[0][0]                
____________________________________________________________________________________________________
timedistributed_1 (TimeDistribut (None, 8, 86)         22102       simplernn_3[0][0]                
Total params: 102258
____________________________________________________________________________________________________


In [71]:
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

In [90]:
x_rnn = np.stack(np.squeeze(xs), axis=1)
y_rnn = np.atleast_3d(np.stack(ys, axis=1))

In [92]:
model.fit(x_rnn, y_rnn, batch_size=64, nb_epoch=8)

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x7f82761cc990>

### Building a Stateful Model

The models we've built so far (the 4-char, *n*-char, and 2 to *n*-char models) don't really incorporate state as much as we'd like them to. To fix this, we can't train on random batches of data. Instead, when training, we set *shuffle = false*. Building a stateful model is useful for tasks that require long term memory. To handle long term dependencies, our hidden states will be passed between sequences. We'll start with our initial zero vector input, then pass along our sequence to the next one to always have our hidden state reflect an arbitrarily long dependency. 

Constructing this model in Keras is simple, just add *stateful = true* when creating the model. Then, we add batch normalization and use an LSTM layer(briefly introduced in Lesson 5). Remember, normalizing our data rather than directly feeding it into our model improves convergence. Once our model generates a prediction, the data is denormalized to get “real world” results.

In [290]:
bs = 64

In [338]:
model=Sequential([
        Embedding(vocab_size, n_fac, input_length=cs, batch_input_shape=(bs,8)),
        BatchNormalization(),
        LSTM(n_hidden, return_sequences=True, stateful=True),
        TimeDistributed(Dense(vocab_size, activation='softmax')),
    ])

In [339]:
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

In [340]:
# inputs/outputs must be even multiple of fixed batch size
mx = len(x_rnn)//bs*bs

Before, our model’s hidden to hidden layer operations were only applied *n-1* times. In this stateful model, they are being applied possibly hundreds of thousands of times. Therefore, our network is sensitive to exploding gradients; if the matrix is poorly scaled to even a small degree, a number that's slightly larger than the others would exponentially implode--sending the activations to infinity and destabilizing the network.

This instability was corrected using the LSTM model; our network now controls how much state it needs through optimization. It is important to note here that these stateful models compile at a slower rate. This is expected because each sequence is passed through iteratively, making our network harder to parallelize.

In [341]:
model.fit(x_rnn[:mx], y_rnn[:mx], batch_size=bs, nb_epoch=4, shuffle=False)

INFO (theano.gof.compilelock): Refreshing lock /home/jhoward/.theano/compiledir_Linux-4.4--generic-x86_64-with-Ubuntu-16.04-xenial-x86_64-2.7.12-64/lock_dir/lock


Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7fa16f1d2690>

In [342]:
model.optimizer.lr=1e-4

In [343]:
model.fit(x_rnn[:mx], y_rnn[:mx], batch_size=bs, nb_epoch=4, shuffle=False)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7fa1773b8c10>

In [344]:
model.fit(x_rnn[:mx], y_rnn[:mx], batch_size=bs, nb_epoch=4, shuffle=False)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7fa1773b8d50>

### Building an RNN in Theano

To really understand how we implemented these models in Keras, we will learn how to build a RNN in pure Theano (the backend Keras runs on top of). In the future, we want to build more advanced models. So, it's important build them from scratch to better understand and debug them on this lower level framework. The same could be done using Tensorflow as backend.

Let's first think of how to construct the task operations we're going to need for our model. To build our layers (input to hidden, hidden to hidden, and hidden to output), we'll need to create our weight matrices and bias vectors from scratch. *Shared()* is used to tell Theano that the data passing through is manageable by copying it to and from the GPU when necessary. Here, the weights and biases are returned as tuples.

In [107]:
n_input = vocab_size
n_output = vocab_size

In [108]:
def init_wgts(rows, cols): 
    scale = math.sqrt(2/rows)
    return shared(normal(scale=scale, size=(rows, cols)).astype(np.float32))
def init_bias(rows): 
    return shared(np.zeros(rows, dtype=np.float32))

In [109]:
def wgts_and_bias(n_in, n_out): 
    return init_wgts(n_in, n_out), init_bias(n_out)
def id_and_bias(n): 
    return shared(np.eye(n, dtype=np.float32)), init_bias(n)

In Theano, our variables must be declared before use; no computations are done before our functions are compiled and evaluated. Below, we will declare our matrices, vectors, and scalars and group them in *all_args*. Next, we can use the functions above to manually initialize the weights and biases to the hidden (*W_h*), input (*W_x*), and output (*W_y*) layers before combining them in a single list (*w_all*). 

In [110]:
t_inp = T.matrix('inp')
t_outp = T.matrix('outp')
t_h0 = T.vector('h0')
lr = T.scalar('lr')

all_args = [t_h0, t_inp, t_outp, lr]

In [73]:
W_h = id_and_bias(n_hidden)
W_x = wgts_and_bias(n_input, n_hidden)
W_y = wgts_and_bias(n_hidden, n_output)
w_all = list(chain.from_iterable([W_h, W_x, W_y]))

Now that we've intialized our inputs, we need to tell Theano what needs to happen in each step (a single forward pass for one char) of our RNN. Our step function, *step()*, includes calculating the hidden activations and output. Then, for each scan, we will call this function using the initial values of the outputs, inputs, and all other arguments.

In [74]:
def step(x, h, W_h, b_h, W_x, b_x, W_y, b_y):
    # Calculate the hidden activations
    h = nnet.relu(T.dot(x, W_x) + b_x + T.dot(h, W_h) + b_h)
    # Calculate the output activations
    y = nnet.softmax(T.dot(h, W_y) + b_y)
    # Return both (the 'Flatten()' is to work around a theano bug)
    return h, T.flatten(y, 1)

In [75]:
[v_h, v_y], _ = theano.scan(step, sequences=t_inp, 
                            outputs_info=[t_h0, None], non_sequences=w_all)

After we've completed one forward pass, we must update our weights by first calculating our loss function. Then, we can perform SGD by storing our updates from each forward pass in a dictionary in *upd_dict()*. We can use Theano functions to do this; categorical cross-entropy will help us calculate error given our step function outputs. *T.grad()* is then used to find the gradient of our error function before updating our parameters with our learning rate *lr*.

In [76]:
error = nnet.categorical_crossentropy(v_y, t_outp).sum()
g_all = T.grad(error, w_all)

In [77]:
def upd_dict(wgts, grads, lr): 
    return OrderedDict({w: w-g*lr for (w,g) in zip(wgts,grads)})

upd = upd_dict(w_all, g_all, lr)

We now have our loss, gradient, and step update functions and are ready to compile them using *theano.function()*, which will give our error function. After each loop, our weights will be updated through this manual form of SGD. The function is then iteratively called for each char from our training data. With each pass, the loss function is calculated and used to update the parameters. Here, Jermey prints the error every thousandth iteration to show how the network is improving. 

In [78]:
fn = theano.function(all_args, error, updates=upd, allow_input_downcast=True)

In [123]:
X = oh_x_rnn
Y = oh_y_rnn
X.shape, Y.shape

((75110, 8, 86), (75110, 8, 86))

In [86]:
err=0.0; l_rate=0.01
for i in range(len(X)): 
    err+=fn(np.zeros(n_hidden), X[i], Y[i], l_rate)
    if i % 1000 == 999: 
        print ("Error:{:.3f}".format(err/1000))
        err=0.0

Error:25.196
Error:21.489
Error:20.900
Error:19.913
Error:18.816
Error:19.202
Error:19.066
Error:18.473
Error:17.942
Error:18.251
Error:17.489
Error:17.570
Error:18.371
Error:17.331
Error:16.807
Error:17.681
Error:17.401
Error:17.136
Error:16.830
Error:16.651
Error:16.518
Error:16.430
Error:16.687
Error:16.161
Error:16.775
Error:16.566
Error:16.053
Error:16.296
Error:16.240
Error:16.454
Error:16.699
Error:16.396
Error:16.644
Error:16.328
Error:15.990
Error:16.644
Error:15.981
Error:16.359
Error:16.042
Error:16.326
Error:15.361
Error:15.690
Error:15.742
Error:16.048
Error:15.955
Error:15.866
Error:15.571
Error:16.069
Error:15.997
Error:16.030
Error:15.230
Error:15.612
Error:14.918
Error:14.821
Error:15.580
Error:15.380
Error:14.650
Error:15.499
Error:15.110
Error:14.972
Error:15.034
Error:15.427
Error:15.236
Error:15.037
Error:14.768
Error:14.781
Error:14.329
Error:14.726
Error:15.229
Error:14.809
Error:15.144
Error:14.755
Error:14.440
Error:14.431
Error:14.464


We've now successfully built an RNN from scratch in Theano! Let's use our model to make predictions. We'll define another Theano function that takes in our hidden layers and input word sequences and converts them to our one-hot encoded outputs.

In [87]:
f_y = theano.function([t_h0, t_inp], v_y, allow_input_downcast=True)

In [337]:
act = np.argmax(X[6], axis=1)

In [338]:
[indices_char[o] for o in act]

['t', 'h', 'e', 'n', '?', ' ', 'I', 's']

The model does pretty well when you consider, for example, how it knows to follow a punctuation mark with a space, then begin a new word with a capitalized letter. 