# Lesson 6: Building RNNs

In this lesson, we will first review what we've learned about pseudo labeling and embeddings from [Lesson 4](https://github.com/fdaham/fastai/blob/master/lesson4.ipynb) and [Lesson 5](https://github.com/fdaham/fastai/blob/master/lesson5.ipynb). Then, we will cover a few RNN architectures before building one from scratch in Theano.

## Reviewing Pseudo Labeling and Embeddings

### Psuedo Labeling

Pseudo labeling allows us to learn more information from unlabeled data (especially when there are large amounts of it) in conjunction with labeled data. To do this, our model uses labeled data to train and make predictions on our test set. Then, another model is built using data from the training set and the pseudo labeled test set in ratios of 2:1 or 3:1 (labeled:unlabeled data). Keras doesn't have a built in function to generate batches from different data sets, so we use ```MixIterator```, a class written by Jeremy, to do this:  

```python
class MixIterator(object):
    def __init__(self, iters):
        self.iters = iters
        self.N = sum([it.N for it in self.iters])
        
    def reset(self):
        for it in self.iters: it.reset()
        
    def __iter__(self):
        return self
        
    def next(self, *args, **kwargs):
        nexts = [next(it) for it in self.iters]
        n0 = np.concatenate([n(0) for n in nexts])
        n1 = np.concatenate([n[1] for n in nexts])
        return (n0, n1)
```

### Embeddings

In Lessons 4 and 5, we used embeddings to represent user movie ratings and trained our model to optimize these parameters through gradient descent. In Keras, the embedding matrices were passed to functions for both users and movies to generate embeddings for user/movie IDs. The given user movie ratings (raw data) were used as our third input, or the target outputs of our model. Based on the calculated loss function, the embeddings were updated. For that example, we used matrices to list each element of our embeddings (for learning purposes). However, representing them this way is impratical as there are limited numbers of user/movie combinations, leaving their matrices mostly sparse. 

Now, back to RNNs...

## RNNs

We know RNNs are best used to keep track of memory. Let's try to build an architecture that best reflects this use; keeping track of past states to predict future ones. The example we briefly discussed in Lesson 5 shows how a RNN uses state to guess the next word in a sentence. The network (shown in the diagram below) takes in an embedding for a word, passes it through two layers of transformations, then combines it with the transformed embedding of the next word. This is what gives the network a sense of state. At the final layer, a prediction for the last word in the sentence is made. Therefore, the final word is dependent on the information learned from the preceeding words.   

### Building a 4-Character Model

We'll now be taking advantage of Keras' functional API to construct arbitrary architectures. Let's consider a network that predicts the fourth character from inputting the first three: 

![img](https://i.imgur.com/DaZbuXZ.png[/img])

Every <font color='green'>green</font> arrow (input to hidden layer) is performing the same action. They are all essentially weight matrices with equal dimensionality. The <font color='orange'>orange</font> arrows take the hidden state from the previous char and combine them with the hidden state of the next char. The <font color='blue'>blue</font> arrow predicts the fourth char, given a hidden state. Therefore, it's fair to assume that the task for transforming and concatenating chars 1 and 2 is the same as for chars 3 and 4, making these tasks **time shift invariant** (of course time, $t$, being sequential in nature). 

Let's start by downloading all necessary libraries and completing a few configuration steps: 

In [1]:
from theano.sandbox import cuda
cuda.use('gpu1')

Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5103)


In [2]:
%matplotlib inline
import utils; reload(utils)
from utils import *
from __future__ import division, print_function

Using Theano backend.


In [3]:
# downloading Nietzsche's collected works
path = get_file('nietzsche.txt', origin="https://s3.amazonaws.com/text-datasets/nietzsche.txt")
text = open(path).read()
print('corpus length:', len(text))

corpus length: 600901


In [4]:
chars = sorted(list(set(text)))
vocab_size = len(chars)+1
print('total chars:', vocab_size)

total chars: 86


In [5]:
# add 0 for padding
chars.insert(0, "\0")

In [6]:
char_indices = dict((c, i) for i, c in enumerate(chars)) # map from chars to inds
indices_char = dict((i, c) for i, c in enumerate(chars)) # map from inds to chars

In [7]:
# create array of all indices for characters in corpus
idx = [char_indices[c] for c in text]

We are now ready to build our model. Ultimately, we want to predict the fourth character from a sequence of three. Let's start by creating a list of every fourth character in ```idx``` (our numerical character representation of the text), starting with characters 0-3. Characters 0-2 are our inputs (```x1```, ```x2```, and ```x3```) and they are stacked into seperate numpy arrays, followed by character 3 (out output, ```y```):   

In [8]:
cs = 3
c1_dat = [idx[i] for i in xrange(0, len(idx)-1-cs, cs)]
c2_dat = [idx[i+1] for i in xrange(0, len(idx)-1-cs, cs)]
c3_dat = [idx[i+2] for i in xrange(0, len(idx)-1-cs, cs)]
c4_dat = [idx[i+3] for i in xrange(0, len(idx)-1-cs, cs)]

In [9]:
x1 = np.stack(c1_dat[:-2])
x2 = np.stack(c2_dat[:-2])
x3 = np.stack(c3_dat[:-2])

In [10]:
y = np.stack(c4_dat[:-2])

Next, we'll create an embedding for each input character, where each embedding contains 42 latent factors:

In [11]:
n_fac = 42

In [12]:
def embedding_input(name, n_in, n_out):
    inp = Input(shape=(1,), dtype='int64', name=name)
    emb = Embedding(n_in, n_out, input_length=1)(inp)
    return inp, Flatten()(emb)

In [13]:
c1_in, c1 = embedding_input('c1', vocab_size, n_fac)
c2_in, c2 = embedding_input('c2', vocab_size, n_fac)
c3_in, c3 = embedding_input('c3', vocab_size, n_fac)

We'll now focus on our input's path through the hidden state (the <font color='green'>green</font> arrow in the diagram above). ```n_hidden``` is the size of our hidden state, which we chose to be 256, and ```dense_in``` is the layer operation. Our first hidden activation, ```c1_hidden```, is the output generated from the embedding of our first character. 

In [14]:
n_hidden = 256

In [15]:
dense_in = Dense(n_hidden, activation='relu')

In [16]:
c1_hidden = dense_in(c1)

Now, we can define the layer operation from hidden to hidden (the <font color='orange'>orange</font> arrows in the diagram above). This sets us up to build the merging portions of our network using these two layers: 

In [17]:
dense_hidden = Dense(n_hidden, activation='tanh')

In [18]:
c2_dense = dense_in(c2)
hidden_2 = dense_hidden(c1_hidden)
c2_hidden = merge([c2_dense, hidden_2])

In [19]:
c3_dense = dense_in(c3)
hidden_3 = dense_hidden(c2_hidden)
c3_hidden = merge([c3_dense, hidden_3])

Finally, we can define the layer operation from hidden to output (the <font color='blue'>blue</font> arrow in the diagram). Here, our third hidden state is transformed into our final output prediction. We use this to build and train our model:

In [20]:
dense_out = Dense(vocab_size, activation='softmax')

In [21]:
c4_out = dense_out(c3_hidden)

In [22]:
model = Model([c1_in, c2_in, c3_in], c4_out)

In [23]:
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

In [26]:
#model.optimizer.lr=0.000001

Use the default learning rate to train the model. After just four epochs, our loss is 2.1662. Originally, when the learning rate was set to a very small number (```model.optimizer.lr = 0.000001```) and tested to generate the fourth word in the sequence, ```get_next``` would only return ' '. 

In [24]:
model.fit([x1, x2, x3], y, batch_size=64, nb_epoch=4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f48071e7510>

Let's now test our model. We'll first need the function ```get_next```, which takes in a string argument, makes an array of index values for every character in the input (like we did earlier for the first three characters of the text), stacks the input into a numpy array, then returns the char with the maximum index value: 

In [25]:
def get_next(inp):
    idxs = [char_indices[c] for c in inp]
    arrs = [np.array(i)[np.newaxis] for i in idxs]
    p = model.predict(arrs)
    i = np.argmax(p)
    return chars[i]

To test our model, we'll use the following character sequences: 'phi', ' th', and '  an'. The model is limited because it only makes predictions based off of the preceeding three characters, but does a pretty good job at completing the words 'phil', 'the', and 'and'.

In [26]:
get_next('phi')

'l'

In [27]:
get_next(' th')

'e'

In [28]:
get_next(' an')

'd'

### Building an *N*-Character Model

The diagram above shows the unrolled version of a network that takes in the first three characters as input to predict the fourth character output. This time, we want to build a more arbitrary model that can predict the *n*th character given a sequence of *n*-1 characters. To do this, we'll stack all the input/hidden layers from our previous model *n*-1 times. In the diagram below, we can see this new recurrent network:

![img](https://i.imgur.com/JnwzwIJ.png[/img])

**Note:** when stacking Keras on top of Tensorflow, RNNs can only be implemented in their unrolled form. However, Theano can implement RNNs in their recurrent form (actions boxed in <font color='red'>red</font> in the above diagram), which we will use for this build.

We will now build our *n*th character RNN (where ```cs = n-1```). In this example, let's say we want to predict the 9th character in a sequence of 8 (```cs = 8```). The input matrices, ```c_in_dat```, and desired output, ```c_out_dat```, are created and stacked in seperate numpy arrays:

In [29]:
cs=8

In [30]:
c_in_dat = [[idx[i+n] for i in xrange(0, len(idx)-1-cs, cs)]
            for n in range(cs)]

In [31]:
c_out_dat = [idx[i+cs] for i in xrange(0, len(idx)-1-cs, cs)]

In [32]:
xs = [np.stack(c[:-2]) for c in c_in_dat]

In [33]:
y = np.stack(c_out_dat[:-2])

In [34]:
n_fac = 42

Similar to our 4-character model, we need to create the embeddings for each character and define our layer tasks. This time, we're keeping track of more characters. Therefore, we should expect our model to yield better results. 

In [35]:
def embedding_input(name, n_in, n_out):
    inp = Input(shape=(1,), dtype='int64', name=name+'_in')
    emb = Embedding(n_in, n_out, input_length=1, name=name+'_emb')(inp)
    return inp, Flatten()(emb)

In [36]:
c_ins = [embedding_input('c'+str(n), vocab_size, n_fac) for n in range(cs)]

In [37]:
n_hidden = 256

In [38]:
dense_in = Dense(n_hidden, activation='relu')
dense_hidden = Dense(n_hidden, activation='relu', init='identity')
dense_out = Dense(vocab_size, activation='softmax')

In [39]:
hidden = dense_in(c_ins[0][1])

In [40]:
for i in range(1,cs):
    c_dense = dense_in(c_ins[i][1])
    hidden = dense_hidden(hidden)
    hidden = merge([c_dense, hidden])

In [41]:
c_out = dense_out(hidden)

Let's now create and test our model, as we've done before:

In [42]:
model = Model([c[0] for c in c_ins], c_out)
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

In [43]:
model.fit(xs, y, batch_size=64, nb_epoch=12)

Epoch 1/12
Epoch 2/12
Epoch 3/12
Epoch 4/12
Epoch 5/12
Epoch 6/12
Epoch 7/12
Epoch 8/12
Epoch 9/12
Epoch 10/12
Epoch 11/12
Epoch 12/12


<keras.callbacks.History at 0x7f47fd47b790>

In [44]:
def get_next(inp):
    idxs = [np.array(char_indices[c])[np.newaxis] for c in inp]
    p = model.predict(idxs)
    return chars[np.argmax(p)]

In [45]:
get_next('for thos')

'e'

In [46]:
get_next('part of ')

't'

In [47]:
get_next('queens a')

'n'

Our model has successfully predicted the last character given a list of the preceeding characters. This type of RNN is best suited for tasks like sentiment analysis, which uses sequences of chars/words as input. Keras has a built-in implementation of this RNN that we can use for our sequential model:

In [48]:
n_hidden, n_fac, cs, vocab_size = (256, 42, 8, 86)

In [49]:
model=Sequential([
        Embedding(vocab_size, n_fac, input_length=cs),
        SimpleRNN(n_hidden, activation='relu', inner_init='identity'),
        Dense(vocab_size, activation='softmax')
    ])

In [50]:
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_4 (Embedding)          (None, 8, 42)         3612        embedding_input_1[0][0]          
____________________________________________________________________________________________________
simplernn_1 (SimpleRNN)          (None, 256)           76544       embedding_4[0][0]                
____________________________________________________________________________________________________
dense_7 (Dense)                  (None, 86)            22102       simplernn_1[0][0]                
Total params: 102258
____________________________________________________________________________________________________


In [51]:
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

In [52]:
model.fit(np.concatenate(xs,axis=1), y, batch_size=64, nb_epoch=8)

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x7f47f73c2510>

In [53]:
def get_next_keras(inp):
    idxs = [char_indices[c] for c in inp]
    arrs = np.array(idxs)[np.newaxis,:]
    p = model.predict(arrs)[0]
    return chars[np.argmax(p)]

In [54]:
get_next_keras('this is ')

't'

In [55]:
get_next_keras('part of ')

't'

In [56]:
get_next_keras('queens a')

'n'

It's important to note that, so far, we've be initializing our hidden to hidden weight matrices as indentity matrices. For keeping track of state, this makes sense, however, hidden to hidden layer transformations are meant to give us information on how information from the previous state should be transformed before being combined with the newly transformed input. The best way to do this is to pass the information from the previous state directly to the contruction of our new one, optimized through SGD. 

### Building a 2 to *N* Character Sequence Model

Let's now consider a model that returns sequences; instead of having our model predict the *n*th character from the preceeding *n-1* characters, let's have our model predict chars 2 to *n* from the preceeding *n*-1 characters. Referencing the previous computational diagram, our output is now included in our <font color='red'>red</font> highlighted iteration box (see diagram below). 

![img](https://i.imgur.com/kPFsJ0Q.png[/img])

This model will increase the number of predictions we make on our training set. Instead of making only one prediction, we are making *n*-1 times that. Because of this, our model is able to handle more long-term memory tasks. To build this model, we need to change our output, ```c_out_dat```, to a sequence.

In [57]:
c_out_dat = [[idx[i+n] for i in xrange(1, len(idx)-cs, cs)]
            for n in range(cs)]

In [58]:
ys = [np.stack(c[:-2]) for c in c_out_dat]

We can create and train our model as we've done before. However, instead of starting with an indentity matrix, we will first pass a zero vector:

In [59]:
dense_in = Dense(n_hidden, activation='relu')
dense_hidden = Dense(n_hidden, activation='relu', init='identity')
dense_out = Dense(vocab_size, activation='softmax', name='output')

In [60]:
inp1 = Input(shape=(n_fac,), name='zeros')
hidden = dense_in(inp1)

In [61]:
outs = []

for i in range(cs):
    c_dense = dense_in(c_ins[i][1])
    hidden = dense_hidden(hidden)
    hidden = merge([c_dense, hidden], mode='sum')
    # every layer now has an output
    outs.append(dense_out(hidden))

In [62]:
model = Model([inp1] + [c[0] for c in c_ins], outs)
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

In [63]:
zeros = np.tile(np.zeros(n_fac), (len(xs[0]),1))

In [64]:
model.fit([zeros]+xs, ys, batch_size=64, nb_epoch=12)

Epoch 1/12
Epoch 2/12
Epoch 3/12
Epoch 4/12
Epoch 5/12
Epoch 6/12
Epoch 7/12
Epoch 8/12
Epoch 9/12
Epoch 10/12
Epoch 11/12
Epoch 12/12


<keras.callbacks.History at 0x7f47e9442110>

When testing our model, we see that we've attained pretty good results. In the first example, when given ' this is', our model was able to provide a space after predicting the word 'this'. It also knew to start a word after the space. In the second example, when given ' part of', our model was able to give a space before and after accurately predicting the word 'of':

In [65]:
def get_nexts(inp):
    idxs = [char_indices[c] for c in inp]
    arrs = [np.array(i)[np.newaxis] for i in idxs]
    p = model.predict([np.zeros(n_fac)[np.newaxis,:]] + arrs)
    print(list(inp))
    return [chars[np.argmax(o)] for o in p]

In [66]:
get_nexts(' this is')

[' ', 't', 'h', 'i', 's', ' ', 'i', 's']


['t', 'h', 'e', 't', ' ', 's', 'n', ' ']

In [67]:
get_nexts(' part of')

[' ', 'p', 'a', 'r', 't', ' ', 'o', 'f']


['t', 'o', 'r', 't', ' ', 'o', 'f', ' ']

We can also create this model in Keras using its sequential API. To do this, we would need to set the parameter ```return_sequences``` to ```True``` to allow the prediction to step into the next iteration and change our targets into the necessary sequences:

In [68]:
model=Sequential([
        Embedding(vocab_size, n_fac, input_length=cs),
        SimpleRNN(n_hidden, return_sequences=True, activation='relu', inner_init='identity'),
        TimeDistributed(Dense(vocab_size, activation='softmax')),
    ])

In [69]:
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_5 (Embedding)          (None, 8, 42)         3612        embedding_input_2[0][0]          
____________________________________________________________________________________________________
simplernn_2 (SimpleRNN)          (None, 8, 256)        76544       embedding_5[0][0]                
____________________________________________________________________________________________________
timedistributed_1 (TimeDistribute(None, 8, 86)         22102       simplernn_2[0][0]                
Total params: 102258
____________________________________________________________________________________________________


In [70]:
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

In [71]:
x_rnn = np.stack(np.squeeze(xs), axis=1)
y_rnn = np.atleast_3d(np.stack(ys, axis=1))

In [72]:
model.fit(x_rnn, y_rnn, batch_size=64, nb_epoch=8)

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x7f47e72cf090>

In [73]:
def get_nexts_keras(inp):
    idxs = [char_indices[c] for c in inp]
    arr = np.array(idxs)[np.newaxis,:]
    p = model.predict(arr)[0]
    print(list(inp))
    return [chars[np.argmax(o)] for o in p]

In [74]:
get_nexts_keras(' this is')

[' ', 't', 'h', 'i', 's', ' ', 'i', 's']


['t', 'h', 'e', 'n', ' ', 's', 's', ' ']

### Building a Stateful Model

The models we've built so far (the 4-char, *n*-char, and 2 to *n*-char models) don't really incorporate state as much as we'd like them to. To fix this, we can't train on random batches of data. Instead, when training, we set ```shuffle = false```. Building a stateful model is useful for tasks that require long term memory. To handle long term dependencies, our hidden states will be passed between sequences. We'll start with our initial zero vector input, then pass along our sequence to the next one to always have our hidden state reflect an arbitrarily long dependency. 

Constructing this model in Keras is simple, just set ```stateful = true``` when creating the model. Then, add batch normalization and use an LSTM layer (briefly introduced in Lesson 5). Remember, normalizing our data rather than directly feeding it into our model improves convergence. Once our model generates a prediction, the data is denormalized to get “real world” results.

In [75]:
bs = 64

In [76]:
model=Sequential([
        Embedding(vocab_size, n_fac, input_length=cs, batch_input_shape=(bs,8)),
        BatchNormalization(),
        LSTM(n_hidden, return_sequences=True, stateful=True),
        TimeDistributed(Dense(vocab_size, activation='softmax')),
    ])

In [77]:
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

In [78]:
# inputs/outputs must be even multiple of fixed batch size
mx = len(x_rnn)//bs*bs

Before, our model’s hidden to hidden layer operations were only applied *n*-1 times. In this stateful model, they are being applied possibly hundreds of thousands of times. Therefore, our network is sensitive to exploding gradients; if the matrix is poorly scaled to even a small degree, a number that's slightly larger than the others would exponentially implode--sending the activations to infinity and destabilizing the network.

This instability was corrected using the LSTM model; our network now controls how much state it needs through optimization. It is important to note here that these stateful models compile at a slower rate. This is expected because each sequence is passed through iteratively, making our network harder to parallelize.

In [79]:
model.fit(x_rnn[:mx], y_rnn[:mx], batch_size=bs, nb_epoch=4, shuffle=False)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f47caead250>

In [80]:
model.optimizer.lr=1e-4

In [81]:
model.fit(x_rnn[:mx], y_rnn[:mx], batch_size=bs, nb_epoch=4, shuffle=False)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f47cb9200d0>

In [344]:
model.fit(x_rnn[:mx], y_rnn[:mx], batch_size=bs, nb_epoch=4, shuffle=False)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7fa1773b8d50>

### Building an RNN in Theano

To really understand how we implemented these models in Keras, we will learn how to build a RNN in pure Theano. In the future, we want to build more advanced models. So, it's important to build them from scratch to better understand and debug them on this lower level framework. The same can be done using Tensorflow as the backend.

Let's first think of how to construct the task operations we're going to need for our model. To build our layers (input to hidden, hidden to hidden, and hidden to output), we'll need to create our weight matrices and bias vectors from scratch. ```Shared``` is used to tell Theano that the data passing through is manageable by copying it to and from the GPU when necessary. Here, the weights and biases are returned as tuples:

In [82]:
n_input = vocab_size
n_output = vocab_size

In [83]:
def init_wgts(rows, cols): 
    scale = math.sqrt(2/rows)
    return shared(normal(scale=scale, size=(rows, cols)).astype(np.float32))
def init_bias(rows): 
    return shared(np.zeros(rows, dtype=np.float32))

In [84]:
def wgts_and_bias(n_in, n_out): 
    return init_wgts(n_in, n_out), init_bias(n_out)
def id_and_bias(n): 
    return shared(np.eye(n, dtype=np.float32)), init_bias(n)

In Theano, our variables must be declared before use; no computations are done before our functions are compiled and evaluated. Below, we will declare our matrices, vectors, and scalars and group them in ```all_args```. Next, we can use the functions above to manually initialize the weights and biases to the hidden (```W_h```), input (```W_x```), and output (```W_y```) layers before combining them in a single list (```w_all```). 

In [85]:
t_inp = T.matrix('inp')
t_outp = T.matrix('outp')
t_h0 = T.vector('h0')
lr = T.scalar('lr')

all_args = [t_h0, t_inp, t_outp, lr]

In [86]:
W_h = id_and_bias(n_hidden)
W_x = wgts_and_bias(n_input, n_hidden)
W_y = wgts_and_bias(n_hidden, n_output)
w_all = list(chain.from_iterable([W_h, W_x, W_y]))

Now that we've intialized our inputs, we need to tell Theano what needs to happen in each step (a single forward pass for one character) of our RNN. Our ```step``` function calculates the hidden activations and output. Then, for each scan, we will call this function using the initial values of the outputs, inputs, and all other arguments:

In [87]:
def step(x, h, W_h, b_h, W_x, b_x, W_y, b_y):
    # Calculate the hidden activations
    h = nnet.relu(T.dot(x, W_x) + b_x + T.dot(h, W_h) + b_h)
    # Calculate the output activations
    y = nnet.softmax(T.dot(h, W_y) + b_y)
    # Return both (the 'Flatten()' is to work around a theano bug)
    return h, T.flatten(y, 1)

In [88]:
[v_h, v_y], _ = theano.scan(step, sequences=t_inp, 
                            outputs_info=[t_h0, None], non_sequences=w_all)

After we've completed one forward pass, we must update our weights by first calculating our loss function. Then, we can perform SGD by storing our updates from each forward pass in a dictionary using ```upd_dict```. We can use Theano functions to do this; categorical cross-entropy will help us calculate error given our step function outputs. ```T.grad``` is then used to find the gradient of our error function before updating our parameters with our learning rate, ```lr```.

In [89]:
error = nnet.categorical_crossentropy(v_y, t_outp).sum()
g_all = T.grad(error, w_all)

In [90]:
def upd_dict(wgts, grads, lr): 
    return OrderedDict({w: w-g*lr for (w,g) in zip(wgts,grads)})

upd = upd_dict(w_all, g_all, lr)

We now have our loss, gradient, and step update functions and are ready to compile them using ```function```, which will give our error function. After each loop, our weights will be updated through this manual form of SGD. The function is then iteratively called for each char from our training data. With each pass, the loss function is calculated and used to update the parameters. Here, Jermey prints the error every thousandth iteration to show how the network is improving:

In [91]:
fn = theano.function(all_args, error, updates=upd, allow_input_downcast=True)

In [95]:
oh_ys = [to_categorical(o, vocab_size) for o in ys]
oh_y_rnn=np.stack(oh_ys, axis=1)

oh_xs = [to_categorical(o, vocab_size) for o in xs]
oh_x_rnn=np.stack(oh_xs, axis=1)

X = oh_x_rnn
Y = oh_y_rnn
X.shape, Y.shape

((75110, 8, 86), (75110, 8, 86))

In [96]:
err=0.0; l_rate=0.01
for i in range(len(X)): 
    err+=fn(np.zeros(n_hidden), X[i], Y[i], l_rate)
    if i % 1000 == 999: 
        print ("Error:{:.3f}".format(err/1000))
        err=0.0

Error:25.158
Error:21.460
Error:20.904
Error:19.906
Error:18.812
Error:19.281
Error:19.061
Error:18.505
Error:17.926
Error:18.193
Error:17.430
Error:17.620
Error:18.409
Error:17.324
Error:16.794
Error:17.751
Error:17.322
Error:17.197
Error:16.828
Error:16.665
Error:16.520
Error:16.429
Error:16.676
Error:16.197
Error:16.835
Error:16.559
Error:16.144
Error:16.295
Error:16.285
Error:16.415
Error:16.727
Error:16.439
Error:16.713
Error:16.358
Error:15.976
Error:16.674
Error:15.975
Error:16.386
Error:16.063
Error:16.259
Error:15.349
Error:15.734
Error:15.723
Error:16.061
Error:15.973
Error:15.910
Error:15.658
Error:16.064
Error:15.978
Error:16.110
Error:15.217
Error:15.606
Error:15.022
Error:14.843
Error:15.621
Error:15.380
Error:14.701
Error:15.558
Error:15.103
Error:15.032
Error:14.977
Error:15.397
Error:15.325
Error:15.122
Error:14.672
Error:14.806
Error:14.297
Error:14.682
Error:15.249
Error:14.802
Error:15.153
Error:14.698
Error:14.446
Error:14.519
Error:14.449


We've now successfully built an RNN from scratch in Theano! Let's use our model to make predictions. We'll define another Theano function that takes in our hidden layers and input word sequences and converts them to our one-hot encoded outputs.

In [97]:
f_y = theano.function([t_h0, t_inp], v_y, allow_input_downcast=True)

In [98]:
act = np.argmax(X[6], axis=1)

In [99]:
[indices_char[o] for o in act]

['t', 'h', 'e', 'n', '?', ' ', 'I', 's']

The model does pretty well when you consider, for example, how it knows to follow a punctuation mark with a space, then begin a new word with a capitalized letter. 