# Lesson 7: CNN Architectures and RNNs

In this lesson, we will use what we’ve learned so far in the course to explore other CNN architectures and how they can be used to enter the [Nature Conservancy Fisheries Monitoring Kaggle Competition](https://www.kaggle.com/c/the-nature-conservancy-fisheries-monitoring). The pre-trained convolutional layers will not be altered in the neural networks used for this lesson. So, the training data will simply be the output of the convolutional layers. Then, we’ll review our discussion of backpropagation by building a simple RNN in python and a more complex RNN known as the GRU.  

The first CNN architecture we’ll cover is ResNet50.

## ResNet50

We know CNNs are stacked layers of linear and non-linear operations that are trained to identify features from given inputs to generate predicted outputs. The deeper the network, the more difficult it is to train and the greater the issue of exploding gradients becomes. This is addressed through the addition of normalization layers, where networks with tens of thousands of layers can converge through SGD with backpropagation. Once the layers converge, the issue of **degradation** arises, where accuracy is easily oversaturated before it rapidly degrades. This degradation is not caused by overfitting and adding more layers to an already deep model leads to higher training errors.

**Residual learning** attempts to solve these issues by directly connecting input from the $n$th layer to some $n+i$th layer using the identity function, $id(x) = x$. These identity connections enable the layers to learn incremental, or residual, features. In the image below, taken from Microsoft Research's paper on [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385.pdf), $F(x)$ is the learned residual mapping of (possibly) multiple convolutional layers. The operation $F(x) + x$ is performed using a shortcut connection and element-wise addition that is formulated through the feedforward neural networks (skipping one or more layers). These shortcut connections simply perform identity mappings and don't add to the computational complexity of the model.   

![img](https://i.imgur.com/QPUe5lB.png[/img])

So, how does residual learning compare to VGG? The networks are similar in that they use 3x3 convolutions, activation layers, and max pooling. When using VGG, our output $y = c(c(c(x)))$ is a convolution of a convolution of a convolution of some input $x$. However, for ResNet, these convolutions of convolutions of convolutions are what make up an identity block, sequentially indexed in steps of time $t$. The hidden layer at time $t$ then becomes $h(t+1) = c(c(c(h(t)))) + h(t)$. 

All of the layers of the ResNet architecture are built so that they gradually improve by modeling how each layer differs from the next, a method also known as **boosting**. Dimensionality here remains constant. To build deeper networks, the weights applied to each layer are backpropagated through an indentity without having to worry about exploding gradients. The hidden layer $h(t)$ can then be subtracted to get the residual. This leaves us with the convolution of the convolution of the convolution of $h(t)$. 

We will now add **ResNet50** to our cats vs. dogs image classification model, which is a 50 layer residual network. We’ll be using the network for image classification the same way we’ve used VGG16 in previous lessons. There are other variants of the network, like ResNet101 and ResNet152, that we will not discuss in this lesson. 

Let's start by importing all necessary modules and creating our ResNet50 model object...

In [1]:
%matplotlib inline
import utils; reload(utils)
from utils import *
from __future__ import division, print_function
import resnet50; reload(resnet50)
from resnet50 import Resnet50

Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5103)
Using Theano backend.


In [2]:
rn0 = Resnet50(include_top=False).model

The `include_top=False` parameter ensures that only the convolutional layers are included in the model, which gives us the freedom to attach our own classification layers afterwards. We can see that this is accounted for in a snippet from the `create` function below:

```python
if include_top:
    x = AveragePooling2d((7,7), name='avg_pool')(x)
    x = Flatten()(x)
    x = Dense(1000, activation='softmax', name='fc1000')(x)
    name = 'rn50.h5'
else:
    name = 'resnet_nt.h5'
```

We can now get our batches and precompute the convolutional features: 

In [3]:
path = 'data/dogscats/'
batch_size = 64
model_path = 'data/dogscats/models/'

In [4]:
batches = get_batches(path+'train', shuffle=False, batch_size=batch_size)
val_batches = get_batches(path+'valid', batch_size=batch_size*2, shuffle=False)
(val_classes, trn_classes, val_labels, trn_labels, val_filenames, filenames, test_filenames) = get_classes(path)

Found 23000 images belonging to 2 classes.
Found 2000 images belonging to 2 classes.
Found 23000 images belonging to 2 classes.
Found 2000 images belonging to 2 classes.
Found 12500 images belonging to 1 classes.


In [5]:
val_features = rn0.predict_generator(val_batches, val_batches.nb_sample)
trn_features = rn0.predict_generator(batches, batches.nb_sample)

In [6]:
save_array(model_path + 'trn_rn0_conv.bc', trn_features)
save_array(model_path + 'val_rn0_conv.bc', val_features)

We'll now stick our layers on top of our ResNet (as we've done in the past with our VGG models) and finetune for cats and dogs: 

In [7]:
def get_fc_layers(p):
    return [
        BatchNormalization(axis=1, input_shape=rn0.output_shape[1:]),
        Flatten(),
        Dropout(p),
        Dense(1024, activation='relu'),
        BatchNormalization(),
        Dropout(p/2),
        Dense(1024, activation='relu'),
        BatchNormalization(),
        Dropout(p),
        Dense(2, activation='softmax')
    ]

In [8]:
model = Sequential(get_fc_layers(0.5))
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
batchnormalization_1 (BatchNormal(None, 2048, 7, 7)    4096        batchnormalization_input_1[0][0] 
____________________________________________________________________________________________________
flatten_1 (Flatten)              (None, 100352)        0           batchnormalization_1[0][0]       
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 100352)        0           flatten_1[0][0]                  
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 1024)          102761472   dropout_1[0][0]                  
___________________________________________________________________________________________

In [9]:
model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])

In [10]:
model.fit(trn_features, trn_labels, nb_epoch=2, batch_size=batch_size, validation_data=(val_features, val_labels))

Train on 23000 samples, validate on 2000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7fe435911550>

In just 92 seconds, our model has been able to achieve a validation accuracy of 98.4%, which would've ranked pretty high on the competition's leaderboard. As a comparison, my [kaggle submission](https://github.com/fdaham/fastai/blob/master/dogs_cats_kaggle.ipynb) using the VGG16 model achieved an accuracy of 98.5% and would've gotten 347/1314 on the leaderboard. 

## Other CNN Architectures

### The Nature Conservancy Fisheries Monitoring Kaggle Competition 

More CNN architectures are used and discussed in [my submission](https://github.com/fdaham/fastai/blob/master/fisheries_kaggle.ipynb) of [the Nature Conservancy Fisheries Monitoring Kaggle Competition](https://www.kaggle.com/c/the-nature-conservancy-fisheries-monitoring/). The notebook covers the following architectures:

* Data Leakage
* Bounding Boxes
* Multi-Output Models
* Fully Convolutional Networks
* Heatmaps
* Inception CNNs
* Psuedolabeling
* Global Average Pooling

This notebook is the last we'll discuss of CNNs for this part of the course. For the remainder of the lesson, we will review what we know about RNNs and build a more complicated RNN called the GRU.

## Reviewing RNNs

In [Lesson 6](https://github.com/fdaham/fastai/blob/master/lesson6.ipynb), we built an RNN in Theano, but hadn't yet explored the mechanics of how the gradients were being calculated as we allowed Theano to handle all of that for us. For educational purposes, we'll be covering the subject in this lesson as we build an RNN in pure python using NumPy. But, why are we doing this for an RNN? Well, RNNs handle the more difficult cases of backpropagation. If the mechanics are understood here, they'll be easier to understand in simpler cases elsewhere, like when dealing with CNNs.

### Building an RNN in Pure Python 

Let's begin by creating the functions needed to calculate the Sigmoid and ReLU activations as well as the functions used to calculate their gradients (all functions ending with `_d`). Here's a simple derivation of the sigmoid derivative function:

\begin{equation*}
sig(x) = \frac{1}{1 + e^{-x}} \to \frac{1}{sig(x)} = 1 + e^{-x}
\end{equation*}
\begin{equation*}
\frac{d}{dx}\frac{1}{sig(x)} = \frac{-\frac{d}{dx}sig(x)}{sig(x)^{2}} , \frac{d}{dx}(1+e^{-x}) = -e^{-x} = 1 - \frac{1}{sig(x)} = \frac{sig(x) - 1}{sig(x)}
\end{equation*}
\begin{equation*}
\frac{-\frac{d}{dx}sig(x)}{sig(x)^{2}} = \frac{sig(x) - 1}{sig(x)} \to \frac{d}{dx}sig(x) = sig(x)(1-sig(x))
\end{equation*}

In [11]:
def sigmoid(x): return 1/(1+np.exp(-x))
def sigmoid_d(x): 
    output = sigmoid(x)
    return output*(1-output)

For ReLU, the function is defined as $f(x) = max(0, x)$, or: 

$$
f(x) = \left\{
        \begin{array}{ll}
            x, & \quad x > 0 \\
            0, & \quad otherwise
        \end{array}
    \right.
$$

The piece-wise derivatives of the function are $f'(x > 0) = 1$ and $f'(x < 0) = 0$. The derivative $f'(x=0)$ is undefined and set to be 0.

$$
f'(x) = \left\{
        \begin{array}{ll}
            1, & \quad x > 0 \\
            0, & \quad otherwise
        \end{array}
    \right.
$$

In [12]:
def relu(x): return np.maximum(0., x)
def relu_d(x): return (x > 0.)*1.

The same is done for the Euclidean distance function $f(a,b) = (a-b)^{2}$, with the simple derivative $f'(a,b) = 2(a-b)$...

In [13]:
def dist(a,b): return pow(a-b,2)
def dist_d(a,b): return 2*(a-b)

...as well as for cross-entropy and softmax activation functions:

In [14]:
import pdb

In [15]:
eps = 1e-7
def x_entropy(pred, actual): 
    return -np.sum(actual * np.log(np.clip(pred, eps, 1-eps)))
def x_entropy_d(pred, actual): return -actual/pred

In [16]:
def softmax(x): return np.exp(x)/np.exp(x).sum()

In [17]:
def softmax_d(x):
    sm = softmax(x)
    res = np.expand_dims(-sm,-1)*sm
    res[np.diag_indices_from(res)] = sm*(1-sm)
    return res

We'll also need to define our own scan function, which will allow us to walk through and apply a function to each element of a sequence one step at a time. Since we're not worrying about parallelization, this should be very simple to implement. At each time step, we'll pass the next element of the sequence as the parameters to the function as well as the result from the previous run.  

In [18]:
def scan(fn, start, seq):
    res = []
    prev = start
    for s in seq:
        app = fn(prev, s)
        res.append(app)
        prev = app
    return res

We're now going to train our RNN on the Nietzsche corpus with one-hot encoding. To start, we'll need to download the collected works of Nietzsche.

In [19]:
path = get_file('nietzsche.txt', origin="https://s3.amazonaws.com/text-datasets/nietzsche.txt")
text = open(path).read()
print('corpus length:', len(text))

corpus length: 600901


In [20]:
chars = sorted(list(set(text)))
vocab_size = len(chars)+1
print('total chars:', vocab_size)

total chars: 86


We'll also do what we've done before in [Lesson 6](https://github.com/fdaham/fastai/blob/master/lesson6.ipynb) to initialize the appropriate variables:

In [21]:
# map chars to indices then back to chars
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

In [22]:
# convert all chars to their index based on mapping above
idx = [char_indices[c] for c in text]

In [23]:
# create inputs
cs = 8
c_in_dat = [[idx[i+n] for i in xrange(0, len(idx)-1-cs, cs)]
            for n in range(cs)]
c_out_dat = [[idx[i+n] for i in xrange(1, len(idx)-cs, cs)]
            for n in range(cs)]

In [24]:
xs = [np.stack(c[:-2]) for c in c_in_dat]
ys = [np.stack(c[:-2]) for c in c_out_dat]

In [25]:
oh_xs = [to_categorical(o, vocab_size) for o in xs]
oh_x_rnn = np.stack(oh_xs, axis=1)
oh_ys = [to_categorical(o, vocab_size) for o in ys]
oh_y_rnn=np.stack(oh_ys, axis=1)

Now, let's define our data and shape:

In [26]:
inp = oh_x_rnn
outp = oh_y_rnn
n_input = vocab_size
n_output = vocab_size
# get shape
inp.shape, outp.shape

((75110, 8, 86), (75110, 8, 86))

Now, we can define our forward and backward pass functions. The forward pass will be pretty similar to our simple Theano RNN, but the backward pass will be a little different. Initially, we will do a forward pass of each character in the corpus by applying the function appropriately. The function that returns the hidden state after a single forward pass of the RNN for a single character is defined below. 

In [27]:
def one_char(prev, item):
    # previous state
    tot_loss, pre_hidden, pre_pred, hidden, ypred = prev
    # current inputs and output
    x, y = item
    pre_hidden = np.dot(x,w_x) + np.dot(hidden,w_h)
    hidden = act(pre_hidden)
    pre_pred = np.dot(hidden,w_y)
    ypred = softmax(pre_pred)
    return (tot_loss+loss(ypred, y), pre_hidden, pre_pred, hidden, ypred)

To get the hidden state, we'll need to apply the input weight matrix to $x$ and the hidden weight matrix to the previous hidden state, then add them element-wise before passing them through our activation. We'll then need to generate a prediction using the new hidden state with the output weight matrix and pass the resulting output through the softmax activation. 

Our function will keep track of the total loss by adding our current calculated loss to the previous loss for these predictions and labels. In addition, we'll also keep track of the weight matrices for backpropagation (`pre_hidden` and `pre_pred`), the new hidden state (`hidden`) for the next forward pass, and our predictions (`ypred`).  

Then, we'll apply the scan function to the above for a sequence of characters:

In [28]:
def get_chars(n): return zip(inp[n], outp[n])
def one_fwd(n): return scan(one_char, (0,0,0,np.zeros(n_hidden),0), get_chars(n))

For the backward pass, we'll need to find the partial derivative of the output with respect to the input. This will let us know at what rate changing our input effects our output. We can use the diagram below to visualize this: 

![img](https://i.imgur.com/ADkWm0p.png[/img])

Our final output, the last activation layer, is combined with our total loss, or the cumulative sum of each of the losses from the characters at each activation layer. If we want the derivative of the loss with respect to the derivative of the hidden activation, we would have to take the derivative of the loss with respect to the output activation and multiply it with the derivative of the hidden activation. 

We'll start with one of our inputs and one of our outputs, then go backwards through each of the characters in our sequence. The derivative of the loss will take us from the loss to the final output activation. The derivative of the softmax will get us from the output to the hidden layer through the other side of the activation function (the connecting blue arrow). 

To apply this process throughout the entire network, we'll use the chain rule to loop through every element of the sequence, iteratively applying the partial derivatives at each step using a learning rate (`alpha`) and accumulating the gradients across the sequence. The idea here is that we're trying to reverse all of the transformations and activation functions done in the forward pass. 

In [29]:
# "columnify" a vector
def col(x): return x[:,newaxis]

def one_bkwd(args, n):
    global w_x,w_y,w_h

    i=inp[n]  # 8x86
    o=outp[n] # 8x86
    d_pre_hidden = np.zeros(n_hidden) # 256
    for p in reversed(range(len(i))):
        totloss, pre_hidden, pre_pred, hidden, ypred = args[p]
        x=i[p] # 86
        y=o[p] # 86
        d_pre_pred = softmax_d(pre_pred).dot(loss_d(ypred,y))  # 86
        d_pre_hidden = (np.dot(d_pre_hidden, w_h.T) 
                        + np.dot(d_pre_pred,w_y.T)) * act_d(pre_hidden) # 256

        # outputs - d(loss)/d(w_y) = d(loss)/d(pre_pred) * d(pre_pred)/d(w_y)
        w_y -= col(hidden) * d_pre_pred * alpha
        # hiddens - d(loss)/d(w_h) = d(loss)/d(pre_hidden[p-1]) * d(pre_hidden[p-1])/d(w_h)
        if (p>0): w_h -= args[p-1][3].dot(d_pre_hidden) * alpha
        w_x -= col(x)*d_pre_hidden * alpha
    return d_pre_hidden

We'll initialize our weight matrices as we would normally. To keep this example as simple as possible, we won't include bias.

In [30]:
n_hidden, n_fac, cs, vocab_size = (256, 42, 8, 86)

In [31]:
scale = math.sqrt(2./n_input)
w_x = normal(scale=scale, size=(n_input, n_hidden))
w_y = normal(scale=scale, size=(n_hidden, n_output))
w_h = np.eye(n_hidden, dtype=np.float32)

Now that our forward and backward steps have been defined and our weight matrices have been initialized, we can loop through our dataset and train our network: 

In [32]:
X = oh_x_rnn
Y = oh_y_rnn
X.shape, Y.shape

((75110, 8, 86), (75110, 8, 86))

In [33]:
act = relu
act_d = relu_d
loss = x_entropy
loss_d = x_entropy_d

In [34]:
overallError = 0
alpha = 0.0001
for n in range(10000):
    res = one_fwd(n)
    overallError += res[-1][0]
    deriv = one_bkwd(res, n)
    if(n % 1000 == 999):
        print ("Error:{:.4f}; Gradient:{:.5f}".format(
                overallError/1000, np.linalg.norm(deriv)))
        overallError = 0

Error:36.0060; Gradient:2.91292
Error:35.2746; Gradient:4.00893
Error:33.3053; Gradient:4.00392
Error:30.9275; Gradient:3.23169
Error:29.5900; Gradient:3.95175
Error:29.3541; Gradient:3.53137
Error:28.5532; Gradient:3.95373
Error:28.0482; Gradient:3.43233
Error:27.7101; Gradient:3.87015
Error:27.6855; Gradient:2.81146


Success! We can see that our network is minimizing the loss function after each step.

Let's move on to more complex RNNs...


## Advanced RNNs

We briefly mentioned LSTMs in [Lesson 5](https://github.com/fdaham/fastai/blob/master/lesson5.ipynb). We won't have time to talk about LSTMs as we'll be finishing off this last lesson of part 1 of the course by discussing **GRUs**, or Gated Recurrent Units. Both techniques are used when building RNNs and are meant to prevent gradient explosions. However, GRUs are much simpler to implement than LSTMs.

![img](https://i.imgur.com/6IAMUhR.png[/img])

### The GRU

In the diagram of the GRU above, we see an input is fed through a hidden state $h$ to produce an output. This logic seems normal, but what happens within the hidden state is what makes the GRU unique to a simple RNN. The hidden state updates itself by going through a weight matrix and activation function after passing through a reset gate $r$, which is like a mini neural network. The gate outputs some number between 0 and 1 and multiplies it with the input. This allows the network to either forget or remember the hidden state. 

How should we know whether or not we want to remember or forget the hidden state? We don't. That's why we have the neural network, which will learn a set of weights to decide when to forget what it knows. For all nonzero entries, the reset gate will reach the new value of the hidden state $\tilde{h}$ after being reset. This then goes back to the top bit to meet the old hidden state with another gate, the update gate $z$ (another mini neural network). This gate decides how much of each state it will keep. If the value of the gate is 1, then the update will come purely from the previous hidden state. If it's 0, it will come purely from $\tilde{h}$. Otherwise, it'll be determined again by another neural network.

This is all already implemented in Theano. The Theano GRU looks just like the simple Theano RNN except for the use of the reset and update gates. We'll need separate weights for not just our input, hidden, and output, but also for our reset and update gates: 

In [35]:
def wgts_and_bias(n_in, n_out): 
    return init_wgts(n_in, n_out), init_bias(n_out)
def id_and_bias(n): 
    return shared(np.eye(n, dtype=np.float32)), init_bias(n)
def init_bias(rows): 
    return shared(np.zeros(rows, dtype=np.float32))
def init_wgts(rows, cols): 
    scale = math.sqrt(2/rows)
    return shared(normal(scale=scale, size=(rows, cols)).astype(np.float32))

In [36]:
W_h = id_and_bias(n_hidden)
W_x = init_wgts(n_input, n_hidden)
W_y = wgts_and_bias(n_hidden, n_output)
rW_h = init_wgts(n_hidden, n_hidden)
rW_x = wgts_and_bias(n_input, n_hidden)
uW_h = init_wgts(n_hidden, n_hidden)
uW_x = wgts_and_bias(n_input, n_hidden)
w_all = list(chain.from_iterable([W_h, W_y, uW_x, rW_x]))
w_all.extend([W_x, uW_h, rW_h])

Let's now define our gates, which will just be a sigmoid applied to the addition of the dot products of the input vectors:

In [37]:
def gate(x, h, W_h, W_x, b_x):
    return nnet.sigmoid(T.dot(x, W_x) + b_x + T.dot(h, W_h))

Our step function will be nearly identical to what we've used before except that we'll multiply our hidden state by our reset gate before updating our hidden state based on the update gate.

In [38]:
def step(x, h, W_h, b_h, W_y, b_y, uW_x, ub_x, rW_x, rb_x, W_x, uW_h, rW_h):
    reset = gate(x, h, rW_h, rW_x, rb_x)
    update = gate(x, h, uW_h, uW_x, ub_x)
    h_new = gate(x, h * reset, W_h, W_x, b_h)
    h = update*h + (1-update)*h_new
    y = nnet.softmax(T.dot(h, W_y) + b_y)
    return h, T.flatten(y, 1)

We've now intialized the weights for the respective gates and implemented the gates into our hidden loop. The great thing about GRUs is their ability to learn from the weights whether or not they should throw away or keep a hidden state. These extra degrees of freedom allow SGD to yield better results. 

Everything from here on is identical to our simple Theano RNN:

In [39]:
t_inp = T.matrix('inp')
t_outp = T.matrix('outp')
t_h0 = T.vector('h0')
lr = T.scalar('lr')
all_args = [t_h0, t_inp, t_outp, lr]

In [40]:
[v_h, v_y], _ = theano.scan(step, sequences = t_inp, outputs_info = [t_h0, None], non_sequences = w_all)

In [41]:
error = nnet.categorical_crossentropy(v_y, t_outp).sum()
g_all = T.grad(error, w_all)

In [42]:
def upd_dict(wgts, grads, lr): 
    return OrderedDict({w: w-g*lr for (w,g) in zip(wgts,grads)})

In [43]:
upd = upd_dict(w_all, g_all, lr)
fn = theano.function(all_args, error, updates=upd, allow_input_downcast=True)

In [44]:
err = 0.0; l_rate = 0.1
for i in range(len(X)): 
    err+=fn(np.zeros(n_hidden), X[i], Y[i], l_rate)
    if i % 10000 == 9999: 
        l_rate *= 0.95
        print ("Error:{:.2f}".format(err/10000))
        err=0.0

Error:21.27
Error:18.61
Error:17.59
Error:17.45
Error:16.88
Error:16.33
Error:16.02


We'll now combine the weights. To make things simpler and faster, we'll concatenate the hidden and input matrices together:

In [45]:
W = (shared(np.concatenate([np.eye(n_hidden), normal(size=(n_input, n_hidden))])
            .astype(np.float32)), init_bias(n_hidden))
rW = wgts_and_bias(n_input+n_hidden, n_hidden)
uW = wgts_and_bias(n_input+n_hidden, n_hidden)
W_y = wgts_and_bias(n_hidden, n_output)
w_all = list(chain.from_iterable([W, W_y, uW, rW]))

In [46]:
def gate(m, W, b): return nnet.sigmoid(T.dot(m, W) + b)

In [47]:
def step(x, h, W, b, W_y, b_y, uW, ub, rW, rb):
    m = T.concatenate([h, x])
    reset = gate(m, rW, rb)
    update = gate(m, uW, ub)
    m = T.concatenate([h*reset, x])
    h_new = gate(m, W, b)
    h = update*h + (1-update)*h_new
    y = nnet.softmax(T.dot(h, W_y) + b_y)
    return h, T.flatten(y, 1)

In [48]:
[v_h, v_y], _ = theano.scan(step, sequences=t_inp, 
                            outputs_info=[t_h0, None], non_sequences=w_all)

In [49]:
def upd_dict(wgts, grads, lr): 
    return OrderedDict({w: w-g*lr for (w,g) in zip(wgts,grads)})

In [50]:
error = nnet.categorical_crossentropy(v_y, t_outp).sum()
g_all = T.grad(error, w_all)

In [51]:
upd = upd_dict(w_all, g_all, lr)
fn = theano.function(all_args, error, updates=upd, allow_input_downcast=True)

In [52]:
err=0.0; l_rate=0.01
for i in range(len(X)): 
    err+=fn(np.zeros(n_hidden), X[i], Y[i], l_rate)
    if i % 10000 == 9999: 
        print ("Error:{:.2f}".format(err/10000))
        err=0.0

Error:21.29
Error:19.51
Error:18.51
Error:18.24
Error:17.72
Error:17.05
Error:16.66
