This is a demo of an implementation of multiplayer perceptron (aka artificial neural network) with 0 hidden layer. Reverse-mode auto differetiation ([1](http://cs231n.github.io/optimization-2/), [2](https://cs224d.stanford.edu/notebooks/vanishing_grad_example.html)) is used to compute the derivative of the loss function with respect to the parameters `W` and `B`.

The neural network was trained and evaluated on the MNIST hand-written digit dataset

The logits $S$ is defined as
$$S = X\cdot W + B$$, 

where $X$ is $n$-by-$p$, $W$ is $p$-by-$k$, and $b$ is $1$-by-$k$. In the case of MNIST, $p$ is 784 (28 pixels wide, 28 pixels high), $k$ is 10 (10 digits), $n$ is the number of training examples.

So $S$ is $n$-by-$k$, and the predicted labels are the maximum value of each row of $S$. The loss function is defined as the avearge cross-entropy between true class label and predicted class label(i.e. applying softmax on each row of $S$, see [1](https://www.tensorflow.org/get_started/mnist/beginners)).

In [1]:
from tensorflow.examples.tutorials.mnist import input_data
from autodiff import *

In [2]:
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
sess = Session()

batch_size = 10000
learning_rate = 1e-1
reg = 1e-3 # regulariztion coefficient

W_val = np.random.normal(scale=0.01, size=(784, 10))
B_val = np.random.normal(scale=0.01, size=(1, 10))

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


Training stage:

In [3]:
print "Training stage:"
for _ in xrange(1000):

    batch_xs, batch_ys = mnist.train.next_batch(batch_size)

    X = PlaceholderOp([batch_size, 784], sess)
    W = PlaceholderOp([784, 10], sess)
    I = PlaceholderOp([batch_size, 1], sess)
    B = PlaceholderOp([1, 10], sess)
    S = AddOp(MulOp(X, W, sess), MulOp(I, B, sess), sess)

    H = SoftmaxCrossEntropyWithLogitsOp(S, np.where(batch_ys)[1], sess)

    F = AddOp(H, RegMatOp(W, reg, sess), sess) # add regularization term on `W`

    feed_dict = {X: batch_xs,
        W: W_val,
        I: np.ones((batch_size, 1)),
        B: B_val}

    if _ % 100 == 0:
        loss = F.eval(feed_dict)
        S_val = S.eval(feed_dict)
        print  "iteration: %d, loss: %f, train accuracy: %f" % (_, loss, np.mean(np.argmax(S_val, axis=1) == np.argmax(batch_ys, axis=1)))
    H.parent_total = 1
    H.deriv(feed_dict, 1.) # propagate derivative dH/dH = 1. backwards 
    W_val += -learning_rate * sess.derivs[id(W)]
    B_val += -learning_rate * sess.derivs[id(B)]

    sess.reset()

Training stage:
iteration: 0, loss: 2.290462, train accuracy: 0.114600
iteration: 100, loss: 0.621227, train accuracy: 0.859100
iteration: 200, loss: 0.504227, train accuracy: 0.876500
iteration: 300, loss: 0.444214, train accuracy: 0.886800
iteration: 400, loss: 0.428894, train accuracy: 0.888000
iteration: 500, loss: 0.412143, train accuracy: 0.890400
iteration: 600, loss: 0.402179, train accuracy: 0.899600
iteration: 700, loss: 0.382540, train accuracy: 0.900700
iteration: 800, loss: 0.379738, train accuracy: 0.904400
iteration: 900, loss: 0.383835, train accuracy: 0.900900


Test stage:

In [4]:
print "Test stage:"
feed_dict = {X: mnist.test.images,
        W: W_val,
        I: np.ones((batch_size, 1)),
        B: B_val}

S_val = S.eval(feed_dict)
print "test set size: %d, test accuracy: %f" % (mnist.test.images.shape[0], np.mean(np.argmax(S_val, axis=1) == np.argmax(mnist.test.labels, axis=1)))

Test stage:
test set size: 10000, test accuracy: 0.910900
