# Classifiying MNIST digits with an RNN

Adapted from Aymeric Damien's [Tensorflow tutorials](https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/3_NeuralNetworks/recurrent_network.ipynb)

## Setting up

In [None]:
import tensorflow as tf
import numpy as np

# Import MINST data
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

## Params

In [37]:
learning_rate = 0.001
training_iters = 10000
batch_size = 10

n_hidden=50

n_steps= 28
n_input= 28
n_classes= 10

## Building Model and data ingestion

In [36]:
tf.reset_default_graph()

x= tf.placeholder(tf.float32, (None, n_steps, n_input), 'input')
y= tf.placeholder(tf.float32, (None, n_classes))
W= tf.Variable(tf.random_uniform((n_hidden, n_classes),0,1))
b= tf.Variable(tf.random_uniform((n_classes,1),0,1))
    
inps= tf.unstack(x, num=28, axis=1)
lstm_cell= tf.contrib.rnn.LSTMCell(n_hidden)
outputs, states= tf.contrib.rnn.static_rnn(lstm_cell, inps, dtype=tf.float32)
pred=tf.matmul(outputs[-1],W) + b # batch_size*50, 50*10

loss= tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels= y, logits= pred))
train_op = tf.train.AdamOptimizer(learning_rate).minimize(loss)


correct_pred = tf.equal(tf.argmax(pred,1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

init_op = tf.global_variables_initializer()

## Training

In [31]:
def data_loader(data, batch_size):
    inp,target= data.next_batch(10)
    inp=np.reshape(inp, (batch_size, n_steps, n_input))
    return inp, target

with tf.Session() as sess:
    sess.run(init_op) #If you're not using tf.InteractiveSession, then this is how you initialize global variables
    for i in range(training_iters):
        inp, target= data_loader(mnist.train, batch_size)
        _, l, acc= sess.run([train_op,loss,accuracy], feed_dict= {x:inp, y:target})
        if i%1000==0:
            print(i, l, acc)

0 2.31749 0.1
1000 0.414572 0.8
2000 0.412182 0.9
3000 0.0647575 1.0
4000 0.0786366 1.0
5000 0.0489145 1.0
6000 0.13115 0.9
7000 0.292898 0.8
8000 0.0823695 1.0
9000 0.018918 1.0


## Notes and Gotchas

- scopes are DEFINITELY neccesary. Pretty simple reason actually. You might build a graph for training and testing separately, but need the weights, parameters to be shared in both. 
    - Figure out a strategy for running model on test data
- tf.unstack
    - Commonly used function. It chips a tensor along a certain axis, best to see an example
- tf.global_variable_initializer functions differently in InteractiveSession and regular Sessions. It cannot simply be called, it needs to be fetched in the current Session
- APIs for static and dynamic RNNs are different. Requires different format of inputs. Static needs a list of (batch,inputs) sized, where the length of this list is the maximum length of the sequence
- An affine transform is just a linear transformation of the input
- tf.nn.cross_entropy_softmax_logit ...
- sigmoid vs softmax
    - sigmoid used for 2 or more mutually exclusive classes
    - softmax for predicting discrete classes, of which sigmoid is special case k=2
- logit- unnormalized log probablity (wx+b for distribution over binary variables). Although it can be interpreted that way, and it makes things convenient from an optimization objective front, why is the output of an affine transform "considered" to be a log probability? 
- sigmoid function different from sigmoid activation