# Tensorflow: Neural Networks
*Rachel Buttry*

*16 April 2018*

We've seen that tensorflow is capable of performing the basic tasks that python can (with some longer syntax), but where it really shines is when doing deep learning. Tensorflow provides tools, such as tensoboard, that may not be particularly useful for doing sklearn models from scratch, but are very helpful when designing and using artificial neural networks (ANN).

This notebook with give overviews of deep learning concepts, but I highly reccommend reading [this article](http://adventuresinmachinelearning.com/neural-networks-tutorial/) for a more in-depth introduction to the subject. 

Let's load our data. For this example, we'll be using the [MNIST digits dataset](http://yann.lecun.com/exdb/mnist/). Code for this example was taken directly from: [Adventures in Machine Learning](http://adventuresinmachinelearning.com/python-tensorflow-tutorial/?sfw=pass)

<img src='https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png' width='50%'>

In [1]:
# load mnist digits
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


In [2]:
import tensorflow as tf
sess = tf.Session()

### What is an Aritifical Nerual Network?
An articficial nerural network is a machine learning model with a structure that is based off of the brain. The "neurons" (a.k.a nodes) of a given layer are connected to the "neurons" of the next layer. The "neurons" take the input, weigh it, then use an activation function to determine if it should "fire" or just return zero. Artificial neural networks can be designed and trained for very specific problems.
Image from [Wikipedia](https://en.wikipedia.org/wiki/Artificial_neural_network):
<img src='https://upload.wikimedia.org/wikipedia/commons/e/e4/Artificial_neural_network.svg' width='25%'>

We're going to make simple neural net with an input layer, one hidden layer, and an output layer to apply to our dataset. It will have the shape (784-300-10) where there are 784 nodes in the first (input) layer, 300 in the second (hidden), and 10 in the third (output) layer.

### Input Layer
The MNIST images are 28x28 pixel each where each pixel has a grayscale value from 0 to 1. 

28 x 28 = 784

Thus our input layer will have 784 input nodes.
(Note we're also creating a placeholder for the correct labels of the data that isn't part of the input layer. This is so we can compare the neural net output to the correct label.)

In [3]:
# declare the training data placeholders
x = tf.placeholder(tf.float32, [None, 784])
# now declare the output data placeholder - 10 digits
y = tf.placeholder(tf.float32, [None, 10])

### Hidden Layer
There are 300 nodes in our hidden layer. It's a somewhat aribitrary number, but for this example, why not?

In reality, there are a bunch of conventions and "rules" for determining the ideal number of hidden layers you need as well as how many nodes in should be in each hidden layer. For instance, you generally want the number of nodes in the hidden layer to be between the number of input nodes and the number of output nodes.  We won't worry too much about all the conventions for now and just say that since 784 > 300 > 10, we should be good.

#### Weights and Bias
For each layer, there is a set of weights $W$ multiplied by the input values and every input value (once multiplied by the appropriate weight) has a bias value $b$ added to it. This new  set of values will become the input values for our hidden layer activation function.

<center>
    $input \cdot W   + b = weighted \space input$
</center>


* $W$ is a matrix of shape $inputsize \times \# nodes$
* $b$ is a vector of length $\# nodes$.

So, our $W$ for this layer is  of shape $784 \times 300$ and $b$ is a vector of length $300$. Notice that we're multiplying the input data on the right so that our $1 \times 784$ vector ultimately becomes a $1 \times 300$ vector. This exact order of multiplication and shapes of the weights/biases aren't really a convention, the goal is to match the data inputed with the respective weights/biases and get back the shape we want.

In [4]:
# now declare the weights connecting the input to the hidden layer
W1 = tf.Variable(tf.random_normal([784, 300], stddev=0.03), name='W1')
b1 = tf.Variable(tf.random_normal([300]), name='b1')

#### Activation Functions
In a neuron, an activation function is a function to determine whether the neuron should fire (activate) or not. In a neural network, the activation function works in a similar way. However, rather than a binary fire vs. don't fire, the activation function will usually return a continuous value (usually between 0 and 1) that can be thought of as a partial firing.

The input for this function is the weighted/biased data from the input layer:
$$
\begin{aligned}
    out = f(W \cdot input + b)
\end{aligned}
$$

Every node in the layer uses the same activation fucntion. We could hypothetically loop thru the nodes of a given layer and use different activation fucntions on each, but that is computationally costly and there isn't any clear benefit to doing so.

In this layer we're using the [rectifier linear unit activation function](https://en.wikipedia.org/wiki/Rectifier_(neural_networks):
$$
\begin{aligned}
    f(x) = max(0, x)
\end{aligned}
$$

The function returns the value of the input if it's positive, and zero otherwise.
It is called with ```tf.nn.relu()```.

In [5]:
# calculate the output of the hidden layer
hidden_out = tf.add(tf.matmul(x, W1), b1)
hidden_out = tf.nn.relu(hidden_out)

### Output Layer
The output of our hidden layer becomes the input for our last layer.

For this layer, we use the [softmax activation function](https://en.wikipedia.org/wiki/Softmax_function):
$$
\begin{aligned}
    f(x_k) = \frac{e^{x_k}}{\sum_{i=1}^{n} e^{x_i}}
\end{aligned}
$$
* $k$ is the number of the current output node
* $n$ is the number of output nodes

The function returns a value between 0 and 1 that is the probability of the node's outcome. It is called using ```tf.nn.softmax()```.

In [6]:
# weights connecting the hidden layer to the output layer
W2 = tf.Variable(tf.random_normal([300, 10], stddev=0.03), name='W2')
b2 = tf.Variable(tf.random_normal([10]), name='b2')

# now calculate the hidden layer output - in this case, let's use a softmax activated
# output layer
y_ = tf.nn.softmax(tf.add(tf.matmul(hidden_out, W2), b2))

### Cost Fucntion
We want to minimize the error between our predicted and correct values. To do this, we represent the error using a cost (a.k.a loss) function and then minimize it. We're using the [cross entropy cost function](https://en.wikipedia.org/wiki/Cross_entropy).

We have 10 output nodes so that each will (ideally) return either a 1 or a zero. So if the input image is a 6, then we'd expect the correct label to look like \[0,0,0,0,0,0,1,0,0,0\]. This means that we'd only want the output node corresponding to the number 6 to return a 1 and the rest to return 0. (This type of representation is called [one-hot encoding](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f).)

Each node should return a binary value (0 or 1), so we use the binary calculation for cross entropy. For each test prediction output node $i$:
$$
\begin{aligned}
    L_i = -(y_{actual}\space log(y_{pred}) + (1-y_{actual})log(1 - y_{pred})) 
\end{aligned}
$$
* $y_{actual}$ is the actual label value for that node
* $y_{pred}$ is the value predicted for the test data by the neural net

We want to find the total loss for a given batch, then calculate the average loss across all batches:
$$
\begin{aligned}
    L_{avg} = \frac{1}{m} \sum_{j=1}^{m}\sum_{i=1}^{n}L_{i,j}
\end{aligned}
$$
* $L_i^m $ is the Loss for node $i$ in batch $j$
* $m$ is the number of batches
* $n$ is the number of output nodes


This is the function we are going to minimize using gradient descent.

**Note:** We're going to clip the y value as to avoid $NaN$s when taking $log(0)$.

In [7]:
y_clipped = tf.clip_by_value(y_, 1e-10, 0.9999999)
cross_entropy = -tf.reduce_mean(tf.reduce_sum(y * tf.log(y_clipped)
                         + (1 - y) * tf.log(1 - y_clipped), axis=1))

### Training Terminology
* **forward pass** - computes values from inputs to output
* **backward pass** - backpropagation which starts at the end and recursively applies the chain rule to compute the gradients
* **pass** - one forward pass and one backward pass
* **batch size** - number of examples in one forward/backward pass
* **number of iterartions** - number of passes, each pass using *batch size* number of examples
* **epoch** - one forward pass and backward pass of *all* training samples
* **learning rate** - how much the coefficents can change with each update

For instance, if you have 1000 training examples and a batch size of 500, then it will take 2 iterations to complete 1 epoch.

In [8]:
# Python optimisation variables
learning_rate = 0.5
epochs = 10
batch_size = 100

optimiser = tf.train.GradientDescentOptimizer(
    learning_rate=learning_rate).minimize(cross_entropy)

### Automatic Backpropogation
Ordinarily when applying gradient descent to a neural network, the derivatives are applied in the direction ouput to input. The chain rule is applied to the final output, thus partial derivatives of the gradient need to be calculated "from end to start".

This is known as [backpropogation](https://brilliant.org/wiki/backpropagation/)--short for "backward propagation of errors". Tensorflow has made it so that we don't need to write our own backpropogation algorithm. The ```GradientDescentOptimizer()``` will minimize loss and do all the hard work for us.

In [9]:
# finally setup the initialisation operator
init = tf.global_variables_initializer()

# initialise the variables
sess.run(init)
total_batch = int(len(mnist.train.labels) / batch_size)

# run the training
for epoch in range(epochs):
    avg_cost = 0
    for i in range(total_batch):
        batch_x, batch_y = mnist.train.next_batch(batch_size=batch_size)
        _, c = sess.run([optimiser, cross_entropy], 
                     feed_dict={x: batch_x, y: batch_y})
        avg_cost += c / total_batch
    print "Epoch:", (epoch + 1), ", cost =", "{:.3f}".format(avg_cost)

Epoch: 1 , cost = 0.603
Epoch: 2 , cost = 0.224
Epoch: 3 , cost = 0.164
Epoch: 4 , cost = 0.126
Epoch: 5 , cost = 0.108
Epoch: 6 , cost = 0.088
Epoch: 7 , cost = 0.073
Epoch: 8 , cost = 0.061
Epoch: 9 , cost = 0.051
Epoch: 10 , cost = 0.041


### Results

In [10]:
# define an accuracy assessment operation
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

print(sess.run(accuracy, feed_dict={x: mnist.test.images, y: mnist.test.labels}))

0.9772


### Final Notes
There's definetly room for improvement, but 97% accuracy is a good enough for our little nerural net. 

It's important to be aware othat we won't be able to understand the meaning of what our model is doing. If we wanted to, we could extract the weights and biases, but those are just numbers and don't have any physical meaning. As neural networks become more and more complex, the individual weights/biases will be more difficult to interpret.

So, if you're looking for a machine learning model that you will be able to easily understand where the result came from, neural networks are *not* for you.

#### Additional Resources:
* [Activation Functions Intro](https://medium.com/the-theory-of-everything/understanding-activation-functions-in-neural-networks-9491262884e0)
* [Convolutional Neural Networks Intro](http://cs231n.github.io/optimization-2/)
* [Loss Functions Cheatsheet](http://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html)