# Hello World in Deep Learning

### Deep learning is simply an artificial neural network with multiple hidden layers. It is the hidden layers that make them deep. A general neural network may only have a few hidden layers (1 or 2). Deep neural networks also differs from general neural networks in the types of layers that are used as we will see.

<BR>
<BR>
<img src="./images/nn.png" width="600px"> 

# The MNIST Dataset

### MNIST is a dataset of digital images representing handwritten digits. It also includes labels for the images. For example, the labels for the images below would be 5, 0, 4, and 1.
<BR>
<img src="./images/MNIST.png" width="300px"> 
<BR>
    
### The MNIST data is split into three parts: 

* 55,000 data points of training data (mnist.train)
* 10,000 points of test data (mnist.test)
* 5,000 points of validation data (mnist.validation). 

### This split is very important!  We want to learn from data that is different from what we test on so we are sure we have learned a model that generalizes well.


# Data Representation

### Each image is 28 pixels by 28 pixels representing 784 features.
<BR>
<img src="./images/MNIST-Matrix.png" width="700px"> 
<BR> 
    
### We will flatten the data and throw away the 2D structure of the image. We will talk about taking advantage of structure later . . .
<BR>
<img src="./images/mnist-train-xs.png" width="400px"> 
<BR>

# Converting labels to "one-hot vectors"

### A one-hot vector is a vector with 0 everywhere except having a 1 in the dimensional position representing our label. In our case the label representing our digit. 

#### For example a label of 3 = [0,0,0,1,0,0,0,0,0,0].
<BR>
<img src="./images/mnist-train-ys.png" width="400px"> 
<BR>    

# Softmax Regression

### There are 10 possibilities that a given image can be.

### Softmax provides a distributin over possible outcomes. That is to say, for a given image that is actually an image of a nine our model may determine that there is a 80% chance of it being a 9, a 10% chance of an 8, and some small probability of it being one of the pother possibilities because models are not perfect!

<BR>
<img src="./images/softmax-weights.png" width="400px"> 
    
#### Red = Negative weight, Blue = Positive weight

<center>$evidence_i = \displaystyle \sum_{j=0} p(\space W_{i,j} \space x_j \space + \space b_i) $</center>

<center>$y = softmax(evidence)$</center>

<BR>
<img src="./images/softmax-regression-scalargraph.png" width="400px"> 
<BR>
<img src="./images/softmax-regression-vectorequation.png" width="400px">     

# Let's define our model

In [1]:
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

In [2]:
# A placeholder is not a value! We just need to define how we will hold onto our training cases. 
# "None" means that this dimension can be any length.
x = tf.placeholder(tf.float32, [None, 784])

In [3]:
# We need to define some variables for our weights and biases. A variable is a modifiable tensor element.
# We will want to initialize our variables to zero.
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
y_hat = tf.matmul(x, W) + b

# Now we need to train our model

#### First we have to understand what it means  for a model to be good! IN machine learning we actually define what it means for a model to be bad. We call this our cost of loss function. Our goal is to minimize our loss.

#### A commonly used loss function is "cross-entropy". Cross-entropy loss increases as the predicted probability diverges from the actual label. It looks something like this . . .

<BR>
<img src="./images/cross_entropy.png" width="400px"> 
<BR>

<center>$H_{y'}(y) = - \displaystyle \sum_{i} y'_i \space log(y_i) $</center>

#### Where  <i>y</i> is our predicted probability distribution, <i>y'</i> is our true distribution (the one-hot vector with the digit labels).

#### When we calcualte our evidence we end up with unnormalized log probabilities such as below . . .

$evidence_i = \displaystyle \sum_{j=0} p(\space W_{i,j} \space x_j \space + \space b_i) $

#### Let's assume a 3 class problem . . . 
* Training Case 1 Prediction: [ 0.5,  1.5,  0.1]
* Training Case 2 Prediction: [ 2.2,  1.3,  1.7]

#### These outputs do not sum to one, that is they are unromalized probabilities

#### Now Softmax is going to normalize these into linear probabilites

$y_{hat} = softmax(evidence)$

* Training Case 1 Softmax: [0.227863, 0.61939586, 0.15274114]
* Training Case 2 Softmax: [0.49674623,0.20196195,0.30129182]




In [4]:
# We need a new placeholder for our actual label
y = tf.placeholder(tf.float32, [None, 10])
# Now we can define our loss function. Cross entropy with logits and we average across our batch.
# We generate unnormalized log probabilities (aka logits) and we want the outputs normalized linear probabilities
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels=y, logits=y_hat))

#### Gradient Descent
<BR>
<img align="left" style="float: l;" src="./images/gd.png" width="400px">

<img style="float: l;" src="./images/gd-learning.png" width="350px">
<BR>


In [5]:
# Define our gradient descent optimizer
# We want to minimize our loss function, that is cross entropy. We will set a learning rate of 0.5
train_step = tf.train.GradientDescentOptimizer(0.2).minimize(cross_entropy)

In [6]:
# Let's create a Tensorflow session
sess = tf.InteractiveSession()
# We need to initialize the variables that we creted.
tf.global_variables_initializer().run()

In [7]:
# Let's train our model
mnist = input_data.read_data_sets("data", one_hot=True)


Extracting data\train-images-idx3-ubyte.gz
Extracting data\train-labels-idx1-ubyte.gz
Extracting data\t10k-images-idx3-ubyte.gz
Extracting data\t10k-labels-idx1-ubyte.gz


In [8]:
# Let's train our model with 1000 batches of 100
for _ in range(10000):
    batch_xs, batch_ys = mnist.train.next_batch(100)
    sess.run(train_step, feed_dict={x: batch_xs, y: batch_ys})

In [9]:
# Test trained model
correct_prediction = tf.equal(tf.argmax(y_hat, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print(sess.run(accuracy, feed_dict={x: mnist.test.images, y: mnist.test.labels}))

0.9251
