# Basic Neural Network in TensorFlow

In this notebook, we build a 2-hidden layers neural network (a.k.a multilayer perceptron) with TensorFlow.


## MNIST Dataset Overview

This example uses the MNIST handwritten digits dataset. The dataset contains 60,000 examples for training and 10,000 examples for testing. The digits have been size-normalized and centered in a fixed-size image (28x28 pixels) with values from 0 to 1. For simplicity, each image has been flattened and converted to a 1-D numpy array of 784 features (28*28).

![MNIST Dataset](http://neuralnetworksanddeeplearning.com/images/mnist_100_digits.png)

More info: http://yann.lecun.com/exdb/mnist/

## Creating the Model

The task of our network is to look at the handwritten digits and determine which nubmer it is 0-9. We will formulate this as a 10-class classification problem, where the output will be the probability of the handwritten image belonging to each class. 

First let's set up our environment and load the data. The MNIST data is hosted on Yann LeCun's website. For your convenience, we've included some python code to download and install the data automatically. 

In [1]:
from __future__ import print_function

import tensorflow as tf
from tensorflow.keras import Model, layers
import numpy as np

# Import MNIST data
from tensorflow.keras.datasets.mnist import load_data
mnist = load_data()

In [4]:
# Data Parameters
num_features = 784 # MNIST data input (img shape: 28*28)
num_classes = 10 # MNIST total classes (0-9 digits)

# Parameters
learning_rate = 0.1    # alpha for gradient descent
num_steps = 2000       # iterations for gradient descent
batch_size = 256       # number of inputs to look at simultaneously (good for large data!)
display_step = 100     # when to print out some feedback

# Network Parameters
n_hidden_1 = 128 # 1st layer number of neurons
n_hidden_2 = 256 # 2nd layer number of neurons

# Divide the data into training and testing
(x_train, y_train), (x_test, y_test) = mnist

print(x_train.shape)
# Some data preprocessing to make this go more smoothly
# Convert to float32.
x_train, x_test = np.array(x_train, np.float32), np.array(x_test, np.float32)
print(x_train.shape)
# Flatten images to 1-D vector of 784 features (28*28 pixels).
x_train, x_test = x_train.reshape([-1, num_features]), x_test.reshape([-1, num_features])
print(x_train.shape)
print(y_train.shape)
# Normalize images value from [0, 255] to [0, 1].
x_train, x_test = x_train / 255., x_test / 255.
# Use tf.data API to shuffle and batch data (batches will make it faster to queue up 
# sets of images for training all at one time--a convenient way to split up large data sets)
train_data = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_data = train_data.repeat().shuffle(5000).batch(batch_size).prefetch(1)

(60000, 28, 28)
(60000, 28, 28)
(60000, 784)
(60000,)


### Defining the Model Structure
Now we can define our neural network model. To do this, we define a new class for our model that inherits from the built-in TensorFlow `Model` class. Then, we just specify each layer of the nework; we only need to indicate how many nodes will be at each layer and what the activation function looks like, and do not have to explicitly list any of the weights. We are going to use the rectified linear function as our activation function (threshold function) for the perceptrons at layers 1 and 2. 

Because this is a multi-class classification problem (which class 0-9 does the handwritten digit fall into?), we are going to use the same softmax normalization we discussed in class (and which can be found in the `NeuralNetwork-demo.ipynb` eample). Rather than just making a yes/no decision about each class and only choosing one, softmax lets us output a probability for each class. 

In [None]:
# Create TF Model. Our NeuralNet inherits from the generic TF Model class
class NeuralNet(Model):
    # Set layers.
    def __init__(self):
        super(NeuralNet, self).__init__()
        # First fully-connected hidden layer. Activation function (threshold function) is 
        # rectified linear.
        self.h1 = layers.Dense(n_hidden_1, activation=tf.nn.relu)
        # First fully-connected hidden layer. Activation function (threshold function) is 
        # rectified linear.
        self.h2 = layers.Dense(n_hidden_2, activation=tf.nn.relu)
        # Second fully-connecter hidden layer. Activation function is using "softmax" to
        # normalize output as a probability distribution over the 10 classes (digits 0-9)
        self.out = layers.Dense(num_classes, activation=tf.nn.softmax)

    # Set forward pass--this defines the input layer (h1) and output layer (out) and then
    # formats the final results as a probability distribution over the classes using softmax
    def call(self, x, is_training=False):
        x = self.h1(x)
        x = self.out(x)
        if not is_training:
            # tf cross entropy expect logits (positions on the logistic regression sigmoid)
            # without softmax normalization, so only apply softmax when not training.
            x = tf.nn.softmax(x)
        return x

# Build neural network model.
neural_net = NeuralNet()

### Specify the Loss Function

The next step is to define our loss function, which we will need for training. Again, because we are doing a probabilistic multi-class classification, we will use the cross-entropy loss function presented in class. 

In [None]:
# Cross-Entropy Loss.
# Note that this will apply 'softmax' to the logits as part of the function, so don't do
# it before calling.
def cross_entropy_loss(x, y):
    # Convert labels to int 64 for tf cross-entropy function.
    y = tf.cast(y, tf.int64)
    # Apply softmax to logits and compute cross-entropy.
    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=x)
    # Average loss across the batch.
    return tf.reduce_mean(loss)

# Accuracy metric. This counts how many of our predictions we get right based on 
# choosing the prediction with the highest probability. 
def accuracy(y_pred, y_true):
    # Predicted class is the index of highest score in prediction vector (i.e. argmax).
    correct_prediction = tf.equal(tf.argmax(y_pred, 1), tf.cast(y_true, tf.int64))
    return tf.reduce_mean(tf.cast(correct_prediction, tf.float32), axis=-1)

### Train the Model
Next, we need to write our training loop. TensorFlow has many different optimization functions you can use to minimize loss with respect to the parameters. However, we are going to stick with gradient descent. Specifically, we will be using a variation of gradient descent called ***stochastic gradient descent***. 

Standard gradient descent--typically called batch gradient descent because we process the data in one big batch--will look at each point in the training data in sequence, which gives us a good model, but can be impossible with large datasets that maybe don't all fit in memory. Stochastic gradient descent (SGD) divides the training data into managably sized batches, runs a mini-gradient descent on all of those, and averages the results to get the final answer. It still looks at all the data, but in more "bite-sized" pieces. The results of SGD are usually pretty close to the true gradient descent, and it is much more tractable for large datasets (and often less suceptible to getting stuck in local minima). 

In [None]:
# Stochastic gradient descent optimizer.
optimizer = tf.optimizers.SGD(learning_rate)

# Optimization process. 
def run_optimization(x, y):
    # Wrap computation inside a GradientTape for automatic differentiation 
    # (see Backprop.ipnb for an explanation of the GradientTape)
    with tf.GradientTape() as g:
        # Forward pass.
        pred = neural_net(x, is_training=True)
        # Compute loss.
        loss = cross_entropy_loss(pred, y)
        
    # Variables to update, i.e. trainable variables.
    trainable_variables = neural_net.trainable_variables

    # Compute gradients (backpropagation).
    gradients = g.gradient(loss, trainable_variables)
    
    # Update all of the weights W and biases (y-intercepts) b following gradients.
    optimizer.apply_gradients(zip(gradients, trainable_variables))
    
# Run training for the given number of steps.
for step, (batch_x, batch_y) in enumerate(train_data.take(num_steps), 1):
    # Run the optimization to update W and b values.
    run_optimization(batch_x, batch_y)
    
    if step % display_step == 0:
        pred = neural_net(batch_x, is_training=True)
        loss = cross_entropy_loss(pred, batch_y)
        acc = accuracy(pred, batch_y)
        print("Pred:", pred)
        print("Actual:", batch_y)
        print("step: %i, loss: %f, accuracy: %f" % (step, loss, acc))

### Test the Model

We now have a trained neural network ready for testing! Let's run it on the testing data and see how it does!

In [None]:
# Test model on validation set.
pred = neural_net(x_test, is_training=False)
print("Test Accuracy: %f" % accuracy(pred, y_test))

To get a better idea how the model is performing, we can print out some of the handwritten letters in the validation data and the prediction made by our neural net.

In [None]:
# Visualize predictions.
import matplotlib.pyplot as plt

# Predict 5 images from validation set.
n_images = 20
test_images = x_test[:n_images]
predictions = neural_net(test_images)

# Display image and model prediction.
for i in range(n_images):
    plt.imshow(np.reshape(test_images[i], [28, 28]), cmap='gray')
    plt.show()
    print("Model prediction: %i" % np.argmax(predictions.numpy()[i]))