---
## Multilayer Perceptron.
---
__Introduction to MLPs__ [[1]](https://en.wikipedia.org/wiki/Multilayer_perceptron)
Multilayer Perceptron's or Feed forward Networks are _deeper_ version of a logistic regression model (for classification tasks). In case of logistic regression, $w\bullet x + b$ was implemented before converting it to a probability distribution (with softmax). With MLPs, we add __hidden layers__ i.e. multiple layers of $w\bullet x + b$ to learn __richer & abstract__ representations of the images.

This method though comes with a problem. If the we were to combine a series of linear functions, the output before softmax ultimately will be linear. Hence MLPs add a non-linear function called an __activation function__ [[2]](https://en.wikipedia.org/wiki/Activation_function) to create a __non-linear mapping__ between the input and the output.

There are multiple activation functions from __Rectified Linear Units (ReLU), sigmoid (old school), tanh, Leaky ReLU__ and so on.

In [1]:
import time
from IPython import display

# Import the libraries and load the datasets.
import os
os.environ["CUDA_VISIBLE_DEVICES"] = '0'

import numpy as np
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
set_session(tf.Session(config=config))

Using TensorFlow backend.


---
### Feed Forward Networks in Tensorflow
Following are the steps to implement a simple MLP regression function in Tensorflow.
The model is implemented to classify digits from the MNIST dataset.

In [2]:
# import MNIST data.
from tensorflow.examples.tutorials.mnist import input_data

# Check previous section for details on MNIST dataset.
mnist = input_data.read_data_sets("data/", one_hot=True)

Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz


In [3]:
# Define some standard parameters.
img_h = 28
img_w = 28
n_classes = 10

# Training, validation, testing...
train_x = mnist[0].images
train_y = mnist[0].labels
print("Training Size: {}".format(len(train_x)))

val_x = mnist[1].images
val_y = mnist[1].labels
print("Validation Size: {}".format(len(val_x)))

test_x = mnist[2].images
test_y = mnist[2].labels
print("Test Size: {}".format(len(test_x)))

Training Size: 55000
Validation Size: 5000
Test Size: 10000


__Step 1__: Like linear regression, define the input $x$, output $y$ and weight $w$ and bias $b$. Each MNIST image is of size $(28, 28)$. This is image is _squashed_ into a vector of size $(1$x$784)$. n_classes represents the total number of digits $(10)$.

__NOTE: Common layer sizes for the hidden layers tend to be of the power of 2 (16, 32, 64 .. 256)__. It is also common these days to keeps the size of the hidden layers constant and go _deeper_ rather _wider_. This is based on the theory that neural networks learns features higher and higher abstract form as the network grows deeper (This though is more empirical with some theory to back it).

While initializing weight vectors `w_1`, `b_1`, `w_o` & `b_o` avoid initializing with zero values. This is because the outputs of all the activations will be 0. The gradients generated for each neuron remain the same. Use random initializations where the weights values are sampled from a Gaussian distribution.

In [4]:
# Hidden layer size.
layer_size_1 = 128
layer_size_2 = 128

# NOTE: The name of the variable is optional.
x = tf.placeholder(tf.float32, shape=(None, 784), name="X")
y = tf.placeholder(tf.float32, shape=(None, 10), name="Y")
lr_rate = tf.placeholder(tf.float32, shape=(), name="lr")

# Weight & bias.
# Hidden layer.
w_1 = tf.get_variable(shape=[784, layer_size_1], name="w_1",
                      initializer=tf.random_normal_initializer())
b_1 = tf.get_variable(shape=[layer_size_1], name="b_1",
                      initializer=tf.random_normal_initializer())

# Output layer.
w_o = tf.get_variable(shape=[layer_size_1, 10], name="w_o",
                      initializer=tf.random_normal_initializer())
b_o = tf.get_variable(shape=[10], name="b_o",
                      initializer=tf.random_normal_initializer())

# NOTE: Initializations are important.
# Zero initialization: initializer=tf.zeros_initializer())

__Step 2__: Once the placeholders & variable have been created, compute the $y$. The softmax function is a generalized form of the logistic function.

$P(y = j| x) = \frac{e^{x^Tw_j}}{\sum_{k=1}^{K}e^{x^Tw_k}}$

__Activation Function: One of the most popular activation function (especially in computer vision) is Rectified Linear Unit (ReLU).__

$R(x) = \begin{cases} 
          x & x\geq 0 \\
          0 & x\lt 0 
       \end{cases}$

In [5]:
# Compute predicted Y.
# h_1 = tf.nn.relu(tf.add(tf.matmul(x, w_1), b_1)) # <--- Add ReLU activation.
h_1 = tf.sigmoid(tf.add(tf.matmul(x, w_1), b_1)) # <--- Add Sigmoid activation.
y_pred = tf.nn.softmax(tf.add(tf.matmul(h_1, w_o), b_o))

__Step 3__: Once the predicted $y$ has been computed, define the loss between the predicted $y$ and the actual $y$.

With logistic regression, the loss function is __categorical cross entropy__.


_Cross Entropy Loss_: $H(p, q) = -\sum_xp(x)log(q(x))$

__Try__: Calculate $H(p, q)$ for a binary classification $(0, 1)$.

Important note: [NaN Bug](https://stackoverflow.com/questions/33712178/tensorflow-nan-bug)

In [6]:
cross_entropy = tf.reduce_mean(-tf.reduce_sum(tf.multiply(y, tf.log(y_pred)), axis=1))
# cross_entropy = tf.reduce_mean(-tf.reduce_sum(tf.multiply(y,
#                                                           tf.log(tf.clip_by_value(y_pred,
#                                                                                   1e-10,1.0))),
#                                                           axis=1))

# The tensorflow function available.
# cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y,
#                                                                        logits=y_pred))

__Step 4__: The loss shows how far we are from the actual $y$ value. Use the loss to change the weights by calulating the gradient w.r.t $w$. We use a stochastic gradient descent optimizer for this purpose.

In [7]:
# Create a gradient descent optimizer with the set learning rate
optimizer = tf.train.GradientDescentOptimizer(learning_rate=lr_rate)

# Run the optimizer to minimize loss
# Tensorflow automatically computes the gradients for the loss function!!!
train = optimizer.minimize(cross_entropy)

# Gradient Clipping.
# gvs = optimizer.compute_gradients(cross_entropy)
# capped_gvs = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gvs]
# train = optimizer.apply_gradients(capped_gvs)

__Step 5__: Add summaries for the variables that are to be visualized.

In [8]:
# Helper function.
# https://www.tensorflow.org/get_started/summaries_and_tensorboard
def variable_summaries(var, name):
    """Attach a lot of summaries to a Tensor (for TensorBoard visualization)."""
    with tf.name_scope(name):
        with tf.name_scope('summaries'):
            mean = tf.reduce_mean(var)
            tf.summary.scalar('mean', mean)
            
            with tf.name_scope('stddev'):
                stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean)))
            
            tf.summary.scalar('stddev', stddev)
            tf.summary.scalar('max', tf.reduce_max(var))
            tf.summary.scalar('min', tf.reduce_min(var))
            tf.summary.histogram('histogram', var)
    
# Define summaries.
variable_summaries(w_1, "weights")
variable_summaries(b_1, "bias")
variable_summaries(cross_entropy, "loss")

__Step 6__: `train` the model.

In [9]:
# Initialize all variables
init = tf.global_variables_initializer()

__Step 7__: Compute the accuracy.

`tf.argmax` returns the largest value along a specific axis of the vector (in this case 1).

In [10]:
# First create the correct prediction by taking the maximum value from the prediction class
# and checking it with the actual class. The result is a boolean column vector
correct_predictions = tf.equal(tf.argmax(y_pred, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_predictions, tf.float32))

__Step 8__: With the histogram being generated for each variable. `merge_all` the summaries.
The logs are written to the `logs/logistic/tf/` which is the logs sub-directory from the current.

In [11]:
# Define some hyper-parameters.
lr = 0.01
epochs = 5
batch_size = 55
log_dir = 'logs/mlp/tf/' # Tensorboard log directory.
batch_limit = 100

# Train the model.
with tf.Session() as sess:
    # Initialize all variables
    sess.run(init)
    
    # Create the writer.
    # Merge all the summaries and write them.
    merged = tf.summary.merge_all()
    train_writer = tf.summary.FileWriter(log_dir, sess.graph)
    
    num_batches = int(len(train_x)/batch_size)
    for epoch in range(epochs):
        for batch_num in range(num_batches):
            batch_xs, batch_ys = mnist.train.next_batch(batch_size)
            y_p, curr_w, curr_b,\
            curr_loss, _, summary, cur_acc = sess.run([y_pred, w_1, b_1, cross_entropy,
                                                      train, merged, accuracy],
                                                      feed_dict = {x: batch_xs,
                                                                   y: batch_ys,
                                                                   lr_rate: lr})
            if batch_num % batch_limit == 0:
                # IMP: Add the summary for each epoch.
                train_writer.add_summary(summary, epoch)
                display.clear_output(wait=True)
                time.sleep(0.1)
                
                # Print the loss
                print("Epoch: %d/%d. Batch #: %d/%d. Loss: %.2f. Train Accuracy: %.2f"
                      %(epoch+1, epochs, batch_num, num_batches, curr_loss, cur_acc))
    
    # Testing.
    test_accuracy = sess.run(accuracy, feed_dict={x: test_x,                                                   y: test_y})
    print("Test Accuracy: %.2f"%test_accuracy)
    train_writer.close() # <-------Important!


Epoch: 5/5. Batch #: 900/1000. Loss: 0.79. Train Accuracy: 0.78
Test Accuracy: 0.74


__Try: 1. Make all the weights zero and test it.__

__Try: 2. Use a ReLU activation instead of sigmoid.__

__Try: 3. Add another hidden layer and check the output.__


---
## Keras Implementation.
Similar to the example in linear regression, Keras makes it __easy__ to generate summaries so that it can be visualized in Tensorboard.

In [12]:
from keras.layers import Dense, Input
from keras.initializers import random_normal
from keras.models import Model
from keras import optimizers, metrics

For tensorboard add it from __keras backend__. `keras.callbacks.TensorBoard`

In [13]:
from keras.callbacks import TensorBoard

[Keras Activations](https://keras.io/activations/) - A list of all the activations that are present in Keras. Using the Function API rather than the Sequential Model. Output of every layer is a __Keras Tensor__.

In [14]:
# Create a layer to take an input.
input_l = Input(shape=np.array([784]))
# Compute Wx + b.
dense_1 = Dense(layer_size_1, activation='sigmoid')(input_l) # <-- Thats it!
output = Dense(10, activation='softmax')(dense_1)

In [15]:
# Create a model and compile it.
model = Model(inputs=[input_l], outputs=[output])
model.summary() # Get the summary.

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 784)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 128)               100480    
_________________________________________________________________
dense_2 (Dense)              (None, 10)                1290      
Total params: 101,770
Trainable params: 101,770
Non-trainable params: 0
_________________________________________________________________


In [16]:
sgd = optimizers.sgd(lr=lr)
model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])

# NOTE: Add Tensorboard after compiling.
tensorboard = TensorBoard(log_dir="logs/logistic/keras/")

__That's pretty much it!__
Add `callbacks=[tensorboard]` to the fit function.

In [17]:
# Train the model.
# Add a callback.
model.fit(x=train_x, y=train_y, batch_size=batch_size, 
          epochs=epochs, verbose=0, callbacks=[tensorboard])

<keras.callbacks.History at 0x3effa9b69f50>

In [18]:
# Predict the y's.
y_p = model.predict(test_x)
y_p_loss = model.evaluate(test_x, test_y)



In [19]:
# Plot them.
print("Evaluation Metrics: " + str(model.metrics_names))
print("Loss: {}, Accuracy: {}".format(y_p_loss[0], y_p_loss[1]))

Evaluation Metrics: ['loss', 'acc']
Loss: 0.481021496654, Accuracy: 0.8847


__That's an example with TensorBoard!__

Tensorboard command: `$> tensorboard --logdir <log directory>`