# Chapter 3
# Implementing Neural Networks in Tensorflow

## What is Tensorflow?

- TensorFlow is a Python library that allows users to express arbitrary
computation as a graph of data flows.
- Data in Tensorflow is represented as tensors, which are multidimensional arrays.


---

## Creating and Manipulating Tensorflow Variables

- We use variables to represent the parameters of the model.
- Tensorflow variables have the follwoing three properties:
    - Variables must be explicitly initialized before a graph is used for the first time.
    - We can use gradient methods to modify variables after each iteration as we
search for a model’s optimal parameter settings.
    - We can save the values stored in variables to disk and restore them for later use.

Let’s start off by initializing a variable that
describes the weights connecting neurons between two layers of a feed-forward neural
network:

In [0]:
import tensorflow as tf

In [0]:
# weights = tf.Variable(tf.random_normal([300, 200], stddev=0.5),
#                       name='weights')

Let’s start off by initializing a variable that
describes the weights connecting neurons between two layers of a feed-forward neural
network:

In [0]:
# weights = tf.Variable(tf.random_normal([300, 200], stddev=0.5), 
#                       name='weights', trainable=False)
# here the weights is not meant to be trainable

In [0]:
# # several other methods to initialize a Tensorflow variable:
# tf.zeros(shape, dtype=tf.float32, name=None)
# tf.ones(shape, dtype=tf.float32m name=None)
# tf.random_normal(shape, mean=0, stddev=1.0, dtype=tf.float32,
#                  seed=None, name=None)
# tf.truncated_normal(shape, mean=0, stddev=1, dtype=tf.float32,
#                     seed=None, name=None)
# tf.random_uniform(shape, minval=0, maxval=None, dtype=tf.float32, 
#                   seed=None, name=None)

When we call tf.Variable, three operations are added to the computation graph:
- The operation producing the tensor we use to initialize our variable.
- The tf.assign operation, which is responsible for filling the variable with the
initializing tensor prior to the variable’s use.
- The variable operation, which holds the current value of the variable.

As we mentioned previously in the three operations, before we use any TensorFlow
variable, the tf.assign8 operation must be run so that the variable is appropriately
initialized with the desired value. We can do this by running tf.initial
ize_all_variables(),9 which will trigger all of the tf.assign operations in our
graph. We can also selectively initialize only certain variables in our computational
graph using the tf.initialize_variables(var1, var2, ...).

---

## TensorFlow Operations

- On a high level, TensorFlow operations represent abstract transformations
that are applied to tensors in the computation graph.


Element-wise mathematical operations: Add, Sub, Mul, Div, Exp, Log, ...

Array operations: Concat, Slice, Split, Constant, ...

Matrix operations: MatMul, MatrixInverse, MatrixDeterminant, ...

Stateful operations: Variable, Assign, AssignAdd, ...

Neural network building blocks: SoftMax, Sigmoid, ReLU, Convolutional2D, MaxPool, ...

Checkpointing operations: Save, Restore, ...

Queue and synchronization operations: Enqueue, Dequeue, MutexAcquaire, MutexRelease, ...

Controlflow operations: Merge, Switch, Enter, Leave, NextIteration, ...

---

## Placeholder Tensors

- The only missing piece is how we pass the input to our deep model (during
both train and test time). A variable is insufficient because it is only meant to be initialized
once. Instead, we need a component that we populate every single time the
computation graph is run.

A placeholder
is instantiated as follows and can be used in operations just like ordinary TensorFlow
variables and tensors:

In [0]:
# x = tf.placeholder(tf.float32, name='x', shape=[None, 784])
# W = tf.Variable(tf.random_uniform([784, 10], -1, 1), name='W')
# multiply = tf.matmul(x, W)


Here we define a placeholder where x represents a minibatch of data stored
as float32’s. We notice that x has 784 columns, which means that each data sample
has 784 dimensions. We also notice that x has an undefined number of rows. This
means that x can be initialized with an arbitrary number of data samples. While we
could instead multiply each data sample separately by W, expressing a full minibatch
as a tensor allows us to compute the results for all the data samples in parallel. The
result is that the ith row of the multiply tensor corresponds to W multiplied with
the ith data sample.

---

## Sessions in TensorFlow

- A TensorFlow program interacts with a computation graph using a session.
- The TensorFlow
session is responsible for building the initial graph, and can be used to initialize
all variables appropriately and to run the computational graph.

In [0]:
# import tensorflow as tf
# from read_data import get_minibatch

# x = tf.placeholder(tf.float32, name='x', shape=[None, 784])
# W = tf.Variable(tf.random_uniform([784, 10], -1, 1), name='W')
# b = tf.Variable(tf.zeros([10]), name='biases')
# output = tf.matmul(x, W)+ bin

# init_op = tf.initializa_all_variable()

# sess = tf.Session()
# sess.run(init_op)
# feed_dict = {'x' : get_minibatch()}
# sess.run(output, feed_dict=feed_dict)

How exactly does a single line of code (sess.run) accomplish such a
wide variety of functions? The answer lies in the powerful expressivity of the underlying computational graph. All of these functionalities are represented as TensorFlow
operations that can be passed as arguments to sess.run. All sess.run needs to do is
traverse down the computational graph to identify all of the dependencies that compose
the relevant subgraph, ensure that all of the placeholder variables that belong to
the identified subgraph are filled using the feed_dict, and then traverse back up the
subgraph (executing all of the intermediate operations) to evaluate the original arguments.

In short, the sess.run scan all the dependencies along the graph and makes sure everything is good, then run the graph.

---

## Navigating Variable Scopes and Sharing Variables

Let's consider the following example:


In [0]:
import tensorflow as tf

In [0]:
def my_network(input):
    W_1 = tf.Variable(tf.random_uniform([784, 100], -1, 1), name="W_1")
    b_1 = tf.Variable(tf.zeros([100]), name="biases_1")
    output_1 = tf.matmul(input, W_1) + b_1

    W_2 = tf.Variable(tf.random_uniform([100, 50], -1, 1), name="W_2")
    b_2 = tf.Variable(tf.zeros([50]), name="biases_2")
    output_2 = tf.matmul(output_1, W_2) + b_2

    W_3 = tf.Variable(tf.random_uniform([50, 10], -1, 1), name="W_3")
    b_3 = tf.Variable(tf.zeros([10]), name="biases_3")
    output_3 = tf.matmul(output_2, W_3) + b_3

    # printing names
    print("Printing names of weight parameters")
    print(W_1.name, W_2.name, W_3.name)
    print("Printing names of bias parameters")
    print(b_1.name, b_2.name, b_3.name)

    return output_3

In [0]:
i_1 = tf.placeholder(tf.float32, [1000, 784], name='i_1')
my_network(i_1)

Printing names of weight parameters
W_1:0 W_2:0 W_3:0
Printing names of bias parameters
biases_1:0 biases_2:0 biases_3:0


<tf.Tensor 'add_2:0' shape=(1000, 10) dtype=float32>

In [0]:
i_2 = tf.placeholder(tf.float32, [1000, 784], name='i_2')
my_network(i_2)

Printing names of weight parameters
W_1_1:0 W_2_1:0 W_3_1:0
Printing names of bias parameters
biases_1_1:0 biases_2_1:0 biases_3_1:0


<tf.Tensor 'add_5:0' shape=(1000, 10) dtype=float32>

This network setup consists of six variables describing three layers. As a result, if we
wanted to use this network multiple times, we’d prefer to encapsulate it into a compact
function like my_network, which we can call multiple times.

If we observe closely, our second call to my_network doesn’t use the same variables as
the first call (in fact, the names are different!). Instead, we’ve created a second set of
variables! In many cases, we don’t want to create a copy, but rather reuse the model
and its variables. It turns out, that in this case, we shouldn’t be using tf.Variable.
Instead, we should be using a more advanced naming scheme that takes advantage of
TensorFlow’s variable scoping.

TensorFlow's variable scoping mechanisms are largely controlled by two functions:

In [0]:
# tf.get_variable(name, shape, initializer)
# tf.variable_scope(scope_name)

Let'stry to rewrite my_network in cleaner fashion

In [0]:
def layer(input, weight_shape, bias_shape):
    weight_init = tf.random_uniform_initializer(minval=-1, maxval=1)
    bias_init = tf.constant_initializer(value=0)
    W = tf.get_variable("W", weight_shape, initializer=weight_init)
    b = tf.get_variable("b", bias_shape, initializer=bias_init)

    return tf.matmul(input, W) + b

def my_network(input):
    with tf.variable_scope("layer_1"):
        output_1 = layer(input, [784, 100], [100])
    with tf.variable_scope("layer_2"):
        output_2 = layer(output_1, [100, 50], [50])
    with tf.variable_scope("layer_3"):
        output_3 = layer(output_2, [50, 10], [10])

    return output_3

In [0]:
i_1 = tf.placeholder(tf.float32, [1000, 784], name='i_1')
my_network(i_1)

<tf.Tensor 'layer_3/add:0' shape=(1000, 10) dtype=float32>

In [0]:
# i_2 = tf.placeholder(tf.float32, [1000, 784], name='i_2')
# my_network(i_2)

Unlike tf.Variable, the tf.get_variable command checks that a variable of the
given name hasn’t already been instantiated. By default, sharing is not allowed (just to
be safe!), but if we want to enable sharing within a variable scope, we can say so
explicitly:

In [0]:
# with tf.variable_scope("shared_variables") as scope:
#     i_1 = tf.placeholder(tf.float32, [1000, 784], name='i_1')
#     my_network(i_1)
#     scope.reuse_variables()
#     i_2 = tf.placeholder(tf.float32, [1000, 784], name='i_2')
#     my_network(i_2)

---

## Managing Models over the CPU and GPU

TensorFlow allows us to utilize multiple computing devices, if we so desire, to build
and train our models. Supported devices are represented by string IDs and normally
consist of the following:

- "/cpu:0"
The CPU of our machine.
- "/gpu:0"
The first GPU of our machine, if it has one.
- "/gpu:1"
The second GPU of our machine, if it has one.

When a TensorFlow operation has both CPU and GPU kernels, and GPU use is
enabled, TensorFlow will automatically opt to use the GPU implementation. To
inspect which devices are used by the computational graph, we can initialize our TensorFlow
session with the log_device_placement set to True:

In [0]:
# sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

If we desire to use a specific device, we may do so by using with tf.device16 to
select the appropriate device. If the chosen device is not available, however, an error
will be thrown. If we would like TensorFlow to find another available device if the
chosen device does not exist, we can pass the allow_soft_placement flag to the session
variable as follows:

In [0]:
with tf.device('/gpu:2'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0], shape=[2, 2], name='a')
    b = tf.constant([1.0, 2.0], shape=[2, 1], name='b')
    c = tf.matmul(a, b)

sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True,
                                        log_device_placement=True))

sess.run(c)

array([[ 5.],
       [11.]], dtype=float32)

TensorFlow also allows us to build models that span multiple GPUs by building
models in a tower-like fashion as shown in Figure 3-3. The following code is an
example of multi-GPU code:

In [0]:
# c = []

# for d in ['/gpu:0', '/gpu:0']:
#     with tf.device(d):
#         a = tf.constant([1.0, 2.0, 3.0, 4.0], shape=[2, 2], name='a')
#         b = tf.constant([1.0, 2.0], shape=[2, 1], name='b')
#         c.append(tf.matmul(a, b))

# with tf.device('/cpu:0'):
#     sum = tf.add_n(c)

# sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

# sess.run(sum)

---

## Specifying the Logistic Regression Model in TensorFlow

Now that we’ve developed all of the basic concepts of TensorFlow, let’s build a simple
model to tackle the MNIST dataset. As you may recall, our goal is to identify handwritten
digits from 28 x 28 black-and-white images. The first network that we’ll build
implements a simple machine learning algorithm known as logistic regression.

On a high level, logistic regression is a method by which we can calculate the probability
that an input belongs to one of the target classes. In our case, we’ll compute the
probability that a given input image is a 0, 1, ..., or 9. Our model uses a
matrix W representing the weights of the connections in the network, as well as a
vector b corresponding to the biases to estimate whether an input x belongs to
class i using the softmax expression we talked about earlier

You’ll notice that the network interpretation for logistic regression is rather primitive.
It doesn’t have any hidden layers, meaning that it is limited in its ability to learn complex
relationships! We have an output softmax of size 10 because we have 10 possible
outcomes for each input. Moreover, we have an input layer of size 784, one input neuron
for every pixel in the image! As we’ll see, the model makes decent headway
toward correctly classifying our dataset, but there’s lots of room for improvement.

We’ll build the the logistic regression model in four phases:
1. inference: produces a probability distribution over the output classes given a
minibatch
2. loss: computes the value of the error function (in this case, the cross-entropy
loss)
3. training: responsible for computing the gradients of the model’s parameters and
updating the model
4. evaluate: will determine the effectiveness of a model

Given a minibatch, which consists of 784-dimensional vectors representing MNIST
images, we can represent logistic regression by taking the softmax of the input multiplied
with a matrix representing the weights connecting the input and output layer.
Each row of the output tensor represents the probability distribution over output
classes for each corresponding data sample in the minibatch:

In [0]:
import tensorflow as tf

In [0]:
def inference(x):
    tf.constant_initializer(value=0)
    W = tf.get_variable('W', [784, 10], initializer=init)
    b = tf.get_variable('b', [10], initializer=init)
    output = tf.nn.softmax(tf.matmul(x, W) + b)
    
    return output

Now, given the correct labels for a minibatch, we should be able to compute the average
error per data sample. We accomplish this using the following code snippet that
computes the cross-entropy loss over a minibatch:

In [0]:
def loss(output, y):
    dot_product = y * tf.log(output)

    # Reduction along axis 0 collapses each column into a
    # single value, whereas reduction along axis 1 collapses
    # each row into a single value. In general, reduction along
    # axis i collapses the ith dimension of a tensor to size 1.
    xentropy = -tf.reduce_sum(dot_product, reduction_indices=1)

    loss = tf.reduce_mean(xentropy)

    return loss

Then, given the current cost incurred, we’ll want to compute the gradients and modify
the parameters of the model appropriately. TensorFlow makes this easy by giving
us access to built-in optimizers that produce a special train operation that we can run
via a TensorFlow session when we minimize them. Note that when we create the
training operation, we also pass in a variable that represents the number of minibatches
that have been processed. Each time the training operation is run, this step
variable is incremented so that we can keep track of progress:

In [0]:
def training(cost, global_step):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    train_op = optimizer.minimize(cost, global_step=global_step)
    return train_op

Finally, we put together a simple computational subgraph to evaluate the model on
the validation or test set:

In [0]:
def evaluate(output, y):
    correct_prediction= tf.equal(tf.argmax(output, 1), tf.argmax(y, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

    return accuracy

---

## Logging and Training the Logistic Regression Model

In [0]:
import input_data
mnist.train

import tensorflow as tf
import time


# Parameters
learning_rate = 0.01
training_epochs = 60
batch_size = 100
display_step = 1

def inference(x):
    init = tf.constant_initializer(value=0)
    W = tf.get_variable("W", [784, 10],
                         initializer=init)
    b = tf.get_variable("b", [10],
                         initializer=init)
    output = tf.nn.softmax(tf.matmul(x, W) + b)

    w_hist = tf.histogram_summary("weights", W)
    b_hist = tf.histogram_summary("biases", b)
    y_hist = tf.histogram_summary("output", output)

    return output

def loss(output, y):
    dot_product = y * tf.log(output)

    # Reduction along axis 0 collapses each column into a single
    # value, whereas reduction along axis 1 collapses each row 
    # into a single value. In general, reduction along axis i 
    # collapses the ith dimension of a tensor to size 1.
    xentropy = -tf.reduce_sum(dot_product, reduction_indices=1)
     
    loss = tf.reduce_mean(xentropy)

    return loss

def training(cost, global_step):

    tf.scalar_summary("cost", cost)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    train_op = optimizer.minimize(cost, global_step=global_step)

    return train_op


def evaluate(output, y):
    correct_prediction = tf.equal(tf.argmax(output, 1), tf.argmax(y, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

    tf.scalar_summary("validation error", (1.0 - accuracy))

    return accuracy

if __name__ == '__main__':

    with tf.Graph().as_default():

        x = tf.placeholder("float", [None, 784]) # mnist data image of shape 28*28=784
        y = tf.placeholder("float", [None, 10]) # 0-9 digits recognition => 10 classes


        output = inference(x)

        cost = loss(output, y)

        global_step = tf.Variable(0, name='global_step', trainable=False)

        train_op = training(cost, global_step)

        eval_op = evaluate(output, y)

        summary_op = tf.merge_all_summaries()

        saver = tf.train.Saver()

        sess = tf.Session()

        summary_writer = tf.train.SummaryWriter("logistic_logs/",
                                            graph_def=sess.graph_def)

        
        init_op = tf.initialize_all_variables()

        sess.run(init_op)


        # Training cycle
        for epoch in range(training_epochs):

            avg_cost = 0.
            total_batch = int(mnist.train.num_examples/batch_size)
            # Loop over all batches
            for i in range(total_batch):
                minibatch_x, minibatch_y = mnist.train.next_batch(batch_size)
                # Fit training using batch data
                sess.run(train_op, feed_dict={x: minibatch_x, y: minibatch_y})
                # Compute average loss
                avg_cost += sess.run(cost, feed_dict={x: minibatch_x, y: minibatch_y})/total_batch
            # Display logs per epoch step
            if epoch % display_step == 0:
                print("Epoch:", '%04d' % (epoch+1), "cost =", "{:.9f}".format(avg_cost))

                accuracy = sess.run(eval_op, feed_dict={x: mnist.validation.images, y: mnist.validation.labels})

                print("Validation Error:", (1 - accuracy))

                summary_str = sess.run(summary_op, feed_dict={x: minibatch_x, y: minibatch_y})
                summary_writer.add_summary(summary_str, sess.run(global_step))

                saver.save(sess, "logistic_logs/model-checkpoint", global_step=global_step)


        print("Optimization Finished!")


        accuracy = sess.run(eval_op, feed_dict={x: mnist.test.images, y: mnist.test.labels})

        print("Test Accuracy:", accuracy)