# Practical Session 4: Getting Started with Deep Learning Models in TensorFlow

*Notebook by [Marek Rei](https://github.com/marekrei/cl-datasci-pnp)*

In this practical, we will continue from where the lecture left off and learn more about using TensorFlow.

The practical will cover a few different network architectures and we will look at different components that are often used in neural networks.

To start off, let's import `TensorFlow` into our notebook. If you have just installed `TensorFlow` you will have version 2.X, which is a more user-friendly version of this library and which packs many details in concise function calls. In costrast, `TensorFlow` 1.X is a more verbose and a lower-level version; it is still useful to "look under the hood", and you can in fact run the code from `TensorFlow` 1.X using `TensorFlow` 2.X in a compatibility mode like so:

In [None]:
# if using TensorFlow 1.X
#import tensorflow as tf

# Otherwise, you can still run code from this notebook 
# with TensorFlow 2.X using its compatibility mode
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

In [None]:
tf.__version__

# Minimal Tensorflow Example

This is the first example from the lecture. 

We first create a network with two placeholders that adds these together and returns the result. Then, we execute this network with two input values, 4 and 5. This returns the result 9.

In [None]:
tf.reset_default_graph()

a = tf.placeholder(tf.float32, name="a")
b = tf.placeholder(tf.float32, name="b")
y = a + b

with tf.Session() as sess:
    result = sess.run(y,
                      feed_dict={a:4, b:5})
    print("Result: ", result)

Occasionally throughout this notebook, the following function will be called:

In [None]:
tf.reset_default_graph()

This is necessary to reset the TensorFlow network. We have many different small networks in one notebook and we don't want them interfering with each other, so as a pre-emptive measure we will occasionally reset the computation graph.

# Training the Parameters

This is the second example from the lecture, showing how to optimize the parameters in your model.

We define a network that takes a vector `x` with two features as input, multiplies the features with corresponding parameters in `W`, and sums them together. We then train this network for 10 epochs over a single training point, optimizing the output towards value 20. Printing out the results, we can see that the output `y` gradually moves towards the target.

In [None]:
tf.reset_default_graph()

x = tf.placeholder(tf.float32, [2], name="x")
target = tf.placeholder(tf.float32, name="target")
learning_rate = tf.placeholder(tf.float32, name="learning_rate")

W = tf.get_variable("W", initializer=[0.2, 0.7])
y = tf.reduce_sum(x * W)

loss = tf.pow(target - y, 2.0)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(loss)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for epoch in range(10):
        result, _ = sess.run(
            [y, train_op], 
            feed_dict={x: [1.0, 1.0], 
                       target: 20.0, 
                       learning_rate: 0.1}) 
        print("Result: ", result)

# Network Layers

For most cases, we don't actually need to create the trainable variables manually. Instead, the feedfoward layer is available as a pre-defined module.

In [None]:
x = tf.placeholder(tf.float32, [None, 2], name="x")
y = tf.layers.dense(x, 1, activation=None)

This creates a hidden layer that takes `x` as input, has 1 output neuron (we can also create bigger layers of course), and has no non-linear activation. The parameters that connect the two layers together are created automatically and are trained during optimization. By default, these parameters are initialized randomly.

Let's replace the manually created variables with a TensorFlow dense layer.

In [None]:
tf.reset_default_graph()

x = tf.placeholder(tf.float32, [None, 2], name="x")
target = tf.placeholder(tf.float32, name="target")
learning_rate = tf.placeholder(tf.float32, name="learning_rate")

y = tf.layers.dense(x, 1, activation=None)

loss = tf.pow(target - y, 2.0)

optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)

train_op = optimizer.minimize(loss)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for epoch in range(10):
        result, _ = sess.run(
            [y, train_op], 
            feed_dict={x: [[1.0, 1.0]], 
                       target: 20.0, 
                       learning_rate: 0.1}) 
        print("Result: ", result[0][0])

This version actually gets to the correct solution a bit faster than before. That's because it is internally also creating a bias parameter, which adds a bit more power to the model.

In large networks, you would normally chain together many large layers with non-linear activation functions:

In [None]:
x = tf.placeholder(tf.float32, [None, 300], name="x")
hidden1 = tf.layers.dense(x, 100, activation=tf.tanh)
hidden2 = tf.layers.dense(hidden1, 50, activation=tf.tanh)
y = tf.layers.dense(hidden2, 1, activation=tf.sigmoid)

# Activation Functions

In the last example, we used non-linear activation functions. As we saw in the lectures, this is what gives neural networks their power to model non-linear patterns in the data. There are a number of different activation functions to choose from.

The [sigmoid function](https://en.wikipedia.org/wiki/Logistic_function), also known as the logistic function, is the most classic non-linear activation. It transforms the value to a range between 0 and 1.

In [None]:
hidden = tf.layers.dense(x, 100, activation=tf.sigmoid)

In modern networks, the [tanh function](https://en.wikipedia.org/wiki/Hyperbolic_function) is used more often. It has more flexibility, as it transforms the input value to a range between -1 and 1, and can therefore output negative values as well.

In [None]:
hidden = tf.layers.dense(x, 100, activation=tf.tanh)

Another popular one is the [Rectified Linear Unit](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)) function, or the ReLU. This function acts as a linear function above zero, but restricts everything below zero to 0. By doing this it also introduces a non-linearity.

In [None]:
hidden = tf.layers.dense(x, 100, activation=tf.nn.relu)

The partial linear property of the ReLU can help it converge faster on some tasks, although in practice tanh may be a more robust option.

Finally, [softmax](https://en.wikipedia.org/wiki/Softmax_function) is a special type of activation function. It takes a whole layer as input and converts it into a probability distribution, such that all values are between 0 and 1, and together they sum up to 1. It is often used in the output layers of networks when performing classification, in order to predict a probability distribution over all the possible classes.

In [None]:
output = tf.layers.dense(hidden, 2, activation=None)
probabilities = tf.nn.softmax(output)

# Operations and Useful Functions

TensorFlow has corresponding versions of all the main operations you might want to use. This means you can add them into your computation graph and into your neural network.

In [None]:
tf.abs # absolute value
tf.negative # computes the negative value
tf.sign # returns 1, 0 or -1 depending on the sign of the input
tf.reciprocal # reciprocal 1/x
tf.square # return input squared
tf.round # return rounded value
tf.sqrt # square root
tf.rsqrt # reciprocal of square root
tf.pow # power
tf.exp # exponential

These operations can be applied to scalar values, but also to vectors, matrices and higher-order tensors. In the latter case, they will be applied element-wise. For example:


In [None]:
tf.reset_default_graph()

a = tf.placeholder(tf.int32, name="a")
b = tf.placeholder(tf.float32, [3], name="b")

c = tf.negative(a)
d = tf.square(b)

with tf.Session() as sess:
    c_, d_ = sess.run([c, d], feed_dict={a:4, b:[3.0,2.0,1.0]})
    print(c_, d_)

Some useful operations are performed over a whole vector/matrix tensor and return a single value:


In [None]:
tf.reduce_sum # Add elements together
tf.reduce_mean # Average over elements
tf.reduce_min # Minimum value
tf.reduce_max # Maximum value
tf.argmax # Index of the largest value
tf.argmin # Index of the smallest value

In [None]:
tf.reset_default_graph()

b = tf.placeholder(tf.float32, [3,2], name="b")
c = tf.reduce_sum(b)

with tf.Session() as sess:
    c_ = sess.run([c], feed_dict={b:[[6.0, 5.0],[4.0,3.0],[2.0,1.0]]})
    print(c_)

Different adaptive learning rate strategies are also implemented in TensorFlow as functions. The main ones to try are:


In [None]:
tf.train.GradientDescentOptimizer
tf.train.AdadeltaOptimizer
tf.train.AdamOptimizer

If you are interested in the differences between these strategies, [this blog post](http://ruder.io/optimizing-gradient-descent/) provides more details.

# Training an XOR Function

[XOR](https://en.wikipedia.org/wiki/XOR_gate) is the function that takes two binary values and returns 1 only if one of them is 1 and the other 0, while returning 0 if both of them have the same value.

It can be a complicated function to optimize and cannot be modeled with a linear model. But let's try anyway.

Our dataset consists of all the possible different states that XOR can take:

In [None]:
data_x = [[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]]
data_y = [0.0, 1.0, 1.0, 0.0]

Now we construct a linear network and optimize it on this dataset, printing out the predictions at each epoch:

In [None]:
tf.reset_default_graph()

x = tf.placeholder(tf.float32, [None, 2], name="x")
target = tf.placeholder(tf.float32, [None], name="target")
learning_rate = tf.placeholder(tf.float32, name="learning_rate")

y = tf.reduce_sum(tf.layers.dense(x, 1, activation=None), axis=1)

loss = tf.reduce_sum(tf.pow(target - y, 2.0))

optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(loss)

data_x = [[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]]
data_y = [0.0, 1.0, 1.0, 0.0]
lr = 0.1

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for epoch in range(50):
        result, _ = sess.run([y, train_op], feed_dict={x: data_x, target: data_y, learning_rate: lr})
        if epoch % 10 == 0:
            print("Epoch ", epoch, ": ", "\t".join([str(x) for x in result]))

As you can see, it's not doing very well. Ideally, the predictions should be [0, 1, 1, 0], but in this case they are hovering around 0.5 for every input case.

In order to improve this architecture, let's add some non-linear layers into our model.

In [None]:
tf.reset_default_graph()

x = tf.placeholder(tf.float32, [None, 2], name="x")
target = tf.placeholder(tf.float32, [None], name="target")
learning_rate = tf.placeholder(tf.float32, name="learning_rate")

hidden = tf.layers.dense(x, 5, activation=tf.tanh) # <- non-linear layer
y = tf.reduce_sum(tf.layers.dense(hidden, 1, activation=tf.sigmoid), axis=1) # <- non-linear layer

loss = tf.reduce_sum(tf.pow(target - y, 2.0))

optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(loss)

data_x = [[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]]
data_y = [0.0, 1.0, 1.0, 0.0]
lr = 1.0

tf.set_random_seed(20)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for epoch in range(50):
        result, _ = sess.run([y, train_op], feed_dict={x: data_x, target: data_y, learning_rate: lr})
        if epoch % 10 == 0:
            print("Epoch ", epoch, ": ", "\t".join([str(x) for x in result]))

This is much better. The values are much closer to [0, 1, 1, 0] than before, and they will continue improving if we train for longer.

We also had to increase the learning rate for this network. It was still learning with the smaller learning rate, but was convering very slowly. As we discussed in the lectures, learning rate is a hyperparameter that can vary quite a bit depending on the network architecture and dataset.

# XOR Classification

We can also do classification with TensorFlow. For this, we often use the softmax activation function described above, which predicts the probability for each of the possible classes.

We also have to change the loss function, as squared error is not suitable for classification. The loss function that works best with softmax is [cross entropy](https://en.wikipedia.org/wiki/Cross_entropy). When minimizing cross entropy, we are essentially minimizing the negative log likelihood of the correct class for each datapoint. That's exactly what we want, as the model learns to assign high values for the correct label.

We can change the XOR example above to perform classification instead. In this case, we are constructing a binary classifier - choosing between the classes of 0 and 1. When printing the output, we are printing the predicted classes, which were assigned the highest probability by the network.

In [None]:
tf.reset_default_graph()

x = tf.placeholder(tf.float32, [None, 2], name="x")
target = tf.placeholder(tf.int32, [None], name="target")
learning_rate = tf.placeholder(tf.float32, name="learning_rate")

hidden = tf.layers.dense(x, 5, activation=tf.tanh)
output = tf.layers.dense(hidden, 2, activation=None)

probabilities = tf.nn.softmax(output)
predictions = tf.argmax(probabilities, axis=1)
loss_ = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=output, labels=target)
loss = tf.reduce_mean(loss_)

optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(loss)

data_x = [[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]]
data_targets = [0, 1, 1, 0]
lr = 1.0

tf.set_random_seed(20)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for epoch in range(50):
        result, _ = sess.run([predictions, train_op], feed_dict={x: data_x, target: data_targets, learning_rate: lr})
        if epoch % 10 == 0:
            print("Epoch ", epoch, ": ", " ".join([str(x) for x in result]))

As you can see, the model starts off with incorrect predictions, but fairly soon learns to return the correct sequence of [0, 1, 1, 0].