# Lab 4 - How to improve your results?

In this lab session we will look at data augmentation and debugging strategies.

## Objectives:

1. Learn how to perform data augmentation in tensorflow
2. Experiment with different types of data augmentation
3. Learn how to debug with tensorflow 

## Setup (REMINDER)

1. Login to BC4

    ```
    ssh <your_UoB_ID>@bc4login.acrc.bris.ac.uk
    ```
    
2. Clone the repository

    ```
    git clone "https://github.com/COMSM0018-Applied-Deep-Learning/labsheets.git" ~/labsheets
    ```

3. Change to the lab 4 directory:

    ```
    cd ~/labsheets/Lab_4_Augment/
    ```
    
4. Make all ```go_interactive.sh``` and ```tensorboard_params.sh``` files executables by using the command `chmod`: 

    ```
    chmod +x go_interactive.sh tensorboard_params.sh
    ```
   
5. Switch to interactive mode, and note the change of the gpu login to a reserved gpu:

    ```
    ./go_interactive.sh 
    ```
    
6. Run the following script. It will pop up two values: `ipnport=XXXXX` and `ipnip=XX.XXX.X.X.`

    ```
    ./tensorboard_params.sh
    ```
    
    **Write them down since we will use them for using TensorBoard.**

7. Train the model using the command: [your new modified file - see below]
    
    ```
    python cifar_augment.py
    ```
   
8. Open a **new terminal window** and login using SSH like in step 1 then run:

    ```
    tensorboard --logdir=logs/ --port=<ipnport>
    ```
    
9. Open a **new terminal window** on your machine and type: 
    
    ```
    ssh -N <USER_NAME>@bc4login.acrc.bris.ac.uk -L 6006:<ipnip>:<ipnport>
    ```

10. Open your web browser (Use Chrome; Firefox currently has issues with tensorboard) and open the port 6006 (http://localhost:6006). This should open TensorBoard, and you can navigate through the summaries that we included.



## 1. From tf.nn to tf.layers

**NOW** copy your code from Lab 3, and rename it as `cifar_augment.py`. 

Until now,  you have fully specified your network in details, using ops defined in the [`tf.nn`](https://www.tensorflow.org/versions/r1.2/api_docs/python/tf/nn) module to build the network. Today we'll modify the code in a way using ops from the [`tf.layers`](https://www.tensorflow.org/versions/r1.2/api_docs/python/tf/layers) module which is a higher level interface making it easier to try new architectures. We started with [`tf.nn`](https://www.tensorflow.org/versions/r1.2/api_docs/python/tf/nn) to show you the nuts and bolts of neural networks so nothing was hidden. The [`tf.layers`](https://www.tensorflow.org/versions/r1.2/api_docs/python/tf/layers) ops are defined in a very similar way to the layers you used in the previous lab, except that they are parameterised so you don't have to repeat the same code over and over again.

We will show you how your previous convolutional layer can now be re-written using tf.layers.

*previously (in labs 1-3):*

In [None]:
def deepnn(x):
    x_image = tf.reshape(x, [-1, FLAGS.img_width, FLAGS.img_height, FLAGS.img_channels])
    with tf.variable_scope('Conv_1'):
        W_conv1 = weight_variable([5, 5, FLAGS.img_channels, 32])
        b_conv1 = bias_variable([32])
        h_conv1 = tf.nn.relu(tf.nn.conv2d(x_image, W_conv1, strides=[1, 1, 1, 1], padding='SAME', name='convolution') + b_conv1)

        # Pooling layer - downsamples by 2X.
        h_pool1 = tf.nn.max_pool(h_conv1, ksize=[1, 2, 2, 1],
                          strides=[1, 2, 2, 1], padding='SAME', name='pooling')

followed by your Lab 3 code on batch normalisation. We will replace all this with the following

*after*:

In [None]:
xavier_initializer = tf.contrib.layers.xavier_initializer(uniform=True)
def deepnn(x):
    x_image = tf.reshape(x, [-1, FLAGS.img_width, FLAGS.img_height, FLAGS.img_channels])
    conv1 = tf.layers.conv2d(
        inputs=x_image,
        filters=32,
        kernel_size=[5, 5],
        padding='same',
        use_bias=False,
        kernel_initializer=xavier_initializer,
        name='conv1'
    )
    conv1_bn = tf.nn.relu(tf.layers.batch_normalization(conv1))
    pool1 = tf.layers.max_pooling2d(
        inputs=conv1_bn,
        pool_size=[2, 2],
        strides=2,
        name='pool1'
    )

**NOW** Change your full architecture to use [`tf.layers`](https://www.tensorflow.org/versions/r1.2/api_docs/python/tf/layers)

The architecture of the network stays the same from the last lab: two convolutional layers followed by two fully connected layers with batch normalisation on the convolutional layers.

You can use [`tf.layers.dense`](https://www.tensorflow.org/versions/r1.2/api_docs/python/tf/layers/dense) for your fully connected layers

Debug and test. Your performance should not change.

## 2. Flushing summaries periodically

Train and test summaries are flushed every 120 seconds by default. Decrease this so you don't have to wait until the network finishes training to inspect the summaries in tensorboard.

**NOW** Set the `flush_secs` kwarg to a reasonable value when constructing the [`tf.summary.FileWriter`](https://www.tensorflow.org/versions/r1.2/api_docs/python/tf/summary/FileWriter) objects

In [None]:
summary_writer = tf.summary.FileWriter(run_log_dir + "_train", sess.graph, flush_secs=5)
summary_writer_validation = tf.summary.FileWriter(run_log_dir + "_validate", sess.graph, flush_secs=5)

Both improvements will prove useful for your project coursework.

## 3  Data Augmentation

Generally the more data a CNN (or any deep learning model) has access to the better features it learns, and therefore the better it performs. Data augmentation refers to techniques to artificially increase the amount of training data with label preserving transformations on the original training data, i.e. we want to transform CIFAR-10 images in such a way that the data becomes more varied but the object in the image remains the same.

It is typical to implement data augmentation *online* where each training mini-batch is loaded and mutated stocastically in some way (e.g. rotation, translation, blurring). Recall that we don't stop training after having processed the full dataset but after some other stopping criteria and so it is possible, even probable, that we will process each training example more than once. If we mutate the inputs stochastically then each time an example is part of a mini-batch it will be mutated in some different way (e.g. rotating by an angle sampled from a random distribution).

### Practical 3.1: Data Augmentation with Random Flips

One example of data augmentation which can improve results on the CIFAR-10 dataset is horizontal flips. By randomly flipping the image you are able to add additional training data, without making the object in the image unrecognisable. 

* **Q. Think of problems that are invariant to horizontal flips... then think of problems that are not invariant (i.e. a horizontal flip will change the correct label of your sample).**

Implement data augmentation into your network using online data augmentation and retrain to see the improvement in your results. Be careful that you **don't** apply the data augmentation during **testing**.

* **Hint**: For types of data augmentation implemented in tensorflow have a look at the [`tf.image`](https://www.tensorflow.org/versions/r1.2/api_docs/python/tf/image) module.
* **Hint**: Because the stochastic [`tf.image`](https://www.tensorflow.org/versions/r1.2/api_docs/python/tf/image) ops are defined for 3D tensors you'll need to map the op over the mini-batch using [`tf.map_fn`](https://www.tensorflow.org/api_docs/python/tf/map_fn) so it applied to each training example.
* **Hint**: Inside your computational graph you'll need to do different things if you're training or testing, this is what [`tf.cond`](https://www.tensorflow.org/api_docs/python/tf/cond) will allow you to do.

### Practical 3.2: Your own data augmentation

Think carefully about another form of data augmentation which you believe would improve the capability of your network to recognise images from the CIFAR-10 dataset. Some types of data augmentation may have a negative impact on your network. For instance, flipping the image vertically is not useful as the test data will not contain any upside down boats or cats etc.

Implement your chosen data augmentation method to see if it does improve your results.

**Hint:** If you're stuck for ideas revisit the practical slides or talk to the TAs.

Train your model and save the relevant logs, as well as the code you've written

## 4. Debugging Strategies

In Labs 1 and 2 you have seen how to use tensorflow to view how the accuracy and loss change over time. This is one of the most valuable tools for gaining insight into what your network is doing and debugging it.

One thing to bear in mind when debugging tensorflow is that you *can't* easily use a normal python `print` statement. This is because tensorflow works by first building a computational graph and then evaluating the graph. You cannot print a value until it has been evaluated in the graph. To add a print operation to a tensorflow graph you use `tf.Print`. For example:

```python
x = tf.get_variable('x', [10, 5])
x = tf.Print(x, [x])

sess = tf.InteractiveSession()
sess.run(x)
```

However, it is generally better pratice and faster to use a debugger. We will go through how to use the tensorflow debugger in this next exercise.


### Practical 4.1: Using the tensorflow debugger

Another useful tools is the [tensorflow debugger](https://www.tensorflow.org/programmers_guide/debugger).

To use it you'll first need to access an interactive node on bluecrystal. To do this run the command:

```bash
./go_interactive.sh

```

Try running the debug example within tensorflow: 

```bash
module add libs/tensorflow/1.2
python -m tensorflow.python.debug.examples.debug_mnist
```

This code also trains a classifier for [MNIST dataset](http://yann.lecun.com/exdb/mnist/), you’ll notice that unlike the previous examples we have trained, the accuracy decreases after step 1 and does not increase. This
is most likely due to a bad numeric value such as `inf` or `nan` being generated in the training graph. The way tfdbg
works is to add filters for tensor values so we can find the problem. Since `inf` and `nan` are common problems this filter already exists.

To debug the example with tfdbg run:

```bash
python −m tensorflow.python.debug.examples.debug_mnist −−debug
```

You will see the following screen:

![tfdbg start](img/tfdbg_screenshot_run_start.png)


You can use **PageUp** / **PageDown** / **Home** / **End** to navigate. If you lack those keys use **Fn + Up** / **Fn + Down** / **Fn + Right** / **Fn + Left**

Run `help` to list the available commands, or alternatively you can refer to the [`tfdbg` cheatsheet](https://www.tensorflow.org/programmers_guide/debugger#tfdbg_cli_frequently-used_commands)

```bash
tfdbg> help
```

Run the filter `has_inf_or_nan` to determine which tensors are contain either `Inf` or `NaN` values:

```bash
tfdbg> run -f has_inf_or_nan
```

You should now see this screen:

![has_inf_or_nan](img/tfdbg_screenshot_run_end_inf_nan.png)

The tensors that match the filter are displayed in chronological order, the tensor at the top `cross_entropy/Log:0` is the one in which `NaN` or `Inf` first appeared, so this is a good place to start our debugging.

To view the value of a tensor click on the underlined tensor name e.g. `cross_entropy/Log:0` or enter the equivalent command:

```bash
tfdbg> pt cross_entropy/Log:0
```

To perform a regex search of tensor values run:

```bash
tfdbg> /(inf|nan)
```

OK, so there are `-Inf` values present in the tensor, how do we determine where they originate from? Well let's determine how this tensor was constructed. Use `node_info --traceback cross_entropy/Log` to determine what op output this tensor.

* **Q:  Which line in the stack trace corresponds to user's code that defines the cross entropy?**

## Preparing Lab 4 Portfolio

You should by now have the following files, which you can zip under the name `Lab_4_<username>.zip`

Note that we are asking you **this lab** to submit a copy of your modified code with data augmentation
    
From your logs, include only the TensorBoard summaries and remove the checkpoints (model.ckpt-* files)

```
 Lab_4_<username>.zip
 |----------cifar_augment.py
 |----------logs\ 
```

Store this zip safely. You will be asked to upload all your labs' portfolio to SAFE after Week 10 - check SAFE for deadline details.

## Further Resources

* [`tfdbg` tutorial](https://www.tensorflow.org/programmers_guide/debugger)
* [Using `tfdbg` for batch run scripts](https://www.tensorflow.org/programmers_guide/debugger#offline_debugging_of_remotely-running_sessions)