# Overview

In this notebook we will cover the basic concepts for training convolutional neural networks in Tensorflow. We will be specifically building a network to detect presence or absence of brain tumors from multimodal MR images. The data that we will be using in this tutorial comes from the MICCAI Brain Tumor Segmentation Challenge (BRaTS). More information about he BRaTS Challenge can be found here: http://braintumorsegmentation.org/

For basics of Tensorflow operation and neural networks, consider reviewing the first part of this series **01 - Introduction to Data, Tensorflow and Deep Learning**.

### Importing modules

To train our simple classifer implementation, we will require three open-source libraries (`tensorflow`, `numpy` and `os`) as well as our custom modules created for this tutorial (`net`, `data`). The additional `Markdown` library is used to generate dynamic links to Terminal and TensorBoard below.

In [1]:
import os, sys
import tensorflow as tf, numpy as np

sys.path.append('%s/dl_tutorial/code' % os.environ['HOME'])
import net, data

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 
from IPython.display import Markdown as md
ip = open('../public-ipv4').read() if os.path.exists('../public-ipv4') else '0.0.0.0'

### Hyperparameter variables

Hyperparameters are parameters whose values are set before the learning process begins and which in turn influence and direct the training process. These will be the three most important hyperparameter variables to vary in this experiment. We will cover these in more detail as they are encountered in the code below.

In [2]:
iterations = 200
batch_size = 16
learning_rate = 1e-3

### Preparation

Here we perform some basic preparatory steps including: 

* setting the data root directory 

> If **AWS** or **Google**: `/data/brats/npy`

> If **local**: `../data` (or whichever local directory data is stored in)

* making the output directory for saving training checkpoints and logs
* defining an `ops` dictionary to save operations
* reseting any existing graph that may exist (`tf.reset_default_graph()`)

In [3]:
# root = '/data/brats/npy'
root = '../data'
output_dir = '%s/dl_tutorial/exp_classifier' % os.environ['HOME']
os.makedirs(output_dir, exist_ok=True)
ops = {}
tf.reset_default_graph()

### Data batch

A data **mini-batch** is used to describe the collection of image and label pairs used to perform one update of our network parameters. The more number of images and labels we use for each update, the more likely that update is to reflect the underlying population data. However, the trade-off is that computationally each network update will require more time. A good initial starting point for images matrices of our dataset may be 16 or 32. 

To implement batching, we will use a prepared template method `net.init_batch()` to load a number of slices simulatenously:

In [8]:
batch = net.init_batch(batch_size, root=root, one_hot=False)
print(batch['train']['X'])
print(batch['train']['y'])

Tensor("IteratorGetNext_6:0", shape=(?, 240, 240, 4), dtype=float32)
Tensor("IteratorGetNext_6:1", shape=(?, 240, 240), dtype=uint8)


### Placeholders

A tensorflow **placeholder** is an entry point for us to feed actual data values into the model. We must define this **placeholder** and all subsequent downstream operations performed on this **placeholder** before ever passing data into the model. 

The placeholder `X` will serve as the method for introduction image data into the graph. The placeholder `y` will serve as the method for introducing the correct target label representing presence (1) or absence (0) of tumor. The placeholder `mode` will serve as a method for introducing whether or not the graph is being executed for training or for validation.

In [14]:
X = tf.placeholder(tf.float32, shape=[None, 240, 240, 4], name='X')
y = tf.placeholder(tf.int32, shape=[None], name='y')
mode = tf.placeholder(tf.bool, name='mode')

### Network

In this example we will be using the architecture defined by `net.create_classifier()`. This classifer alternates a series of convolutions, ReLU non-linearities and max-pooling to collapse the `240 x 240 x 1` input into a single 2-channel out, one number representing likelihood of no tumor and the other representing likelihood of tumor.

A simplified diagram of the architecture is as follows:
```
DEFINITION
----------
BLOCK = [ CONV --> RELU --> CONV --> RELU --> POOL ]

LAYER NAME | OUTPUT SHAPE
------------------------------
BLOCK-01   | 120 x 120 x 8
BLOCK-02   | 060 x 060 x 16 
BLOCK-03   | 030 x 030 x 32
BLOCK-04   | 015 x 015 x 64
BLOCK-05   | 007 x 007 x 96
BLOCK-06   | 003 x 003 x 128
FLATTEN    | 001 x 001 x 1152
FC         | 001 x 001 x 2
```

Note that this diagram is simplified. This particular algorithm also implements minor additions such as batch normalization and L2 regularization which are byeond the scope of this tutorial. See source code for further information. 

To implement this architecture, simply call the `net.create_classifier()` method:

In [15]:
pred = net.create_classifier(X, training=mode)

### Loss and error

Next, based on these prediction logits, we need to give the algorithm feedback whether or not the network is correct. To do so, we will use the softmax function, a formula that computes the exponential (e-power) of the given input value divided by the sum of exponential values of all the values in the inputs. For a classification model these values on the range of (0, 1) together represent the probability distribution of the different label classes.

To implement a softmax function, we will use the a prepared template function `net.loss_sce()` that serve as a wrapper to the underlying `tf.nn.sparse_softmax_cross_entropy_with_logits()` method described in the previous tutorial.

In addition to the loss function, we want to gauge how accurate (%) the predictions are in a human-interpretable way. To do, we will keep track of the `top-k` accuracy of our model, which in our simple two-class prediction simplifies to a `top-1` score (e.g. `k=1`).

In [16]:
losses = {}
losses['sce'] = net.loss_sce(pred, y)
losses['topk'] = net.error_topk(pred, y)

### Optimizer

An optimizer is a strategy used to update the network parameters through backprogration by taking into account the quantitative loss function. We will be using the Adam optimizer for our tutorials, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. For further reading, see the following link for the original paper: https://arxiv.org/abs/1412.6980

A key hyperparameter here is the optimizer **learning rate**. The learning rate describes the absolute magnitude of update for each parameter for one iteration. A higher learning rate will result in a correspondingly larger, more aggresive "step" towards the global minimum of a function, however a learning rate that is too high may cause the network to overshoot the true function minimum and even worse, may lead to network instability. A good initial learning rate to use in most experiments, without other guiding heuristics, is `1e-3` which is what we will set our initial `learning_rate` hyperparameter to.

Note that the `tf.control_dependencies()` method here ensures that any other pending graph operations must be complete before the optimizer node is executed.

In [17]:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)

with tf.control_dependencies(update_ops):
    global_step = tf.train.get_or_create_global_step()
    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
    ops['train'] = optimizer.minimize(losses['sce'], global_step=global_step)

### Collections

After creating the placeholders and predictions, we will add them to named Graph collections for easy retrieval after training is complete during inference.

In [18]:
tf.add_to_collection("inputs", X)
tf.add_to_collection("inputs", mode)
tf.add_to_collection("outputs", pred)

### TensorBoard

TensorBoard is useful utility that can be used to track various statistics during the network training process. Here we set up operations to create log files that can be loaded using the TensorBoard interface

In [19]:
tf.summary.histogram('logits', pred)
tf.summary.scalar('sce', losses['sce'])
tf.summary.scalar('topk', losses['topk'])
ops['summary'] = tf.summary.merge_all()

# Network training

Now that graph, loss function and optimizer have been configured, it is time to run the training algorithm. To begin we define a new `tf.Session` class and initialize our basic objects to enable saving intermediate checkpoints and writing log data. In addition we initialize `coord` and `thread` objects to handle asynchronous loading of input data into batches:
```
sess, saver, writer_train, writer_valid = net.init_session(sess, output_dir)
```

To perform actual training, we will construct a loop to repeat parameter updates a total of `iteration` times. For each update, we will start by loading the data into batches `X_batch` and `y_batch`:
```
X_batch, y_batch = sess.run([batch['train']['X'], batch['train']['y']])
```

We will then collapse the label masks into either 0 or 1 based on whether or not any mask is present in the entire label:
```
y_batch = np.max(y_batch > 0, axis=(1,2)).astype('float32')
```

At last we call `sess.run()` to run one iteration of the training process. Specifically we wil request the network to output the `error` (accuracy %), `summary` (used for creating logs) and `step` (global step reflecting total number of iterations). Note that the `ops['train']` operation corresponding to the optimizer node is also called, but there is no output for this function and hence no (`_,`) return variable.
```
 _, metric, summary, step  = sess.run(
                [ops['train'], losses, ops['summary'], global_step],
                feed_dict={
                    X: X_batch, 
                    y: y_batch, 
                    mode: True})
```

Finally, for every 10 updates, will ask the network to run against a separate validation cohort (e.g. completely separate from the training dataset) to track the overall generalization of the algorithm's learned representation:
```
if not i % 10:
    ...
```

This entire training process can be executed by running the following cell:

In [20]:
with tf.Session() as sess:
    
    sess, saver, writer_train, writer_valid = net.init_session(sess, output_dir)
    print('\n\nTraining Statistics:\n')

    try:
        coord = tf.train.Coordinator()
        threads = tf.train.start_queue_runners(coord=coord)
        metrics = net.init_metrics(losses)

        for i in range(iterations):
            
            # --- Run a single iteration of training
            X_batch, y_batch = sess.run([batch['train']['X'], batch['train']['y']])
            y_batch = np.max(y_batch > 0, axis=(1,2)).astype('float32')
            _, metric, summary, step  = sess.run(
                [ops['train'], losses, ops['summary'], global_step],
                feed_dict={
                    X: X_batch, 
                    y: y_batch, 
                    mode: True})

            writer_train.add_summary(summary, step)
            metrics = net.update_ema(metrics, metric, mode='train', iteration=i)
            net.print_status(metrics, step, metric_names=['sce', 'topk'])

            # --- Every 10th iteration run a single validation batch
            if not i % 10:

                X_batch, y_batch = sess.run([batch['valid']['X'], batch['valid']['y']])
                y_batch = np.max(y_batch > 0, axis=(1,2)).astype('float32')
                metric, summary = sess.run(
                    [losses, ops['summary']],
                    feed_dict={
                        X: X_batch, 
                        y: y_batch, 
                        mode: False})

                writer_valid.add_summary(summary, step)
                metrics = net.update_ema(metrics, metric, mode='valid', iteration=i)
                net.print_status(metrics, step, metric_names=['sce', 'topk'])

        saver.save(sess, '%s/checkpoint/model.ckpy' % output_dir)

    finally:
        coord.request_stop()
        coord.join(threads)
        saver.save(sess, '%s/checkpoint/model.ckpy' % output_dir)



Training Statistics:

0000200 | TRAIN - sce : 0.1203 - topk : 0.9552  | VALID - sce : 1.7631 - topk : 0.5217 

In the above space you will see updates of algorithm training status including number of iterations and errors on both the training and validation set data.

# TensorBoard 

### Overview

TensorBoard is a suite of web applications for inspecting and understanding your TensorFlow runs and graphs. To use Tensorboard, you must embed specialized `tf.summary.*` operations into your graph which produce serialized protobufs that track various training statistics over time. The supported summary ops include:

* tf.summary.scalar
* tf.summary.image
* tf.summary.audio
* tf.summary.text
* tf.summary.histogram

During the training process, a specialized `summary.FileWriters()` class is used to take summary data created by `tf.summary.*` operations and write them to a specified directory, known as the `logdir`. This was implemented in following line of code above:
```
writer_valid.add_summary(summary, step)
```

### Launching TensorBoard

To launch TensorBoard, run the following command:

In [None]:
!pkill -9 tensorboard
get_ipython().system_raw(
    'tensorboard --logdir {} --host 0.0.0.0 --port 6006 &'
    .format(output_dir)
)


Note, this is equivalent to the following Terminal command:
```
tensorboard --logdir={output_dir} --host 0.0.0.0 --port 6006
```

To launch the TensorBoard session, open up a new tab in your browser and type in the following address pattern:
```
[IP-address]:6006
```

Where `[IP-address]` is the same address of form `xxx.xxx.xxx.xxx` that represents the IP address of your instance. If you are on AWS (or a local copy) It should be the same prefix as your Jupyter notebook in the address bar currently at the top of your screen. For example, if the IP address is `34.215.158.68`, then the URL entered into the web browser is `34.215.158.68:6006`.

Alternatively, the following lines of code can be used for convience:

For **AWS instance** use the following Markdown text:

In [None]:
md('**Tensorboard URL** (right-click > open in new tab): [http://%s:6006](http://%s:6006)' % (ip, ip))

For **Google Colaboratory** use the following ngrok commands:

In [None]:
!pkill -9 ngrok
get_ipython().system_raw('./ngrok http 6006 &')
!curl -s http://localhost:4040/api/tunnels | python3 -c \
    "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"


For more information about TensorBoard usage, see link: https://github.com/tensorflow/tensorboard.

### Final thoughts

Feel free to continue training the algorithm until convergence at reasonable accuracy. Once complete, turn off the kernel (top menu > `Kernel` > `Shutdown`; you can keep this tab open in your browser to retrain later) so that it's resources can be used in the next notebook. You are now ready to move on the **03 - Inference with a Classifier** to use the newly trained network on data.