# Batch Normalization – Practice

Batch normalization is most useful when building deep neural networks. To demonstrate this, we'll create a convolutional neural network with 20 convolutional layers, followed by a fully connected layer. We'll use it to classify handwritten digits in the MNIST dataset, which should be familiar to you by now.

This is **not** a good network for classfying MNIST digits. You could create a _much_ simpler network and get _better_ results. However, to give you hands-on experience with batch normalization, we had to make an example that was:
1. Complicated enough that training would benefit from batch normalization.
2. Simple enough that it would train quickly, since this is meant to be a short exercise just to give you some practice adding batch normalization.
3. Simple enough that the architecture would be easy to understand without additional resources.

This notebook includes two versions of the network that you can edit. The first uses higher level functions from the `tf.layers` package. The second is the same network, but uses only lower level functions in the `tf.nn` package.

1. [Batch Normalization with `tf.layers.batch_normalization`](#example_1)
2. [Batch Normalization with `tf.nn.batch_normalization`](#example_2)

The following cell loads TensorFlow, downloads the MNIST dataset if necessary, and loads it into an object named `mnist`. You'll need to run this cell before running anything else in the notebook.

In [1]:
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True, reshape=False)

Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting MNIST_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


# Batch Normalization using `tf.layers.batch_normalization`<a id="example_1"></a>

This version of the network uses `tf.layers` for almost everything, and expects you to implement batch normalization using [`tf.layers.batch_normalization`](https://www.tensorflow.org/api_docs/python/tf/layers/batch_normalization) 

We'll use the following function to create fully connected layers in our network. We'll create them with the specified number of neurons and a ReLU activation function.

This version of the function does not include batch normalization.

In [2]:
"""
DO NOT MODIFY THIS CELL
"""
def fully_connected(prev_layer, num_units):
    """
    Create a fully connectd layer with the given layer as input and the given number of neurons.
    
    :param prev_layer: Tensor
        The Tensor that acts as input into this layer
    :param num_units: int
        The size of the layer. That is, the number of units, nodes, or neurons.
    :returns Tensor
        A new fully connected layer
    """
    layer = tf.layers.dense(prev_layer, num_units, activation=tf.nn.relu)
    return layer

We'll use the following function to create convolutional layers in our network. They are very basic: we're always using a 3x3 kernel, ReLU activation functions, strides of 1x1 on layers with odd depths, and strides of 2x2 on layers with even depths. We aren't bothering with pooling layers at all in this network.

This version of the function does not include batch normalization.

In [3]:
"""
DO NOT MODIFY THIS CELL
"""
def conv_layer(prev_layer, layer_depth):
    """
    Create a convolutional layer with the given layer as input.
    
    :param prev_layer: Tensor
        The Tensor that acts as input into this layer
    :param layer_depth: int
        We'll set the strides and number of feature maps based on the layer's depth in the network.
        This is *not* a good way to make a CNN, but it helps us create this example with very little code.
    :returns Tensor
        A new convolutional layer
    """
    strides = 2 if layer_depth % 3 == 0 else 1
    conv_layer = tf.layers.conv2d(prev_layer, layer_depth*4, 3, strides, 'same', activation=tf.nn.relu)
    return conv_layer

**Run the following cell**, along with the earlier cells (to load the dataset and define the necessary functions). 

This cell builds the network **without** batch normalization, then trains it on the MNIST dataset. It displays loss and accuracy data periodically while training.

In [4]:
"""
DO NOT MODIFY THIS CELL
"""
def train(num_batches, batch_size, learning_rate):
    # Build placeholders for the input samples and labels 
    inputs = tf.placeholder(tf.float32, [None, 28, 28, 1])
    labels = tf.placeholder(tf.float32, [None, 10])
    
    # Feed the inputs into a series of 20 convolutional layers 
    layer = inputs
    for layer_i in range(1, 20):
        layer = conv_layer(layer, layer_i)

    # Flatten the output from the convolutional layers 
    orig_shape = layer.get_shape().as_list()
    layer = tf.reshape(layer, shape=[-1, orig_shape[1] * orig_shape[2] * orig_shape[3]])

    # Add one fully connected layer
    layer = fully_connected(layer, 100)

    # Create the output layer with 1 node for each 
    logits = tf.layers.dense(layer, 10)
    
    # Define loss and training operations
    model_loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=logits, labels=labels))
    train_opt = tf.train.AdamOptimizer(learning_rate).minimize(model_loss)
    
    # Create operations to test accuracy
    correct_prediction = tf.equal(tf.argmax(logits,1), tf.argmax(labels,1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    
    # Train and test the network
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        for batch_i in range(num_batches):
            batch_xs, batch_ys = mnist.train.next_batch(batch_size)

            # train this batch
            sess.run(train_opt, {inputs: batch_xs, labels: batch_ys})
            
            # Periodically check the validation or training loss and accuracy
            if batch_i % 100 == 0:
                loss, acc = sess.run([model_loss, accuracy], {inputs: mnist.validation.images,
                                                              labels: mnist.validation.labels})
                print('Batch: {:>2}: Validation loss: {:>3.5f}, Validation accuracy: {:>3.5f}'.format(batch_i, loss, acc))
            elif batch_i % 25 == 0:
                loss, acc = sess.run([model_loss, accuracy], {inputs: batch_xs, labels: batch_ys})
                print('Batch: {:>2}: Training loss: {:>3.5f}, Training accuracy: {:>3.5f}'.format(batch_i, loss, acc))

        # At the end, score the final accuracy for both the validation and test sets
        acc = sess.run(accuracy, {inputs: mnist.validation.images,
                                  labels: mnist.validation.labels})
        print('Final validation accuracy: {:>3.5f}'.format(acc))
        acc = sess.run(accuracy, {inputs: mnist.test.images,
                                  labels: mnist.test.labels})
        print('Final test accuracy: {:>3.5f}'.format(acc))
        
        # Score the first 100 test images individually. This won't work if batch normalization isn't implemented correctly.
        correct = 0
        for i in range(100):
            correct += sess.run(accuracy,feed_dict={inputs: [mnist.test.images[i]],
                                                    labels: [mnist.test.labels[i]]})

        print("Accuracy on 100 samples:", correct/100)


num_batches = 800
batch_size = 64
learning_rate = 0.002

tf.reset_default_graph()
with tf.Graph().as_default():
    train(num_batches, batch_size, learning_rate)

Batch:  0: Validation loss: 0.69060, Validation accuracy: 0.11000
Batch: 25: Training loss: 0.35825, Training accuracy: 0.09375
Batch: 50: Training loss: 0.32565, Training accuracy: 0.09375
Batch: 75: Training loss: 0.32712, Training accuracy: 0.06250
Batch: 100: Validation loss: 0.32510, Validation accuracy: 0.11260
Batch: 125: Training loss: 0.32646, Training accuracy: 0.03125
Batch: 150: Training loss: 0.32563, Training accuracy: 0.10938
Batch: 175: Training loss: 0.32313, Training accuracy: 0.09375
Batch: 200: Validation loss: 0.32612, Validation accuracy: 0.09860
Batch: 225: Training loss: 0.32526, Training accuracy: 0.09375
Batch: 250: Training loss: 0.32680, Training accuracy: 0.07812
Batch: 275: Training loss: 0.32330, Training accuracy: 0.12500
Batch: 300: Validation loss: 0.32580, Validation accuracy: 0.09240
Batch: 325: Training loss: 0.32323, Training accuracy: 0.15625
Batch: 350: Training loss: 0.32320, Training accuracy: 0.12500
Batch: 375: Training loss: 0.32707, Trainin

With this many layers, it's going to take a lot of iterations for this network to learn. By the time you're done training these 800 batches, your final test and validation accuracies probably won't be much better than 10%. (It will be different each time, but will most likely be less than 15%.)

Using batch normalization, you'll be able to train this same network to over 90% in that same number of batches.


# Add batch normalization

We've copied the previous three cells to get you started. **Edit these cells** to add batch normalization to the network. For this exercise, you should use [`tf.layers.batch_normalization`](https://www.tensorflow.org/api_docs/python/tf/layers/batch_normalization) to handle most of the math, but you'll need to make a few other changes to your network to integrate batch normalization. You may want to refer back to the lesson notebook to remind yourself of important things, like how your graph operations need to know whether or not you are performing training or inference. 

If you get stuck, you can check out the `Batch_Normalization_Solutions` notebook to see how we did things.

**TODO:** Modify `fully_connected` to add batch normalization to the fully connected layers it creates. Feel free to change the function's parameters if it helps.

In [18]:
# AE: added parameter <is_training>, which should be a tf.bool placeholder
def fully_connected(prev_layer, num_units, is_training):
    """
    Create a fully connectd layer with the given layer as input and the given number of neurons.
    
    :param prev_layer: Tensor
        The Tensor that acts as input into this layer
    :param num_units: int
        The size of the layer. That is, the number of units, nodes, or neurons.
    :returns Tensor
        A new fully connected layer
    """
    #layer = tf.layers.dense(prev_layer, num_units, activation=tf.nn.relu)
    # AE: We are going to have an ordinary dense (or fully connected layer), then we need to apply batch normalisation
    # AE: BEFORE we apply activation function, so we'll make sure, that activation function is None and then we'll apply
    # AE: activation at the end.
    layer = tf.layers.dense(prev_layer, num_units, activation=None, use_bias=False)
    layer = tf.layers.batch_normalization(inputs=layer, training=is_training)
    layer = tf.nn.relu(layer)
    return layer

**TODO:** Modify `conv_layer` to add batch normalization to the convolutional layers it creates. Feel free to change the function's parameters if it helps.

In [19]:
# AE: added parameter <is_training> which should be a tf.bool placeholder
def conv_layer(prev_layer, layer_depth, is_training):
    """
    Create a convolutional layer with the given layer as input.
    
    :param prev_layer: Tensor
        The Tensor that acts as input into this layer
    :param layer_depth: int
        We'll set the strides and number of feature maps based on the layer's depth in the network.
        This is *not* a good way to make a CNN, but it helps us create this example with very little code.
    :returns Tensor
        A new convolutional layer
    """
    strides = 2 if layer_depth % 3 == 0 else 1
    #conv_layer = tf.layers.conv2d(prev_layer, layer_depth*4, 3, strides, 'same', activation=tf.nn.relu)
    # AE: Very similar as above- adding batch normalisation to the convolutional layer BEFORE activation, then adding activation.
    conv_layer = tf.layers.conv2d(prev_layer, layer_depth*4, 3, strides, 'same', activation=None, use_bias=False)
    conv_layer = tf.layers.batch_normalization(inputs=conv_layer, training=is_training)
    conv_layer = tf.nn.relu(conv_layer)
    return conv_layer

**TODO:** Edit the `train` function to support batch normalization. You'll need to make sure the network knows whether or not it is training, and you'll need to make sure it updates and uses its population statistics correctly.

In [21]:
def train(num_batches, batch_size, learning_rate):
    # Build placeholders for the input samples and labels 
    inputs = tf.placeholder(tf.float32, [None, 28, 28, 1])
    labels = tf.placeholder(tf.float32, [None, 10])
    
    # AE: When we do infering (predictions), then we will want to use the previously (during training)
    # AE: calculated population statistics about the samples accross all batches. For that to happen
    # AE: we will need our batch normalisation layers to know that we're infering and not training.
    # AE: That's why we will have to pass this placeholder to the batch normalisation function, 
    # AE: which we can easily change between different session runs through the feed_dict parameter.
    # AE: This placeholder will be passed to "fully_connected" and "conv_layer" functions and there
    # AE: it will be used in constructing the batch normalisation layers.
    is_training = tf.placeholder(tf.bool, name='is_training')

    # Feed the inputs into a series of 20 convolutional layers 
    layer = inputs
    for layer_i in range(1, 20):
        layer = conv_layer(layer, layer_i, is_training=is_training) # AE: added the is_training=is_training parameter

    # Flatten the output from the convolutional layers 
    orig_shape = layer.get_shape().as_list()
    layer = tf.reshape(layer, shape=[-1, orig_shape[1] * orig_shape[2] * orig_shape[3]])

    # Add one fully connected layer
    layer = fully_connected(layer, 100, is_training=is_training) # AE: added the is_training=is_training parameter

    # Create the output layer with 1 node for each 
    logits = tf.layers.dense(layer, 10)
    
    # Define loss and training operations
    model_loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=logits, labels=labels))
    #train_opt = tf.train.AdamOptimizer(learning_rate).minimize(model_loss)

    # AE: Now that we're doing batch normalisation, we need to update population mean and variance on every batch,
    # AE: because batch normalisation calculates batch mean and batch variance for every batch, but that all needs
    # AE: to be accumulated so that when we do our predictions, we use the actual mean and variance of the whole
    # AE: population, rather than just the last batch. This process doesn't happen automatically (at least not yet)
    # AE: so we need to force a dependency on the optimizer to sort the population statistics before it runs the next
    # AE: batch.
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):
        train_opt = tf.train.AdamOptimizer(learning_rate).minimize(model_loss)
    
    # Create operations to test accuracy
    correct_prediction = tf.equal(tf.argmax(logits,1), tf.argmax(labels,1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    
    # Train and test the network
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        for batch_i in range(num_batches):
            batch_xs, batch_ys = mnist.train.next_batch(batch_size)

            # train this batch
            # AE: added the is_training: True parameter
            sess.run(train_opt, {inputs: batch_xs, labels: batch_ys, is_training: True})
            
            # Periodically check the validation or training loss and accuracy
            if batch_i % 100 == 0:
                # AE: added the is_training: False parameter
                loss, acc = sess.run([model_loss, accuracy], {inputs: mnist.validation.images,
                                                              labels: mnist.validation.labels,
                                                              is_training: False})
                print('Batch: {:>2}: Validation loss: {:>3.5f}, Validation accuracy: {:>3.5f}'.format(batch_i, loss, acc))
            elif batch_i % 25 == 0:
                # AE: added the is_training: True parameter. Here we are only checking the accuracy of the current batch
                # AE: so the whole population statistics are not really required, hence the parameter value of True. This
                # AE: may be giving "an unfair" advantage to the training accuracy estimation, because if we used full
                # AE: population statistics here, we probably would get worse results towards the beginning and better
                # AE: results as the training progresses, but this way we get the value of exactly how good the current
                # AE: batch was trained.
                loss, acc = sess.run([model_loss, accuracy], {inputs: batch_xs, labels: batch_ys, is_training: True})
                print('Batch: {:>2}: Training loss: {:>3.5f}, Training accuracy: {:>3.5f}'.format(batch_i, loss, acc))

        # At the end, score the final accuracy for both the validation and test sets
        # AE: added the is_training: False parameter
        acc = sess.run(accuracy, {inputs: mnist.validation.images,
                                  labels: mnist.validation.labels,
                                  is_training: False})
        print('Final validation accuracy: {:>3.5f}'.format(acc))
        # AE: added the is_training: False parameter
        acc = sess.run(accuracy, {inputs: mnist.test.images,
                                  labels: mnist.test.labels,
                                  is_training: False})
        print('Final test accuracy: {:>3.5f}'.format(acc))
        
        # Score the first 100 test images individually. This won't work if batch normalization isn't implemented correctly.
        correct = 0
        for i in range(100):
            # AE: added the is_training: False parameter
            correct += sess.run(accuracy,feed_dict={inputs: [mnist.test.images[i]],
                                                    labels: [mnist.test.labels[i]],
                                                    is_training: False})

        print("Accuracy on 100 samples:", correct/100)


num_batches = 800
batch_size = 64
learning_rate = 0.002

tf.reset_default_graph()
with tf.Graph().as_default():
    train(num_batches, batch_size, learning_rate)

Batch:  0: Validation loss: 0.69097, Validation accuracy: 0.08680
Batch: 25: Training loss: 0.33733, Training accuracy: 0.28125
Batch: 50: Training loss: 0.26345, Training accuracy: 0.46875
Batch: 75: Training loss: 0.21008, Training accuracy: 0.53125
Batch: 100: Validation loss: 0.37897, Validation accuracy: 0.08680
Batch: 125: Training loss: 0.08809, Training accuracy: 0.89062
Batch: 150: Training loss: 0.06718, Training accuracy: 0.92188
Batch: 175: Training loss: 0.05523, Training accuracy: 0.95312
Batch: 200: Validation loss: 0.36617, Validation accuracy: 0.09220
Batch: 225: Training loss: 0.07558, Training accuracy: 0.90625
Batch: 250: Training loss: 0.03752, Training accuracy: 0.98438
Batch: 275: Training loss: 0.06072, Training accuracy: 0.92188
Batch: 300: Validation loss: 0.21706, Validation accuracy: 0.59800
Batch: 325: Training loss: 0.02993, Training accuracy: 0.98438
Batch: 350: Training loss: 0.01218, Training accuracy: 1.00000
Batch: 375: Training loss: 0.03881, Trainin

With batch normalization, you should now get an accuracy over 90%. Notice also the last line of the output: `Accuracy on 100 samples`. If this value is low while everything else looks good, that means you did not implement batch normalization correctly. Specifically, it means you either did not calculate the population mean and variance while training, or you are not using those values during inference.

# Batch Normalization using `tf.nn.batch_normalization`<a id="example_2"></a>

Most of the time you will be able to use higher level functions exclusively, but sometimes you may want to work at a lower level. For example, if you ever want to implement a new feature – something new enough that TensorFlow does not already include a high-level implementation of it, like batch normalization in an LSTM – then you may need to know these sorts of things.

This version of the network uses `tf.nn` for almost everything, and expects you to implement batch normalization using [`tf.nn.batch_normalization`](https://www.tensorflow.org/api_docs/python/tf/nn/batch_normalization).

**Optional TODO:** You can run the next three cells before you edit them just to see how the network performs without batch normalization. However, the results should be pretty much the same as you saw with the previous example before you added batch normalization. 

**TODO:** Modify `fully_connected` to add batch normalization to the fully connected layers it creates. Feel free to change the function's parameters if it helps.

**Note:** For convenience, we continue to use `tf.layers.dense` for the `fully_connected` layer. By this point in the class, you should have no problem replacing that with matrix operations between the `prev_layer` and explicit weights and biases variables.

In [60]:
def fully_connected(prev_layer, num_units, is_training):
    """
    Create a fully connectd layer with the given layer as input and the given number of neurons.
    
    :param prev_layer: Tensor
        The Tensor that acts as input into this layer
    :param num_units: int
        The size of the layer. That is, the number of units, nodes, or neurons.
    :returns Tensor
        A new fully connected layer
    """
    #layer = tf.layers.dense(prev_layer, num_units, activation=tf.nn.relu)
    layer = tf.layers.dense(prev_layer, num_units, use_bias=False, activation=None)

    # AE: To do batch normalisation ourselves, we'll need to use tf.nn.batch_normalization and pass several parameters to it
    # AE: that will have to be able to change. They all can be tf.Variable, because they are only given initial value at first
    # AE: and then are modified by the process. Mean and variance will have to come from the tf.nn.moments function if training.
    # AE: The shapes of beta, gamma, mean and variance will be a vector, but the size of the vector will change from layer to 
    # AE: layer, but that doesn't matter if we construct them as local variables, because the instances of these tf.Variable
    # AE: items will still be trainable.
    # AE:
    # AE: Mean and variance of course should not be trainable, we fill them in ourselves from the batches according to the 
    # AE: formulas given in the "lesson" jupyter notebook.
    mean = tf.Variable(tf.zeros(shape=[num_units], dtype=tf.float32), trainable=False)
    variance = tf.Variable(tf.ones(shape=[num_units], dtype=tf.float32), trainable=False)

    # AE: Beta and gamma have to be trainable and will initially be zeroes and ones, to leave no impact at first.
    # AE: because the formula is: xi * gamma + beta (see the "Lesson" jupyter notebook).
    beta = tf.Variable(tf.zeros(shape=[num_units], dtype=tf.float32))
    gamma = tf.Variable(tf.ones(shape=[num_units], dtype=tf.float32))

    # AE: Epsilon is just a little constant to avoid a possibility of dividing by zero and also to expand the batch variance
    # AE: a little bit, because the total population variance is always higher than the batch variance and we'll get better
    # AE: results if the value here will be closer to the population variance, than the actual batch variance.
    epsilon = 0.001

    # AE: Now if we're training, we'll be calculating batch mean and variance and we'll need to update the population mean 
    # AE: and variance with those values.
    # AE:
    # AE: But if we're infering, then we'll be using population mean and variance for the calculations and will not be updating
    # AE: them.

    def when_training():
        batch_mean, batch_variance = tf.nn.moments(x=layer, axes=[0], keep_dims=False)
        
        # AE: We calculate the updated population mean and variance here and assign it, but we will need to make sure
        # AE: that this calculation is done before normalisation, so we'll need to call tf.assign here and create an
        # AE: artificial dependency on the reference returned from tf.assign (see below).
        # AE: 
        # AE: When updating, we'll yse a decay variable to make sure that the impact of the newly calculated mean and variance
        # AE: only affects the total population values gradually and doesn't make them "jump".
        decay = 0.99
        tmp_new_mean = tf.assign(mean, mean * decay + batch_mean * (1 - decay))
        tmp_new_variance = tf.assign(variance, variance * decay + batch_variance * (1 - decay))
        
        # AE: We need to make sure that the batch_normalisation is calculated AFTER we've updated population mean and
        # AE: variance, so we need to make a dependency here on the temporary variables that we created earlier.
        with tf.control_dependencies([tmp_new_mean, tmp_new_variance]):
            # AE: When training, we want to normalise batches with the current batch mean and variance,
            # AE: hence we use batch_variance and batch_mean
            layer_local = tf.nn.batch_normalization(layer, batch_mean, batch_variance, beta, gamma, epsilon)
            return layer_local

    def when_infering():
        # AE: When infering, we want to use global population mean and variance,
        # AE: hence we use variance and mean which pertain to the population values.
        layer_local = tf.nn.batch_normalization(layer, mean, variance, beta, gamma, epsilon)
        return layer_local
    
    layer = tf.cond(is_training, when_training, when_infering)
    
    # AE: At the end of course we want the activation function on the layer that we will be using in any case.
    layer = tf.nn.relu(layer)
    
    return layer

**TODO:** Modify `conv_layer` to add batch normalization to the fully connected layers it creates. Feel free to change the function's parameters if it helps.

**Note:** Unlike in the previous example that used `tf.layers`, adding batch normalization to these convolutional layers _does_ require some slight differences to what you did in `fully_connected`. 

In [61]:
def conv_layer(prev_layer, layer_depth, is_training):
    """
    Create a convolutional layer with the given layer as input.
    
    :param prev_layer: Tensor
        The Tensor that acts as input into this layer
    :param layer_depth: int
        We'll set the strides and number of feature maps based on the layer's depth in the network.
        This is *not* a good way to make a CNN, but it helps us create this example with very little code.
    :returns Tensor
        A new convolutional layer
    """
    strides = 2 if layer_depth % 3 == 0 else 1

    in_channels = prev_layer.get_shape().as_list()[3]
    out_channels = layer_depth*4
    
    # AE: Not entirely sure what happens here, but we will use "weights" as the "filter" parameter to tf.nn.conv2d function
    # AE: when creating the convolutional layer. Looks like the 3 x 3 is the size of the filter, the out_channels is the value
    # AE: that changes according to the depth of this layer (yeah, funny, I know, but basically this is the number of filters
    # AE: to use for this layer) and the the in_channels seems to be a dimension of the 4D input tensor. This in_channels to 
    # AE: my understanding should correspond to the number of colour channels that we use in a picture (so 3 for RGB and 1 for 
    # AE: B/W), so in this case should be 1.
    # AE:
    # AE: Here is what's in the TF documentation:
    # (...) Given an input tensor of shape [batch, in_height, in_width, in_channels] and a filter / kernel tensor of shape 
    # [filter_height, filter_width, in_channels, out_channels], this op performs the following: (...)
    weights = tf.Variable(
        tf.truncated_normal([3, 3, in_channels, out_channels], stddev=0.05))

    bias = tf.Variable(tf.zeros(out_channels))

    #conv_layer = tf.nn.conv2d(prev_layer, weights, strides=[1,strides, strides, 1], padding='SAME')
    c_layer = tf.nn.conv2d(input=prev_layer, filter=weights, strides=[1,strides, strides, 1], padding='SAME')
    
    # AE: Same as in "fully_connected" function (see comments above) we will need several variables to implement batch
    # AE: normalisation:
    #image_height = prev_layer.get_shape().as_list()[1]
    #image_width = prev_layer.get_shape().as_list()[2]

    # AE: A significant difference here is that we don't want to calculate mean and variance per node in the layer, but rather
    # AE: per filter. And since we will have <out_channels> number of filters (each with the dimensions of: [3, 3, in_channels]).
    # AE: So only <out_channels> values in each vector.
    mean = tf.Variable(tf.zeros(shape=[out_channels], dtype=tf.float32), trainable=False)
    variance = tf.Variable(tf.ones(shape=[out_channels], dtype=tf.float32), trainable=False)
    beta = tf.Variable(tf.zeros(shape=[out_channels], dtype=tf.float32))
    gamma = tf.Variable(tf.ones(shape=[out_channels], dtype=tf.float32))
    epsilon = 0.001
    
    # AE: And just like above in the "fully_connected" function, we perform batch normalisation in two scenarios:
    # AE: when training and when infering. And again we use a decay value and again we set up dependency for the
    # AE: population mean and variance updates.

    def when_training():
        # AE: An important moment here. Since we only want to calculate variance and mean (i.e. moments)
        # AE: for the feature maps and not the whole layer, we need to pass the special value for axes 
        # AE: and keep_dims parameters as per recommendation in tf.nn.moments documnetation below :
        # (...) for so-called "global normalization", used with convolutional filters with shape [batch, height, width, depth], 
        # pass axes=[0, 1, 2]. (...)
        # AE:
        # AE: And of course, the dimensionality will not be the same anymore as the input. The input is a 4D tensor, the output
        # AE: now needs to be a 1D tensor.
        batch_mean, batch_variance = tf.nn.moments(x=c_layer, axes=[0, 1, 2], keep_dims=False)

        decay = 0.99
        tmp_new_mean = tf.assign(mean, mean * decay + batch_mean * (1 - decay))
        tmp_new_variance = tf.assign(variance, variance * decay + batch_variance * (1 - decay))
        
        with tf.control_dependencies([tmp_new_mean, tmp_new_variance]):
            c_layer_local = tf.nn.batch_normalization(c_layer, batch_mean, batch_variance, beta, gamma, epsilon)
            return c_layer_local

    def when_infering():
        c_layer_local = tf.nn.batch_normalization(c_layer, mean, variance, beta, gamma, epsilon)
        return c_layer_local

    c_layer = tf.cond(is_training, when_training, when_infering)
    
    c_layer = tf.nn.bias_add(c_layer, bias)
    c_layer = tf.nn.relu(c_layer)

    return c_layer

**TODO:** Edit the `train` function to support batch normalization. You'll need to make sure the network knows whether or not it is training.

In [62]:
def train(num_batches, batch_size, learning_rate):
    # Build placeholders for the input samples and labels 
    inputs = tf.placeholder(tf.float32, [None, 28, 28, 1])
    labels = tf.placeholder(tf.float32, [None, 10])
    
    # AE: again same as with the previous train function, we need to know if we're training or not:
    is_training = tf.placeholder(tf.bool, name='is_training')
    
    # Feed the inputs into a series of 20 convolutional layers 
    layer = inputs
    for layer_i in range(1, 20):
        layer = conv_layer(layer, layer_i, is_training)

    # Flatten the output from the convolutional layers 
    orig_shape = layer.get_shape().as_list()
    layer = tf.reshape(layer, shape=[-1, orig_shape[1] * orig_shape[2] * orig_shape[3]])

    # Add one fully connected layer
    layer = fully_connected(layer, 100, is_training)

    # Create the output layer with 1 node for each 
    logits = tf.layers.dense(layer, 10)
    
    # Define loss and training operations
    model_loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=logits, labels=labels))
    train_opt = tf.train.AdamOptimizer(learning_rate).minimize(model_loss)
    
    # Create operations to test accuracy
    correct_prediction = tf.equal(tf.argmax(logits,1), tf.argmax(labels,1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    
    # Train and test the network
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        for batch_i in range(num_batches):
            batch_xs, batch_ys = mnist.train.next_batch(batch_size)

            # train this batch
            #sess.run(train_opt, {inputs: batch_xs, labels: batch_ys})
            sess.run(train_opt, {inputs: batch_xs, labels: batch_ys, is_training: True})
            
            # Periodically check the validation or training loss and accuracy
            if batch_i % 100 == 0:
                #loss, acc = sess.run([model_loss, accuracy], {inputs: mnist.validation.images, labels: mnist.validation.labels})
                loss, acc = sess.run([model_loss, accuracy], {inputs: mnist.validation.images,
                                                              labels: mnist.validation.labels,
                                                              is_training: False}) # AE: added this line
                print('Batch: {:>2}: Validation loss: {:>3.5f}, Validation accuracy: {:>3.5f}'.format(batch_i, loss, acc))
            elif batch_i % 25 == 0:
                #loss, acc = sess.run([model_loss, accuracy], {inputs: batch_xs, labels: batch_ys})
                loss, acc = sess.run([model_loss, accuracy], {inputs: batch_xs, labels: batch_ys, is_training: False})
                print('Batch: {:>2}: Training loss: {:>3.5f}, Training accuracy: {:>3.5f}'.format(batch_i, loss, acc))

        # At the end, score the final accuracy for both the validation and test sets
        
        #acc = sess.run(accuracy, {inputs: mnist.validation.images, labels: mnist.validation.labels})
        acc = sess.run(accuracy, {inputs: mnist.validation.images,
                                  labels: mnist.validation.labels,
                                  is_training: False})
        print('Final validation accuracy: {:>3.5f}'.format(acc))
        #acc = sess.run(accuracy, {inputs: mnist.test.images, labels: mnist.test.labels})
        acc = sess.run(accuracy, {inputs: mnist.test.images,
                                  labels: mnist.test.labels,
                                  is_training: False})
        print('Final test accuracy: {:>3.5f}'.format(acc))
        
        # Score the first 100 test images individually. This won't work if batch normalization isn't implemented correctly.
        correct = 0
        for i in range(100):
            #correct += sess.run(accuracy,feed_dict={inputs: [mnist.test.images[i]], labels: [mnist.test.labels[i]]})
            correct += sess.run(accuracy,feed_dict={inputs: [mnist.test.images[i]],
                                                    labels: [mnist.test.labels[i]],
                                                    is_training: False})

        print("Accuracy on 100 samples:", correct/100)


num_batches = 800
batch_size = 64
learning_rate = 0.002

tf.reset_default_graph()
with tf.Graph().as_default():
    train(num_batches, batch_size, learning_rate)

Batch:  0: Validation loss: 0.69117, Validation accuracy: 0.10700
Batch: 25: Training loss: 0.59189, Training accuracy: 0.10938
Batch: 50: Training loss: 0.47506, Training accuracy: 0.06250
Batch: 75: Training loss: 0.40561, Training accuracy: 0.03125
Batch: 100: Validation loss: 0.36587, Validation accuracy: 0.08680
Batch: 125: Training loss: 0.35462, Training accuracy: 0.06250
Batch: 150: Training loss: 0.35443, Training accuracy: 0.04688
Batch: 175: Training loss: 0.36261, Training accuracy: 0.03125
Batch: 200: Validation loss: 0.36426, Validation accuracy: 0.09860
Batch: 225: Training loss: 0.37292, Training accuracy: 0.09375
Batch: 250: Training loss: 0.34525, Training accuracy: 0.06250
Batch: 275: Training loss: 0.40170, Training accuracy: 0.15625
Batch: 300: Validation loss: 0.39395, Validation accuracy: 0.13620
Batch: 325: Training loss: 0.41931, Training accuracy: 0.12500
Batch: 350: Training loss: 0.39284, Training accuracy: 0.21875
Batch: 375: Training loss: 0.40014, Trainin

Once again, the model with batch normalization should reach an accuracy over 90%. There are plenty of details that can go wrong when implementing at this low level, so if you got it working - great job! If not, do not worry, just look at the `Batch_Normalization_Solutions` notebook to see what went wrong.