Batch Normalization and Visualization in TensorFlow with TensorBoard
=============

Work in Progress. Planning to use the higher level code in tf.slim to implement the models
------------

Using notMNiST dataset and starter code from Udacity ud730 as starting point

In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import numpy as np
import tensorflow as tf
from six.moves import cPickle as pickle

First reload the data we generated in `1_notmnist.ipynb`.

In [2]:
pickle_file = 'notMNIST.pickle'

with open(pickle_file, 'rb') as f:
  save = pickle.load(f)
  train_dataset = save['train_datasetju
  train_labels = save['train_labels']
  valid_dataset = save['valid_dataset']
  valid_labels = save['valid_labels']
  test_dataset = save['test_dataset']
  test_labels = save['test_labels']
  del save  # hint to help gc free up memory
  print('Training set', train_dataset.shape, train_labels.shape)
  print('Validation set', valid_dataset.shape, valid_labels.shape)
  print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 28, 28) (200000,)
Validation set (10000, 28, 28) (10000,)
Test set (10000, 28, 28) (10000,)


Reformat into a shape that's more adapted to the models we're going to train:
- data as a flat matrix,
- labels as float 1-hot encodings.

In [3]:
image_size = 28
num_labels = 10

def reformat(dataset, labels):
  dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
  # Map 1 to [0.0, 1.0, 0.0 ...], 2 to [0.0, 0.0, 1.0 ...]
  labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
  return dataset, labels
train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 784) (200000, 10)
Validation set (10000, 784) (10000, 10)
Test set (10000, 784) (10000, 10)


---
Define a model for experimenting with. Will use a 10 layer simple net with 500 neurons each with RELU activation. Note this is not likely a good choice for this notMNIST problem, but is chosen because it should demonstrate the effectiveness of batch normalization. It also serves as a sanity check on the implementation of batch normalization. Here I am using the examples from Stanford class cs231n on convolutional neural nets http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture6.pdf
as reference material. In those lectures, they show that the gradients will vanish without proper initialization. They show some examples of different initializations and how the activations either saturate or go to zero. 


---

### Background on how to use batch_norm()
#### Using `update_collections`

Another post from @squada suggesting to use `update_collections=None`:
```python
slim = tf.contrib.slim
def model(data, is_training=False, reuse=None, scope='my_model'):
  # Define a variable scope to contain all the variables of your model
  with tf.variable_scope(scope, 'model', data, reuse=reuse):  # I had to edit this and remove 'model' to make it work
    # Configure arguments of fully_connected layers
    with slim.arg_scope([slim.fully_connected],
                        activation_fn=tf.nn.relu,
                        normalizer_fn=slim.batch_nom):
      # Configure arguments of batch_norm layers
      with slim.arg_scope([slim.batch_norm],
                          decay=0.9,  # Adjust decay to the number of iterations
                          update_collections=None, # Make sure updates happen automatically
                          is_training=is_training, # Switch behavior from training to non-training):
        net = slim.fully_connected(data, 100, scope='fc1')
        net = slim.fully_connected(net, 200, scope='fc2')
        ....
        # Don't use activation_fn nor batch_norm in the last layer        
        net = slim.fully_connected(net, 10, activation_fn=None, normalizer_fn=None, scope='fc10')
       return net
```
 
A working example on using batch_norm() that used `update_collections=None` is 
[working examples](http://stackoverflow.com/documentation/tensorflow/7909/using-batch-normalization/25676/a-full-working-example-of-2-layer-neural-network-with-batch-normalization-mnist#t=20170508203945378031)

Another example using `updates_collections=ops.GraphKeys.UPDATE_OPS` is explained [A Gentle...](http://ruishu.io/2016/12/27/batchnorm/) and code on GitHub at [RuiShu](https://github.com/RuiShu/micro-projects/blob/master/tf-batchnorm-guide/batchnorm_guide.ipynb)

There is a speed penalty for using `updates_collections=None` for deep networks. With it set to `None`, it does the calculations in place rather than allowing them to be collected and optimally computed later. 


Finally latest docs show method for using `updates_collections=ops.GraphKeys.UPDATE_OPS`

Note: When is_training is True the moving_mean and moving_variance need to be
    updated, by default the update_ops are placed in `tf.GraphKeys.UPDATE_OPS` so
    they need to be added as a dependency to the `train_op`, example:

```python
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) 
with tf.control_dependencies(update_ops):
    train_op = optimizer.minimize(loss)
````
#### Setting `decay`
The setting of the `decay` value is critical if running small batches. The default value of `0.999` can result in poor results with small batches. 
From same thread as above:
"We could change the default to 0.9 or document better its impact in smaller datasets or few updates.
@vincentvanhoucke in our distributed setting we usually do millions of updates so it is ok, however in other cases like the one here which does only a few hundreds of updates it makes a big difference:
For example using decay=0.999 has a 0.36 bias after 1000 updates, but that bias goes down to 0.000045 after 10000 updates and to 0.0 after 50000 updates."

#### 


For reference, the inputs and defaults for fully_connected() and batch_norm() area shown below. 

def fully_connected(inputs,
                    num_outputs,
                    activation_fn=nn.relu,
                    normalizer_fn=None,
                    normalizer_params=None,
                    weights_initializer=initializers.xavier_initializer(),
                    weights_regularizer=None,
                    biases_initializer=init_ops.zeros_initializer(),
                    biases_regularizer=None,
                    reuse=None,
                    variables_collections=None,
                    outputs_collections=None,
                    trainable=True,
                    scope=None):
                    
def batch_norm(inputs,
               decay=0.999,
               center=True,
               scale=False,
               epsilon=0.001,
               activation_fn=None,
               param_initializers=None,
               param_regularizers=None,
               updates_collections=ops.GraphKeys.UPDATE_OPS,
               is_training=True,
               reuse=None,
               variables_collections=None,
               outputs_collections=None,
               trainable=True,
               batch_weights=None,
               fused=False,
               data_format=DATA_FORMAT_NHWC,
               zero_debias_moving_mean=False,
               scope=None,
               renorm=False,
               renorm_clipping=None,
               renorm_decay=0.99):

Note the flow is modeled from my code used for Udacity Self Driving Car project on German Traffic Sign classification [my code](https://github.com/boulderZ/German-Traffic-Sign-Classification).
Start by defining a 10 layer net. Below I am taking advantage of tf.contrib.slim to get access to higher level features and to use arg_scope(). 


In [32]:
# define simple ten layer net with optional batch normalization (BN)
import tensorflow as tf

tf.reset_default_graph()  ### NOTE: This is needed in jupyter notebooks when using tf.get_variable(), otherwise 
                          ### when you run a second time, you will get errors that variables already exist.
slim = tf.contrib.slim
updates_collections=tf.GraphKeys.UPDATE_OPS
#updates_collections = None   # uncomment this to do the batch norm updates in place
normalizer_fn=slim.batch_norm
#normalizer_fn = None   # uncomment this to remove batch normalization
#weights_initializer=initializers.xavier_initializer() ## using this without BN, will also work well
weights_initializer = tf.random_normal_initializer(stddev=.01) # this is a bad initializer on purpose to test BN
## try setting normalizer_fn = None and weights_initializer = tf.random_normal_initializer(stddev=.01)
## you should see that the training fails and resulting accuracy is random (0.1)
## Then same poor initializer but with normalizer_fn=slim.batch_norm and results are quite good (accuracy ~0.9)

def model(data, is_training=False, reuse=None, scope='model'):
    # Define a variable scope to contain all the variables of your model
    with tf.variable_scope(scope, data, reuse=reuse):  
        # Configure arguments of fully_connected layers
        with slim.arg_scope([slim.fully_connected],
                        activation_fn=tf.nn.relu,
                        weights_initializer=weights_initializer,
                        normalizer_fn=normalizer_fn):
            # Configure arguments of batch_norm layers
            with slim.arg_scope([slim.batch_norm],
                          decay=0.9,  # Adjust decay to the number of iterations
                          updates_collections=updates_collections, # None: Make sure updates happen automatically
                          is_training=is_training): # Switch behavior from training to non-training):
                net = slim.fully_connected(data, 500, scope='fc1')
                tf.summary.histogram('fc1_activations', net)       # add visualization of activations for TensorBoard
                net = slim.fully_connected(net, 500, scope='fc2')
                net = slim.fully_connected(net, 500, scope='fc3')
                net = slim.fully_connected(net, 500, scope='fc4')
                net = slim.fully_connected(net, 500, scope='fc5')
                tf.summary.histogram('fc5_activations', net)
                net = slim.fully_connected(net, 500, scope='fc6')
                net = slim.fully_connected(net, 500, scope='fc7')
                net = slim.fully_connected(net, 500, scope='fc8')
                net = slim.fully_connected(net, 500, scope='fc9')
                tf.summary.histogram('fc9_activations', net)

                # Don't use activation_fn nor batch_norm in the last layer        
                logits = slim.fully_connected(net, 10, activation_fn=None, normalizer_fn=None, scope='fc10')
                tf.summary.histogram('fc10_activations', logits)
                
    return logits

In [33]:
x = tf.placeholder(tf.float32, (None, image_size*image_size))
y = tf.placeholder(tf.int32, (None,10))
#one_hot_y = tf.one_hot(y, 43)
keep_prob = tf.placeholder(tf.float32) # probability to keep units
is_training = tf.placeholder(tf.bool, name='is_training')


In [34]:
### IMPORTS needed to use tf.contrib.layers.optimize_loss()

from tensorflow.python.ops import variable_scope
from tensorflow.python.framework import dtypes
from tensorflow.python.ops import init_ops
global_step = variable_scope.get_variable(  # this needs to be defined for tf.contrib.layers.optimize_loss()
      "global_step", [],
      trainable=False,
      dtype=dtypes.int64,
      initializer=init_ops.constant_initializer(0, dtype=dtypes.int64))

rate = 0.001 
logits = model(x,is_training)
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=logits,labels=y)
loss_operation = tf.reduce_mean(cross_entropy)
tf.summary.scalar('loss', loss_operation)
#optimizer = tf.train.AdamOptimizer(learning_rate = rate)
#optimizer = tf.train.GradientDescentOptimizer(0.5)
#training_operation = optimizer.minimize(loss_operation)

## Experiment with optimize_loss() which does summaries on gradients for TensorBoard. 
if updates_collections: # if not None, then update the moving_mean and moving_variance
    update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
    with tf.control_dependencies(update_ops):
        training_operation = tf.contrib.layers.optimize_loss(
                    loss_operation, global_step, learning_rate=rate, optimizer='Adam',summaries=["gradients"])
else:
    training_operation = tf.contrib.layers.optimize_loss(
              loss_operation, global_step, learning_rate=rate, optimizer='Adam',summaries=["gradients"])
    

<tf.Tensor 'loss:0' shape=() dtype=string>

In [35]:
BATCH_SIZE = 128

correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(y, 1))  # returns tensor of dtype = boolean
accuracy_operation = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))  # convert from boolean to float32
saver = tf.train.Saver()
summary = tf.summary.merge_all()

def evaluate(X_data, y_data):
    num_examples = len(X_data)
    total_accuracy = 0
    sess = tf.get_default_session()
    for offset in range(0, num_examples, BATCH_SIZE):
        batch_x, batch_y = X_data[offset:offset+BATCH_SIZE], y_data[offset:offset+BATCH_SIZE]
        accuracy = sess.run(accuracy_operation, feed_dict={x: batch_x, y: batch_y,keep_prob: 1.0,
                                                          is_training: False})
        total_accuracy += (accuracy * len(batch_x))
    return total_accuracy / num_examples
#print(X_train.shape)

In [37]:
from sklearn.utils import shuffle
import time
from datetime import datetime
EPOCHS = 2  # EPOCHS = 2 
#logdir = FLAGS.train_dir + '/' + datetime.now().strftime('%Y%m%d-%H%M%S') + '/'
logdir='tf_logs/'  # directory to save summaries in for TensorBoard

# rename variables to match existing code
X_train,y_train = train_dataset, train_labels
X_validation,y_validation = valid_dataset, valid_labels
X_test,y_test = test_dataset, test_labels

# uncomment below to see overfitting. Also is a sanity check on the model/code. You should see training accuracy
# of 1.0 with enough EPOCHS (EPOCHS=10,train_datset[:25]). Also, the first loss value should be 2.3 which is log(10)
#X_train,y_train = train_dataset[:25], train_labels[:25]  # reduce training set (and increase EPOCHS) to see overfitting


with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    #summary_writer = tf.summary.FileWriter(logdir, sess.graph)  # Need for TensorBoard to save summaries 
    num_examples = len(X_train)
    
    
    
    print("Training...")
    print()
    for i in range(EPOCHS):
        X_train, y_train = shuffle(X_train, y_train)
        for offset in range(0, num_examples, BATCH_SIZE):
            end = offset + BATCH_SIZE
            batch_x, batch_y = X_train[offset:end], y_train[offset:end]
            #sess.run(training_operation, feed_dict={x: batch_x, y: batch_y,keep_prob: 1.0})
            _, loss = sess.run([training_operation,loss_operation], feed_dict={x: batch_x, y: batch_y,keep_prob: 1.0,
                                                                              is_training: True})
            
        validation_accuracy = evaluate(X_validation, y_validation)
        training_accuracy = evaluate(X_train,y_train)
        ## add code to save summaries for TensorBoard
        str_epoch = logdir + 'run_' + str(i) # new directory for each run for tensorboard, then you can view by run(epoch)
        summary_writer = tf.summary.FileWriter(str_epoch, sess.graph) 
        summary_str = sess.run(summary, feed_dict={x: batch_x, y: batch_y,keep_prob: 1.0,
                                                  is_training: False})
        summary_writer.add_summary(summary_str, i)
        summary_writer.flush()  # evidently this is needed sometimes or scalars will not show up on tensorboard.
        print("EPOCH {} ...".format(i+1))
        print("Validation Accuracy = {:.3f}".format(validation_accuracy))
        print('Training Loss = ',loss)
        print('Training Accuracy = ',training_accuracy)
        print()
        
    saver.save(sess, './lenet')
    print("Model saved")


Training...

EPOCH 1 ...
Validation Accuracy = 0.873
Training Loss =  0.689682
Training Accuracy =  0.8741

EPOCH 2 ...
Validation Accuracy = 0.889
Training Loss =  0.626996
Training Accuracy =  0.8932



'./lenet'

Model saved


In [29]:
with tf.Session() as sess:
    saver.restore(sess, tf.train.latest_checkpoint('.'))

    test_accuracy = evaluate(X_test, y_test)
    print("Test Accuracy = {:.3f}".format(test_accuracy))

INFO:tensorflow:Restoring parameters from ./lenet
Test Accuracy = 0.938
