## Evaluation Routines in TF-Slim

*by Marvin Bertin*
<img src="../images/tensorflow.png" width="400">

**Evaluating Deep Learning Models**

Evaluating Deep Learning models allows us to measure the model performance and let us know if the neural network is learning useful information for solving the task.

Deep neural networks can be very large and take a long time to train. Therefore it is recommended to evaluate the model's performance regularly while training.

It is important to monitor the 'health' of the training because optimization could stop functioning properly.

For example for the following reasons:

- overfitting (use early stopping, regularization, more data)
- vanishing or exploding gradients (clip gradient norm, change activation function, residual skip connection)
- non-converging learning (bad initialization, large learning rate, tune optimizer, bug in network)
- reaching local minima (update learning rate, dropout)
- covariate shift in very deep network (batch normalization)
- low performance, high bias (modify model architecture, larger network)

TF-Slim has an evaluation module that contains helper functions for evaluating TensorFlow
models using a variety of metrics and summarizing the results.

**Evaluation Loop**

TF-Slim provides an evaluation module, which contains helper functions for writing model evaluation scripts using metrics from the metrics module.

Evaluation loop
- runs evaluation periodically
- evaluates metrics over batches of data
- summarizes metric results.


In [1]:
import sys  
sys.path.append("../") 

import tensorflow as tf
slim = tf.contrib.slim

%load_ext autoreload
%autoreload 2

## Evaluation for a Single Run

In the simplest use case, we use a model to create the predictions, then specify
the metrics and finally call the `evaluation` method:

`slim.evaluation.evaluation()` will perform a single evaluation run.

**A single evaluation consists of several steps**

1. an initialization op that initialize local and global variables.
2. an evaluation op which is executed `num_evals` times.
3. a finalization op which is executed at end of the evaluation loop.
4. the execution of a summary op which is written out using a summary writer.
    

In [None]:
# Create model and obtain the predictions:
images, labels = LoadData(...)
predictions = MyModel(images)

# Choose the metrics to compute:
names_to_values, names_to_updates = slim.metrics.aggregate_metric_map({
    "accuracy": slim.metrics.accuracy(predictions, labels),
    "mse": slim.metrics.mean_squared_error(predictions, labels),
})

# Initialize variables
inital_op = tf.group(
    tf.global_variables_initializer(),
    tf.local_variables_initializer())

with tf.Session() as sess:
    # Run evaluation
    metric_values = slim.evaluation.evaluation(
        sess,
        num_evals=10,
        inital_op=initial_op,
        eval_op=names_to_updates.values(),
        final_op=name_to_values.values())
    
    # print final metric values
    for metric, value in zip(names_to_values.keys(), metric_values):
        logging.info('Metric %s has value: %f', metric, value)

## Evaluating a Checkpointed Model with Metrics

Often, one wants to evaluate a model checkpoint saved on disk.

The evaluation can be performed periodically during training on a set schedule.

Instead of calling the `evaluation()` method, we now call `evaluation_loop()` method. We now provide in addition the logging and checkpoint directory, as well as, a evaluation time interval.    

In [None]:
# Create model and obtain the predictions:
images, labels = LoadData(...)
predictions = MyModel(images)

# Choose the metrics to compute:
names_to_values, names_to_updates = slim.metrics.aggregate_metric_map({
      "accuracy": slim.metrics.accuracy(predictions, labels),
      "mse": slim.metrics.mean_squared_error(predictions, labels),
  })

# model checkpoints
checkpoint_dir = '/tmp/my_model_dir/'

# logging
log_dir = '/tmp/my_model_eval/'

# evaluate for 1000 batches:
num_evals = 1000

# Evaluate every 10 minutes:
slim.evaluation.evaluation_loop(
      master='',
      checkpoint_dir,
      logdir,
      num_evals=num_evals, # number of batches to evaluate
      eval_op=names_to_updates.values(),
      eval_interval_secs=600) # How often to run the evaluation

## Evaluating a Checkpointed Model with Summaries

In addition to computing the metrics, the evaluation loop can also construct metrics, scalar, and histogram summaries of the model and save them to disk

In [None]:
# Load the data
images, labels = load_data(...)

# Define the network
predictions = MyModel(images)

# Choose the metrics to compute:
names_to_values, names_to_updates = slim.metrics.aggregate_metric_map({
    'accuracy': slim.metrics.accuracy(predictions, labels),
    'precision': slim.metrics.precision(predictions, labels),
    'recall': slim.metrics.recall(predictions, targets),
})


# Define the summaries to write:
for metric_name, metric_value in names_to_values.iteritems():
    tf.summary.scalar(metric_name, metric_value)
    
# Define other summaries to write (loss, activations, gradients)
tf.summary.scalar(...)
tf.summary.histogram(...)

checkpoint_dir = '/tmp/my_model_dir/'
log_dir = '/tmp/my_model_eval/'

# evaluate for 1000 batches:
num_evals = 1000

# Setup the global step.
slim.get_or_create_global_step()

slim.evaluation.evaluation_loop(
    master='',
    checkpoint_dir,
    log_dir,
    num_evals=num_evals,
    eval_op=names_to_updates.values(),
    summary_op=tf.summary.merge(summary_ops), # Merge summaries (list of summary operations)
    eval_interval_secs=600) # How often to run the evaluation

## Evaluating at a Given Checkpoint.

When a model has already been trained, and we only wish to evaluate it from its last checkpoint, TF-Slim has provided us with a method calle `evaluate_once()`. It only evaluates the model at the given checkpoint path.

In [None]:
from utils.slim_models import CNNClassifier

image_shape = (64,64,3)
num_class = 5

CNN_model = CNNClassifier("flowers", image_shape , num_class)

In [12]:
logits, nodes = CNN_model(inputs, dropout = 0.5, is_training=False)
predictions = tf.argmax(logits, 1)

# Define streaming metrics
names_to_values, names_to_updates = slim.metrics.aggregate_metric_map({
    'eval/Accuracy': slim.metrics.streaming_accuracy(predictions, targets),
    'eval/Recall@3': slim.metrics.streaming_sparse_recall_at_k(
            tf.to_float(logits), tf.expand_dims(targets,1), 3),
    'eval/Precision': slim.metrics.streaming_precision(predictions, targets),
    'eval/Recall': slim.metrics.streaming_recall(predictions, targets)
})


print('Running evaluation Loop...')
# Only load latest checkpoint
checkpoint_path = tf.train.latest_checkpoint(checkpoint_dir)

metric_values = slim.evaluation.evaluate_once(
    num_evals=num_evals,
    master='',
    checkpoint_path=checkpoint_path,
    logdir=checkpoint_dir,
    eval_op=names_to_updates.values(),
    final_op=names_to_values.values())

# print final metric values
names_to_values = dict(zip(names_to_values.keys(), metric_values))
for name in names_to_values:
    print('%s: %f' % (name, names_to_values[name]))

## Evaluate CNN Flower Model

## Define Evaluater

In [3]:
from utils.slim_training_evaluation import ModelTrainerEvaluater
from utils.slim_data_provider import DatasetProvider

checkpoint_dir="../models/flowers/"
data_dir = "../data/flowers/"

CNN_trainer = ModelTrainerEvaluater(model = CNN_model,
                           dataset_provider = DatasetProvider(data_dir),
                           data_name="flowers",
                           checkpoint_dir=checkpoint_dir)

## Run Evaluation

In [10]:
CNN_trainer.evaluate(num_evals=20,
                     data_type='validation',
                     dropout = 0.5,
                     batch_size=32)

Running evaluation Loop...
Instructions for updating:
Please switch to tf.summary.FileWriter. The interface and behavior is the same; this is just a rename.
INFO:tensorflow:Starting evaluation at 2017-02-18-19:52:17
INFO:tensorflow:Executing eval ops
INFO:tensorflow:Executing eval_op 1/20
INFO:tensorflow:Executing eval_op 2/20
INFO:tensorflow:Executing eval_op 3/20
INFO:tensorflow:Executing eval_op 4/20
INFO:tensorflow:Executing eval_op 5/20
INFO:tensorflow:Executing eval_op 6/20
INFO:tensorflow:Executing eval_op 7/20
INFO:tensorflow:Executing eval_op 8/20
INFO:tensorflow:Executing eval_op 9/20
INFO:tensorflow:Executing eval_op 10/20
INFO:tensorflow:Executing eval_op 11/20
INFO:tensorflow:Executing eval_op 12/20
INFO:tensorflow:Executing eval_op 13/20
INFO:tensorflow:Executing eval_op 14/20
INFO:tensorflow:Executing eval_op 15/20
INFO:tensorflow:Executing eval_op 16/20
INFO:tensorflow:Executing eval_op 17/20
INFO:tensorflow:Executing eval_op 18/20
INFO:tensorflow:Executing eval_op 19/2

## Next Lesson
### Fine-Tuning and Transfer Learning in TF-Slim
-  Explore how to fine-tune pre-trained models and use transfer learning to train on a new task

<img src="../images/divider.png" width="100">