# Evaluation Metrics Module in TF-Slim
*by Marvin Bertin*
<img src="../images/tensorflow.png" width="400">

While training a Deep Learning model inspecting the loss alone does not provide interpretable measure about how well the neural network is training. 

Instead we compute at regular intervals evaluations metrics to score the model performance based on the task we care about. 

**Computing model performance includes**

- loading the (subset) data
- performing inference
- comparing the results to the ground truth
- recording the evaluation scores
- repeating periodically


**Evaluation Metrics**

- a metric is a performance measure
- a metric is not a loss function (losses are directly optimized during training)
- a metric is consistent with the task of the problem
- for example, we may want to minimize cross-entropy, but our metrics of interest might be accuracy or F1 score
- a metric is no necessarily differentiable, and therefore cannot be used as a loss

<img src="../images/recall.png" width="200">

**F1 score is the harmonic mean of precision and recall**

$${\displaystyle F_{1}=2\cdot {\frac {1}{{\tfrac {1}{\mathrm {recall} }}+{\tfrac {1}{\mathrm {precision} }}}}=2\cdot {\frac {\mathrm {precision} \cdot \mathrm {recall} }{\mathrm {precision} +\mathrm {recall} }}}$$

** TF-Slim Metrics Module**

TF-Slim provides a set of metric operations that makes evaluating models easy.
Computing the value of a metric can be divided into three parts:

Each metric function adds nodes to the graph that hold the state necessary to compute the value of the metric as well as a set of operations that actually perform the computation. Every metric evaluation is composed of three steps:

1. **Initialization** - initialize the variables used to compute the metrics.
2. **Aggregation** - updating the values of the metric state.
3. **Finalization** - computing the final metric value.

In [1]:
import sys  
sys.path.append("../") 

import tensorflow as tf
slim = tf.contrib.slim

%load_ext autoreload
%autoreload 2

## Streaming Metrics

TF-Slim provides a number of streaming metrics. These metrics are computed on dynamically valued Tensors, as sample batches are evaluated.

Each metric declaration returns:

- a **value_tensor** - an operation that returns the current value of the metric.
- an **update_op** - an operation that accumulates the information from the current value of the batch of Tensors being measured.


## Streaming Mean Metric
Simple example on how a streaming mean would be computed.

1. declare the metric
2. call update_op repeatedly to accumulate data.
3. evaluate the value_tensor.

In [None]:
value = ...
mean_value, update_op = slim.metrics.streaming_mean(values)
sess.run(tf.local_variables_initializer())

for i in range(number_of_batches):
    print('Mean after batch %d: %f' % (i, update_op.eval())
print('Final Mean: %f' % mean_value.eval())

## Defining Multiple Metrics

In practice, we commonly want to evaluate multiple metrics at the same time. Below is how you would define three different metrics. Each metric generate it's own update operation that accumulates the results across multiple batches.

For example, to compute mean_absolute_error, two variables, a count and total variable are initialized to zero. During aggregation, we observed some set of predictions and labels, compute their absolute differences and add the total to total variable. Each time we observe another value, the count variable is incremented. Finally, during finalization, total is divided by count to obtain the mean.

In [None]:
# Load data
images, labels = LoadTestData(...)

# make predictions
predictions = MyModel(images)

# Evaluation metrics
mae_value_op, mae_update_op = slim.metrics.streaming_mean_absolute_error(predictions, labels)
mre_value_op, mre_update_op = slim.metrics.streaming_mean_relative_error(predictions, labels, labels)
pl_value_op, pl_update_op = slim.metrics.percentage_less(mean_relative_errors, 0.3)

## Metric Aggregation

Each metric returns a `value_op` and `update_op`. Keeping track of each of these operations can become difficult when there are a lot of metrics.

**List Aggregation**

To deal with this, TF-Slim provides an aggregate functions that combines them togther.

In [None]:
# Aggregates the value and update ops in two lists:
value_ops, update_ops = slim.metrics.aggregate_metrics(
    slim.metrics.streaming_mean_absolute_error(predictions, labels),
    slim.metrics.streaming_mean_squared_error(predictions, labels))

**Dictionary Aggregation**

We can also aggregate metrics into a dictionary and give each one of them names.
In practice, we commonly want to evaluate across many batches and multiple metrics.
This is done by run the aggregate metric computation operations multiple times.

In [None]:
# Load the data
images, labels = load_data(...)

# Define a neural network
logits = MyModel(images)
predictions = tf.argmax(logits, 1)


# Aggregates the value and update ops in two dictionaries:
names_to_values, names_to_updates = slim.metrics.aggregate_metric_map({
    'eval/Accuracy': slim.metrics.streaming_accuracy(predictions, targets),
    'eval/Recall@3': slim.metrics.streaming_recall_at_k(tf.to_float(logits), targets, 3),
    'eval/Precision': slim.metrics.streaming_precision(predictions, targets),
    'eval/Recall': slim.metrics.streaming_recall(predictions, targets)
})


# Evaluate the model using 1000 batches of data:
num_batches = 1000

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    sess.run(tf.local_variables_initializer())

    # run metrics over multiple batches
    for batch_id in range(num_batches):
        sess.run(names_to_updates.values())

    # Get each metric end value
    metric_values = sess.run(name_to_values.values())
    for metric, value in zip(names_to_values.keys(), metric_values):
        print('Metric %s has value: %f' % (metric, value))

## Other Functions Provided by Slim.metrics Module

TF-Slim provides other useful metric function and distance metrics that I will let you explore on your own. Below are a few examples:

```
slim.metrics.streaming_recall_at_k
slim.metrics.confusion_matrix
slim.metrics.streaming_auc
slim.metrics.streaming_mean_cosine_distance
slim.metrics.streaming_root_mean_squared_error
slim.metrics.streaming_pearson_correlation
```

## Next Lesson
### Compact Evaluation Routings in TF-Slim
-  Construct evaluation routines and score the performance of your deep neural network.

<img src="../images/divider.png" width="100">