# Loss Function Module in TF-Slim
*by Marvin Bertin*
<img src="../images/tensorflow.png" width="400">

**Losses Functions**

TF-Slim provides an easy-to-use mechanism for defining and keeping track of loss functions via the losses module.

All of the loss functions take a pair of predictions and ground truth labels, from which the loss is computed.

In machine learning, a loss function is used to measure the "cost" or degree of fit of the model. Therefore it is important to use the appropriate cost function based on the predictive task at hand.

The chose of loss function should optimize the metric we care about (such as accuracy, residue error). 

Every deep learning model is an optimization problem that seeks to minimize a loss function. The loss score is what is used to backpropagate the error signal throughout the neural network and update the model weights. It is therefore a crucial component of the learning process. 


<img src="../images/loss_acc.png" width="900">

In [1]:
import sys  
sys.path.append("../") 

import tensorflow as tf
slim = tf.contrib.slim

%load_ext autoreload
%autoreload 2

## Import The Flower CNN Model

In [2]:
from utils.slim_models import CNNClassifier

image_shape = (64,64,3)
num_class = 5

CNN_model = CNNClassifier("flowers", image_shape , num_class)

## Multi-Class Classification

A common classfier choice for multi-class task is the **Softmax classifier**.
The softmax classifier is a generalization of a binary classifier to multiple classes.
The softmax classifier has an interpretable output of normalized class probabilities.

**Softmax function**
$$f_j(z) = \frac{e^{z_j}}{\sum_k e^{z_k}}$$

**Softmax function** takes a vector of arbitrary real-valued scores and squashes it to a vector of values between zero and one that sum to one. Therefore, it guarantees that the sum of all class probabilities is 1.That's why it's used for multi-class classification because you expect your samples to belong to a single class at the time.

**Cross-Entropy Loss**
$$L_i = -\log\left(\frac{e^{f_{y_i}}}{ \sum_j e^{f_j} }\right)$$

The Softmax classifier minimizes the **cross-entropy** between the estimated class probabilities and the “true” distribution, which is the one-hot encoding of the target labels.

This loss is equivalent to minimizing the KL divergence (distance) between the two distributions.

$$H(p,q) = - \sum_x p(x) \log q(x)$$

Another interpretation is the cross-entropy objective wants the predicted distribution to have all of its mass on the correct label.

**Probabilistic interpretation**
Softmax classifier gives the probability assigned to the correct label $y_i$ given the image $x_i$ and parameterized by $W$.

$$P(y_i \mid x_i; W) = \frac{e^{f_{y_i}}}{\sum_j e^{f_j} }$$

We are therefore minimizing the negative log likelihood of the correct class, which can be interpreted as performing **Maximum Likelihood Estimation (MLE)**. With added L2 regularization (which equates to a Gaussian prior over the weight matrix $W$), we are instead performing the **Maximum a posteriori (MAP)** estimation.

In [None]:
# input batch images
inputs = tf.placeholder(tf.float32, shape=(None,) + image_shape)

# target batch labels
target = tf.placeholder(tf.int32, shape=(None))

# Make the model.
logits, _ = CNN_model.graph(inputs, weight_decay=0.05, dropout=0.5)

# transform labels into one-hot-encoding
one_hot_labels = slim.one_hot_encoding(targets, num_class)

# Add the loss function to the graph.
loss = slim.losses.softmax_cross_entropy(logits, one_hot_labels)

# The total loss is the model's loss plus any regularization losses.
total_loss = slim.losses.get_total_loss()

## Binary Classification

**Sigmoid function** or logistic function only ouputs a single value (between 0 and 1) , independent of all other values. It is often used as an activation function with saturation points at both extremes. It is also used for binary classification.

<img src="../images/sigmoid.jpg" width="200">

$$f_j(z)=\frac {1}{1+e^{-z_j}}$$

**Sigmoid function** can be thought of has a special case of softmax where the number of class equals to 2.  Sigmoid functions takes in a single output neuron and the prediction is defined by an arbitrary threshold (often 0.5).

In [None]:
# inpute images
inputs = tf.placeholder(tf.float32, shape=(None,) + image_shape)

# target binary labels
binary_labels = tf.placeholder(tf.int32, shape=(None))

# Make the model (e.g. not implemented BinaryClassificationModel)
logits, nodes = BinaryClassificationModel(inputs)


# Add the loss function to the graph.
loss = slim.losses.sigmoid_cross_entropy(logits, binary_labels)

# The total loss is the model's loss plus any regularization losses.
total_loss = slim.losses.get_total_loss()

## Multi-Label Classification
Multi-Label Classification is when a sample observation can belong to multiple classes at the same time.

We can rephrase multi-label learning as the problem of finding a model that maps inputs x to binary vectors y, rather than scalar outputs as in the ordinary classification problem.
With this interpretation, the solution is to apply an independent sigmoid function for each label.

Sigmoid functions give predictions independent of all other classes. We lose the probabilistic interpretation because the prediction sum can range anywhere from 0 to K.

In [None]:
# multi-label classification
loss = slim.losses.sigmoid_cross_entropy(logits, multi_class_labels) #[batch_size, num_classes]

## Compute Predictions
All the losses we've seen so far combine multiple steps into one to compute the loss score.
What if we are interested in the intermediate steps such as the predictions probabilities and the actual label predictions?

In [None]:
# predictions probabilities
predictions_probabilities = slim.softmax(logits)

# predictions as intermediate steps
predictions = tf.argmax(predictions_probabilities, axis=1)

# log-loss (multi-class cross-entropy)
loss = slim.losses.log_loss(predictions_probabilities, one_hot_labels)

## Hinge Loss
**Hinge Loss** is used in Multiclass Support Vector Machine (SVM) loss. The SVM loss is set up so that the SVM “wants” the correct class for each image to a have a score higher than the incorrect classes by some fixed margin $\Delta$.

<img src="../images/hinge.png" width="400">

For the score function $s_j = f(x_i, W)_j$. The Multiclass SVM loss for the i-th example is then formalized as follows:

$$L_i = \sum_{j\neq y_i} \max(0, s_j - s_{y_i} + \Delta)$$


In [None]:
# Multiclass SVM with hinge loss
sum_of_squares_loss = slim.losses.hinge_loss(logits, labels)

## Regression Loss
Not all predictive tasks involve classification with distinct labels. Sometimes we need to predict a continuous variable. For example the price of a house given a set of features, or predict how much snow will fall given weather data.

**Mean squared error (MSE)** measures the average of the squares of the errors or deviations. In the context of regression analysis, it measures the quality of an estimator—it is always non-negative, and values closer to zero are better.

$$\operatorname {MSE}={\frac  {1}{n}}\sum _{{i=1}}^{n}({\hat  {Y_{i}}}-Y_{i})^{2}$$

In [None]:
# Mean squared error for regression tasks
slim.losses.mean_squared_error(predictions, ground_truth)

## Multi-Losses for Multi-Task Model
More complex problems sometime require to minimize multiple objectives (cam be both categorical and continous). This are called **multi-task models** and they produces multiple outputs. 

TF-Slim allows to easily combine multiple loss functions together and optimize over both of them. 

When you create a loss function via TF-Slim, TF-Slim adds the loss to a special TensorFlow collection of loss functions. This enables you to either manage the total loss manually, or allow TF-Slim to manage them for you.

In [None]:
# Define two loss functions
classification_loss = slim.losses.softmax_cross_entropy(categorical_predictions, categorical_labels)
regression_loss = slim.losses.mean_squared_error(continous_predictions, continous_labels)

# Compute the total loss of the model
total_loss = classification_loss + regression_loss

# or equivalent with slim built in function
total_loss = slim.losses.get_total_loss(add_regularization_losses=False)

## Costum Loss Functions with Regularization Loss
TF-Slim allows you also to construct your custom loss function.

For example if you want to tailor your objective to a specific task, where common loss functions are not appropriate.

In [None]:
# Define two loss functions with a custom one
classification_loss = slim.losses.softmax_cross_entropy(predictions_1, labels_1)
custom_loss = MyCustomLossFunction(predictions_2, labels_2)

# Letting TF-Slim know about the additional loss.
slim.losses.add_loss(custom_loss) 

# Compute regularization loss
regularization_loss = tf.add_n(slim.losses.get_regularization_losses())

# get total model loss
total_loss = classification_loss + custom_loss + regularization_loss

# OR use Slim built in function to compute total loss with regularization.
total_loss = slim.losses.get_total_loss(add_regularization_losses=True)

## Other Functions Provided by Slim.losses Module

TF-Slim provides other useful loss function and distance metrics that I will let you explore on your own. Below are a few examples:

```
slim.losses.sparse_softmax_cross_entropy()
slim.losses.mean_pairwise_squared_error()
slim.losses.cosine_distance()
slim.losses.absolute_difference()
slim.losses.compute_weighted_loss()
```

## Next Lesson
### Build Compact Training Routings in TF-Slim
-  Construct training routines and train your first deep neural network.

<img src="../images/divider.png" width="100">