<div><img style="float: right; width: 120px; vertical-align:middle" src="https://www.upm.es/sfs/Rectorado/Gabinete%20del%20Rector/Logos/EU_Informatica/ETSI%20SIST_INFORM_COLOR.png" alt="ETSISI logo" />


# Exploring Loss Functions in Deep Learning<a id="top"></a>

<i><small>Authors: Alberto Díaz Álvarez<br>Last update: 2023-04-15</small></i></div>
                                                  

***

## Introduction

One of the key components of training a deep learning model is defining a loss function that measures how well the model is performing on a particular task. The loss function is used to calculate the **error between the predicted output of the model and the true output**, and the **goal of training is** to **minimize this error over time**.

There are many different loss functions that can be used in deep learning, each with its own strengths and weaknesses depending on the type of problem being solved and the characteristics of the data. In this notebook, we will explore some of the most common loss functions used in deep learning and their properties.

## Goals

This notebook will explore the application of loss functions in neural networks, various types of loss functions, creating personalized loss functions using TensorFlow, and practical use cases of loss functions for computer vision, which includes image and video training data - an area of focus and interest.

## Libraries and configuration

Next we will import the libraries that will be used throughout the notebook.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

***

## About the Loss and Cost functions

There is often confusion between the terms "loss function" and "cost function," which are used interchangeably. However, they are distinct concepts. The **loss** function computes the error for a **single training example**, while the **cost** function calculates the **average loss** over the entire **training dataset**.

A lower value of the loss function implies a superior model, whereas a higher value necessitates changing the model parameter and minimizing the loss.

Just like with metrics, the choice of loss function depends on the type of problem we are solving. The selection of a proper loss function depends on the nature of the task and the characteristics of the dataset.

Some of the commonly used loss functions include Mean Squared Error, Binary Cross-Entropy, Categorical Cross-Entropy, and so on. As we have seen, the loss function is a measure of the model's performance and reflects how well the model fits the data. Therefore, optimizing the loss function plays a vital role in improving the model's accuracy.

We have already discussed some of these loss functions in the metrics section, as they are used to calculate the model's performance. By minimizing the loss, we can ensure that the model is learning and making accurate predictions.

## Regression tasks

In regression tasks, the loss function measures the **difference between the predicted values and the actual values of the target variable**. The goal is to minimize this difference, or the error, in order to improve the accuracy of the model.

The most commonly used loss function in regression tasks is the Mean Absolute Error (MAE) or the Mean Squared Error (MSE). But in general, any error metric that is able to tell us how much error we are making in the prediction is useful. The choice of loss function depends on the specific problem at hand and the characteristics of the data.

#### Huber Loss

Huber loss is a loss function that is commonly used in regression tasks, especially when there are outliers in the data. It is less sensitive to outliers than mean squared error (MSE) but more sensitive than mean absolute error (MAE).

| <img src="https://upload.wikimedia.org/wikipedia/commons/c/cc/Huber_loss.svg" alt="Huber loss" width="50%"> | 
|:--:| 
| *Huber loss (green, $\delta = 1$) vs. MSE (blue) as a function of $y-f(x)$. Source: Huber loss.svg, <https://commons.wikimedia.org/w/index.php?title=File:Huber_loss.svg&oldid=507900412> (last visited April 10).* |

The formula for Huber loss is:

$$
f(a,b) = 
    \begin{cases}
        L(y, f(x)) &= \frac{1}{2} (y - f(x))^2 & \text{if} & |y - f(x)| <= \delta \\
        L(y, f(x)) &= \delta (|y - f(x)| - \frac{\delta}{2}) & \text{if} & |y - f(x)| > \delta
    \end{cases}
$$

Where $y$ is the true value, $f(x)$ is the predicted value, and $\delta$ is a hyperparameter that controls the range of the function. The loss function can be implemented as follows:

In [None]:
def huber_loss(y, ŷ, delta=1.0):
    abs_error = np.abs(y - ŷ)
    quadratic = np.minimum(abs_error, delta)
    linear = (abs_error - quadratic) * delta
    return np.mean(0.5 * quadratic ** 2 + linear)

y_test = np.array([1, 1, 0, 0, 1])
ŷ_test = np.array([0, 1, 0, 1, 1])
print(f'Huber loss = {huber_loss(y_test, ŷ_test)}')

The advantages of Huber loss compared to MSE and MAE are that it is less sensitive to outliers than MSE and provides a more continuous and differentiable function than MAE. However, the disadvantage of Huber loss is that it requires tuning of the hyperparameter delta, which can be time-consuming and difficult to determine.

Additionally, Huber loss may not be the best choice for all regression tasks, and other loss functions may be more suitable depending on the problem and data.

## Classification tasks

As the models belonging to this class of tasks are probabilistic, i.e., they classify according to the probability of belonging or not to a class, our goal is that the probabilities of the predictions get closer and closer to the actual probabilities, which are 0 or 1 (a cat is a 100% cat in reality).

### Cross-entropy

Without going into detail, cross-entropy is a metric that measures the distance between the probability distribution of the prediction (e.g. $(0.51, 0.18, 0.49, 0.72)$) and that of the actual output (e.g. $y = (1, 0, 0, 0, 1)$), with the advantages that it is derivable and that it gives information on how close or far we are from the correct prediction.

| <img src="https://raw.githubusercontent.com/shruti-jadon/Data_Science_Images/main/cross_entropy.png" alt="Cross entropy loss" width="50%"> | 
|:--:| 
| *Cross entropy loss when predicting a positive (green) or negative (blue) value. Source: cross_entropy.png, <https://www.datasciencepreparation.com/blog/articles/what-is-cross-entropy-loss/> (last visited April 10).* |

This measure is usually used exclusively to calculate the _loss_ during training, since the information it gives us is not useful for measuring anything other than "I'm doing better" or "I'm doing worse".

**Categorical cross-entropy (CCE)** is one of the existing implementations in keras for the cross-entropy. It works in binary and multiclass classification, and requires one output neuron per class. In the case of binary classification, it would require two neurons, one for the class and one for the _non-class_. Its formula is as follows:

$$
CCE = -\frac{1}{n} \sum_{i=1}^n\sum_{j=1}^n y_{ij} \cdot log(p_{ij})
$$

A possible implementation in numpy could be as follows:

In [None]:
def categorical_crossentropy(y, ŷ):
    # To avoid taking the log of values close to 0 or 1
    𝜀 = 1e-7
    y_pred = np.clip(ŷ, 𝜀, 1 - 𝜀)

    loss = -np.sum(y * np.log(ŷ), axis=1)
    return np.mean(loss)

y_test = np.array([[0, 1],
                   [1, 0],])
ŷ_test = np.array([[0.1, 0.9],
                   [0.9, 0.1],])
print(f'Categorical crossentropy = {categorical_crossentropy(y_test, ŷ_test)}')

**Binary crossentropy (BCE)** is another of the above-mentioned implementations. It behaves the same as the previous one in the case of binary classification, but with the advantage that it does not need two neurons for it, so it is the preferred one in this type of problems. Its formula is as follows:

$$
BCE = -\frac{1}{n} \sum_{i=1}^n (y_{i} log(p_{i}) + (1 - y_{i}) log(1 - p_{i}))
$$

One possible implementation on numpy could be the following:

In [None]:
def binary_crossentropy(y, ŷ):
    # To avoid taking the log of values close to 0 or 1
    𝜀 = 1e-7
    y_pred = np.clip(ŷ, 𝜀, 1 - 𝜀)  # to avoid taking the log of values close to 0 or 1

    loss = -(y * np.log(ŷ) + (1 - y) * np.log(1 - ŷ))
    return np.mean(loss)

y_test = np.array([1, 0])
ŷ_test = np.array([0.9, 0.1])
print(f'Binary crossentropy = {binary_crossentropy(y_test, ŷ_test)}')

## Conclusion

These loss functions are the ones we will use the most during this course. There are many more, of course, and we will be naming and discussing them. However, the objective of this notebook is to get to know them and to understand that there are different types of loss for different types of problems.

***

<div><img style="float: right; width: 120px; vertical-align:top" src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by-nc-sa.png" alt="Creative Commons by-nc-sa logo" />

[Back to top](#top)

</div>