## Description
This notebook aims at creating visualization for the [note of objective functions](https://www.notion.so/Objective-Functions-21cdb3d560ac4b2fb7755c37f441209b). Please refer to the note to view the practical usage of objective functions in Tensorflow.

## Import packages, Environment Setting

In [1]:
import os
import numpy as np
import tensorflow as tf

from bokeh.io import output_notebook, export_png, reset_output
from bokeh.plotting import figure, show
from bokeh.models import Range1d

reset_output()
output_notebook()

physical_devices = tf.config.list_physical_devices('GPU') 
tf.config.experimental.set_memory_growth(physical_devices[0], True)

In [2]:
def visualize(name, X, y, ymin=None, ymax=None, fig=None, export=True, x_axis_label='y - ŷ', 
    classification=False, **kwargs):
    X = X.numpy()
    y = y.numpy()

    xmin, xmax = np.min(X), np.max(X)
    if classification:
        ymin, ymax = 0, np.max(y)
    else:
        ymin, ymax = 0, np.max(np.power(X, 2))
    if not fig:
        fig = figure(title=name, tools=[], x_axis_label=x_axis_label, y_axis_label='Loss')
        fig.x_range=Range1d(xmin, xmax)
        fig.y_range=Range1d(ymin, ymax)

    fig.line(X, y, line_width=5, **kwargs)
    
    if export:
        file = os.path.join('assets', f'objective_functions_{name}.png'.lower().replace(' ', '_'))
        export_png(fig, filename=file)
    
    return fig

In [3]:
X = tf.linspace(-10., 10., 100)

## Regression Objective Functions
Suppose $\mathbf{y}$ is the actual values we want to predict, and $\mathbf{\hat{y}} \in \mathbb{R}^m$ is the prediction made by our model. In other words, $\mathbf{\hat{y}}=f(\mathbf{X}, \mathbf{w})$.  we can use several objective functions to evaluate our model. In this section, we will discuss the most commonly used ones: `MeanAbsoluteError`,  `MeanSquaredError` and `HuberError`.


### Mean Absolute Error (L1 Loss)
Equation

$L(\mathbf{y}, \mathbf{\hat{y}})=\frac{1}{m}\sum\limits_{i=1}^m|y_i - \hat{y}_i|$

Derivative

$\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\mathbf{\hat{y}}}=[d_i]=\begin{cases}\frac{1}{m}\;&\text{if } \hat{y}_i > y_i\\-\frac{1}{m}&\text{otherwise}\end{cases}$

Properties
* Less sensitive to samples with large residual between prediction and actual value.

In [4]:
y = tf.abs(X)
_ = visualize('Mean Absolute Error', X, y, **{'color': 'deepskyblue'})

![Mean Absolute Error](assets/objective_functions_mean_absolute_error.png)

### Mean Squared Error (L2 Loss)
Equation

$L(\mathbf{y}, \mathbf{\hat{y}})=\frac{1}{m}\sum\limits_{i=1}^m(y_i - \hat{y}_i)^2$

Derivative

$\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\mathbf{\hat{y}}}=-\frac{2}{m}(\mathbf{y}-\mathbf{\hat{y}})$

Properties
* More sensitive to samples with large residual between prediction and actual value.


In [5]:
y = tf.pow(X, 2)
_ = visualize('Mean Squared Error', X, y, **{'color': 'lightcoral'})

![Mean Squared Error](assets/objective_functions_mean_squared_error.png)

### Huber Loss
Equation

$\begin{align}
L(\mathbf{y}, \mathbf{\hat{y}})=\sum\limits_{i=1}^m\begin{cases}\frac{1}{2}(y_i-\hat{y}_i)^2\;\;\;&\text{if}\;|y_i-\hat{y}_i|\leq\delta\\\delta(|y_i-\hat{y}_i|-\frac{1}{2}\delta)&\text{otherwise}\end{cases}
\end{align}$

Derivative

$\begin{align}
\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial \mathbf{\hat{y}}}=[d_i]=\begin{cases}\hat{y}_i-y_i\;\;\;&\text{if}\;|y_i-\hat{y}_i|\leq\delta\\-\delta&\text{if}\;|y_i-\hat{y}_i|>\delta\text{ and }|y_i-\hat{y}_i|>0\\\delta&\text{otherwise}\end{cases}
\end{align}$

Properties
* Hyperparameter $\delta$ can be used to control how many penalties should be given to samples with large residual.

In [6]:
delta = 1
y = tf.where(tf.abs(X) <= delta, 0.5 * tf.pow(X, 2), delta * (tf.abs(X) - 0.5 * delta))
fig = visualize('Huber Loss', X, y, export=False, **{'color': 'thistle', 'legend_label': f'delta:{1:>3}'})
delta = 5
y = tf.where(tf.abs(X) <= delta, 0.5 * tf.pow(X, 2), delta * (tf.abs(X) - 0.5 * delta))
fig = visualize('Huber Loss', X, y, fig=fig, export=False, **{'color': 'plum', 'legend_label': f'delta:{5:>3}'})
delta = 10
y = tf.where(tf.abs(X) <= delta, 0.5 * tf.pow(X, 2), delta * (tf.abs(X) - 0.5 * delta))
fig = visualize('Huber Loss', X, y, fig=fig, **{'color': 'mediumorchid', 'legend_label': f'delta:{10:>3}'})

![Huber Loss](assets/objective_functions_huber_loss.png)

## Classification Loss
### Cross Entropy
Consider $y_{ij}=1$ denotes the sample $i$ when it belongs to class $j$, $y_i=0$ denotes the sample when it dows not belong to class $j$, $\hat{y}_{ij}$ denotes the prediction of the **probability** to assign sample $i$ to class $j$

Equation:

$H(\mathbf{y}, \mathbf{\hat{y}})=-\frac{1}{m}\sum\limits_{i=1}^m\sum\limits_{j=1}^Ky_{ij}\ln\hat{y}_{ij}$

Derivative:

$\frac{\partial H(\mathbf{y}, \mathbf{\hat{y}})}{\partial\mathbf{\hat{y}}}=[d_{ij}]=-\frac{y_{ij}}{\hat{y}_{ij}}$

Properties:

- The inferred probability $\mathbf{\hat{y}}$ is more accurate compare to hinge loss.
- **Sigmoid** or **softmax** activation functions are preferred to used in the output layer.

Notes:
- [CategoricalCrossentropy](https://www.tensorflow.org/api_docs/python/tf/keras/losses/CategoricalCrossentropy): Used when `y_true` is an one-hot encoded vector.
- [SparseCategoricalCrossentropy](https://www.tensorflow.org/api_docs/python/tf/keras/losses/SparseCategoricalCrossentropy): Used when `y_true` is a label encoded vector.

In [7]:
y_pred = tf.linspace(0.01, 0.99, 100) 
loss = -1 * tf.math.log(y_pred)
fig = visualize(
    'Cross Entropy', 
    y_pred, 
    loss,
    classification=True, 
    export=False,
    x_axis_label='ŷ (Class A)', 
    **{'color': 'limegreen', 'legend_label': 'Truth (Class 0)'} 
)
loss = -1 * tf.math.log(1 - y_pred)
fig = visualize(
    'Cross Entropy', 
    y_pred, 
    loss, 
    fig=fig,
    classification=True,
    **{'color': 'palegreen', 'legend_label': 'Truth (Class 1)'} 
)

![Cross Entropy](assets/objective_functions_cross_entropy.png)

### Hinge Loss (Crammer and Singer)

Consider $y_{ij}=1$ denotes the sample $i$ when it belongs to class $j$, $y_i=0$ denotes the sample when it dows not belong to class $j$, $\hat{y}_{ij}$ denotes the prediction of the **distance** to assign sample $i$ to class $j$, the positive value suggests higher confidence while the negative value suggests lower confidence. The prediction value can be an **unbounded continuous value**.

Equation:

$\text{neg}_i =\text{max}_{j}((1-y_{ij})\hat{y}_{ij})$

$\text{pos}=\sum\limits_{j=1}^my_{ij}\hat{y}_{ij}$

$L(\mathbf{y}, \mathbf{\hat{y}})=\frac{1}{m}\sum\limits_{i=1}^m\text{max}(0, 1 + \text{neg}_i - \text{pos})$

Derivative:

$\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial \mathbf{\hat{y}}}=[d_{ij}]=\begin{cases}1-2y_{ij}\;&\text{if }{(1-y_{ij})\hat{y}_{ij}}=\text{neg}_i\text{ and neg}_i-\text{pos}>-1\\-y_{ij}\;&\text{if }{(1-y_{ij})\hat{y}_{ij}}\ne\text{neg}_i\text{ and neg}_i-\text{pos}>-1\\0&\text{otherwise}\end{cases}$ 

Properties:

- Data point far away from the decision boundary do not contribute to the loss function.
- **Linear** or **hyperbolic tangent** activation functions are preferred to used in the output layer.
- Need additional methods (e.g. **Platt scaling**) to estimate probability for each class.

In [8]:
y_pred = tf.linspace(-3., 3., 100) 
loss = tf.where(1. + y_pred > 0., 1. + y_pred, tf.zeros_like(y_pred))
fig = visualize(
    'Hinge Loss',
    y_pred,
    loss,
    x_axis_label='ŷ',
    classification=True,
    export=False,
    **{'color': 'blueviolet', 'legend_label': 'Truth (Class 0)'}
)
loss = tf.where(1. - y_pred > 0., 1. - y_pred, tf.zeros_like(y_pred))
fig = visualize(
    'Hinge Loss',
    y_pred,
    loss,
    fig=fig,
    classification=True,
    **{'color': 'plum', 'legend_label': 'Truth (Class 1)'}
)

![Hinge Loss](assets/objective_functions_hinge_loss.png)

## Why are MSE and MAE not suitable for classification
### Binary Classification with Probability

Let's first decompose the binary classification problem with **sigmoid** output:

$\mathbf{\hat{y}}=\sigma(\mathbf{\hat{X}}),\;\;\;\mathbf{\hat{x}}=\mathbf{X}\mathbf{w}+b$

Now, consider the gradient of mean square error with respect to $\mathbf{\hat{y}}$:

$\nabla_{\mathbf{\hat{y}}}L(\mathbf{y}, \mathbf{\hat{y}})=\begin{bmatrix}\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{y}_1} \\\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{y}_2} \\\vdots \\\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{y}_m}\end{bmatrix}=\begin{bmatrix}-\frac{2}{m}(y_1 - \hat{y}_1) \\-\frac{2}{m}(y_2 - \hat{y}_2) \\\vdots \\-\frac{2}{m}(y_1 - \hat{y}_m)\end{bmatrix}=\frac{2}{m}(\mathbf{y}-\mathbf{\hat{y}})$

And the Jacobian matrix of sigmoid function with respect to $\mathbf{\hat{x}}$ (check **Activation Functions** in [Neural Network Basics](https://www.notion.so/Neural-Network-Basics-6c71b218abc14bb89e2fa21f35066c54) ):

$\frac{\partial\mathbf{\hat{y}}}{\partial\mathbf{\mathbf{\hat{x}}}}=\begin{bmatrix}    \frac{\partial\hat{y}_1}{\partial\hat{x}_1} & \frac{\partial\hat{y}_1}{\partial\hat{x}_2} & \cdots & \frac{\partial\hat{y}_1}{\partial\hat{x}_m} \\    \frac{\partial\hat{y}_2}{\partial\hat{x}_1} & \frac{\partial\hat{y}_2}{\partial\hat{x}_2} & \cdots & \frac{\partial\hat{y}_2}{\partial\hat{x}_m} \\    \vdots & \vdots & \ddots & \vdots \\     \frac{\partial\hat{y}_m}{\partial\hat{x}_1} & \frac{\partial\hat{y}_m}{\partial\hat{x}_2} & \cdots & \frac{\partial\hat{y}_m}{\partial\hat{x}_m}\end{bmatrix}=\begin{bmatrix}\sigma(\hat{x}_1)(1-\sigma(\hat{x}_1)) & 0 & \cdots & 0 \\0 & \sigma(\hat{x}_2)(1-\sigma(\hat{x}_2))  & \cdots & 0 \\\vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \sigma(\hat{x}_m)(1-\sigma(\hat{x}_m)) \\\end{bmatrix}=\mathbf{1}\sigma(\mathbf{\hat{x}})^T\otimes(\mathbf{I}-\sigma(\mathbf{\hat{x}})\mathbf{1}^T)$

And the Jacobian matrix of $\mathbf{\hat{x}}$ with respect to $\mathbf{w}$:

$\mathbf{\frac{\partial \hat{x}}{\partial w}}=\begin{bmatrix}    \frac{\partial\hat{x}_1}{\partial w_1} & \frac{\partial\hat{x}_1}{\partial w_2} & \cdots & \frac{\partial\hat{x}_1}{\partial w_n} \\    \frac{\partial\hat{x}_2}{\partial w_1} & \frac{\partial\hat{x}_2}{\partial w_2} & \cdots & \frac{\partial\hat{x}_2}{\partial w_n} \\    \vdots & \vdots & \ddots & \vdots \\     \frac{\partial\hat{x}_m}{\partial w_1} & \frac{\partial\hat{x}_m}{\partial w_2} & \cdots & \frac{\partial\hat{x}_m}{\partial w_n}\end{bmatrix}=\mathbf{X}$

We can then compute the gradient of $L(\mathbf{y}, \mathbf{\hat{y}})$ with respect to ${\mathbf{\hat{x}}}$:

$\nabla_{\mathbf{\hat{x}}}L(\mathbf{y}, \mathbf{\hat{y}})=\begin{bmatrix}\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{x}_1} \\\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{x}_2} \\\vdots \\\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{x}_m} \\\end{bmatrix}=\begin{bmatrix}\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{y}_1}\frac{\partial\hat{y}_1}{\partial\hat{x}_1} + \frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{y}_2}\frac{\partial\hat{y}_2}{\partial\hat{x}_1}+\cdots+\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{y}_m}\frac{\partial\hat{y}_m}{\partial\hat{x}_1}\\\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{y}_1}\frac{\partial\hat{y}_1}{\partial\hat{x}_2} + \frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{y}_2}\frac{\partial\hat{y}_2}{\partial\hat{x}_2}+\cdots+\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{y}_m}\frac{\partial\hat{y}_m}{\partial\hat{x}_2}\\\vdots \\\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{y}_1}\frac{\partial\hat{y}_1}{\partial\hat{x}_m} + \frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{y}_2}\frac{\partial\hat{y}_2}{\partial\hat{x}_m}+\cdots + \frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{y}_m}\frac{\partial\hat{y}_m}{\partial\hat{x}_m}\\\end{bmatrix}=(\frac{\partial\mathbf{\hat{y}}}{\partial\mathbf{\mathbf{\hat{x}}}})^T\nabla_{\mathbf{\hat{y}}}L(\mathbf{y}, \mathbf{\hat{y}})$

And finally, the gradient of $L(\mathbf{y}, \mathbf{\hat{y}})$ with respect to ${\mathbf{w}}$:

$\nabla_{\mathbf{w}}L(\mathbf{y}, \mathbf{\hat{y}})=\begin{bmatrix}\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial w_1} \\\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial w_2} \\\vdots \\\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial w_n} \\\end{bmatrix}=\begin{bmatrix}\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{x}_1}\frac{\partial\hat{x}_1}{\partial w_1} + \frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{x}_2}\frac{\partial\hat{x}_2}{\partial w_1}+\cdots+\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{x}_m}\frac{\partial\hat{x}_m}{\partial w_1}\\\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{x}_1}\frac{\partial\hat{x}_1}{\partial w_2} + \frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{x}_2}\frac{\partial\hat{x}_2}{\partial w_2}+\cdots+\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{x}_m}\frac{\partial\hat{x}_m}{\partial w_2}\\\vdots \\\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{x}_1}\frac{\partial\hat{x}_1}{\partial w_n} + \frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{x}_2}\frac{\partial\hat{x}_2}{\partial w_n}+\cdots + \frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{x}_m}\frac{\partial\hat{x}_m}{\partial w_n}\\\end{bmatrix}=(\mathbf{\frac{\partial \hat{x}}{\partial w}})^T(\frac{\partial\mathbf{\hat{y}}}{\partial\mathbf{\mathbf{\hat{x}}}})^T\nabla_{\mathbf{\hat{y}}}L(\mathbf{y}, \mathbf{\hat{y}})=\mathbf{X}^T [\mathbf{1}\sigma(\mathbf{\hat{x}})^T\otimes(\mathbf{I}-\sigma(\mathbf{\hat{x}})\mathbf{1}^T)] ^ T \frac{2}{m}(\mathbf{y}-\mathbf{\hat{y}})$

We can see that $\nabla_{\mathbf{w}}L(\mathbf{y}, \mathbf{\hat{y}})\to0$ when $\sigma(\mathbf{\hat{x}})\to1$ or $\sigma(\mathbf{\hat{x}})\to0$ (whether or not the prediction is correct, the weight will not be updated if the model is confident about the prediction). Softmax and mean absolute error derive similar results. On the other hand, one can easily see that cross-entropy does not have this problem (substitute the first term in the final gradient with the derivative of cross-entropy).

### Binary Classification with Distance

If the activation function of the output layer is **linear** or **hyperbolic tangent** (in this case, $y_i=1$ for the sample being labelled as class $0$, $y_i=-1$ as class $1$, $f(x_i)>0$ suggests network predict the sample being classified as class A). The predictor makes correct prediction if $y_i\hat{y}_i>0$. However, the MSE and MAE increase when $y_i\hat{y}_i>1$. Therefore, MSE and MAE are not suitable to be used in this setting neither.

In [9]:
y_pred = tf.linspace(-3., 3., 500)
y_true = tf.ones(500)

loss = tf.abs(y_true - y_pred)
fig = visualize(
    'y·ŷ and loss',
    y_true * y_pred,
    loss,
    export=False,
    x_axis_label='y·ŷ',
    **{'color': 'deepskyblue', 'legend_label': 'Mean Absolute Error'}
)
loss = tf.pow(y_true - y_pred, 2)
fig = visualize(
    'Mean Squared Error',
    y_true * y_pred,
    loss,
    fig=fig,
    export=False,
    **{'color': 'lightcoral', 'legend_label': 'Mean Squared Error'}
)
loss = tf.math.log(1 + np.exp(-1 * y_pred * y_true)) / np.log(2)
loss = -1 * y_true * tf.math.log(1 / (1 + tf.math.exp(-1 * y_pred))) / tf.math.log(2.)
fig = visualize(
    'Sigmoid with Cross Entropy',
    y_true * y_pred, loss,
    fig=fig,
    export=False,
    **{'color': 'limegreen', 'legend_label': 'Sigmoid with Cross Entropy', 'line_dash': 'dotted'}
)
loss = tf.maximum(tf.zeros_like(y_pred * y_true), tf.ones_like(y_pred * y_true) - y_pred * y_true) + 0.025
fig = visualize(
    'Hinge Loss',
    y_true * y_pred,
    loss,
    fig=fig,
    export=False,
    **{'color': 'blueviolet', 'legend_label': 'Hinge Loss', 'line_dash': 'dashed'}
)
loss = tf.where(y_true * y_pred < 0, tf.ones_like(y_true * y_pred), tf.zeros_like(y_true * y_pred)) 
fig = visualize(
    'Ideal Loss',
    y_true * y_pred,
    loss,
    fig=fig,
    **{'color': 'black', 'legend_label': 'Ideal Loss'}
)

![Ideal Loss](assets/objective_functions_ideal_loss.png)