# Softmax function

In many cases, the output layer vector $z$ is transformed using the *softmax* function:

$$
z = z_1,....z_k \\
softmax(z_i) = \frac{e^{z_i}}{\sum_{j=1}^k e^{z_j}} \\
$$

It takes a vector of arbitrary real-valued scores and squashes it to a vector of values between zero and one that sum to one. The result is a probability distribution over k possible outcomes.

The *softmax* output transformation is used when we are interested in modeling a probability
distribution over the possible output classes. To be effective, it should be used in
conjunction with a probabilistic training objective such as *cross-entropy*.

When using the softmax function the network outputs are interpreted as the **unnormalized log probabilities** for each class.

$$
P(y=i|x) = \frac{e^{z_i}}{\sum_{j=1}^k e^{z_j}}
$$

The softmax function first takes the exponent of all your numbers, which gives you unnormalized probabilities. Next they are normalized.

The obvious question is why bother performing doing exponents. Why not use $ P(y=i|x) = \frac{z_i}{\sum_{j=1}^k z_j} $ instead?

One reason for this is because the softmax is normally used together with the cross-entropy loss. Intuitively, the log cancels out with the exponent, which makes the gradient calculation much easier.



# Loss functions

The loss function $L(\hat y, y)$ calculates the loss of predicting $\hat y$ when the true output is $y$. The training objective is then to minimize the loss across the different training examples. The loss $L(\hat y, y)$ assigns a numerical score (a scalar) for the network’s output $\hat y$ given the true expected output $y$. The loss is always positive, and should be zero only for cases where the network’s output is correct.

## Hinge (binary)

For binary classification problems, the network’s output is a single scalar $\hat y$ and the intended output $y$ is in {+1, −1}.

$$
L_{hinge(binary)}(\hat y, y) = max(0, 1 - y \cdot \hat y)
$$

The loss is 0 when $y$ and $\hat y$ share the same sign and $\lvert \hat y \rvert \geq 1$, otherwise, the loss is linear. In other words, the binary hinge loss attempts to achieve a correct classification, with a margin of at least 1.

In [25]:
def loss_hinge_binary(y_hat, y):
    return max(0.0, 1 - y_hat * y)

def print_loss_hinge_binary(y_hat, y):
    print('loss(%f, %f) = %f' % (y_hat, y, loss_hinge_binary(y_hat, y)))

print_loss_hinge_binary(2, 1)
print_loss_hinge_binary(1, 1)
print_loss_hinge_binary(0.8, 1)
print_loss_hinge_binary(0.5, 1)
print_loss_hinge_binary(0.0, 1)
print_loss_hinge_binary(-1, 1)

loss(2.000000, 1.000000) = 0.000000
loss(1.000000, 1.000000) = 0.000000
loss(0.800000, 1.000000) = 0.200000
loss(0.500000, 1.000000) = 0.500000
loss(0.000000, 1.000000) = 1.000000
loss(-1.000000, 1.000000) = 2.000000


## Hinge (multiclass)

This is the hinge loss extended to the multiclass setting.

 Let $ \hat y = \hat y_1, . . . , \hat y_n$ be the network’s output vector, and $y$ be the one-hot vector for the correct output class.
 
Denote by $t = arg \ max_i \ y_i$ the correct class, and by $k = arg \ max_{i \ne t} \ \hat y_i$ the highest scoring class such that $k \ne t$.

The multiclass hings loss is defined as:

$$
L_{hinge(multiclass)}(\hat y, y) = max(0, 1 - (\hat y_t - \hat y_k))
$$

The multiclass hinge loss attempts to score the correct class above all other classes with a
margin of at least 1.

In [32]:
# Normally the loss function would take arguments y and y_hat.
# Since y is one hot encoded it is easy to calculate y_hat_t
# and y_hat_k.

def loss_hinge_multiclass(y_hat_t, y_hat_k):
    return max(0.0, 1.0 - (y_hat_t - y_hat_k))

def print_loss_hinge_multiclass(y_hat_t, y_hat_k):
    print('loss(y_hat_t=%f, y_hat_k=%f) = %f' % (y_hat_t, y_hat_k, loss_hinge_multiclass(y_hat_t, y_hat_k)))

print_loss_hinge_multiclass(3, 1)
print_loss_hinge_multiclass(3, 2)
print_loss_hinge_multiclass(3, 2.2)
print_loss_hinge_multiclass(3, 2.8)
print_loss_hinge_multiclass(3, 3)
print_loss_hinge_multiclass(3, 4)

loss(y_hat_t=3.000000, y_hat_k=1.000000) = 0.000000
loss(y_hat_t=3.000000, y_hat_k=2.000000) = 0.000000
loss(y_hat_t=3.000000, y_hat_k=2.200000) = 0.200000
loss(y_hat_t=3.000000, y_hat_k=2.800000) = 0.800000
loss(y_hat_t=3.000000, y_hat_k=3.000000) = 1.000000
loss(y_hat_t=3.000000, y_hat_k=4.000000) = 2.000000


**Note 1 about hinge loss:** Both the binary and multiclass hinge losses are intended to be used with a linear output layer (as opposed to a softmax layer). The hinge losses are useful whenever we require a **hard decision rule**, and do not attempt to model class membership probability.

**Note 2 about hinge loss:** The hinge loss is sometimes called the *max-margin loss*.


## Categorical cross-entropy loss

The *categorical cross-entropy loss* (also referred to as *negative log likelihood*) is used when a probabilistic interpretation of the network output is used.

Let $y = y_1,...,y_n$ be a vector representing the true multinomial distribution and let $ \hat y = \hat y_1,..., \hat y_n$ be the networks output which was transformed by the
*softmax* activation function, and represent the class membership conditional distribution $\hat y_i = P(y=i \mid x)$. 

The categorical cross entropy loss measures the dissimilarity between the true label distribution $y$ and the predicted label distribution $ \hat y$, and is defined as cross
entropy:

$$
L_{cross \ entropy}(\hat y, y) = - \sum_i y_i log(\hat y_i)
$$

For hard classification problems in which each training example has a single correct class assignment, y is a one-hot vector representing the true class. In such cases, the cross
entropy can be simplified to:

$$
L_{cross \ entropy}(\hat y, y) = - log(\hat y_t)
$$

where t is the correct class assignment.

Because the scores $\hat y$ have been transformed using the *softmax* function and represent a conditional distribution, increasing the mass assigned to the correct class means decreasing the mass assigned to all the other classes. When using the cross-entropy loss, it is assumed that the network’s output is transformed using the *softmax* transformation.

**Note 1:** The input to a cross-entropy loss must be a probability distribution. It can not be used with unnormalized scores: the log is not defined for negative numbers and for scores > 1 the loss becomes negative.

**Note 2:** The combination of softmax function and cross entropy loss is somestimes called *softmax loss*.
