The **loss function**, also called the **cost function** is the algorithm that quantifies how wrong a model is. Loss is the measure of this metric. Since loss is the error of the model we ideally want it to be 0. We might wonder why we dont just calculate the error based on the argmax accuracy. Remember our example from earlier of confidence [0.22, 0.6, 0.18] vs [0.32, 0.36, 0.32]. If the correct class were indeed the middle one (index 1), the model accuracy would be identical between the two above. But are these two examples *really* as accurate as each other? No they are not, because accuracy is simply applying and argmax to the output to find the index of the biggest value. The output of a neural network is actually confidence, and more confidence in the correct answer is better. This is the reason we strive to increaste correct confidence and decrease misplaced confidence.

#### Categorical Cross-Entropy Loss
In linear regression there is a loss function used that is also applied in neural networks: squarred error or mean squared error with neural networks. But since we are not doing regression in our example for now we need a different loss function. Our model has a softmax activation function for the output layer, which mean it is outputting probability distribution. **Categorical cross-entropy** is explicitly used to compare a so called "ground truth" probability (y or "targets) and some predicted distribution (y-hat or "predictions"), thus it makes sense to use cross-entropy in our cae. Is is also one of the most commonly used loss functions with softmax activation on the output layer.  
  
The formula for calulating the categorical cross-entropy of y (actual/desired distribution) and y-hat(predicted distribution) is:  
Li = - sum yi,j log(^yi,j)  
where *Li* denotes the sample loss value, *i* the i-th sample in the set, *j* is the label/output index, *y* denotes the target values and *y-hat* denotes the predicted values.

When we start coding we simplify this further to *-log(correct_class_confidence)*, the formula for this is:  
Li = -log(^y, k) , where k is an index of "true" probability  
where *Li* denotes sample loss value, *i* is the i-th sample in a set, *k* is the index of the target label (ground-truth label), *y* denotes the target values and *y-hat* denotes the predicted values.  
  
We may ask ourselves why we call this cross-entropy and not **log loss**, which is also a type of loss. In general , the log loss error function is what we apply to the output of a binary logistic regression model (ch 16) - there are only two classes in the distribution, each of them applying to a single output (neuron) which is targeted as 0 or 1. In our case, we have a classification model that returns a probability distribution over all of the outputs. Cross-entropy compares two probability distributions. In our case, we have a softmax output of let´s say: 

In [None]:
softmax_output = [0.7, 0.1, 0.2]

To which probability distribution do we want to compare this to? We have 3 class confidence in the above output, let´s assume that the desired prediction is the first class (index0, which is currently 0.7). If that the intended prediciton, the desired probability distribution is [1, 0, 0]. The desired probablities will consist of a 1 in the desired class and a 0 in the remaining undesired classes. This type of arrays or vectors are called one-hot, "hot" is (on) with value 1 and the rest are "cold" (off) with values of 0. When comparing the model´s results to a one-hot vector using cross entropy, the other parts of the equation zero out and the target probability´s log loss is multiplied by , making cross-entropy calculation relatively simple. An example with:

In [None]:
softmax_output = [0.7, 0.1, 0.2]  
targest = [1, 0, 0]

We can do the following calculations:  
Li = - (1 * log(0.7) + 0 *log(0.1) + 0 * log(0.2)) =  
= -(-0.3566749 + 0 + 0) = 0.3566749  
  
Let´s see the Python code verion of this:

In [2]:
import math

# an example output of the output layer of the neural network
softmax_output = [0.7, 0.1, 0.2]  
# ground trugh
target_output = [1, 0, 0]

loss = -(math.log(softmax_output[0])*target_output[0] +
         math.log(softmax_output[1])*target_output[1] +
         math.log(softmax_output[2])*target_output[2])

print(loss)

0.35667494393873245


That is the full categorical cross-entropy calculation, however we can make a few assumptions given one-hot target vectors. The true values for target_output[1] and taget_output[2] are both 0 and anything multiplied by 0 is 0. So we dont need to calculate these indices. Next, what is the target_output[0], is is 1 in this case. So this can be omitted as anything multiples by 1 remains the same. The same output then in this example can be calculated with:

In [4]:
loss = -math.log(softmax_output[0])
print(loss)

0.35667494393873245


We can thus make some simple assumptions and use a more basic calculation, reducing it to the negative log of the target class confidence score.  
  
The **categorical cross-entropy loss** account for that and outputs a larger loss the lower the confidence is:

In [7]:
import math

print(math.log(1.))
print(math.log(.95))
print(math.log(.9))
print(math.log(.8))
print('...')
print(math.log(0.2))
print(math.log(.1))
print(math.log(.05))
print(math.log(.01))

0.0
-0.05129329438755058
-0.10536051565782628
-0.2231435513142097
...
-1.6094379124341003
-2.3025850929940455
-2.995732273553991
-4.605170185988091


From above printed log values for a few example confidences, when the confidence level equals to 1, meaning the models is 100% "sure" about its prediciton, the loss value for this sample equals 0. The loss value rases with the confidence level, approaching 0. You might also wonder why we did not print the result of log(0), this is undefined.

In [None]:
# jatka tästä