
## Log loss, cross-entropy loss or negative log probability

The cross-entropy loss for a single example:

$L = -y \ log(\hat{y})$

$y$: one-hot encoded true label

$\hat{y}$: predicted probability distribution, usually the output of a softmax layer

**Note:** Because $y$ is one-hot encoded the loss only depends on the prediction of the true label. This is a key feature of cross-entropy loss, it rewards/penalizes probabilities of the correct class only. The value is independent of how the remaining probabilities are split between incorrect classes.

The loss is 0.0 if the predicted probability is 1.0 and goes to infinity as the probability approaches 0.0.

In [6]:
import numpy as np

def L(y, y_hat):
    '''
    Calculate cross entropy loss for a single sample
        y : one-hot encoded true label
        y_hat : predicted probability distribution
    '''
    return np.dot(-y, np.log(y_hat))

# one-hot encoded label
y = np.array([1, 0, 0, 0, 0])

# y_hat is a prediction result from a softmax layer

y_hat = np.array([0.96, 0.01, 0.01, 0.01, 0.01])
print(L(y, y_hat))

y_hat = np.array([0.01, 0.2475, 0.2475, 0.2475, 0.2575])
print(L(y, y_hat))


0.0408219945203
4.60517018599



The cross-entropy loss for a batch is the average loss of all examples:

$ J = -\frac{1}{N} \sum_{i=1}^N (y_i \ log(\hat{y}_i) )$

In [7]:
y = np.array([
    [1, 0, 0, 0, 0],
    [1, 0, 0, 0, 0],
    [1, 0, 0, 0, 0],
    [1, 0, 0, 0, 0],
    [1, 0, 0, 0, 0],    
])

y_hat = np.array([
    [0.1, 0.5, 0.1, 0.1, 0.2],
    [0.05, 0.8, 0.05, 0.05, 0.05],
    [0.8, 0.05, 0.05, 0.05, 0.05],
    [0.9, 0.25, 0.25, 0.25, 0.25],
    [0.99, 0.025, 0.025, 0.025, 0.025]
])

np.mean(-(y * np.log(y_hat)).sum(axis=1))

# Note: this implementation is pretty inefficient. First the log of all values is calculated, than the multiplication
# with the one-hot encoded y acts like an indicator function that throws away most of the computation. Than all the 
# sparse row are reduced to a column vector by summing all the 0 values.


1.1273743538747145

## Note about Tensorflow

TF has optimized and numerical stable functions that combine softmax/sigmoid with cross-entropy. A TF implementation of cross-entropy would look like this:

    cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
    
This example and more discussions are from [Tensorflow #2462](https://github.com/tensorflow/tensorflow/issues/2462). There is a warning **not** to use this code because it is numerical not stable (!?).