<center>
<img src="https://i.ibb.co/b3T5hkz/logo.png" alt="logo" border="0" width=600>


---
## 02. Cross-Entropy Explained


Eduard Larrañaga (ealarranaga@unal.edu.co)

---

### Abstract

In this notebook we explain the cross-entropy function and ist use as a loss function for a neural network.

---

---

## The Cross-Entropy

Cross-entropy is a function arising from the field of information theory. It is build upon the concept of entropy and is used for calculating the difference between two probability distributions for a given random variable or set of events.

To introduce this function, remember that *information* quantifies the number of bits required to encode and/or transmit an event. **Lower probability events have more information, higher probability events have less information**.

In information theory is importat the notion of “surprise” of an event. An event is more surprising the less likely it is, meaning it contains more information. Hence: 

- **Low probability Event (surprising): More information.**

- **Higher Probability Event (unsurprising): Less information.**

#### Quantifying Information

Information $h(x)$ can be calculated for an event x, given the probability of the event $P(x)$ is defined as

\begin{equation}
h(x) = - \log [P(x)].
\end{equation}

From this definition it is easy to check that the information associated with an event $x_1$ with probability of occurrence $P(x_1) =1$ is $h(x_1) = -\log [1] = 0$, i.e. this event has no information associated.

On the other hand, the information probability associated with an event $x_0$ with a low probability of occurrence $P(x_0) \rightarrow 0$ is $h(x_0) = -\log [P(x_0)] \rightarrow \infty$, i.e. this event has a large amount of information associated.

#### Entropy 

Entropy is defined as the number of bits required to transmit a randomly selected event from a probability distribution. 

A skewed distribution has a low entropy (low information associated), whereas a distribution where events have equal probability has a larger entropy (high information associated). This fact can be undertood by noting that skewed probability distribution has less “surprise” and in turn a low entropy because likely events dominate. On the other hand, balanced distribution are more surprising and turn have higher entropy because events are equally likely.

- **Skewed Probability Distribution (unsurprising): Low entropy.**

- **Balanced Probability Distribution (surprising): High entropy.**

Mathematically, entropy can be calculated for a set $X$ of discrete states $x$, with a probability $P(x)$ of occurence, as

\begin{equation}
S[P(X)] = -\sum_{x \in X} P(x) \log [P(x)].
\end{equation}

Here, the $\log$ is the base-2 logarithm, meaning that the results are in **bits**  (If the base-e or natural logarithm is used instead, the result will have the units called **nats**).

---



In order to understand this definition, consider a set of 3 discrete events $X = [x_1, x_2, x_3]$ with probabilities of occurrence $P(X) = [P(x_1), P(x_2), P(x_3)] = [0, 1, 0]$, i.e. event $x_2$ has 100% probability occurrence while $x_1$ and $x_3$ have 0% of probability. Then, the entropy associated with this set is

\begin{align}
S[P(X)] = &-\sum_{x \in X} P(x) \log [P(x)] \\
S[P(X)]= &- P(x_1) \log [P(x_1)] - P(x_2) \log [P(x_2)] - P(x_3) \log [P(x_3)] \\
S[P(X)] = &-\log [P(x_2)] \\
S[P(X)] = &-\log [1] = 0,
\end{align}

i.e. this set has zero entropy (the result is completely determined). 



In [3]:
import numpy as np

def entropy(p):
  return -sum(p*np.log2(p))



# Probability Distribution
P = np.array([0., 1., 0.]) + 1.e-16 # we add this small quantity to avoid the divergence of the logarithm!

print(f'The entropy is {entropy(P):.2f}')


The entropy is 0.00



Now consider a set of 3 discrete events $𝑋=[𝑥1,𝑥2,𝑥3]$  with probabilities of occurrence  $𝑃(𝑋)=[𝑃(𝑥1),𝑃(𝑥2),𝑃(𝑥3)]=[0.5,0.3,0.2]$. The entropy associated with this probability distribution is


In [6]:
# Probability Distribution
P = np.array([0.5, 0.3, 0.2]) 

print(f'The entropy is {entropy(P):.2f}')


The entropy is 1.49


Clearly, the entropy is not zero because the probability distribution does not determine any result completely.

#### The Cross-Entropy

The definition of entropy for a probaility distribution given above can be generalized ot the concept of **cross-entropy** to calculate the number of bits required to represent or transmit an average event from one distribution compared to another distribution.

Consider a **target distribution** or underlying probability distribution $P$ and an **approximation of the target distribution** $Q$. 

**The cross-entropy of $Q$ from $P$ is the number of additional bits to represent an event using $Q$ instead of $P$.**

The Cross-entropy is defined mathematically as

\begin{equation}
H(P, Q) = – \sum _{x \in X} P(x) \log [Q(x)],
\end{equation}

where $P(x)$ is the probability of the event $x$ in $P$, $Q(x)$ is the probability of event $x$ in $Q$ and $\log$ is the base-2 logarithm, meaning that the results are in bits  (If the base-e or natural logarithm is used instead, the result will have the units called nats).



In [7]:
def crossentropy(p,q):
  return -sum(p*np.log2(q))

In order to understand this definition, consider a set of 3 discrete events $X = [x_1, x_2, x_3]$. Suppose that the target probability distribution is

\begin{equation}
P(X) = [P(x_1), P(x_2), P(x_3)] = [0, 1, 0]
\end{equation}

i.e. event $x_2$ has 100% probability of occurrence.
Now consider an approximated probability distribution of
\begin{equation}
Q(X) = [Q(x_1), Q(x_2), Q(x_3)] = [0.6, 0.2, 0.2].
\end{equation}

The cross-entropy for these two probability distributions is

In [8]:
# Target Probability Distribution
P = np.array([0., 1., 0.]) 

# Approximate Probability Distribution
Q = np.array([0.6, 0.2, 0.2])

print(f'The cross-entropy is {crossentropy(P,Q):.2f}')


The cross-entropy is 2.32


Now consider an better approximated probability distribution,

\begin{equation}
Q(X) = [Q(x_1), Q(x_2), Q(x_3)] = [0.3, 0.5, 0.2].
\end{equation}

In this case, the cross-entropy  is

In [9]:
# Target Probability Distribution
P = np.array([0., 1., 0.]) 

# Approximate Probability Distribution
Q = np.array([0.3, 0.5, 0.2])

print(f'The cross-entropy is {crossentropy(P,Q):.2f}')


The cross-entropy is 1.00


Note that this value indicates that the approximate distribution is a better representation of the target distribution.

Finally consider the approximate distribution

\begin{equation}
Q(X) = [Q(x_1), Q(x_2), Q(x_3)] = [0.1, 0.8, 0.1],
\end{equation}

which gives

In [10]:
# Target Probability Distribution
P = np.array([0., 1., 0.]) 

# Approximate Probability Distribution
Q = np.array([0.1, 0.8, 0.1])

print(f'The cross-entropy is {crossentropy(P,Q):.2f}')


The cross-entropy is 0.32


## The CategoricalCrossentropy and the SparseCategoricalCrossentropy loss functions in `Keras` 

The [CategoricalCrossentropy](https://keras.io/api/losses/probabilistic_losses/#categoricalcrossentropy-class) and the [SparseCategoricalCrossentropy](https://keras.io/api/losses/probabilistic_losses/#sparsecategoricalcrossentropy-class) loss functions are used to measure the  cost of a classification model.

In order to use these function, the algorithm may use and encoding to represent the targets. For example, if one has some categorical targets, they are first represented as integer values:

- TargetA ---> 0 
- TargetB ---> 1
- TargetC ---> 2
- TargetD ---> 3
...

Under this encoding, we can use the **'sparsecategorical_crossentropy'** function which is defined as

\begin{equation}
S(w) = -\sum_{i=1}^N y_i\log (y_i^p) ,
\end{equation}

where $w$ represents the parameters to be adjusted in the optimization procedure.



Another representation is obtained by using the **one-hot encoding**, which is based on the use of binary vectors. In this case each integer assigned to the categorical targets is represented as a binary vector, that is all zero values except the index of the integer which is marked with a 1. For example:

- TargetA ---> 0  ---> [1 0 0 0]
- TargetB ---> 1  ---> [0 1 0 0]
- TargetC ---> 2  ---> [0 0 1 0]
- TargetD ---> 3  ---> [0 0 0 1]
...

Under this encoding, we can use the **'categorical_crossentropy'** function which is also defined as before

\begin{equation}
S(w) = -\sum_{i=1}^N y_i \log (y_i^p) 
\end{equation}
