## Statistical Differences in Machine Learning

Information theory is used extensively in machine learning. The two most popular examples are in Cross Entropy and KL Divergence. This notebook introduces these two concepts and attempts to explain when each should be used.

In [1]:
%matplotlib inline

from math import log2

## KL Divergence

[Kullback-Liebler (KL) Divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence), also known as relative entropy, is a statistical distance measurement of how one probability distribution, $Q$ a model, differs from a second distribution, $P$ the data. Note: while KL Divergence is a distance it is not a metric.

The Kullback–Leibler divergence is then interpreted as the average difference of the number of bits required for encoding samples of $P$ using a code optimized for $Q$ rather than one optimized for $P$. Once again, for more information check out this article from [machine learning mastery](https://machinelearningmastery.com/divergence-between-probability-distributions/) that this tutorial is based on.

KL Divergence Follows the equation:

$$
\text{KL}(P || Q) = -\sum_{x \in X} P(x) * \log (\frac{Q(x)}{P(x)})
$$

The key is that the value within the sum, $P(x) * \log (\frac{Q(x)}{P(x)})$ is the divergence for the event x. The negative sum can be removed by simply reformatting the equation to the more common implementation:

$$
\text{KL}(P || Q) = \sum_{x \in X} P(x) * \log (\frac{P(x)}{Q(x)})
$$

The intuition for the KL divergence score is that when the probability for an event from P is large, but the probability for the same event in Q is small, there is a large divergence. When the probability from P is small and the probability from Q is large, there is also a large divergence, but not as large as the first case.

**IMPORTANT**: KL Divergence is not symmetrical, that is: $\text{KL}(P || Q) != \text{KL}(Q || P)$, hence why it cannot be considered a metric.

In [2]:
# define distributions
events = ['red', 'green', 'blue']
p = [0.10, 0.40, 0.50]
q = [0.80, 0.15, 0.05]

In [3]:
# calculate the kl divergence
def kl_divergence(p, q):
	return sum(p[i] * log2(p[i]/q[i]) for i in range(len(p)))

# calculate (P || Q)
kl_pq = kl_divergence(p, q)
print('KL(P || Q): %.3f bits' % kl_pq)
# calculate (Q || P)
kl_qp = kl_divergence(q, p)
print('KL(Q || P): %.3f bits' % kl_qp)

KL(P || Q): 1.927 bits
KL(Q || P): 2.022 bits


## Cross-Entropy

Cross-entropy is a very common machine learning loss function, used extensively in classification problems. It is an extension on the concept of entropy and closely related to KL Divergence. However, where KL Divergence calculates relative entropy between distributions, cross-entropy calculates the total entropy between distributions. 

Note: While cross-entropy is used interchangeably with logistic loss in machine learning they are derived from completely different sources.

Cross entropy follows the equation:

$$
H(P,Q) = -\sum_{x \in X} P(x) * \log(Q(x))
$$

## Cross-Entropy vs KL Divergence

While they are both very similar, the difference between cross-entropy and KL Divergence can be broken down into a simple issue:
- Cross-Entropy: Average number of total bits to represent an event from Q instead of P.
- Relative Entropy (KL Divergence): Average number of extra bits to represent an event from Q instead of P.

In other words, cross entropy is the total bits and thus can follow the calculation: 
$$
H(P, Q) = H(P) + KL(P || Q)
$$

**IMPORTANT**: Like KL Divergence, cross-entropy is not symmetrical. That is: $H(P,Q) != H(Q,P)$

In [4]:
def cross_entropy(p, q):
	return -sum([p[i]*log2(q[i]) for i in range(len(p))])

# calculate cross entropy H(P, Q)
ce_pq = cross_entropy(p, q)
print('H(P, Q): %.3f bits' % ce_pq)
# calculate cross entropy H(Q, P)
ce_qp = cross_entropy(q, p)
print('H(Q, P): %.3f bits' % ce_qp)

H(P, Q): 3.288 bits
H(Q, P): 2.906 bits
