# Information Theory

Consider a random variable $X$ on the set $\{a, b, c, d\},$ with the following probabilities 

- $P(X=a)=p_{a}$
- $P(X=b)=p_{b}, \ldots$

## Optimal number of bits to encode the possible values of $X$ 

### Equal Probability
If $p_{a}=p_{b}=p_{c}=p_{d}=\frac{1}{4}$ then on average we expect to use
2 bits to transmit a message containing just the value of X using the following encoding
- a: 00
- b: 01
- c: 10
- d: 11

### Uneven Probability
If $p_{a}=\frac{1}{2}, p_{b}=\frac{1}{4}, p_{c}=\frac{1}{8}=p_{d}$, we use a different encoding scheme with the concept

we should use fewer bits to encode the more
frequently occurring values, and more bits to encode the less
frequently occurring ones, example:

- a: 0
- b: 10
- c: 110
- d: 111

Note that we cannot use shorter codes for b, c or d because
we need to be able to unambiguously parse a concatenation of
the strings, eg. 1110110 decodes uniquely into dac.

With this encoding scheme, on average we use

$\left(\frac{1}{2} \times 1\right)+\left(\frac{1}{4} \times 2\right)+\left(\frac{1}{8} \times 3\right)+\left(\frac{1}{8} \times 3\right)=1.75$
bits

## Entropy
The entropy, $H(X)$ of a discrete random variable is given by
$$H(X)=-\sum_{i} p_{i} \log p_{i}$$
where adopt the convention $0 \log 0 = 0$

### Example
If $p_{a}=\frac{1}{2}, p_{b}=\frac{1}{4}, p_{c}=\frac{1}{8}=p_{d}$ then the entropy is as follows:

$$H(X)=-\frac{1}{2} \log _{2} \frac{1}{2}-\frac{1}{4} \log _{2} \frac{1}{4}-\frac{1}{8} \log_2 \frac{1}{8}-\frac{1}{8} \log_2 \frac{1}{8}=1.75 \text{ bits}$$

This is the same as the average number of bits we
computed earlier with our encoding scheme, since we are using a binary encoding

### Example 2: Uniform Distribution
Entropy $H(X)$ is maximized when X is a
uniform distribution; for n classes, we need $- \log _{2} \frac{1}{n}$ bits on average to transmit X , and this is the most bandwidth
required amongst all possible distributions of X .

$$H(X)=-\frac{1}{4} \log _{2} \frac{1}{4}-\frac{1}{4} \log _{2} \frac{1}{4}-\frac{1}{4} \log_2 \frac{1}{4}-\frac{1}{4} \log_2 \frac{1}{4}= - \log _{2} \frac{1}{4}=2$$

## Cross Entropy
The cross entropy of two discrete distributions p and q  is given by
$$H(p, q)=-\sum_{i} p_{i} \log q_{i}$$

Conditions:
- if $q_i = 0 \Rightarrow p_i = 0$
- else: If $q_{i}=0$ for some $i$ but $p_{i}>0,$ then $H(p, q)=\infty$

We can also write $H(X , Y )$ instead when we have two random
variables $X$ and $Y$ with distributions $p$ and $q$ respectively.

- $H(p, q) \geq H(p, p)$ for all $q$ (measure of disimilarity)
- equality occurs when $q=p.$  (entropy 0 when same)
- Not symmetric: i.e. $H(p, q) \neq H(q, p)$ in general.

In [6]:
import numpy as np

def CrossEntropy(yHat, y):
    if y == 1:
        return -np.log(yHat)
    else:
        return -np.log(1 - yHat)

    
print(CrossEntropy(0,0), CrossEntropy(1,1))
print(CrossEntropy(0,1), CrossEntropy(1,0))

-0.0 -0.0
inf inf


  """
  import sys


## Kullback-Leibler (KL) divergence (or relative entropy)

Definition
The KL divergence of two discrete distributions p and q such that
$q_{i}=0 \Longrightarrow p_{i}=0,$ is given by

$$ 
\begin{aligned} D_{K L}(p | q) &=H(p, q)-H(p, p) \\ &=\sum_{i} p_{i} \log \frac{p_{i}}{q_{i}} \end{aligned}
$$

If $q_{i}=0$ for some $i$ but $p_{i}>0,$ then $H(p, q)=\infty$

KL divergence measures the number of extra bits required to
transmit X with distribution p, **as compared to the optimal
code**, when we use the sub-optimal coding scheme associated
with distribution q.

- As with cross entropy, it is not symmetric.
- We can use source coding theorem to infer that KL divergence is always non-negative, but a more there is a more direct proof using Jensen’s inequality.





