# Information theory

The primary goal of information theory is to quantify how much information is in data. We can evaluate the amount of information generated by a random sample by evaluating its frequency in light of a prior knowledge of the distribution of such random sample. 

There are multiple ways to obtaing the amount of information from a random variable that differ on the type of information we want to obtain, we will review three of the most used measurements in information theory: Entropy, KL divergence and Mutual Information.

We will examine simple cases where the random events are independent, discrete and mutually exclusive. As you might have guess we will be evaluating random events that come from bernoulli trials.

Remember that a discrete random Event:

$$ H = \sum_{k=1}^{L} P(k) = 1 $$

# Entropy

The most important metric in information theory is called Entropy, typically denoted as H. The definition of Entropy for a probability distribution is given by the Shannon entropy which is a special derivation of the Hartley function [https://en.wikipedia.org/wiki/Hartley_function](https://en.wikipedia.org/wiki/Hartley_function):

$$ H = - \sum_{i=1}^{N} P(X_i) * log p(X_i) $$

If we use $log_2$ for our calculation we can interpret entropy as "the minimum number of bits it would take us to encode our information".

Entropy is usually defined as a measurement of disorder (ambiguity, uncertainty), where low entropy will be given to random samples where the element do not convey any information regarding the sample. On the other hand, when events carry low probability values, these events carry more information thus the amount of entropy increases.

![Entropy_A](Entropy_A.png)

Entropy is normally measured in bits, the units of measurement depends on the base of the logarithm used but normally the log 2 is frequently used.

For example.

if we throw a fair coin the Entropy of this event is 1, as each value is equally probable. If a value is certainty to occur then entropy is 0 (random varriable is deterministic) as the value does not provide any information regarding its outcome


In [2]:
-(1/2 * log(1/2,2)+ 1/2 *log(1/2,2))

In [4]:
y = c(4, 2, 3, 0, 2, 4, 0, 0, 2, 1, 1)
y1 = c(0,0,0,0,0,0,0,0,0,1,2,3,3,3,4,4,4)

table(y)
table(y1)

y
0 1 2 3 4 
3 2 3 1 2 

y1
0 1 2 3 4 
9 1 1 3 3 

In [8]:
freqs <- table(y)/length(y)
-sum(freqs * log2(freqs))
freqs <- table(y1)/length(y1)
-sum(freqs * log2(freqs))

In [6]:
library(DescTools)

Entropy(table(y))
Entropy(table(y1))

We can measure the amount of entropy for each letter on the English alphabet by the English alphabet, if we assume that all characters (26 letters and a space) are equally likely then:

h = -(log1/27) = 4.75

but this prior of the distribution is not correct, in regular English some characters occur more frequent than others.

![English_Entropy](English_Entropy.png)

In this case the Entropy is 4.219 bits per symbol, that the value is smaller indicates that there are some characters that occur much more frequently than other therefore there is less information on these.

### Joint Entropy

When we want to compare the uncertainty between a set of random variables we can calculate the joint entropy by the equation:

$$ H(X,Y) = - \sum_{S_X}\sum_{S_Y}p(x,y)\log p(x,y) $$

[http://www.cs.tau.ac.il/~iftachh/Courses/Info/Fall14/Printouts/Lesson2_h.pdf](http://www.cs.tau.ac.il/~iftachh/Courses/Info/Fall14/Printouts/Lesson2_h.pdf)

[https://stats.stackexchange.com/questions/72694/joint-entropy-of-two-random-variables](https://stats.stackexchange.com/questions/72694/joint-entropy-of-two-random-variables)

# KL-Divergence

Kullback-Leibler Divergence (also known as relative entropy)

If P(x) and Q(x) are two continuous probability density functions, then the Kullback-Leibler divergence of q from p is defined as 


$$ \mbox{KL}(p~||~q)
= \sum_x P(x) \log \frac{P(x)}{Q(x)} $$


is a non-symmetric measure of difference between two probability distributions. It is related to mutual information and can be used to measure the association between two random variables. It measures the average number of extra bits that the random sample x diverge from y. In other words: The KL divergence between p and q can also be seen as the average number of bits that are wasted by encoding events from a distribution p with a code based on a not-quite-right distribution q.

![Entropy_KL](Entropy_KL.png)

Very often in Probability and Statistics we'll replace observed data or a complex distributions with a simpler, approximating distribution. KL Divergence helps us to measure just how much information we lose when we choose an approximation.

For a good example see [https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained](https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained)

# Mutual Information

If entropy is how much uncertainty there is in a random sample, mutual information pertains how much information is shared between two random variables, or how much information from one random variable depends from another random variable. Even though mutual information value is related to the correlation coefficient, it is also linked to the joint distribution of both random samples. The base 2 logarithm is often used, in which case result is in units of bits.

$$I(x,y) = \sum_{x,y} P(x,y) \ln {{P(x,y)}\over{P(x) P(y)}}$$

Simple Estimator
Let $(X_i, Y_i)$ be a data set of $N$ data points. To compute the empirical mutual information, count the frequencies of each combination of values, $freq(x,y)$, and then compute:

$freq(x) = \sum_y freq(x,y)$

$freq(y) = \sum_x freq(x,y)$

$I_N(x,y) = \sum_{x,y} freq(x,y) \ln {{freq(x,y)}\over{freq(x) freq(y)}}$

when computing the sum, only terms where freq(x,y)>0.

[http://www.lumina.com/blog/estimation-of-mutual-information](http://www.lumina.com/blog/estimation-of-mutual-information)