## 1. Measuring information theory quantities for discrete time series

This notebook is the first in a series of notebooks that aims in demonstrate how to use the "InfoPy" packege to compute information theory quantities such as entropy, mutual information and transfer entropy. We will begin by using the methods to compute IT quatities for discrete variables (or time series).

The series of notebooks are divided as:

1. Application of "InfoPy" to discrete variables;
2. Application of "InfoPy" to continuous variables using kernel density estimation methods;
3. Application of "InfoPy" to continuous variables using KSG estimator (still not available);

In [11]:
##########################################################################
# Importing packages
##########################################################################
import numpy             as np 
import matplotlib.pyplot as plt

##########################################################################
# Plots configuration
##########################################################################
#plt.xkcd()

### A brief review on information theory

Before we proper get hands-on on using "InfoPy" let's review the basic concepts of information theory (IT), namely the entropy, mutual information, and transfer entropy.

#### Entropy

In a straightfoward definition the entropy $H(X)$ quantifies the amount of uncertainty associated to a given random variable $X$, and it is the average of another quantity called information content $h(x_i)$. Note that $h(x_i)$ is a measure of uncertainty (or surprisal as we will see further) associated to a single outcome of $X$, here denoted by $x_i$. 

But, how can we properly define $h(x_i)$? Intuitivelly, one could associate the amount of unecertainty of an outcome $x_i$ to the inverse of its probability $p(x_i)$, i.e., the rare the event is the more information one gets by observing it (or more surprised the observer gets, which is why it is also called surprisal). However, this definition carries a few problems, for example, for $p(x_i)\rightarrow 0$, the surprisal would go toward infinity $h(x_i)\rightarrow \infty$. In order to satisfy the following properties [[4](http://www.mtm.ufsc.br/~taneja/book/node6.html)]:

1. $H(X)$ is continuous;
2. $H(X)$ increases monotonically with the number of outcomes $n$: $\{p(x_1), \dots, p(x_n) \}$;
3. $H(X)$ is additive $H(XY) = H(X)+H(Y)$.

We define the surprisal as:

$h(x_i) = \log_{b}\left(\frac{1}{p(x_i)}\right)$,

Consequently, the entropy of $X$ in given by:

$H(X) = -\sum_{i} p(x_i)\log_{b} p(x_i) $.

Using this definition the maximum amount of uncertainty of a random variable occurs when all its outcomes are equaly probable (picture yourself trying to guess which outcome will result from tossing a fair dice of $1000$ faces let's say, hard isn't it? But let's say that the dice is biased towards even numbers greather than $500$ it starts to get easier to guess!). For $n$ equaly probable outcomes, we have: $p(x_1) = ... = p(x_n) = 1/n$, therefore:

$H(X) = \log_{b}(n)$.

It is easy to see from the equation above that $H(X)$ satsfies conditions (1), and (2). On the equations above if the base of the logarithm $b=2$, the entropy is defined in __bits__ (for $b = e$, in __nats__). Let's apply exponential of $2$ on both sides of the equation above:

$n = 2^{H(X)}$,

for $H(X) = 1$ bit, we get $n = 2$, therefore $1$ bit is the amount of information necessary to choose between two equaly probable outcomes.

#### Joint entropy



The entropy can be computed for two or more random variables, in this case it is computed using the joint probability distribution of those random variables. For example for two variables $X$, and $Y$, the joint entropy is:

$H(X,Y) = -\sum_{i} p(x_i, y_i)\log_{b} p(x_i, y_i) $.

#### Conditional entropy

Another relevant entropy we should define is the conditional entropy:

$H(X|Y) = -\sum_{i} p(x_i|y_i)\log_{b} p(x_i|y_i) $.

where $p(x_i|y_i)$ denotes the conditional probability. 

This quantity can be read as the uncertainty I have about the variable $X$ after observing the variable $Y$ or the uncertainty reduction concerning $X$ given my knowledge of $Y$. For instance, if I'm given the task of receiving book donations for my city local library and to place then in the shelf  corresponding to its category but without knowing which book I'm receiving (this totally could happen). It is very hard to say if the book is categorized as "scifi", "fantasy", or "magical reaslism". But since I know everyone who regularly goes to the library and their reading taste I can try to guess to which category the book belongs to, for instance João is an avid reader of Tolkien so the book he is donating might be a fantasy one, Pedro's favorite writer is Machado de Assis so perhaps the book he is donating can be put on the "realism" shelf. In summary, knowing who is delivering the book ($Y$) have descreased the uncertainty of which category the book might belong too ($X$).

The single, joint and conditional entropies relate to each other via:

$H(X,Y) = H(X) + H(X|Y) = H(Y) + H(Y|X)$

Knowing the definition of the entropies we have discussed allow us to defined the next relevant IT measure called "mutual information".

#### Mutual information

The mutual information, as the name suggests, is the amount of information that is shared between two variables $X$, and $Y$, and is defined as follows:

$MI(X,Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)$

The expression for the $MI$ can be rewritten using the relation between entropies $H(X,Y) = H(X) + H(X|Y)$:

$MI(X,Y) = H(X) + H(Y) - H(X,Y)$

The amount of information shared between the variables is then the total information provaded by them alone minus the information obtained by observing them togheter.

The mutual information can also be interpreted as a non-linear correlation between $X$, and $Y$ where $X$, and $Y$ ara uncorrelated if $MI(X,Y)=0$.

For many variables the equation can be generallized as:

$MI(X_{1},\dots, X_{n}) = \sum_{i}H(X_{i})  - H(X_{1},\dots, X_{n})$

#### Conditional mutual information