# information
----
 information represents the degree of surprise or the abstract possibility of an event
 
 Accordding to Shannon an informative message is a message with a very low chance of occurrening. in contrast, a predictable message has a small amount of information. One is not surprised to receive it. There is no information, if there is no randomness or uncertainty because if we have an absolute knowledge about event and we are certain that the event will shappen, then that event does not convey any information.

Shannon introduced the terminology bit as the unit of information

# note
information can be measured in bits (base 2), natural units (nats) (base e), or hartleys(base 10)
depending on the
base of the logarithm.

In [1]:
from mxnet import np,npx
from mxnet.metric import NegativeLogLikelihood
from mxnet.ndarray import nansum
import random
npx.set_np()

# Self-information

Self-information quantifies the level of information or surprise associated with one particular outcome or event of a random variable,

Using binary encoding any information is encoded by a series of 0 and 1. And hence, a series of binary digits of length n contains n bits of information

suppose that for any series of codes, each 0 or 1 occurs with a probability of $\frac{1}{2}$. Hence, an
event X with a series of codes of length n, occurs with a probability of $\frac{1}{2^{n}}$ .This series contains $n$ bits of information.

# NOTE The amount of information conveyed by each individual events  are considered as a random variable
Let X be a discrete random variable defined by $\{ x_{1},x_{2}, \cdots ,x_{n} \}$ and with probability $ \{p(X=x_{1}),p(X=x_{2}), \cdots ,p(X=x_{n})\}$ .
To measure the amount of information provided by an event $X$ Shannon gave the answer by defining self-information as

$$I(X=x_{i}) =log_{2} \frac{1}{p(X=x_{i})} =− log_{2}(p(X=x_{i})$$
as the bits of information we have received from an event X

WHERE $$p=\frac{1}{2^{n}}$$

## PROPERTIES
----
$$ I(X) \geqslant 0 $$
$$ I(X)=0  \ if \  p(X)=1 $$
$$ if \ p(X)< p(Y) \ then I(X) > I(Y)  $$
$$ p(X) \rightarrow 0, I(X)\rightarrow  +\inf $$


----

 For example, the code “0010” has a self-information
$$I("0010")=-log_{2}(p("0010"))=-log_{2}\frac{1}{2^{4}}=  4bits \ of \ information$$

 and “100110” has a self-information
$$I(“100110”)=-log_{2}(p(“100110”))=-log_{2}\frac{1}{2^{6}}=6bits \ of \ information$$ 

In [2]:
-np.log2(1/2**4)

4.0

In [3]:
2**6

64

In [4]:
-np.log2(1/2**6)

6.0

In [5]:
def self_information(p):
    return -np.log2(p)

In [6]:
self_information(1/64)

6.0

 whereas the entropy
quantifies how "informative" or
"surprising" the entire random variable is,
averaged on all its possible outcomes

# Entropy (Average Self-information)

Information entropy, often just entropy, is a basic quantity in information
theory associated to any random variable, which measures the average uncertainty (randomness) in the random variable. It is the number of bits on average required to describe the random
variable and can be interpreted as the average level of "information", "surprise", or "uncertainty inherent in the variable's possible outcomes


Self-information deals only with a single outcome whilst $entropy$ quantify the amount of uncertainty in an entire probability distribution

# NOTE
$$log=log_{2}$$
$x \sim p$ is read random variable x has a distribution p

Entropy is defined as

$$H(X)=-E_{x\sim p}[\log p(x)]  $$
if X is discrete
$$H(X)=\sum_{i}p(X=x_{i}) I(X)=-\sum_{i}p(X=x_{i}) \log p(X=x_{i}) $$

if X is continuous, entropy is refered as differential entropy and is defined as
$$H(X)=\int_{x}p(X=x_{i})I(X)dx=-\int_{x}p(X=x_{i}) \log p(X=x_{i})d(x) $$

In [7]:
def entropy(p):
    entropy=-p*np.log2(p)
    out=nansum(entropy.as_nd_ndarray())
    return out

In [8]:
entropy(np.array([0.1, 0.5, 0.1, 0.3]))


[1.6854753]
<NDArray 1 @cpu(0)>

Suppose that we have a students race with four student
taking part with probabilities $\{ \frac{1}{4}, \frac{1}{2},\frac{1}{8},\frac{1}{16},\frac{1}{16} \}$
are . We can calculate the entropy of the students as

$$ -\frac{1}{4}log\frac{1}{4} -\frac{1}{2}log\frac{1}{2}-\frac{1}{8}log\frac{1}{8}-2\frac{1}{16}log\frac{1}{16}$$

In [9]:
-1/4*np.log2(1/4) - 1/2*np.log2(1/2)-1/8*np.log2(1/8)-2*1/16*np.log2(1/16)

1.875

In [20]:
a=np.array([1/4,1/2,1/8,1/16,1/16])
entropy(a)


[1.875]
<NDArray 1 @cpu(0)>

In [21]:
def ent(p):
    entropy=-p*np.log2(p)
    return entropy.sum()
    

In [22]:
ent(a)

array(1.875)

# Binary Entropy 

the entropy of a binary source is defined as

$$ H(X)=-p(x)log \ p(x) -(1-p(x)) log \ (1-p(x)) $$

In [23]:
7/4

1.75

# Mutual Information

# Joint Entropy

The joint entropy $H(X, Y )$ of a pair of random variables $(X, Y )$ with a joint distribution $P_{X,Y}(x, y)$ is defined as

$$H(X,Y)=E_{x \sim p}[log \ P_{X,Y}(x,y)]$$

for a pair (X, Y ) of discrete random variables

$$H(X,Y)=-\sum_{x}\sum_{y}P_{X,Y}(x,y)log \ P_{X,Y}(x,y)$$

for pair (X, Y ) of continuous random variables, the differential
joint entropy is defined as


$$H(X,Y)=-\int_{x,y}P_{X,Y}(x,y)log \ P_{X,Y}(x,y) dxdy$$


In [24]:
def joint_entropy(p_xy):
    joint_ent = -p_xy * np.log2(p_xy)
    # nansum will sum up the non-nan number
    out = nansum(joint_ent.as_nd_ndarray())
    return out

In [25]:
joint_entropy(np.array([[0.1, 0.5], [0.1, 0.3]]))


[1.6854753]
<NDArray 1 @cpu(0)>