<a href="https://colab.research.google.com/github/deltorobarba/machinelearning/blob/master/entropy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Entropy (Information Theory)**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

https://en.m.wikipedia.org/wiki/Entropy_(information_theory)

https://en.m.wikipedia.org/wiki/Information_theory

## **Information Distance**

#### **Information Gain**

**Mutual Information**

* Mutual Information is also known as information gain.

* In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. 

* More specifically, it quantifies the "amount of information" (in units such as shannons, commonly called bits) obtained about one random variable through observing the other random variable. 

* The concept of mutual information is intricately linked to that of entropy of a random variable, a fundamental notion in information theory that quantifies the expected "amount of information" held in a random variable.

* Not limited to real-valued random variables and linear dependence like the correlation coefficient, MI is more general and determines how different the joint distribution of the pair (X, Y) is to the product of the marginal distributions of X and Y. MI is the expected value of the pointwise mutual information (PMI).

https://en.m.wikipedia.org/wiki/Mutual_information

**Kullback–Leibler divergence**

* the Kullback–Leibler divergence (also called relative entropy) is a measure of how one probability distribution is different from a second, reference probability distribution.

* Applications include characterizing the relative (Shannon) entropy in information systems, randomness in continuous time-series, and information gain when comparing statistical models of inference. 

* In contrast to variation of information, **it is a distribution-wise asymmetric measure** and thus **does not qualify as a statistical metric of spread** - it also does not satisfy the triangle inequality.

* In the simple case, a Kullback–Leibler divergence of 0 indicates that the two distributions in question are identical. In simplified terms, it is a measure of surprise, with diverse applications such as applied statistics, fluid mechanics, neuroscience and machine learning.

#### **Variation of information**

* In probability theory and information theory, the variation of information or shared information distance is a measure of the distance between two clusterings (partitions of elements). 

* It is closely related to mutual information; indeed, it is a simple linear expression involving the mutual information. 

* Unlike the mutual information, however, **the variation of information is a true metric**, in that it obeys the triangle inequality.

https://en.m.wikipedia.org/wiki/Variation_of_information

## **Quantities of information**

https://en.m.wikipedia.org/wiki/Quantities_of_information

#### **Information Content (Self Information)**

* In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative way of expressing probability, much like odds or log-odds, but which has particular mathematical advantages in the setting of information theory.

* The Shannon information can be interpreted as quantifying the level of "surprise" of a particular outcome. As it is such a basic quantity, it also appears in several other settings, such as the length of a message needed to transmit the event given an optimal source coding of the random variable.

* The **Shannon information is closely related to information (theoretic) entropy**, which is the expected value of the self-information of a random variable, quantifying how surprising the random variable is "on average." This is the average amount of self-information an observer would expect to gain about a random variable when measuring it.

* The information content can be expressed in various units of information, of which the most common is the "bit" (sometimes also called the "shannon"), as explained below.

https://en.m.wikipedia.org/wiki/Information_content

#### **Units of Information**

**shannon**

* The shannon (symbol: Sh), more commonly known as the bit, is a unit of information and of entropy defined by IEC 80000-13. One shannon is the information content of an event occurring when its probability is ​1⁄2.

* It is also the entropy of a system with two equally probable states. If a message is made of a sequence of a given number of bits, with all possible bit strings being equally likely, the message's information content expressed in shannons is equal to the number of bits in the sequence.

* https://en.m.wikipedia.org/wiki/Shannon_(unit)

**nat**

* The natural unit of information (symbol: nat), sometimes also nit or nepit, is a unit of information or entropy, based on natural logarithms and powers of e, rather than the powers of 2 and base 2 logarithms, which define the bit. 

* https://en.m.wikipedia.org/wiki/Nat_(unit)

**Hartley**

* The hartley (symbol Hart), also called a ban, or a dit (short for decimal digit), is a logarithmic unit which measures information or entropy, based on base 10 logarithms and powers of 10, rather than the powers of 2 and base 2 logarithms which define the bit, or shannon. 

* One ban or hartley is the information content of an event if the probability of that event occurring is ​1⁄10. It is therefore equal to the information contained in one decimal digit (or dit), assuming a priori equiprobability of each possible value.

* https://en.m.wikipedia.org/wiki/Hartley_(unit)



## **Measures**

#### **Cross Entropy**

* the cross entropy between two probability distributions p and q **over the same underlying set of events** measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution q, rather than the true distribution p.

https://en.m.wikipedia.org/wiki/Cross_entropy

https://en.m.wikipedia.org/wiki/Cross-entropy_method

#### **Conditional Entropy**

* the conditional entropy (or equivocation)
quantifies the amount of information needed to describe the outcome of a random variable $Y$ given that the value of another
random variable $X$ is known. Here, information is measured in shannons, nats, or hartleys. The entropy of $Y$ conditioned on $X$ is written as $\mathrm{H}(Y \mid X)$

https://en.m.wikipedia.org/wiki/Conditional_entropy

#### **Joint Entropy**

* In information theory, joint entropy is a measure of the uncertainty associated with a set of variables.

https://en.m.wikipedia.org/wiki/Joint_entropy

#### **Sources**

https://en.m.wikipedia.org/wiki/Jensen%27s_inequality

https://en.m.wikipedia.org/wiki/Fisher_information#Matrix_form

https://en.m.wikipedia.org/wiki/Information_content

https://en.m.wikipedia.org/wiki/Probability_theory#Measure-theoretic_probability_theory