<a href="https://colab.research.google.com/github/deltorobarba/machinelearning/blob/master/entropy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Information Geometry (Entropy & Statistical Learning) NEW**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## **Information Entropy**

https://en.m.wikipedia.org/wiki/Entropy_(information_theory)

https://en.m.wikipedia.org/wiki/Information_theory

### **Information Distance**

#### **Information Gain**

**Mutual Information**

* Mutual Information is also known as information gain.

* In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. 

* More specifically, it quantifies the "amount of information" (in units such as shannons, commonly called bits) obtained about one random variable through observing the other random variable. 

* The concept of mutual information is intricately linked to that of entropy of a random variable, a fundamental notion in information theory that quantifies the expected "amount of information" held in a random variable.

* Not limited to real-valued random variables and linear dependence like the correlation coefficient, MI is more general and determines how different the joint distribution of the pair (X, Y) is to the product of the marginal distributions of X and Y. MI is the expected value of the pointwise mutual information (PMI).

https://en.m.wikipedia.org/wiki/Mutual_information

**Kullback–Leibler divergence**

* the Kullback–Leibler divergence (also called relative entropy) is a measure of how one probability distribution is different from a second, reference probability distribution.

* Applications include characterizing the relative (Shannon) entropy in information systems, randomness in continuous time-series, and information gain when comparing statistical models of inference. 

* In contrast to variation of information, **it is a distribution-wise asymmetric measure** and thus **does not qualify as a statistical metric of spread** - it also does not satisfy the triangle inequality.

* In the simple case, a Kullback–Leibler divergence of 0 indicates that the two distributions in question are identical. In simplified terms, it is a measure of surprise, with diverse applications such as applied statistics, fluid mechanics, neuroscience and machine learning.

#### **Variation of information**

* In probability theory and information theory, the variation of information or shared information distance is a measure of the distance between two clusterings (partitions of elements). 

* It is closely related to mutual information; indeed, it is a simple linear expression involving the mutual information. 

* Unlike the mutual information, however, **the variation of information is a true metric**, in that it obeys the triangle inequality.

https://en.m.wikipedia.org/wiki/Variation_of_information

### **Quantities of information**

https://en.m.wikipedia.org/wiki/Quantities_of_information

#### **Information Content (Self Information)**

* In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative way of expressing probability, much like odds or log-odds, but which has particular mathematical advantages in the setting of information theory.

* The Shannon information can be interpreted as quantifying the level of "surprise" of a particular outcome. As it is such a basic quantity, it also appears in several other settings, such as the length of a message needed to transmit the event given an optimal source coding of the random variable.

* The **Shannon information is closely related to information (theoretic) entropy**, which is the expected value of the self-information of a random variable, quantifying how surprising the random variable is "on average." This is the average amount of self-information an observer would expect to gain about a random variable when measuring it.

* The information content can be expressed in various units of information, of which the most common is the "bit" (sometimes also called the "shannon"), as explained below.

https://en.m.wikipedia.org/wiki/Information_content

#### **Units of Information**

**shannon**

* The shannon (symbol: Sh), more commonly known as the bit, is a unit of information and of entropy defined by IEC 80000-13. One shannon is the information content of an event occurring when its probability is ​1⁄2.

* It is also the entropy of a system with two equally probable states. If a message is made of a sequence of a given number of bits, with all possible bit strings being equally likely, the message's information content expressed in shannons is equal to the number of bits in the sequence.

* https://en.m.wikipedia.org/wiki/Shannon_(unit)

**nat**

* The natural unit of information (symbol: nat), sometimes also nit or nepit, is a unit of information or entropy, based on natural logarithms and powers of e, rather than the powers of 2 and base 2 logarithms, which define the bit. 

* https://en.m.wikipedia.org/wiki/Nat_(unit)

**Hartley**

* The hartley (symbol Hart), also called a ban, or a dit (short for decimal digit), is a logarithmic unit which measures information or entropy, based on base 10 logarithms and powers of 10, rather than the powers of 2 and base 2 logarithms which define the bit, or shannon. 

* One ban or hartley is the information content of an event if the probability of that event occurring is ​1⁄10. It is therefore equal to the information contained in one decimal digit (or dit), assuming a priori equiprobability of each possible value.

* https://en.m.wikipedia.org/wiki/Hartley_(unit)



### **Measures**

#### **Cross Entropy**

* the cross entropy between two probability distributions p and q **over the same underlying set of events** measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution q, rather than the true distribution p.

https://en.m.wikipedia.org/wiki/Cross_entropy

https://en.m.wikipedia.org/wiki/Cross-entropy_method

#### **Conditional Entropy**

* the conditional entropy (or equivocation)
quantifies the amount of information needed to describe the outcome of a random variable $Y$ given that the value of another
random variable $X$ is known. Here, information is measured in shannons, nats, or hartleys. The entropy of $Y$ conditioned on $X$ is written as $\mathrm{H}(Y \mid X)$

https://en.m.wikipedia.org/wiki/Conditional_entropy

#### **Joint Entropy**

* In information theory, joint entropy is a measure of the uncertainty associated with a set of variables.

https://en.m.wikipedia.org/wiki/Joint_entropy

#### **Sources**

https://en.m.wikipedia.org/wiki/Jensen%27s_inequality

https://en.m.wikipedia.org/wiki/Fisher_information#Matrix_form

https://en.m.wikipedia.org/wiki/Information_content

https://en.m.wikipedia.org/wiki/Probability_theory#Measure-theoretic_probability_theory

## **Statistical Learning**

### **Properties**

1. $d(x, y) \geq 0 \quad$ (**non-negativity**)

2. $d(x, y)=0$ if and only if $x=y$ (**identity of indiscernibles**. Note that condition 1 and 2 together produce **positive definiteness**)

3. $d(x, y)=d(y, x)$ (**symmetry**)

4. $d(x, z) \leq d(x, y)+d(y, z)$ (**subadditivity / triangle inequality**).

* **Divergence** fullfills property of positive definiteness (1 + 2)

* **Distance** fullfills property of positive definiteness and symmetrie (1 + 2+ 3)

* **Metric** fullfills property of positive definiteness, symmetrie and triangle inequality (1 + 2 + 3 + 4)

* Metric Space: Together with the set, a metric makes up a metric space.

* (*Jede Norm induziert eine Metrik, aber nicht jede Metrik wird durch eine Norm induziert*)

### **Divergences**

#### **Divergence & 'Statistical Distance'**

In statistics and information geometry, divergence or a contrast function is a function which establishes the **"distance" of one probability distribution to the other** on a statistical manifold. In statistics, probability theory, and information theory, **a statistical distance** quantifies the distance between two statistical objects, which can be

* two random variables, or 
* two probability distributions or 
* two samples, or 
* the distance can be between an individual sample point and a population or 
* a wider sample of points.

A distance between populations can be interpreted as **measuring the distance between two probability distributions** and hence they are essentially measures of distances between probability measures. 

* Where statistical distance measures relate to the differences between random variables, these may have statistical dependence, and hence these distances are not directly related to measures of distances between probability measures. 

* Again, a measure of distance between random variables may **relate to the extent of dependence between them, rather than to their individual values.**

**<u>Many statistical distances are not metrics</u>** (and some types are regerred to as divergence), because they lack one or more properties of proper metrics. For example, 

* [pseudometrics](https://en.m.wikipedia.org/wiki/Pseudometric_space) violate the "positive definiteness" (alternatively, "identity of indescernibles") property (1 & 2 above); 

* [quasimetrics](https://en.m.wikipedia.org/wiki/Metric_(mathematics)#Quasimetrics) violate the symmetry property (3); and semimetrics violate the triangle inequality (4). 

* Statistical distances that satisfy (1) and (2) are referred to as divergences.

* In statistics and information geometry, there are many kinds of statistical distances, notably divergences, especially Bregman divergences and f-divergences. These include and generalize many of the notions of "difference between two probability distributions", and allow them to be studied geometrically, as statistical manifolds. 

* The **most elementary** is the **squared Euclidean distance**, which forms the basis of least squares; this is the most basic Bregman divergence (-> is this a metric then ???)

* The **most important** in information theory is the relative entropy (**Kullback–Leibler divergence**), which allows one to analogously study maximum likelihood estimation geometrically; this is the most basic f-divergence, and is also a Bregman divergence (and is the only divergence that is both). 

* Statistical manifolds corresponding to Bregman divergences are flat manifolds in the corresponding geometry, allowing an analog of the Pythagorean theorem (which is traditionally true for squared Euclidean distance) to be used for linear inverse problems in inference by optimization theory.

* Other important statistical distances include the Mahalanobis distance, the energy distance, and many others.

https://en.m.wikipedia.org/wiki/Information_geometry

https://en.m.wikipedia.org/wiki/Statistical_distance

https://en.m.wikipedia.org/wiki/Divergence_(statistics)

#### **Meaning of 'no symmetry' in divergences**

* The Kullback-Leibler divergence is not symmetric. Roughly speaking, it's because you should think of the two arguments of the KL divergence as different kinds of things: the first argument is empirical data, and the second argument is a model you're comparing the data to. 

* Take a bunch of independent random variables $X_{1}, \ldots, X_{n}$ whose possible values lie in a finite set.* Say these variables are identically distributed, with $\operatorname{Pr}\left(X_{i}=x\right)=p_{x}$. Let $F_{n, x}$ be the number of variables whose values are equal to $x$. The list $F_{n}$ is a random variable, often called the "empirical frequency distribution" of the $X_{i} .$ What does $F_{n}$ look like when $n$ is very large?

* More specifically, let's try to estimate the probabilities of the possible values of $F_{n} .$ since the set of possible values is different for different $n$, take a sequence of frequency distributions $f_{1}, f_{2}, f_{3}, \ldots$ approaching a fixed frequency distribution $f$. It turns out $^{* *}$ that

> $\lim _{n \rightarrow \infty} \frac{1}{n} \ln \operatorname{Pr}\left(F_{n}=f_{n}\right)=-\mathrm{KL}(f, p)$ 

* In other words, the Kullback-Leibler divergence of $f$ from $p$ lets you estimate the probability of getting an empirical frequency distribution close to $f$ from a large number of independent random variables with distribution $p$.


Excellent article "Information Theory, Relative Entropy and Statistics," by [François Bavaud](https://link.springer.com/chapter/10.1007/978-3-642-00659-3_3)

List of Distances Types

'braycurtis': hdbscan.dist_metrics.BrayCurtisDistance

'canberra': hdbscan.dist_metrics.CanberraDistance

'chebyshev': hdbscan.dist_metrics.ChebyshevDistance

'cityblock': hdbscan.dist_metrics.ManhattanDistance

'dice': hdbscan.dist_metrics.DiceDistance

'euclidean': hdbscan.dist_metrics.EuclideanDistance

'hamming': hdbscan.dist_metrics.HammingDistance

'haversine': hdbscan.dist_metrics.HaversineDistance

'infinity': hdbscan.dist_metrics.ChebyshevDistance

'jaccard': hdbscan.dist_metrics.JaccardDistance

'kulsinski': hdbscan.dist_metrics.KulsinskiDistance

'l1': hdbscan.dist_metrics.ManhattanDistance

'l2': hdbscan.dist_metrics.EuclideanDistance

'mahalanobis': hdbscan.dist_metrics.MahalanobisDistance

'manhattan': hdbscan.dist_metrics.ManhattanDistance

'matching': hdbscan.dist_metrics.MatchingDistance

'minkowski': hdbscan.dist_metrics.MinkowskiDistance

'p': hdbscan.dist_metrics.MinkowskiDistance

'pyfunc': hdbscan.dist_metrics.PyFuncDistance

'rogerstanimoto': hdbscan.dist_metrics.RogersTanimotoDistance

'russellrao': hdbscan.dist_metrics.RussellRaoDistance

'seuclidean': hdbscan.dist_metrics.SEuclideanDistance

'sokalmichener': hdbscan.dist_metrics.SokalMichenerDistance

'sokalsneath': hdbscan.dist_metrics.SokalSneathDistance

'wminkowski': hdbscan.dist_metrics.WMinkowskiDistance

https://hdbscan.readthedocs.io/en/latest/basic_hdbscan.html

https://reference.wolfram.com/language/guide/DistanceAndSimilarityMeasures.html

#### **Types of Divergences**

##### **f-Divergence**

https://en.m.wikipedia.org/wiki/F-divergence

The Hellinger distance is a type of f-divergence

https://en.m.wikipedia.org/wiki/Hellinger_distance

##### **Bregman Divergence**

https://en.m.wikipedia.org/wiki/Bregman_divergence

The squared Euclidean divergence is a Bregman divergence (corresponding to the function x<sup>2</sup>, but not an f-divergence

https://en.m.wikipedia.org/wiki/Euclidean_distance#Squared_Euclidean_distance

##### **Squared Euclidean Distance**

* Squared Euclidean distance is of central importance in estimating parameters of statistical models, where it is used in the method of least squares, a standard approach to regression analysis. 

* The corresponding loss function is the squared error loss (SEL), and places progressively greater weight on larger errors. The corresponding risk function (expected loss) is mean squared error (MSE).

* **Squared Euclidean distance is not a metric**, as it does not satisfy the triangle inequality. However, **it is a more general notion of distance, namely a divergence** (specifically a Bregman divergence), and can be used as a statistical distance. 

https://en.m.wikipedia.org/wiki/Euclidean_distance#Squared_Euclidean_distance

##### **Kullback–Leibler divergence**

The only divergence that is both an f-divergence and a Bregman divergence is the Kullback–Leibler divergence

https://en.m.wikipedia.org/wiki/Kullback–Leibler_divergence

##### **Jensen–Shannon divergence**

https://en.m.wikipedia.org/wiki/Jensen–Shannon_divergence

### **Similarity & Distances**

#### **Similarity Learning & Similarity Measures**

*Ähnlichkeitsmaße werden für nominal oder ordinal skalierte Variablen genutzt*

##### **Overview**

In der Statistik, insbesondere der Multivariaten Statistik, interessiert man sich für die Messung der Ähnlichkeit zwischen verschiedenen Objekten und definiert dazu Ähnlichkeits- und Distanzmaße. **Es handelt sich dabei nicht um Maße im mathematischen Sinn**, der Begriff bezieht sich ausschließlich auf die Messung einer bestimmten Größe.

https://de.m.wikipedia.org/wiki/Ähnlichkeitsanalyse

https://de.m.wikipedia.org/wiki/Distanzfunktion

https://en.m.wikipedia.org/wiki/Distance_(graph_theory)

https://en.m.wikipedia.org/wiki/Distance

**Similarity Learning**

Similarity learning is an area of supervised machine learning in artificial intelligence. It is closely related to regression and classification, but the goal is to learn a similarity function that measures how similar or related two objects are. It has applications in ranking, in recommendation systems, visual identity tracking, face verification, and speaker verification.

other approaches to learn a distance metric from examples.

https://en.m.wikipedia.org/wiki/Similarity_learning

Triplet Loss (a loss function for machine learning algorithms) is often used for learning similarity for the purpose of learning embeddings, like word embeddings and even thought vectors, and metric learning.

https://en.m.wikipedia.org/wiki/Triplet_loss

**Similarity Measure**

* In statistics and related fields, a similarity measure or similarity function is a real-valued function that quantifies the similarity between two objects. 

* Although no single definition of a similarity measure exists, usually such measures are in some sense the inverse of distance metrics: they take on large values for similar objects and either zero or a negative value for very dissimilar objects.

* Cosine similarity is a commonly used similarity measure for real-valued vectors, used in (among other fields) information retrieval to score the similarity of documents in the vector space model. In machine learning, common kernel functions such as the RBF kernel can be viewed as similarity functions.

https://en.m.wikipedia.org/wiki/Similarity_measure

**Ranking & Learning to Rank**

Ranking.. (triplet loss mit similarity learning wird im ranking verwendet, weil es ordinal ist im ggs zu distance learning..)

https://en.m.wikipedia.org/wiki/Ranking_(information_retrieval)

https://en.m.wikipedia.org/wiki/Learning_to_rank

https://en.m.wikipedia.org/wiki/Ranking

https://en.m.wikipedia.org/wiki/Information_retrieval

##### **Jaccard-Koeffizient**

https://de.m.wikipedia.org/wiki/Jaccard-Koeffizient

##### **Yules Index**

Yules Index ist ein statistischer Messwert, der die Uniformität oder Diversität des Wortschatzes bestimmt. Er wurde vom schottischen Statistiker George Udny Yule entwickelt und misst die Wahrscheinlichkeit, mit der zwei zufällig ausgewählte Wörter eines Textes identisch sind – und zwar weitgehend unabhängig vom Umfang des Textes.Diesen Index hat Herdan aufgegriffen und weiterentwickelt.

https://de.m.wikipedia.org/wiki/Yules_Index

##### **Pearson Korrelationskoeffizient**

https://de.m.wikipedia.org/wiki/Korrelationskoeffizient

#### **Distance Metric Learning & Distance Measures**

*Distanzmaße werden für metrisch skalierte Variablen (d. h. für Intervall- und Verhältnisskala) genutzt.*

##### **Overview**

**Distance Metric Learning**

**Distanzmaße** werden für metrisch skalierte Variablen (d. h. für Intervall- und Verhältnisskala) genutzt.

Similarity learning is closely related to distance metric learning. Metric learning is the task of learning a distance function over objects. A metric or distance function has to obey four axioms: non-negativity, identity of indiscernibles, symmetry and subadditivity (or the triangle inequality). **In practice, metric learning algorithms ignore the condition of identity of indiscernibles and learn a pseudo-metric**.

https://en.m.wikipedia.org/wiki/Similarity_learning#Metric_learning

https://en.m.wikipedia.org/wiki/Distance

**Exkurs**: [Distance measures](https://en.m.wikipedia.org/wiki/Distance_measures_(cosmology)) in cosmology are complicated by the [expansion of the universe](https://en.m.wikipedia.org/wiki/Expansion_of_the_universe), and by effects described by the [theory of relativity](https://en.m.wikipedia.org/wiki/Theory_of_relativity) such as [length contraction](https://en.m.wikipedia.org/wiki/Length_contraction) of moving objects.

##### **Mahalanobis distance**

* The Mahalanobis distance is a measure of the distance between a point P and a distribution D

* If each of these axes is re-scaled to have unit variance, then the Mahalanobis distance corresponds to standard Euclidean distance in the transformed space. The Mahalanobis distance is thus unitless and scale-invariant, and takes into account the correlations of the data set.

* In statistics, the covariance matrix of the data is sometimes used to define a distance metric called Mahalanobis distance.

* Bregman divergence: **the Mahalanobis distance is an example of a Bregman divergence**

* **Bhattacharyya distance related, for measuring similarity between data sets (and not between a point and a data set** - Mahalanobis distance is a particular case of the Bhattacharyya distance when the standard deviations of the two classes are the same.)

https://en.m.wikipedia.org/wiki/Mahalanobis_distance

* Mahalanobis distance is an effective multivariate distance metric that measures the distance between a point and a distribution. 

* It is an extremely useful metric having, excellent applications **in multivariate anomaly detection, classification on highly imbalanced datasets and one-class classification**. 

![alternativer Text](https://raw.githubusercontent.com/deltorobarba/machinelearning/master/mahalanobis.jpg)

* If the dimensions (columns in your dataset) are correlated to one another, which is typically the case in real-world datasets, the Euclidean distance between a point and the center of the points (distribution) can give little or misleading information about how close a point really is to the cluster.

* The two points above are equally distant (Euclidean) from the center. But only one of them (blue) is actually more close to the cluster, even though, technically the Euclidean distance between the two points are equal.

* This is because, Euclidean distance is a distance between two points only. It does not consider how the rest of the points in the dataset vary. So, it cannot be used to really judge how close a point actually is to a distribution of points.

* **What we need here is a more robust distance metric that is an accurate representation of how distant a point is from a distribution.**

So computationally, how is Mahalanobis distance different from Euclidean distance?

1. It transforms the columns into uncorrelated variables
2. Scale the columns to make their variance equal to 1
3. Finally, it calculates the Euclidean distance.

https://www.machinelearningplus.com/statistics/mahalanobis-distance/

The Mahalanobis distance has the following properties:

* It accounts for the fact that the variances in each direction are different.

* It accounts for the covariance between variables.

* It reduces to the familiar Euclidean distance for uncorrelated variables with unit variance.

Distance in standard units

In statistics, we sometimes measure "nearness" or "farness" in terms of the scale of the data. **Often "scale" means "standard deviation."** For univariate data, we say that an observation that is one standard deviation from the mean is closer to the mean than an observation that is three standard deviations away. (You can also specify the distance between two observations by specifying how many standard deviations apart they are.)

**For many distributions, such as the normal distribution, this choice of scale also makes a statement about probability**. Specifically, it is more likely to observe an observation that is about one standard deviation from the mean than it is to observe one that is several standard deviations away. Why? Because the probability density function is higher near the mean and nearly zero as you move many standard deviations away.

**For normally distributed data, you can specify the distance from the mean by computing the so-called z-score**. For a value x, the z-score of x is the quantity z = (x-μ)/σ, where μ is the population mean and σ is the population standard deviation. This is a dimensionless quantity that you can interpret as the number of standard deviations that x is from the mean.



https://blogs.sas.com/content/iml/2012/02/15/what-is-mahalanobis-distance.html

##### **Bhattacharyya distance**

* In statistics, the Bhattacharyya distance measures the similarity of two probability distributions. It is closely related to the Bhattacharyya coefficient which is a measure of the amount of overlap between two statistical samples or populations. 

* The coefficient can be used to determine the relative closeness of the two samples being considered. It is used to measure the separability of classes in classification and it is considered to be more reliable than the Mahalanobis distance, as the ***Mahalanobis distance is a particular case of the Bhattacharyya distance** when the standard deviations of the two classes are the same. 

* Consequently, when two classes have similar means but different standard deviations, the Mahalanobis distance would tend to zero, whereas the Bhattacharyya distance grows depending on the difference between the standard deviations.

* under certain conditions does not obey the triangle inequality

https://en.m.wikipedia.org/wiki/Bhattacharyya_distance

##### **Hellinger Distance**

*  the Hellinger distance (closely related to, although different from, the Bhattacharyya distance) is used to **quantify the similarity between two probability distributions**. 

* **It is a type of f-divergence.**

* (?) ist vielleicht sogar eine metric weil es triangle inequality erfüllt.

https://en.m.wikipedia.org/wiki/Hellinger_distance

##### **Euclidean L1 and Manhatten Distance L2**

Sinde it fullfills all 4 properties, it is not only a distance but also a metric (see details below)

## **Additional Sources**

https://franknielsen.github.io


http://yosinski.com/mlss12/MLSS-2012-Amari-Information-Geometry/


https://www.frontiersin.org/articles/10.3389/fevo.2019.00447/full


https://numerics.mathdotnet.com/Distance.html


https://research.wmz.ninja/articles/2018/03/a-brief-list-of-statistical-divergences.html


https://link.springer.com/article/10.1007/s00362-018-01082-8?shared-article-renderer


https://gmarti.gitlab.io//qfin/2020/07/01/mutual-information-is-copula-entropy.html