<a href="https://colab.research.google.com/github/deltorobarba/machinelearning/blob/master/geometry_old.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Differential (Information) Geometry - OLD**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt 
import numpy as np
import pandas as pd

## **Cost (Loss) Function**

* The function we want to minimize or maximize is called the objective function or criterion. When we are minimizing it, we may also call it the cost function, loss function, or error function.

* In most cases, our parametric model defines a distribution […] and we simply use the principle of maximum likelihood. This means we use the cross-entropy between the training data and the model’s predictions as the cost function.

* It is important, therefore, that the function faithfully represent our design goals. If we choose a poor error function and obtain unsatisfactory results, the fault is ours for badly specifying the goal of the search.

**Maximum Likelihood Estimation**

* Maximum likelihood seeks to find the optimum values for the parameters by maximizing a likelihood function derived from the training data.

* Given input, the model is trying to make predictions that **match the data distribution of the target variable**. Under maximum likelihood, a loss function estimates how closely the distribution of predictions made by a model matches the distribution of target variables in the training data.

* One way to interpret maximum likelihood estimation is to view it as **minimizing the dissimilarity** between the empirical distribution […] defined by the training set and the model distribution, with the degree of dissimilarity between the two measured by the KL divergence. […] **Minimizing this KL divergence corresponds exactly to minimizing the cross-entropy between the distributions**.

* Under appropriate conditions, the maximum likelihood estimator has the **property of consistency** […], meaning that as the number of training examples approaches infinity, the maximum likelihood estimate of a parameter converges to the true value of the parameter.

* Under the framework maximum likelihood, the error between two probability distributions is measured using cross-entropy. Under maximum likelihood estimation, we would seek a set of model weights that minimize the difference between the model’s predicted probability distribution given the dataset and the distribution of probabilities in the training dataset. This is called the cross-entropy.

When using the framework of maximum likelihood estimation, we will implement a cross-entropy loss function, which often in practice means:
* a **cross-entropy** loss function for classification problems and 
* a **mean squared error** loss function for regression problems.


* Under the framework of maximum likelihood estimation and assuming a **Gaussian distribution for the target variable**, mean squared error can be considered the cross-entropy between the distribution of the model predictions and the distribution of the target variable.

* Many authors use the term “cross-entropy” to identify specifically the negative log-likelihood of a Bernoulli or softmax distribution, but that is a misnomer. 

* Any loss consisting of a negative log-likelihood is a cross-entropy between the empirical distribution defined by the training set and the probability distribution defined by model. 

* For example, **mean squared error is the cross-entropy between the empirical distribution and a Gaussian model**

https://machinelearningmastery.com/cross-entropy-for-machine-learning/

## **Cross-Entropy & Information Theory**

* Cross-entropy is a measure of the difference between two probability distributions for a given random variable or set of events.

* You might recall that information quantifies the number of bits required to encode and transmit an event. Lower probability events have more information, higher probability events have less information.

* Entropy is the number of bits required to transmit a randomly selected event from a probability distribution. A skewed distribution has a low entropy, whereas a distribution where events have equal probability has a larger entropy.

* In information theory, we like to describe the “surprise” of an event. Low probability events are more surprising therefore have a larger amount of information. Whereas probability distributions where the events are equally likely are more surprising and have larger entropy.

* Skewed Probability Distribution (unsurprising): Low entropy.

* Balanced Probability Distribution (surprising): High entropy.

* Entropy can be calculated for a random variable with a set of x in X discrete states discrete states and their probability P(x) as follows:

```
H(X) = – sum x in X P(x) * log(P(x))
```
Cross-entropy builds upon the idea of entropy from information theory and calculates the number of bits required to represent or transmit an average event from one distribution compared to another distribution.


* The intuition for this definition comes if we consider a target or underlying probability distribution P and an approximation of the target distribution Q, then the cross-entropy of Q from P is the number of additional bits to represent an event using Q instead of P.

* The cross-entropy between two probability distributions, such as Q from P, can be stated formally as:

H(P, Q)

* Where H() is the cross-entropy function, P may be the target distribution and Q is the approximation of the target distribution.

* Cross-entropy can be calculated using the probabilities of the events from P and Q, as follows:



```
H(P, Q) = – sum x in X P(x) * log(Q(x))
```

* Where P(x) is the probability of the event x in P, Q(x) is the probability of event x in Q and log is the base-2 logarithm, meaning that the results are in bits. If the base-e or natural logarithm is used instead, the result will have the units called nats.

* This calculation is for discrete probability distributions, although a similar calculation can be used for continuous probability distributions using the integral across the events instead of the sum.

* The result will be a positive number measured in bits and will be equal to the entropy of the distribution if the two probability distributions are identical.

https://machinelearningmastery.com/what-is-information-entropy/

**Cross-Entropy vs KL Divergence vs Logloss**

* Cross-entropy is a measure from the field of information theory, building upon entropy and generally calculating the difference between two probability distributions. It is closely related to but is different from **KL divergence** that calculates the relative entropy between two probability distributions, whereas cross-entropy can be thought to calculate the total entropy between the distributions.

* Cross-entropy is also related to and often confused with **logistic loss, called log loss**. Although the two measures are derived from a different source, when used as loss functions for classification models, both measures calculate the same quantity and can be used interchangeably.



## **Metric (Similarity) Learning**

* Triplet loss is probably the most popular loss function of metric learning. 

https://en.m.wikipedia.org/wiki/Similarity_learning

https://towardsdatascience.com/metric-learning-loss-functions-5b67b3da99a5

## **Distance & Divergence**

**Conditions**

1. d(x, y) ≥ 0     (non-negativity)
2. d(x, y) = 0   if and only if   x = y     (identity of indiscernibles. Note that condition 1 and 2 together produce positive definiteness)
3. d(x, y) = d(y, x)     (symmetry)
4. d(x, z) ≤ d(x, y) + d(y, z)     (subadditivity / triangle inequality).

**Distances**

For continuous data:

* Euclidean Distance
* Manhattan Distance
* Canberra Distance
* Bray Curtis Distance
* Cosine Distance
* Correlation Distance

**Divergences**

* is a (contrast) function which establishes the "distance" of one probability distribution to the other on a statistical manifold. 
* divergence is a weaker notion than that of the distance, in particular the divergence need not be symmetric (that is, in general the divergence from p to q is not equal to the divergence from q to p), and need not satisfy the triangle inequality.
* The two most important divergences are the relative entropy (Kullback–Leibler divergence, KL divergence) and the squared Euclidean distance.
* Minimizing these two divergences is the main way that linear inverse problem are solved, via the principle of maximum entropy and least squares, notably in logistic regression and linear regression.
* The two most important classes of divergences are the f-divergences and Bregman divergences; however, other types of divergence functions are also encountered in the literature. The only divergence that is both an f-divergence and a Bregman divergence is the Kullback–Leibler divergence; the squared Euclidean divergence is a Bregman divergence (corresponding to the function x<sup>2</sup>), but not an f-divergence.

## **Find the similarity between two probability distributions**

Using Jensen Shannon Divergence to build a tool to find the distance between probability distributions using Python.

I was on a mission to find a good measure of difference between two probability distributions. After doing a lot of research online, taking feedback from my colleagues, and validating various methods, I found one that does a really good job.

My problem statement could be solved by calculating the statistical distance between the two probability distributions. To do this, I found out that Jensen Shannon Distance can be used.

Jensen-Shannon Divergence (JSD)is a metric derived from another measure of statistical distance called the Kullback-Leiber Divergence(KLD). The reason why I couldn’t use the KLD is that it’s an asymmetrical function. Since there might have been a lot of distance calculations required, it posed a risk.

JSD, on the other hand, is a symmetrical function and the square root of JSD gives the Jensen-Shannon Distance. A measure that we can use to find the similarity between the two probability distributions. 0 indicates that the two distributions are the same, and 1 would indicate that they are nowhere similar.

Where P & Q are the two probability distribution, M = (P+Q)/2, and D(P ||M) is the KLD between P and M. Similarly D(Q||M) is the KLD between Q and M.
Implementation in Python
Now that we know the formula, it’s time to implement it. First of all, we need to calculate M and also, the KLD between P&M and Q&M.
Scipy is a phenomenal Python Library for scientific computing and it has lots of statistical measures in-built. It turns out that the entropy measure in scipy is implemented using the KLD. Just what we want.

(I found it to be quite simple to implement it with python and I got really good results when I tested it with a few distributions.)


In [None]:
 # Create test data
p = np.random.rayleigh(3,3)
q = np.random.weibull(3,3)
p, q

(array([3.84848271, 4.18906706, 5.61567569]),
 array([1.036487  , 0.9192782 , 0.63485667]))

In [None]:
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.entropy.html
# Calculate the entropy of a distribution for given probability values
entropy([p, q], base=None)

array([0.51682914, 0.471327  , 0.32851553])

In [None]:
 m = (p + q) / 2

    # compute Jensen Shannon Divergence
divergence = (scipy.stats.entropy(p, m))

divergence

array([inf, inf, inf])

In [None]:
# Create function to compute distance
from scipy.stats import entropy
def jensen_shannon_distance(p, q):
    """
    method to compute the Jenson-Shannon Distance 
    between two probability distributions
    """

    # convert the vectors into numpy arrays in case that they aren't
    # p = np.array(p)
    # q = np.array(q)

    # calculate m
    m = (p + q) / 2

    # compute Jensen Shannon Divergence
    divergence = (scipy.stats.entropy(p, m) + scipy.stats.entropy(q, m)) / 2

    # compute the Jensen Shannon Distance
    distance = np.sqrt(divergence)

    return distance

In [None]:
print(jensen_shannon_distance(eins,zwei))

[inf inf inf]


https://en.wikipedia.org/wiki/Gromov%E2%80%93Hausdorff_convergence