# The Kullback-Leibler Divergence

The KL divergence is used as a measure of how close an approximation to a probability distribution is to the true probability distribution it approximates.  In this notebook, we try to gain some intuition about the magnitude of the KL divergence by computing its value between two Gaussian PDFs as a function of the "precision" and "tension" between them.

### Requirements

You'll need the `qp` package and its dependencies (notably `scipy` and matplotlib).

## Background

The Kullback-Leibler divergence between probability distributions $P$ and $Q$ is:

$D(P||Q) = \int_{-\infty}^{\infty} \log \left( \frac{P(x)}{Q(x)} \right) P(x) dx$

The Wikipedia page for the KL divergence gives the following useful interpretation of the KLD:

> KL divergence is a measure of the difference between two probability distributions $P$ and $Q$. It is not symmetric in $P$ and $Q$. In applications, $P$ typically represents ... a precisely calculated theoretical distribution, while $Q$ typically represents ... [an] approximation of $P$.
>
> Specifically, the Kullback–Leibler divergence from $Q$ to $P$, denoted $D_{KL}(P‖Q)$, is a measure of the information gained when one revises one's beliefs from ... $Q$ to ... $P$. In other words, it is the amount of information lost when $Q$ is used to approximate $P$.


## 1D Gaussian Illustration

"Information" is not a terribly familiar quantity to most of us, so let's compute the KLD between two Gaussians:

* The "True" 1D Gaussian PDF, $P(x)$, of unit width and central value 0

* An "approximating" 1D Gaussian PDF, $Q$, of width $\sigma$ and centroid $x$

How does the KLD between these PDFs vary with offset $x$ and width $\sigma$?

In [None]:
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
%matplotlib inline

import qp

import numpy as np
import scipy.stats as sps

In [None]:
P = qp.PDF(truth=sps.norm(loc=0.0, scale=1.0))

In [None]:
x, sigma = 2.0, 1.0
Q = qp.PDF(truth=sps.norm(loc=x, scale=sigma))

In [None]:
infinity = 100.0
D = qp.utils.calculate_kl_divergence(P, Q, limits=(-infinity,infinity), vb=False)
print D

i.e. Two equal-width Gaussians overlapping at their 1-sigma points have a KLD of 2 nats. 

> The unit of information here is a "nat" rather than a "bit" because `qp` uses a natural logarithm in its KLD calculation. 1 nat = $1/\log{2} \approx 1.44$  bits. 

What if the two Gaussians are perfectly aligned, but the approximation is broader than the truth?

In [None]:
x, sigma = 0.0, 4.37
Q = qp.PDF(truth=sps.norm(loc=x, scale=sigma))
D = qp.utils.calculate_kl_divergence(P, Q, limits=(-infinity,infinity), vb=False)
print D

i.e. Two concentric 1D Gaussian PDFs differing in width by a factor of 4.37 have a KLD of 1 nat.

## Analytic Formulae

It is illustrative to consider the KL divergence (in nats) between an approximating Gaussian of mean $\mu$ and variance $\sigma^{2}$ to a true Gaussian of mean $\mu_{0}$ and variance $\sigma_{0}^{2}$.

\begin{align*}
D &= \int_{-\infty}^{\infty}\ P(x)\ \log\left[\frac{P(x)}{Q(x)}\right]\ dx\\
&= \int_{-\infty}^{\infty}\ \frac{1}{\sqrt{2\pi}\sigma_{0}}\exp\left[-\frac{(x-\mu_{0})^{2}}{2\sigma_{0}^{2}}\right]\ \log\left[\frac{\frac{1}{\sqrt{2\pi}\sigma_{0}}\exp\left[-\frac{(x-\mu_{0})^{2}}{2\sigma_{0}^{2}}\right]}{\frac{1}{\sqrt{2\pi}\sigma}\exp\left[-\frac{(x-\mu)^{2}}{2\sigma^{2}}\right]}\right]\ dx\\
&= \frac{1}{\sqrt{2\pi}\sigma_{0}}\int_{-\infty}^{\infty}\ \exp\left[-\frac{(x-\mu_{0})^{2}}{2\sigma_{0}^{2}}\right]\ \left(\log\left[\frac{\sigma}{\sigma_{0}}\right]-\frac{(x-\mu_{0})^{2}}{2\sigma_{0}^{2}}+\frac{(x-\mu)^{2}}{2\sigma^{2}}\right)\ dx\\
&= \frac{1}{\sqrt{2\pi}\sigma_{0}}\left(\log\left[\frac{\sigma}{\sigma_{0}}\right]\int_{-\infty}^{\infty}\ \exp\left[-\frac{(x-\mu_{0})^{2}}{2\sigma_{0}^{2}}\right]\ dx-\int_{-\infty}^{\infty}\ \frac{(x-\mu_{0})^{2}}{2\sigma_{0}^{2}}\ \exp\left[-\frac{(x-\mu_{0})^{2}}{2\sigma_{0}^{2}}\right]\ dx+\int_{-\infty}^{\infty}\ \frac{(x-\mu)^{2}}{2\sigma^{2}}\ \exp\left[-\frac{(x-\mu_{0})^{2}}{2\sigma_{0}^{2}}\right]\ dx\right)\\
&= \frac{1}{\sqrt{2\pi}\sigma_{0}}\left(\log\left[\frac{\sigma}{\sigma_{0}}\right]\left(-\sqrt{\frac{\pi}{2}}\sigma_{0}\ erf\left[\frac{\mu_{0}-x}{\sqrt{2}\sigma_{0}}\right]\right)|_{-\infty}^{\infty}-\left(\frac{\mu_{0}-x}{2}\exp\left[-\frac{(x-\mu_{0})^{2}}{2\sigma_{0}^{2}}\right]-\frac{1}{2}\sqrt{\frac{\pi}{2}}\sigma_{0}\ erf\left[\frac{\mu_{0}-x}{\sqrt{2}\sigma_{0}}\right]\right)|_{-\infty}^{\infty}+\left(-\frac{1}{2\sigma^{2}}\left(\sqrt{\frac{\pi}{2}}\sigma_{0}((\mu_{0}-\mu)^{2}+\sigma_{0}^{2})\ erf\left[\frac{\mu_{0}-x}{\sqrt{2}\sigma_{0}}\right]+\sigma_{0}^{2}(x-\mu+\mu_{0}-\mu)\exp\left[-\frac{(x-\mu_{0})^{2}}{2\sigma_{0}^{2}}\right]\right)\right)|_{-\infty}^{\infty}\right)\\
&= \frac{1}{\sqrt{2\pi}\sigma_{0}}\left(\log\left[\frac{\sigma}{\sigma_{0}}\right]\left(2\sqrt{\frac{\pi}{2}}\sigma_{0}\right)-0+2\frac{1}{2}\sqrt{\frac{\pi}{2}}\sigma_{0}\right)+\left(-\frac{1}{2\sigma^{2}}\left(-2\sqrt{\frac{\pi}{2}}\sigma_{0}((\mu_{0}-\mu)^{2}+\sigma_{0}^{2})+0\right)\right)\\
&= \log\left[\frac{\sigma}{\sigma_{0}}\right]-\frac{1}{2}+\frac{1}{2}\left(\frac{(\mu_{0}-\mu)^{2}}{\sigma^{2}}+\frac{\sigma_{0}^{2}}{\sigma^{2}}\right)
\end{align*}

We note that when $\mu=\mu_{0}$ and $\sigma^{2}=\sigma_{0}^{2}$, we have $D=0$, as expected.

### Precision

From the above, it is natural to define the "precision" as $\alpha\equiv\frac{\sigma_{0}}{\sigma}$.  This corresponds to the increase in precision going from the approximation to the truth.  The KL divergence then satisfies $D\sim\log\alpha+\alpha^{-2}$.  In the limit of $\sigma>\sigma_{0}$, we will have $\alpha\sim D^{-1/2}$.  In the limit of $\sigma<\sigma_{0}$, we will have $\alpha\sim\exp[D]$.  Precision lost when using the approximation is then proportional to the exponential of the KL divergence for an approximation more restrictive than the truth and the inverse square root of the KL divergence for an approximation less restrictive than the truth.  

### Tension

It is also natural to define the "tension" as $t\equiv\frac{|\mu_{0}-\mu|}{\sqrt{\left(\sigma_0^2 + \sigma^2\right)}}$.  We can rewrite the KL divergence as $D\sim t^{2}$, so $t\sim\sqrt{D}$.  Since has, in some sense, "units" of "$\sigma$", the KL divergence is the information lost when using the approximation: the information loss rises in proprtion to the tension squared. We can see that the KL divergence might provide a route to a generalized quantification of tension.  The square root of the KL divergence between a true distribution and its approximation, in nats, gives an approximate sense of the tension between the two distributions, in "units" of "$\sigma$".

## Approximation Precision

Suppose our approximating PDF is broader than the true PDF, but the centroids are aligned. If we were to use our approximation, we'd be over-estimating the uncertainty in the inference. The approximating PDF represents a lower _precision_ measurement. Let's look at how the KLD quantifies this change in precision, as a function of the change in PDF width. 

In [None]:
widths = np.logspace(-3.0,3.0,13)
D = np.empty_like(widths)

x = 0.0
infinity = 1000.0

for k,sigma in enumerate(widths):
    Q = qp.PDF(truth=sps.norm(loc=x, scale=sigma))
    D[k] = qp.utils.calculate_kl_divergence(P, Q, limits=(-infinity,infinity), vb=False)
    
print zip(widths, D)

In [None]:
x = widths
y = np.log(widths)+0.5*widths**-2-0.5
plt.plot(x, y, color='gray', linestyle='-', lw=8.0, alpha=0.5, label=r'$\log[\sigma]+\frac{1}{2}\sigma^{-2}$-\frac{1}{2}')

plt.plot(widths, D, color='black', linestyle='-', lw=2.0, alpha=1.0, label='Offset=0.0')
plt.xscale('log')
plt.ylim(0.0,32.0)
plt.xlabel('Width of approximating Gaussian $\sigma$')
plt.ylabel('KL divergence (nats)')
l = plt.legend(loc='upper right')

It looks as though using an increasingly broad approximation distribution leads to logarithmically increasing information loss. When the approximating distribution gets _narrower_ than the truth, information is lost at a faster rate, which is interesting.  The mismatch at low $\sigma$ may be due to floating point precision.

## Tension between PDFs

Two measurements that disagree with each other will lead to parameter PDFs that have different cenrtroids. Let's tabulate the KLD for a range of distribution offsets.

In [None]:
separations = np.linspace(0.0,15.0,16)
D = np.empty_like(separations)

sigma = 1.0
infinity = 100.0

for k,x0 in enumerate(separations):
    Q = qp.PDF(truth=sps.norm(loc=x0, scale=sigma))
    D[k] = qp.utils.calculate_kl_divergence(P, Q, limits=(-infinity,infinity), vb=False)
    
print zip(separations, D)

In [None]:
plt.plot(separations, D, color='k', linestyle='-', lw=2.0, alpha=1.0, label='Width=1.0')
plt.plot(separations, 0.5*separations**2-0.5, color='gray', linestyle='-', lw=8.0, alpha=0.5, label=r'$\frac{1}{2}\frac{\mu}{\sqrt{\sigma^{2}+1}}-\frac{1}{2}$')
plt.xlabel('Separation between Gaussians')
plt.ylabel('KL divergence (nats)')
l = plt.legend(loc='upper left')

> For separations greater than about 7 sigma, numerical precision starts to matter: the overlap integral out here is smaller than machine precision. `qp` uses a `safelog` function that replaces values smaller than the system threshold value with that threshold; the log of that threshold is:

In [None]:
import sys
print np.log(sys.float_info.epsilon)

> Probably the precision analysis of the previous section suffered from the same type of numerical error, at very low approximation distribution widths.

Presumably what matters is the "tension" between the two distributions: the separation in units of the combined width, $\sqrt{\sigma_0^2 + \sigma^2}$. This quantity comes up often in discussions of dataset combination, where the difference in centroids of two posterior PDFs needs to be expressed in terms of their widths: the quadratic sum makes sense in this context, since it would appear in the product of the two likelihoods were they to be combined.

For a few different widths, let's plot KLD vs tension.

In [None]:
infinity = 100.0
widths = np.array([1.0,1.5,2.0,2.5,3.0,3.5,4.0]) 
separations = np.linspace(0.0,7.0,15)

D = np.zeros([7,len(separations)])
tensions = np.empty_like(D)

for j,sigma in enumerate(widths):
    
    for k,x0 in enumerate(separations):
        Q = qp.PDF(truth=sps.norm(loc=x0, scale=sigma))
        D[j,k] = qp.utils.calculate_kl_divergence(P, Q, limits=(-infinity,infinity), vb=False)
        tensions[j,k] = x0 / np.sqrt(sigma*sigma + 1.0)
        

In [None]:
x = tensions[0,:]
y = x**2
plt.plot(x, y, color='gray', linestyle='-', lw=8.0, alpha=0.5, label='$t^2$')

plt.plot(tensions[0,:], D[0,:], color='black', linestyle='-', lw=2.0, alpha=1.0, label='Width=1.0')
plt.plot(tensions[1,:], D[1,:], color='violet', linestyle='-', lw=2.0, alpha=1.0, label='Width=1,5')
plt.plot(tensions[2,:], D[2,:], color='blue', linestyle='-', lw=2.0, alpha=1.0, label='Width=2.0')
plt.plot(tensions[3,:], D[3,:], color='green', linestyle='-', lw=2.0, alpha=1.0, label='Width=2.5')
plt.plot(tensions[4,:], D[4,:], color='yellow', linestyle='-', lw=2.0, alpha=1.0, label='Width=3.0')
plt.plot(tensions[5,:], D[5,:], color='orange', linestyle='-', lw=2.0, alpha=1.0, label='Width=3.5')
plt.plot(tensions[6,:], D[6,:], color='red', linestyle='-', lw=2.0, alpha=1.0, label='Width=4.0')
plt.xlabel('Tension between Gaussians, $t$ (sigma)')
plt.ylabel('KL divergence (nats)')
l = plt.legend(loc='upper left')

## Conclusions

To summarize, the KL divergence $D$ is an appropriate metric of an approximation to a probability distribution, expressing the loss of information of the approximation from the true distribution.  The simple numerical experiments in this notebook suggest the following approximate extrapolations and hypotheses.  

Using a Gaussian example enables exploration of two quantities characterizing the approximate distribution: the "precision" $\alpha$ is a measure of the width of the approximating distribution relative to the truth, and the "tension" $t$ is a measure of the difference in centroids weighted by the root-mean-square width of the two distributions.  We have found that the KLD can be interpreted in terms of these quantities; the KLD is proportional to the log of the precision and the square of the tension.

We can also think about when these approximations come up in photo-$z$ PDF expression.  If the true distribution contains small features that are not recoverable from the parametric representation of the distribution, it corresponds to a loss of precision.  This could happen if a binned parametrization has bins that are broader than the features.  If feature location information is lost when parametrizing the distribution, it can result in tension between the truth and the parametrization.  This could happen if a binned parametrization is based on samples from the truth with bins narrower than the features.