# The Kullback-Leibler Divergence

In this notebook, we try and gain some intuition about the magnitude of the KL divergence by computing its value between two Gaussian PDFs as a function of the "tension" between them.

### Requirements

You'll need the `qp` package and its dependencies (notably `scipy` and matplotlib).

## Background

The Kullback-Leibler divergence between probability distributions $P$ and $Q$ is:

$D(P||Q) = \int_{-\infty}^{\infty} \log \left( \frac{P(x)}{Q(x)} \right) P(x) dx$

The wikipedia page for the KL divergence gives the following useful interpretation of the KLD:

> KL divergence is a measure of the difference between two probability distributions P and Q. It is not symmetric in P and Q. In applications, P typically represents ... a precisely calculated theoretical distribution, while Q typically represents ... [an] approximation of P.
>
> Specifically, the Kullback–Leibler divergence from Q to P, denoted DKL(P‖Q), is a measure of the information gained when one revises one's beliefs from ... Q to ... P. In other words, it is the amount of information lost when Q is used to approximate P.


## 1D Gaussian Illustration

"Information" is not a terribly familiar quantity to most of us, so lets compute the KLD between two Gaussians:

* The "True" 1D Gaussian PDF, $P(x)$, of unit width and central value 0

* An "approximating" 1D Gaussian PDF, $Q$, of width $\sigma$ and centroid $x_0$

How does the KLD between these PDFs vary with offset $x_0$ and width $\sigma$?

In [None]:
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
%matplotlib inline

import qp

import numpy as np
import scipy.stats as sps

In [None]:
P = qp.PDF(truth=sps.norm(loc=0.0, scale=1.0))

In [None]:
x0, sigma = 2.0, 1.0
Q = qp.PDF(truth=sps.norm(loc=x0, scale=sigma))

In [None]:
infinity = 100.0
D = qp.utils.calculate_kl_divergence(P, Q, limits=(-infinity,infinity), vb=False)
print np.round(D)

i.e. Two equal-width Gaussians overlapping at their 1-sigma points have a KLD of 2 nats. 

> The unit of information here is a "nat" rather than a "bit" because `qp` uses a natural logarithm in its KLD calculation.

What if the two Gaussians are perfectly aligned, but the approximation is broader than the truth?

In [None]:
x0, sigma = 0.0, 4.37
Q = qp.PDF(truth=sps.norm(loc=x0, scale=sigma))
D = qp.utils.calculate_kl_divergence(P, Q, limits=(-infinity,infinity), vb=False)
print D

i.e. Two concentric Gaussian PDFs differing in width by a factor of 4.37 have a KLD of 1 nat.

## Tension between PDFs

Two measurements that disagree with each other will lead to parameter PDFs that have different cenrtroids. Let's tabulate the KLD for a range of distribution offsets.

In [None]:
separations = np.linspace(0.0,15.0,16)
D = np.empty_like(separations)
D = qp.utils.calculate_kl_divergence(P, Q, limits=(-infinity,infinity), vb=False)
print np.round(D)

In [None]:
sigma = 1.0
infinity = 100.0

for k,x0 in enumerate(separations):
    Q = qp.PDF(truth=sps.norm(loc=x0, scale=sigma))
    D[k] = qp.utils.calculate_kl_divergence(P, Q, limits=(-infinity,infinity), vb=False)
    
print zip(separations, D)

In [None]:
plt.plot(separations, D, color='k', linestyle='-', lw=2.0, alpha=1.0, label='Width=1.0')
plt.xlabel('Separation between Gaussians')
plt.ylabel('KL divergence (nats)')
l = plt.legend(loc='upper left')

> For separations greater than about 7 sigma, numerical precision starts to matter: the overlap integral out here is smaller than machine precision. `qp` uses a `safelog` function that replaces values smaller than the system threshold value with that threshold; the log of that threshold is:

In [None]:
import sys
print np.log(sys.float_info.epsilon)

Presumably what matters is the "tension" between the two distributions: the separation in units of the combined width, $\sigma_0^2 + \sigma^2$. This quantity comes up often in discussions of dataset combination, where the difference in centroids of two posterior PDFs needs to be expressed in terms of their widths: the quadratic sum makes sense in this context, since it would appear in the product of the two likelihoods were they to be combined.

For a few different widths, let's plot KLD vs tension.

In [None]:
infinity = 100.0
widths = np.array([1.0,1.5,2.0,2.5,3.0,3.5,4.0]) 
separations = np.linspace(0.0,7.0,15)

D = np.zeros([7,len(separations)])
tensions = np.empty_like(D)

for j,sigma in enumerate(widths):
    
    for k,x0 in enumerate(separations):
        Q = qp.PDF(truth=sps.norm(loc=x0, scale=sigma))
        D[j,k] = qp.utils.calculate_kl_divergence(P, Q, limits=(-infinity,infinity), vb=False)
        tensions[j,k] = x0 / np.sqrt(sigma*sigma + 1.0)
        

In [None]:
x = tensions[0,:]
y = x**2
plt.plot(x, y, color='gray', linestyle='-', lw=8.0, alpha=0.5, label='$t^2$')

plt.plot(tensions[0,:], D[0,:], color='black', linestyle='-', lw=2.0, alpha=1.0, label='Width=1.0')
plt.plot(tensions[1,:], D[1,:], color='violet', linestyle='-', lw=2.0, alpha=1.0, label='Width=1,5')
plt.plot(tensions[2,:], D[2,:], color='blue', linestyle='-', lw=2.0, alpha=1.0, label='Width=2.0')
plt.plot(tensions[3,:], D[3,:], color='green', linestyle='-', lw=2.0, alpha=1.0, label='Width=2.5')
plt.plot(tensions[4,:], D[4,:], color='yellow', linestyle='-', lw=2.0, alpha=1.0, label='Width=3.0')
plt.plot(tensions[5,:], D[5,:], color='orange', linestyle='-', lw=2.0, alpha=1.0, label='Width=3.5')
plt.plot(tensions[6,:], D[6,:], color='red', linestyle='-', lw=2.0, alpha=1.0, label='Width=4.0')
plt.xlabel('Tension between Gaussians, $t$ (sigma)')
plt.ylabel('KL divergence (nats)')
l = plt.legend(loc='upper left')

## Conclusions

The simple numerical experiment in this notebook suggests that:

The KL divergence, in nats, between an approximating Gaussian and a true Gaussian is _approximately_ equal to the square of the tension between the two distributions, where tension $t$ is defined as

## $t = \frac{\Delta x}{\sqrt{\left(\sigma_0^2 + \sigma^2\right)}}$

and has, in some sense, "units" of "sigma". The KLD is the information lost when using the approximation: the information loss rises in proprtion to the tension squared. An analytic derivation of this result would be welcome!

We can perhaps take the KL divergence to be a generalized quantification of tension: the square root of the KLD between a PDF and its approximation, in nats, gives an approximate sense of the tension between the two distributions, in "units" of "sigma".