The Kullback-Leibler (KL) divergence is a measure of the difference between two probability distributions. Specifically, it quantifies how much one probability distribution diverges from a second, reference distribution.

Given two probability distributions, $p(x)$ and $q(x)$, over the same random variable $X$, the KL divergence is defined as:

$$
D_{KL}(p||q) = \mathbb{E}_{X \sim p} \left[ \log \frac{p(x)}{q(x)} \right] = \mathbb{E}_{X \sim p} [\log p(x) - \log q(x)]
$$

**Discrete Distributions:**

$$D_{KL}(p || q) = \sum_{x \in X} p(x) \log \frac{p(x)}{q(x)}$$

**Continuous Distributions:**

$$D_{KL}(p||q) = \int_{x \in X} p(x) \log \frac{p(x)}{q(x)} dx$$



In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde

def KL(P, Q):
    epsilon = 0.00001
    P = P + epsilon
    Q = Q + epsilon
    divergence = np.sum(P * np.log(P / Q))
    return divergence

def generate_samples(mu, sigma):
    return np.random.normal(mu, sigma, 1000)

def calculate_distribution(s1, s2):
    bins = np.linspace(min(s1.min(), s2.min()), max(s1.max(), s2.max()), 100)
    p1, _ = np.histogram(s1, bins=bins, density=True)
    p2, _ = np.histogram(s2, bins=bins, density=True)
    p1 /= np.sum(p1)
    p2 /= np.sum(p2)
    return p1, p2, bins

def plot_distributions(s1, s2, bins, kl_div, dist1_label, dist2_label, title):
    plt.figure(figsize=(7, 4))
    plt.suptitle(f'KL Divergence: {kl_div:.2f}')

    density1 = gaussian_kde(s1)
    density2 = gaussian_kde(s2)
    x = np.linspace(min(s1.min(), s2.min()), max(s1.max(), s2.max()), 1000)

    plt.fill_between(x, density1(x), color='red', alpha=0.2, label=dist1_label)
    plt.fill_between(x, density2(x), color='green', alpha=0.2, label=dist2_label)
    plt.title(title)
    plt.show()

# Define two sets of observations
s1 = generate_samples(0.2, 0.1)
s2 = generate_samples(0, 0.1)

# Calculate distributions and KL divergence
p1, p2, bins = calculate_distribution(s1, s2)
kl_div = KL(p1, p2)

# Plot distributions
plot_distributions(s1, s2, bins, kl_div, 'Distribution 1', 'Distribution 2', 'Similar Distributions')

# Define two sets of observations
s3 = generate_samples(0.0, 0.1)
s4 = generate_samples(0.5, 0.1)

# Calculate distributions and KL divergence
p3, p4, bins = calculate_distribution(s3, s4)
kl_div = KL(p3, p4)

# Plot distributions
plot_distributions(s3, s4, bins, kl_div, 'Distribution 3', 'Distribution 4', 'Different Distributions')

The concept of **cross-entropy** has a strong connection to the KL divergence.

Minimizing the cross-entropy in relation to $q$ is the same as minimizing the KL divergence in relation to $q$, given that $H(p, p)$ is not influenced by $q$.

\begin{align*}
H(p, q) &= -\mathbb{E}_{x \sim p}[\log(q(x))] \\
&= -\mathbb{E}_{x \sim p}[\log(p(x))] + \mathbb{E}_{x \sim p}[\log(p(x))] -\mathbb{E}_{x \sim p}[\log(q(x))]\\
&= -\mathbb{E}_{x \sim p}[\log(p(x))] + \mathbb{E}_{x \sim p}\left[\log(p(x)) -  \log(q(x))\right] \\
&= H(p, p) + D_{KL}(p||q)
\end{align*}
