
What is KL divergence (Kullback–Leibler)?
=========================================

KL divergence measures how one probability distribution PPP differs from another QQQ that’s treated as a reference. Intuition: “How many extra nats/bits does it cost to encode data from PPP if I use a code optimized for QQQ instead of PPP?”

*   **Zero iff identical:** DKL(P∥Q)=0D\_{\\mathrm{KL}}(P\\|Q)=0DKL​(P∥Q)=0 only when P=QP=QP=Q (almost everywhere).
    
*   **Asymmetric:** DKL(P∥Q)≠DKL(Q∥P)D\_{\\mathrm{KL}}(P\\|Q) \\neq D\_{\\mathrm{KL}}(Q\\|P)DKL​(P∥Q)=DKL​(Q∥P).
    
*   **Non-negative:** DKL≥0D\_{\\mathrm{KL}}\\ge 0DKL​≥0.
    
*   **Units:** Natural log ⇒ **nats**; log base 2 ⇒ **bits**.

What is KL divergence (Kullback–Leibler)?                                                                 
  =========================================                                                                 
                                                                                                            
  KL divergence measures how one probability distribution $P$ differs from another $Q$ (taken as the        
  reference). Intuition: “How many extra nats/bits does it cost to encode data from $P$ if I use a code     
  optimized for $Q$ instead of $P$?”

  - **Zero iff identical:** $D_{\mathrm{KL}}(P \parallel Q) = 0$ only when $P = Q$ (almost everywhere).     
  - **Asymmetric:** $D_{\mathrm{KL}}(P \parallel Q) \neq D_{\mathrm{KL}}(Q \parallel P)$. 
  [Explained here](https://www.youtube.com/watch?v=C_dKimu42D8)
  - **Non-negative:** $D_{\mathrm{KL}} \ge 0$.
  - **Units:** Natural log ⇒ **nats**; log base 2 ⇒ **bits**.

                                                                                                          
  Why/where it’s used                                                                                       
  -------------------                                                                                       
                                                                                                            
  - **Model fitting / cross-entropy loss:** Minimizing                                                      
    $$                                                                                                      
    \mathbb{E}_{x \sim P}\left[-\log Q(x)\right]                                                            
    $$
    is equivalent to minimizing $D_{\mathrm{KL}}(P \parallel Q)$ since                                      
    $$                                                                                                      
    CE(P,Q) = H(P) + D_{\mathrm{KL}}(P \parallel Q).                                                        
    $$                                                                                                      
                                                                                                            
  - **Variational Inference (ELBO):** Fit $q_{\phi}(z)$ to posterior $p(z \mid x)$ by minimizing            
    $D_{\mathrm{KL}}\!\left(q_{\phi}(z) \parallel p(z \mid x)\right)$.                                      
                                                                                                            
  - **Distribution shift & calibration:** Compare predicted class distributions or priors between           
  environments.                                                                                             
                                                                                                            
  - **RL / Policy updates:** Trust-region methods (TRPO, PPO) constrain the KL divergence between old and   
  new policies.                                                                                             
                                                                                                            
  - **Information theory:** Measures the expected code-length regret incurred when using the wrong          
  distribution.                                                                                             
                                                                                                            
  The math                                                                                                  
  --------                                                                                                  
                                                                                                            
  ### Discrete case                                                                                         
                                                                                                            
  $$                                                                                                        
  D_{\mathrm{KL}}(P \parallel Q) = \sum_{i} p_i \log \frac{p_i}{q_i}                                        
  $$                                                                                                        
                                                                                                            
  Important: if $q_i = 0$ while $p_i > 0$, KL divergence is $+\infty$.                                      
                                                                                                            
  **Cross-entropy relation**                                                                                
                                                                                                            
  $$                                                                                                        
  CE(P, Q) \equiv -\sum_{i} p_i \log q_i = H(P) + D_{\mathrm{KL}}(P \parallel Q).                           
  $$                                                                                                        
                                                                                                            
  ### Continuous case                                                                                       
                                                                                                            
  $$                                                                                                        
  D_{\mathrm{KL}}(P \parallel Q) = \int p(x) \log \frac{p(x)}{q(x)} \, dx                                   
  $$                                                                                                        
                                                                                                            
  ### Closed form for univariate Gaussians                                                                  
                                                                                                            
  Let $P = \mathcal{N}(\mu_0, \sigma_0^2)$ and $Q = \mathcal{N}(\mu_1, \sigma_1^2)$:                        
                                                                                                            
  $$                                                                                                        
  D_{\mathrm{KL}}(P \parallel Q) =                                                                          
  \log \frac{\sigma_1}{\sigma_0}                                                                            
  + \frac{\sigma_0^{2} + (\mu_0 - \mu_1)^2}{2\sigma_1^{2}}                                                  
  - \frac{1}{2}.                                                                                            
  $$                                                                                                        
                                                                                                            
  ### Jensen–Shannon divergence                                                                             
                                                                                                            
  A symmetric, bounded alternative often used for evaluation:                                               
                                                                                                            
  $$                                                                                                        
  JS(P, Q) = \frac{1}{2} D_{\mathrm{KL}}(P \parallel M) + \frac{1}{2} D_{\mathrm{KL}}(Q \parallel M),       
  \quad                                                                                                     
  M = \frac{P + Q}{2}.                                                                                      
  $$                                                                                                        

  $JS$ is symmetric and lies in $[0, \log 2]$ nats (or $[0, 1]$ bits).

In [1]:
# KL divergence: reusable functions + worked examples
import numpy as np

def _as_prob_vec(x: np.ndarray, eps: float = 1e-12) -> np.ndarray:
    """Normalize to a probability vector and clip to avoid log(0)."""
    x = np.asarray(x, dtype=float)
    if x.ndim != 1:
        raise ValueError("x must be a 1D array of nonnegative counts or probabilities.")
    if np.any(x < 0):
        raise ValueError("x has negative entries.")
    s = x.sum()
    if s <= 0:
        raise ValueError("Sum must be positive.")
    p = x / s
    # clip to avoid zeros; preserves normalization approximately
    p = np.clip(p, eps, 1.0)
    p = p / p.sum()
    return p

In [2]:
def kl_divergence_discrete(p, q, base=None, eps: float = 1e-12):
    """
    D_KL(P || Q) = sum_i p_i * log(p_i / q_i)
    If base is None -> natural log (nats). For bits, set base=2.
    p, q can be probability vectors OR nonnegative counts.
    """
    p = _as_prob_vec(np.asarray(p), eps=eps)
    q = _as_prob_vec(np.asarray(q), eps=eps)
    if p.shape != q.shape:
        raise ValueError("p and q must have the same shape.")
    log_ratio = np.log(p) - np.log(q)
    kl = float(np.sum(p * log_ratio))
    if base is not None:
        kl = kl / np.log(base)
    return kl

In [4]:
def jensen_shannon(p, q, base=None, eps: float = 1e-12):
    """
    JS(P, Q) = 0.5*KL(P||M) + 0.5*KL(Q||M), M=(P+Q)/2.
    Symmetric and bounded in [0, log(2)] for natural logs.
    """
    p = _as_prob_vec(np.asarray(p), eps=eps)
    q = _as_prob_vec(np.asarray(q), eps=eps)
    m = 0.5 * (p + q)
    js = 0.5 * kl_divergence_discrete(p, m, base=base, eps=eps) + \
         0.5 * kl_divergence_discrete(q, m, base=base, eps=eps)
    return float(js)

def entropy(p, base=None, eps: float = 1e-12):
    p = _as_prob_vec(np.asarray(p), eps=eps)
    H = -float(np.sum(p * np.log(p)))
    if base is not None:
        H = H / np.log(base)
    return H

def cross_entropy(p, q, base=None, eps: float = 1e-12):
    p = _as_prob_vec(np.asarray(p), eps=eps)
    q = _as_prob_vec(np.asarray(q), eps=eps)
    Hpq = -float(np.sum(p * np.log(q)))
    if base is not None:
        Hpq = Hpq / np.log(base)
    return Hpq


# Analytic KL for univariate Gaussians: P=N(mu0, s0^2), Q=N(mu1, s1^2)
def kl_normal_1d(mu0, s0, mu1, s1, base=None):
    if s0 <= 0 or s1 <= 0:
        raise ValueError("Standard deviations must be positive.")
    term = np.log(s1 / s0) + (s0**2 + (mu0 - mu1)**2) / (2 * s1**2) - 0.5
    if base is not None:
        term = term / np.log(base)
    return float(term)

# Monte Carlo estimate of KL(P||Q) from samples x~P and log-densities
def kl_mc_from_samples(logp, logq, x):
    """
    logp, logq: callables returning log-density at x.
    x: samples ~ P
    """
    lp = logp(x)
    lq = logq(x)
    return float(np.mean(lp - lq))

def logpdf_normal_1d(x, mu, s):
    x = np.asarray(x, dtype=float)
    return -0.5*np.log(2*np.pi) - np.log(s) - 0.5*((x - mu)/s)**2


In [5]:
results = {}

# 1) Bernoulli example (binary classification target shift)
p = np.array([0.7, 0.3])  # "true" positive/negative mix
q = np.array([0.5, 0.5])  # model prior or production mix
results['bernoulli_kl_nats'] = kl_divergence_discrete(p, q)
results['bernoulli_kl_bits'] = kl_divergence_discrete(p, q, base=2)
results['bernoulli_entropy_bits'] = entropy(p, base=2)
results['bernoulli_cross_entropy_bits'] = cross_entropy(p, q, base=2)  # should equal H(p)+KL in bits


In [6]:
# 2) Multiclass distributions (e.g., two models' predictions over 3 classes)
p3 = np.array([0.80, 0.15, 0.05])  # Model A
q3 = np.array([0.60, 0.30, 0.10])  # Model B
results['categorical_KL_P||Q'] = kl_divergence_discrete(p3, q3)
results['categorical_KL_Q||P'] = kl_divergence_discrete(q3, p3)  # asymmetry
results['categorical_JS'] = jensen_shannon(p3, q3)

# 3) Univariate Gaussians: analytic vs Monte Carlo
mu0, s0 = 0.0, 1.0
mu1, s1 = 1.0, 1.5
kl_analytic = kl_normal_1d(mu0, s0, mu1, s1)
# Monte Carlo
rng = np.random.default_rng(7)
x = rng.normal(mu0, s0, size=200000)
logp = lambda z: logpdf_normal_1d(z, mu0, s0)
logq = lambda z: logpdf_normal_1d(z, mu1, s1)
kl_mc = kl_mc_from_samples(logp, logq, x)
results['gaussian_analytic'] = kl_analytic
results['gaussian_mc_estimate'] = kl_mc
results['gaussian_abs_error_mc'] = abs(kl_mc - kl_analytic)

# 4) Practical caution: support mismatch (zero-probability under Q)
p_supporty = np.array([0.9, 0.1, 0.0])
q_zeros =     np.array([0.0, 1.0, 0.0])  # assigns zero prob where P has mass -> KL -> +inf (before clipping)
kl_with_clip = kl_divergence_discrete(p_supporty, q_zeros)  # with epsilon clipping
results['support_mismatch_demo'] = kl_with_clip

# Pretty-print
for k, v in results.items():
    print(f"{k:28s}: {v:.6f}")

print("\nInterpretation notes:")
print("* KL is in nats unless 'base=2' was used (then in bits).")
print("* Cross-entropy H(p,q) = H(p) + KL(p||q) (see Bernoulli bits above).")
print("* KL is asymmetric: KL(P||Q) != KL(Q||P) (see categorical example).")
print("* JS(P,Q) is symmetric and bounded in [0, ln 2] nats (or [0,1] bits).")
print("* Monte Carlo estimate closely matches analytic KL for Gaussians.")
print("* If Q assigns zero probability where P has mass, true KL is +infinity; clipping produces a large finite surrogate.")


bernoulli_kl_nats           : 0.082283
bernoulli_kl_bits           : 0.118709
bernoulli_entropy_bits      : 0.881291
bernoulli_cross_entropy_bits: 1.000000
categorical_KL_P||Q         : 0.091516
categorical_KL_Q||P         : 0.104650
categorical_JS              : 0.024157
gaussian_analytic           : 0.349910
gaussian_mc_estimate        : 0.350000
gaussian_abs_error_mc       : 0.000090
support_mismatch_demo       : 24.542836

Interpretation notes:
* KL is in nats unless 'base=2' was used (then in bits).
* Cross-entropy H(p,q) = H(p) + KL(p||q) (see Bernoulli bits above).
* KL is asymmetric: KL(P||Q) != KL(Q||P) (see categorical example).
* JS(P,Q) is symmetric and bounded in [0, ln 2] nats (or [0,1] bits).
* Monte Carlo estimate closely matches analytic KL for Gaussians.
* If Q assigns zero probability where P has mass, true KL is +infinity; clipping produces a large finite surrogate.


**KL Divergence Interpretation**

- Bernoulli KL ≈ $_0.082\text{ nats }(0.119\text{ bits})$_ → small difference between true and model distributions.  
- Entropy $_H(P)=0.881\text{ bits}$_ → inherent uncertainty of the true distribution.  
- Cross-entropy $_H(P,Q)=1.00\text{ bits}$_ → encoding cost using model $Q$.  
- Relation: $_H(P,Q)=H(P)+D_{KL}(P\|Q)$.  
- Asymmetry: $_D_{KL}(P\|Q)\neq D_{KL}(Q\|P)$_ (0.092 vs 0.105).  
- JS Divergence $_=0.024$_ → symmetric, small → distributions are similar.  
- Gaussian KL $_\approx0.35$_ nats → moderate shift in mean/variance; Monte-Carlo check matches analytic.  
- Support mismatch $_\approx24.5$_ → practically infinite; $Q$ misses mass where $P$ has probability.