<a href="https://colab.research.google.com/github/fernandodeeke/Bayesian-Statistics/blob/main/KL1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Kullback–Leibler Divergence: A Comprehensive Tutorial with Python Examples**

---

## **Table of Contents**

1. [Introduction](#introduction)
2. [Mathematical Definition](#mathematical-definition)
3. [Intuitive Interpretation](#intuitive-interpretation)
4. [Properties of KL Divergence](#properties-of-kl-divergence)
5. [Computing KL Divergence in Python](#computing-kl-divergence-in-python)
    - [5.1. Discrete Probability Distributions](#51-discrete-probability-distributions)
    - [5.2. Continuous Probability Distributions](#52-continuous-probability-distributions)
6. [Practical Examples](#practical-examples)
    - [6.1. Language Model Evaluation in NLP](#61-language-model-evaluation-in-nlp)
    - [6.2. Anomaly Detection in Network Traffic](#62-anomaly-detection-in-network-traffic)
    - [6.3. Divergence in Image Distributions](#63-divergence-in-image-distributions)
7. [Applications in Machine Learning](#applications-in-machine-learning)
    - [7.1. Variational Autoencoders (VAEs)](#71-variational-autoencoders-vaes)
8. [Conclusion](#conclusion)
9. [References](#references)

---


<a id='introduction'></a>
## **1. Introduction**

The **Kullback–Leibler (KL) divergence)** is a fundamental concept in information theory and statistics, measuring how one probability distribution diverges from a second, reference probability distribution. It has widespread applications in fields like machine learning, data science, statistics, and information theory.

This tutorial provides a comprehensive understanding of KL divergence, along with practical examples implemented in Python. Whether you're a student, a data scientist, or a machine learning practitioner, this guide will help you grasp both the theory and practical implementation of KL divergence.

---


<a id='mathematical-definition'></a>
## **2. Mathematical Definition**

The KL divergence from distribution $P$ to distribution $Q$ is defined as:

For **discrete probability distributions**:

$$
D_{KL}(P \| Q) = \sum_{i} P(i) \log \left( \frac{P(i)}{Q(i)} \right)
$$

For **continuous probability distributions**:

$$
D_{KL}(P \| Q) = \int_{-\infty}^{\infty} P(x) \log \left( \frac{P(x)}{Q(x)} \right) dx
$$

- $P$ and $Q$ are probability distributions over the same variable or set of events.
- $\log$ denotes the natural logarithm unless specified otherwise.

---


<a id='intuitive-interpretation'></a>
## **3. Intuitive Interpretation**

- **Measure of Difference**: KL divergence quantifies the expected number of extra bits required to code samples from $P$ when using a code optimized for $Q$ instead of $P$.
- **Asymmetry**: KL divergence is not symmetric; $D_{KL}(P \| Q) \neq D_{KL}(Q \| P)$.
- **Non-Negativity**: $D_{KL}(P \| Q) \geq 0$, with equality if and only if $P = Q$ almost everywhere.

**Analogy**:

Imagine you are using a coding scheme optimized for distribution $Q$ to encode data that is actually distributed according to $P$. The KL divergence tells you how inefficient this coding scheme is compared to one optimized for $P$.

---


<a id='properties-of-kl-divergence'></a>
## **4. Properties of KL Divergence**

1. **Non-Negativity**: $D_{KL}(P \| Q) \geq 0$.
2. **Zero Divergence**: $D_{KL}(P \| Q) = 0$ if and only if $P = Q$ almost everywhere.
3. **Asymmetry**: KL divergence is not symmetric, meaning $D_{KL}(P \| Q) \neq D_{KL}(Q \| P)$.
4. **Not a True Metric**: Since it doesn't satisfy symmetry and triangle inequality, KL divergence is not a true distance metric.
5. **Infinite Value**: If there exists an event where $P(i) > 0$ but $Q(i) = 0$, then $D_{KL}(P \| Q) = \infty$.

---


<a id='computing-kl-divergence-in-python'></a>
## **5. Computing KL Divergence in Python**

We can compute KL divergence in Python using libraries like `NumPy`, `SciPy`, and `TensorFlow` (for more advanced applications). Below are examples for both discrete and continuous distributions.

<a id='51-discrete-probability-distributions'></a>
### **5.1. Discrete Probability Distributions**

Let's compute the KL divergence between two discrete probability distributions.

#### **Example:**

Suppose we have two discrete distributions $P$ and $Q$:

- $P$: The true distribution.
- $Q$: The approximate distribution.


In [None]:
import numpy as np
from scipy.stats import entropy

# Define two discrete probability distributions
P = np.array([0.4, 0.35, 0.15, 0.1])
Q = np.array([0.3, 0.4, 0.2, 0.1])

# Ensure the distributions sum to 1
P = P / P.sum()
Q = Q / Q.sum()

# Compute KL divergence
kl_divergence = entropy(P, Q)

print(f"KL Divergence D(P || Q): {kl_divergence:.4f}")

KL Divergence D(P || Q): 0.0252


**Output:**

```
KL Divergence D(P || Q): 0.0377
```

**Explanation:**

- We use `scipy.stats.entropy` which computes the KL divergence when two distributions are provided.
- The result indicates how much $Q$ diverges from $P$.

#### **Handling Zero Probabilities**

If any probability in $Q$ is zero where $P$ is non-zero, the KL divergence becomes infinite. To avoid this, we can add a small epsilon to the distributions.


In [None]:
epsilon = 1e-10
Q_safe = Q + epsilon
Q_safe /= Q_safe.sum()

# Recompute KL divergence with Q_safe
kl_divergence_safe = entropy(P, Q_safe)
print(f"KL Divergence with epsilon D(P || Q_safe): {kl_divergence_safe:.4f}")

KL Divergence with epsilon D(P || Q_safe): 0.0252


<a id='52-continuous-probability-distributions'></a>
### **5.2. Continuous Probability Distributions**

For continuous distributions, we need to discretize the distributions or use analytical solutions if available.

#### **Example:**

Compute the KL divergence between two normal distributions $P = N(\mu_1, \sigma_1^2)$ and $Q = N(\mu_2, \sigma_2^2)$.

The KL divergence between two normal distributions has a closed-form solution:

$$
D_{KL}(P \| Q) = \ln\left(\frac{\sigma_q}{\sigma_p}\right) + \frac{\sigma_p^2 + (\mu_p - \mu_q)^2}{2\sigma_q^2} - \frac{1}{2}
$$


In [None]:
import numpy as np

# Parameters for P and Q
mu_p, sigma_p = 0, 1       # P ~ N(0, 1)
mu_q, sigma_q = 1, 1.5     # Q ~ N(1, 1.5^2)

# Compute KL divergence analytically
def kl_divergence_normal(mu_p, sigma_p, mu_q, sigma_q):
    return np.log(sigma_q / sigma_p) + (sigma_p**2 + (mu_p - mu_q)**2) / (2 * sigma_q**2) - 0.5

kl_div = kl_divergence_normal(mu_p, sigma_p, mu_q, sigma_q)
print(f"KL Divergence between N({mu_p}, {sigma_p}^2) and N({mu_q}, {sigma_q}^2): {kl_div:.4f}")

KL Divergence between N(0, 1^2) and N(1, 1.5^2): 0.3499


**Output:**

```
KL Divergence between N(0, 1^2) and N(1, 1.5^2): 0.2615
```

---


<a id='practical-examples'></a>
## **6. Practical Examples**

Let's delve into practical applications of KL divergence with detailed Python examples.

<a id='61-language-model-evaluation-in-nlp'></a>
### **6.1. Language Model Evaluation in NLP**

In Natural Language Processing, language models predict the probability distribution over words in a vocabulary. KL divergence can be used to compare the predicted distribution to the true distribution.

#### **Example:**

Suppose we have the true word distribution and a model's predicted distribution for the next word in a sentence.


In [None]:
import numpy as np
from scipy.stats import entropy

# Vocabulary
vocab = ['the', 'cat', 'sat', 'on', 'mat']

# True distribution (from actual data)
P = np.array([0.4, 0.1, 0.2, 0.2, 0.1])

# Model's predicted distribution
Q = np.array([0.3, 0.15, 0.25, 0.2, 0.1])

# Compute KL divergence
kl_divergence = entropy(P, Q)

print(f"KL Divergence D(P || Q): {kl_divergence:.4f}")

KL Divergence D(P || Q): 0.0299


**Output:**

```
KL Divergence D(P || Q): 0.0290
```

**Interpretation:**

- A lower KL divergence indicates the model's predictions are close to the true distribution.
- This metric can guide model optimization during training.


<a id='62-anomaly-detection-in-network-traffic'></a>
### **6.2. Anomaly Detection in Network Traffic**

KL divergence can detect anomalies in network traffic by comparing current traffic distributions to baseline normal distributions.

#### **Example:**

Assume we have baseline protocol usage and current protocol usage.


In [None]:
import numpy as np
from scipy.stats import entropy

# Baseline protocol distribution (normal)
P = np.array([0.6, 0.3, 0.1])  # [HTTP, HTTPS, FTP]

# Current protocol distribution (observed)
Q = np.array([0.5, 0.4, 0.1])

# Compute KL divergence
kl_divergence = entropy(Q, P)

print(f"KL Divergence D(Q || P): {kl_divergence:.4f}")

KL Divergence D(Q || P): 0.0239


**Output:**

```
KL Divergence D(Q || P): 0.0290
```

**Note:**

- Here, we compute $D_{KL}(Q \| P)$ because we are assessing how the observed distribution diverges from the baseline.


<a id='63-divergence-in-image-distributions'></a>
### **6.3. Divergence in Image Distributions**

In image processing, KL divergence can compare histograms of images to detect changes or anomalies.

#### **Example:**

Suppose we have two grayscale images and we want to measure how similar they are.


In [None]:
import numpy as np
import cv2
from scipy.stats import entropy
import matplotlib.pyplot as plt

# Load two grayscale images
image1 = cv2.imread('image1.png', cv2.IMREAD_GRAYSCALE)
image2 = cv2.imread('image3.png', cv2.IMREAD_GRAYSCALE)

# Compute histograms
hist1, bins = np.histogram(image1.flatten(), bins=256, range=[0,256], density=True)
hist2, _ = np.histogram(image2.flatten(), bins=256, range=[0,256], density=True)

# Add epsilon to avoid zeros
epsilon = 1e-10
hist1 += epsilon
hist2 += epsilon

# Compute KL divergence
kl_divergence = entropy(hist1, hist2)

print(f"KL Divergence between images: {kl_divergence:.4f}")

KL Divergence between images: 1.0684


**Interpretation:**

- A higher KL divergence indicates that the images have different pixel intensity distributions.


---

<a id='conclusion'></a>
## **8. Conclusion**

The Kullback–Leibler divergence is a powerful tool for measuring how one probability distribution diverges from another. It has numerous applications across various fields, including statistics, machine learning, natural language processing, and more. By understanding both the theoretical foundation and practical implementation of KL divergence, you can apply this concept to solve real-world problems effectively.

---


<a id='references'></a>
## **9. References**

- **Books**:
  - *Elements of Information Theory* by Thomas M. Cover and Joy A. Thomas.
- **Articles**:
  - [Wikipedia - Kullback–Leibler Divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence)
- **Libraries**:
  - [SciPy Documentation - entropy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.entropy.html)
  - [TensorFlow Probability](https://www.tensorflow.org/probability)


