Kullback Leibler divergence! The idea is that we want to calculate the distance between probability distributions.

In [57]:
import numpy as np
import torch

coin1 = np.array([0.5,0.5])
p = 0.4
q = 1 - p
coin2 = np.array([p,q])
# Depending on the p and q we can say how close the coin2 is to coin1, if p = 0.55 we can say it is close. Not really if p = 0.95
# The idea is that you can notice a difference with the p=0.95 coin with the fair one through just a few tosses, since it will result in lots of Heads, unlike the fair one.
# You basically see if the distributions assigned similar probabilities to a similar sequences
# If they assign similar probabilities then we can say that the distributions are similar

In [58]:
kl = coin1[0]*np.log(coin1[0]/coin2[0]) + coin1[1]*np.log(coin1[1]/coin2[1])
c2 = torch.tensor(coin2)
c1 = torch.tensor(coin1)

print(torch.kl_div(torch.log(c2),c1,reduction = 2))
print(kl)

tensor(0.0204, dtype=torch.float64)
0.020410997260127586


Here’s the **intuition and derivation** of KL divergence straight from the video, illustrated with our fair-vs-biased-coin example:

---

## 1. Imagine two coins

* **True coin** $P$: fair, $P(\mathrm{H})=0.5,\;P(\mathrm{T})=0.5$
* **Assumed coin** $Q$: biased toward heads, $Q(\mathrm{H})=0.9,\;Q(\mathrm{T})=0.1$

You flip one of them $N$ times and record your sequence of H/T.

---

## 2. Compare likelihoods of a sequence

For a particular sequence with $k$ heads and $N-k$ tails:

$$
\begin{aligned}
P(\text{seq}) &= 0.5^k\;0.5^{\,N-k} \;,\\
Q(\text{seq}) &= 0.9^k\;0.1^{\,N-k}\;.
\end{aligned}
$$

If $Q$ really matched $P$, those two numbers would be about equal.

Take the **ratio**:

$$
\frac{P(\text{seq})}{Q(\text{seq})}
= \Bigl(\tfrac{0.5}{0.9}\Bigr)^k
  \;\Bigl(\tfrac{0.5}{0.1}\Bigr)^{N-k}.
$$

Then take a log (to turn products into sums and tame tiny numbers) and **normalize** by $N$ to get an average – that’s exactly:

$$
D_{KL}(P\|Q)
= \sum_{x\in\{H,T\}} P(x)\,\ln\frac{P(x)}{Q(x)}.
$$

This derivation is laid out step by step in the video 👇 ([YouTube][1], [공부방][2])

---

## 3. Plug in our numbers

$$
\begin{aligned}
D_{KL}(P\|Q)
&= 0.5\ln\frac{0.5}{0.9}
+ 0.5\ln\frac{0.5}{0.1}\\[4pt]
&\approx 0.5(-0.5878)+0.5(1.6094)
= 0.5108\text{ nats}
\quad\bigl(\approx0.737\text{ bits}\bigr).
\end{aligned}
$$

So on average you’d get about **0.51 “nats” of surprise** each flip by believing the heavily-biased coin when the true coin is fair. 😲

---

## 4. Key take-aways

* **Asymmetry**: $D_{KL}(P\|Q)\neq D_{KL}(Q\|P)$. Swapping them gives a different “surprise.”
* **Zero only if** $P=Q$. A model that matches reality incurs no extra surprise.
* **Uses**: model comparison, coding theory (extra bits wasted), variational inference (as a loss).

That’s the heart of the video’s coin-flip story – a natural, average log-ratio of “how much more surprising” your data is under the wrong coin. 🎲✨

[1]: https://www.youtube.com/watch?pp=0gcJCdgAo7VqN5tD&v=SxGYPqCgJWM&utm_source=chatgpt.com "Intuitively Understanding the KL Divergence - YouTube"
[2]: https://mookstudy.tistory.com/12?utm_source=chatgpt.com "KL Divergence 유도 과정 진짜 쉽게 알아보기 - 공부방 - 티스토리"


KL divergence is the average “extra surprise” you pay per flip (in nats or bits) by assuming a biased coin when the true coin is fair. 