# Information Theory and Website Relationship  
### Shannon Entropy, Conditional Entropy, Mutual Information, KL Divergence  

This notebook explores how fundamental concepts of Information Theory can be applied to compare the summaries generated by an AI system (e.g., NotebookLM and Gemini) with my own notes, published on a simple website.  

The analysis connects with the contents of the course *Introduction to Artificial Intelligence* taught by **Professor Keita Tokuda** at **Juntendo University**, relating Shannon Entropy, Conditional Entropy, Mutual Information, and KL Divergence to the process of summarization.  

**Author:** Judith Urbina  
**Project:** Final Report – Website for Comparing Human Notes and AI Summaries  

## Token Vocabulary

Imagine a vocabulary of **6 tokens**. Both my notes and the AI-generated summary have been tokenized using this set of 6 tokens.  
This allows us to analyze the informational relationship between both texts using concepts from information theory, such as **Shannon entropy**, **conditional entropy**, **mutual information**, and **KL divergence**.

---

### What does each token represent?

Each token ($t_1$ to $t_6$) is a basic unit of information that appears in both texts.  
From the joint and marginal distributions of these tokens, we can calculate how much information my notes and the AI summary share, as well as how much uncertainty remains when knowing one or the other.

---

> **Example tokens:**  
> - $t_1$: "information"  
> - $t_2$: "theory"  
> - $t_3$: "web"  
> - $t_4$: "entropy"  
> - $t_5$: "mutual"  
> - $t_6$: "divergence"

---

Next, we explore how these tokens relate and what they reveal about the similarity and difference between my notes and the AI summary!

In [None]:
# tokens
[f"t{i}" for i in range(1, 7)]

['t1', 't2', 't3', 't4', 't5', 't6']

### Joint Distribution $p(x, y)$

The **joint distribution** $p(x, y)$ describes the probability of each token pair, where:
- $X$ represents my tokenized notes
- $Y$ represents the tokenized AI summary

This matrix captures how often each token from my notes appears together with each token from the AI summary.  
The sum of all entries in $p(x, y)$ must be **1**, ensuring it is a valid probability distribution.

---

|        | $t_1$ | $t_2$ | $t_3$ | $t_4$ | $t_5$ | $t_6$ |
|--------|-------|-------|-------|-------|-------|-------|
| **$t_1$** | 0.28  | 0.04  | 0     | 0     | 0     | 0     |
| **$t_2$** | 0.02  | 0.19  | 0     | 0     | 0     | 0     |
| **$t_3$** | 0     | 0     | 0.15  | 0.03  | 0     | 0     |
| **$t_4$** | 0     | 0     | 0.02  | 0.10  | 0     | 0     |
| **$t_5$** | 0     | 0     | 0     | 0     | 0.08  | 0.02  |
| **$t_6$** | 0     | 0     | 0     | 0     | 0.01  | 0.06  |

---

This table visually summarizes the relationship between the tokens in both texts, forming the basis for further information-theoretic analysis.

In [47]:
import numpy as np
pxy = np.zeros((6,6), dtype=float)
other_entries = {
    (0,1):0.04,
    (1,0):0.02,
    (2,3):0.03,
    (3,2):0.02,
    (4,5):0.02,
    (5,4):0.01
}
for (i,j), v in other_entries.items():
    pxy[i,j]=v
pxy


array([[0.  , 0.04, 0.  , 0.  , 0.  , 0.  ],
       [0.02, 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.03, 0.  , 0.  ],
       [0.  , 0.  , 0.02, 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.  , 0.02],
       [0.  , 0.  , 0.  , 0.  , 0.01, 0.  ]])

In [48]:
np.fill_diagonal(pxy,[0.28, 0.19, 0.15, 0.10, 0.08, 0.06])
pxy

array([[0.28, 0.04, 0.  , 0.  , 0.  , 0.  ],
       [0.02, 0.19, 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.15, 0.03, 0.  , 0.  ],
       [0.  , 0.  , 0.02, 0.1 , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.08, 0.02],
       [0.  , 0.  , 0.  , 0.  , 0.01, 0.06]])

Ensure the sum of the joint probabilities is 1

In [49]:
np.sum(pxy)

1.0000000000000002

## Marginal Probabilities of the Tokens

The marginal probability distributions for each token are calculated as follows:

- **For X (my notes):**  
    Each entry in `px` represents the probability of a token appearing in my notes.

- **For Y (AI summary):**  
    Each entry in `py` represents the probability of a token appearing in the AI summary.

| Token | $p_X(x)$ | $p_Y(y)$ |
|-------|----------|----------|
| t1    | 0.29     | 0.27     |
| t2    | 0.22     | 0.24     |
| t3    | 0.18     | 0.17     |
| t4    | 0.12     | 0.13     |
| t5    | 0.07     | 0.06     |
| t6    | 0.06     | 0.07     |

These probabilities summarize how frequently each token appears in the respective distributions.

In [23]:
px = pxy.sum(axis=1)
py = pxy.sum(axis=0)
px,py

(array([0.29, 0.22, 0.18, 0.12, 0.07, 0.06]),
 array([0.27, 0.24, 0.17, 0.13, 0.06, 0.07]))

### Entropies and Conditional Entropies

The **conditional entropy** $H(X|Y)$ quantifies the remaining uncertainty in my notes ($X$) after knowing the AI summary ($Y$).  
A lower value of $H(X|Y)$ means the summary captures more information from my notes, leaving less uncertainty.

#### Calculated Values

- **Entropy of my notes ($H(X)$):**  
    $H(X) = 2.32$ bits

- **Entropy of the AI summary ($H(Y)$):**  
    $H(Y) = 2.33$ bits

- **Joint entropy ($H(X, Y)$):**  
    $H(X, Y) = 2.99$ bits

- **Conditional entropy ($H(X|Y)$):**  
    $H(X|Y) = 0.66$ bits

- **Conditional entropy ($H(Y|X)$):**  
    $H(Y|X) = 0.67$ bits

---

These values show how much information is shared and how much uncertainty remains between my notes and the AI summary.  
The **conditional entropies** are much lower than the individual entropies, indicating a strong informational relationship between both texts.

The **maximum value of the conditional entropy** is $\log_2(6) \approx 2.58$ bits, since the vocabulary length is 6.  
The entropies of my notes and the AI summary are relatively high, while the joint entropy is intermediate and the conditional entropies are low.  
This means that, although both texts are not strictly dependent on each other, they share a significant amount of context—knowing one leaves little uncertainty about the other.

In [52]:
np.log2(len(pxy.flatten()))

5.169925001442312

In [55]:
np.log2(len(px.flatten()))

2.584962500721156

In [40]:
from math import log2
Hx = -sum(pxi*log2(pxi) for pxi in px if pxi > 0)
Hy = -sum(pyi*log2(pyi) for pyi in py if pyi > 0)
Hxy = -sum(pxyi*log2(pxyi) for pxyi in pxy.flatten() if pxyi > 0)
Hx,Hy,Hxy

(2.322940778767583, 2.3334757516728257, 2.989817560149396)

In [44]:
Hx_given_y = Hxy - Hy
Hy_given_x = Hxy - Hx
Hx_given_y, Hy_given_x

(0.6563418084765704, 0.6668767813818133)

### Mutual Information

**Mutual information** quantifies the degree of shared information between my notes and the AI summary.  
It measures how much knowing one text reduces uncertainty about the other.  
A higher mutual information value indicates greater overlap and similarity in the information content.  
The maximum possible mutual information is the minimum of the entropy of my notes and the entropy of the AI summary.

Mutual information is computed using the entropies of the individual distributions and their joint distribution:

$$
I(X; Y) = H(X) + H(Y) - H(X, Y)
$$

or equivalently,

$$
I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)
$$

### KL Divergence

**KL divergence** measures how different the distribution of tokens in my notes is from the distribution in the AI summary.  
It tells us how much extra information is needed from the AI summary to describe my notes.  
Unlike mutual information, KL divergence does not have a fixed maximum; the closer it is to zero, the more similar the distributions are.

KL divergence is calculated using the probabilities of each token in both distributions:

$$
D_{KL}(P_X \| P_Y) = \sum_{i} p_X(i) \log_2 \frac{p_X(i)}{p_Y(i)}
$$

- **Mutual Information:**  
    $I(X; Y) = 1.67$ bits

- **KL divergence:**  
    $D_{KL}(P_X \| P_Y) = 0.0055$ bits

A lower mutual information means the content in my notes and in the AI summary are very different.  
A lower KL divergence means the distributions are very similar; a higher value means they are more different.

In [57]:
Hx, Hy, min(Hx,Hy)

(2.322940778767583, 2.3334757516728257, 2.322940778767583)

In [58]:
Ixy = Hx + Hy - Hxy
Ixy

1.666598970291012

In [59]:
KL = np.sum(px * np.log2(px / py))
KL

0.005490165858668312