# Notebook 0.3: Computation and Information Theory Basics

###  Objective:
To understand how information is defined and measured, and how computation relates to entropy, compression, and divergence between distributions.

###  Section 1: What is Information?

- Information is **reduction in uncertainty**.
- The more unexpected an event, the more information it carries.

**Bit**: The fundamental unit of information. Tells us the answer to a Yes/No question.

If a random variable \( X \) takes values in a set with probabilities \( P(x) \), then:

- Information content of outcome \( x \):
\[ I(x) = -\log_2 P(x) \]
- More probable events carry **less** information.

###  Section 2: Shannon Entropy

Shannon entropy quantifies **average information** in a random variable:
\[ H(X) = -\sum_{x \in X} P(x) \log_2 P(x) \]

- \( H(X) = 0 \) when outcome is certain.
- Higher entropy = more uncertainty = more information needed to describe the outcome.

In [3]:
!pip3 install numpy

Collecting numpy
  Downloading numpy-2.2.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Downloading numpy-2.2.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.8/16.8 MB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:02[0m
[?25hInstalling collected packages: numpy
Successfully installed numpy-2.2.6


In [4]:
import numpy as np

def shannon_entropy(p):
    p = np.array(p)
    p = p[p > 0]  # avoid log(0)
    return -np.sum(p * np.log2(p))

# Example distributions
uniform = [0.25, 0.25, 0.25, 0.25]
biased = [0.7, 0.1, 0.1, 0.1]

print("Entropy (uniform):", shannon_entropy(uniform))
print("Entropy (biased):", shannon_entropy(biased))

Entropy (uniform): 2.0
Entropy (biased): 1.3567796494470397


### Section 3: Compression and Redundancy

- Compression reduces size by **removing predictable patterns**.
- If data is highly predictable → high redundancy → high compression possible.

**Entropy = Theoretical limit of compression**

For example:
- A message of 1000 bits with entropy 2.5 bits/symbol → can be compressed to ~2500 bits.

### Section 4: KL Divergence

Kullback-Leibler (KL) divergence measures the difference between two probability distributions:
\[ D_{KL}(P || Q) = \sum_x P(x) \log_2 \frac{P(x)}{Q(x)} \]

- Not symmetric: \( D_{KL}(P || Q) \neq D_{KL}(Q || P) \)
- Measures how inefficient it is to assume distribution Q when the true distribution is P.

In [None]:
def kl_divergence(p, q):
    p = np.array(p)
    q = np.array(q)
    mask = (p > 0) & (q > 0)
    return np.sum(p[mask] * np.log2(p[mask] / q[mask]))

# P is true distribution, Q is approximation
P = [0.7, 0.1, 0.1, 0.1]
Q = [0.25, 0.25, 0.25, 0.25]
print("KL Divergence D(P || Q):", kl_divergence(P, Q))

###  Summary

| Concept       | Meaning                                      |
|---------------|----------------------------------------------|
| Bit           | Unit of information (answer to Yes/No)       |
| Entropy       | Average uncertainty in a distribution         |
| Compression   | Reducing data size by removing redundancy     |
| KL Divergence | Difference between two distributions          |

Information theory tells us **how much data we need to encode uncertainty** — this is fundamental to compression, learning, and even language models.