# The Source Coding Theorem

## Information content of independent random variables

If $x$ and $y$ are independent, the following identities hold
$$h\left(x,y\right) = h\left(x\right) + h\left(y\right)$$
$$H\left(X,Y\right) = H\left(X\right) + H\left(Y\right)$$
I.e., entrop and the SHannon information content are additive for independent variables.

## Designing informative experiments

One important property of the entropy is the following: The entropy of an ensemble $X$ is biggest if all the outcomes have equal probabilitiy $p_i=1/|\mathcal{A}_X|$. In other words: The outcome of a random experiment is guaranteed to be most informative if the probability distribution over outcomes is uniform.

### Example: Customer support ticket categories

We want to get a better intuition of what this property means. Let us assume that a support team receives tickets in 4 categories:
* Billing
* Technical issue
* Account access
* Cancellation

If one category happens almost all the time, then each new ticket tells you little (low entropy). If all 4 categories are equally likely, each new ticket is more surprising/informative (high entropy).

Let us start by comparing the entropy of different examples of probability distributions:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from entropy_lab.measures.entropy import compute_entropy

labels = ["Billing", "Tech", "Access", "Cancel"]

distributions = {
    "Very skewed": np.array([0.85, 0.10, 0.03, 0.02]),
    "Moderately skewed": np.array([0.55, 0.20, 0.15, 0.10]),
    "Almost balanced": np.array([0.30, 0.25, 0.25, 0.20]),
    "Uniform (max entropy)": np.array([0.25, 0.25, 0.25, 0.25])
}

for name, p in distributions.items():
    H = compute_entropy(p)
    print(f"{name:22s} -> H = {H:.4f} bits")

As expected, we can observe that:
1. Entropy increases as the distribution becomes more balanced.
2. The uniform case gives the highest entropy.
   
We can visualize this property by generating a plot that moves gradually from a peaked distribution to a uniform one:

In [None]:
import numpy as np 
import matplotlib.pyplot as plt 

from entropy_lab.measures.entropy import compute_entropy

p_skewed = np.array([0.85, 0.10, 0.03, 0.02])
p_uniform = np.array([0.25, 0.25, 0.25, 0.25])

alphas = np.linspace(0, 1, 101)
entropies = []

for a in alphas:
    # Interpolate between skewed and uniform
    p = (1 - a) * p_skewed + a * p_uniform
    H = compute_entropy(p, base=2.0)
    entropies.append(H)

plt.figure(figsize=(8, 4.5))
plt.plot(alphas, entropies, linewidth=2)
plt.axhline(np.log2(4), linestyle="--", linewidth=1, label="Max = log2(4) = 2 bits")
plt.xlabel("Balance level (0 = skewed, 1 = uniform)")
plt.ylabel("Entropy (bits)")
plt.title("Entropy is maximal when outcomes are equally likely")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

### Is the Shannon Information content actually a good measure of information content? An example from the Telecom

Consider a simplified communication system that produces a stream of events observed by a receiver. We use three possible outcomes:
* **D** = normal data event (very common)
* **C** = control event (less common)
* **R** = retransmission/error-related event (rare)

Even though each observed event is just a *symbol*, not all symbols carry the same amount of information. As a recap, according to Shannon, the information content of a specific outcome $x$ is


$$h(x)=\log_2\left(\frac{1}{P(x)}\right)$$

So:
* common events carry little information,
* rare events carry more information.

This is also the key intuition behind data compression:
frequent symbols should get short codes, rare symbols can have longer codes. Let us simulate all that.


In [None]:
import numpy as np 
import matplotlib.pyplot as plt
import math

from entropy_lab.measures.shannon import shannon_information
from entropy_lab.measures.entropy import compute_entropy

# Probabilities of telecom events (example)
# D = data (common), C = control (less common), R = retransmission/error (rare)
symbols = np.array(["D", "C", "R"])
probs = np.array([0.90, 0.07, 0.03])

# Shannon information content for each symbol
info_contents = []
for prob in probs:
    info_contents.append(shannon_information(prob))

for s, p, h in zip(symbols, probs, info_contents):
    print(f"{s}: P={p:.2f}, h(x)={h:.3f} bits")

H = compute_entropy(probs)
print(f"\nEntropy H(X) = {H:.3f} bits")

With the probabilities above:  

* **D** is very common, so it carries **little** information.
* **C** is less common, so it carries **more** information.
* **R** is rare, so it carries **a lot** of information.

This is exactly what we expect in telecom logs:

* seeing another normal data event is not surprising,
* seeing a retransmission/error event is much more surprising and therefore more informative.

The entropy $H(X)$ is the average information per observed event.
It tells us the ideal average number of bits needed to encode this stream efficiently. Let us now simulate a stream of received events.

In [None]:
rng = np.random.default_rng(42) #seed

n = 300
stream = rng.choice(symbols, size=n, p=probs)

# map each symbol to its information content
h_map = {s: h for s, h in zip(symbols, info_contents)}
h_stream = np.array([h_map[s] for s in stream])

# running average information
cum_info = np.cumsum(h_stream)
running_avg = cum_info / np.arange(1, n+1)

print("First 20 events:", "".join(stream[:20]))
print("Final running average:", running_avg[-1])
print("Theoretical entropy H:", H)

Let us now already plot this simulation with a small animation:

In [None]:
from matplotlib.animation import FuncAnimation
from IPython.display import HTML

fig, ax = plt.subplots(figsize=(12, 5))

x = np.arange(1, n + 1)

# Pre-create empty scatter-like artists using line objects for each symbol
(line_D,) = ax.plot([], [], marker='o', linestyle='', label='D events')
(line_C,) = ax.plot([], [], marker='o', linestyle='', label='C events')
(line_R,) = ax.plot([], [], marker='o', linestyle='', label='R events')

(line_avg,) = ax.plot([], [], linewidth=2, label='Running average information')
ax.axhline(H, linestyle='--', linewidth=1.5, label=f'Entropy H(X) = {H:.3f} bits')

ax.set_xlim(1, n)
ax.set_ylim(0, max(info_contents) + 0.7)
ax.set_xlabel("Observed event index")
ax.set_ylabel("Information (bits)")
ax.set_title("Animated telecom event stream: information content over time")
ax.grid(alpha=0.3)
ax.legend(loc="upper right")

text_box = ax.text(
    0.02, 0.98, "",
    transform=ax.transAxes,
    va="top",
    bbox=dict(boxstyle="round", facecolor="white", alpha=0.85)
)

# Storage for animated points
xD, yD = [], []
xC, yC = [], []
xR, yR = [], []

def update(frame):
    i = frame  # 0-based
    s = stream[i]
    h = h_stream[i]

    # Append current point to the correct symbol series
    if s == "D":
        xD.append(i + 1); yD.append(h)
    elif s == "C":
        xC.append(i + 1); yC.append(h)
    else:  # "R"
        xR.append(i + 1); yR.append(h)

    # Update symbol point layers
    line_D.set_data(xD, yD)
    line_C.set_data(xC, yC)
    line_R.set_data(xR, yR)

    # Update running average line
    line_avg.set_data(x[:i+1], running_avg[:i+1])

    # Update text
    text_box.set_text(
        f"Event {i+1}/{n}\n"
        f"Observed symbol: {s}\n"
        f"h(x) = {h:.3f} bits\n"
        f"Running avg = {running_avg[i]:.3f} bits\n"
        f"Entropy H = {H:.3f} bits"
    )

    return line_D, line_C, line_R, line_avg, text_box

anim = FuncAnimation(fig, update, frames=n, interval=80, blit=False)

plt.close(fig)  # avoid duplicate static display in notebooks
HTML(anim.to_jshtml())

Each point is one observed telecom event:

* many points are low (the common **D** events),
* fewer points are higher (**C** events),
* rare points are very high (**R** events).

The running average starts unstable, then gradually converges toward the entropy $H(X)$. This is the central intuition:

- **Shannon information content** describes the surprise of a single outcome.
- **Entropy** is the long-run average surprise of the source.



This example also explains why entropy is connected to file size. If we store the event stream in a file, a naive encoding might use the same number of bits for every event. But a better encoding uses the probabilities:

- frequent symbol **D** gets a short code,
- less frequent **C** gets a longer code,
- rare **R** gets the longest code.

This reduces the **average number of bits per event**. Shannon's entropy $H(X)$ gives the ideal lower bound (in the average sense) for how many bits per event are needed for lossless compression.

So entropy is not just an abstract quantity, but it directly predicts how compressible a random source is.