# Notebook 1: Information Theory Foundations

## Building the Mathematical Groundwork for Understanding Multivariate Information

Welcome to the first notebook in our series on multivariate information theory! In this notebook, we'll build the fundamental concepts from the ground up. Think of this as laying the foundation of a house‚Äîwe need solid basics before we can construct more complex structures.

### What You'll Learn

By the end of this notebook, you'll understand:
1. What entropy really measures (and why it's not just "randomness")
2. How to compute entropy for discrete and continuous variables
3. Joint entropy and conditional entropy
4. Mutual information as a measure of dependence
5. Conditional mutual information

### Why Start From Scratch?

While we'll eventually use powerful packages like HOI, Frites, and XGI, implementing these concepts ourselves first provides crucial intuition. When the more advanced measures give surprising results later (like negative interaction information!), you'll understand why.

Let's begin!

In [None]:
# Import our foundational libraries
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy.special import digamma
import seaborn as sns
from itertools import product
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Make plots beautiful
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

print("Libraries loaded successfully!")
print(f"NumPy version: {np.__version__}")

## Part 1: Understanding Entropy

### The Intuition Behind Entropy

Entropy measures **uncertainty** or **surprise**. Think about these scenarios:

**Scenario A**: You flip a fair coin. Before observing, you're maximally uncertain about the outcome.

**Scenario B**: You flip a heavily biased coin that lands heads 99% of the time. You're pretty sure it'll be heads.

Scenario A has **higher entropy** because the outcome is more uncertain. This is the core idea: entropy quantifies how much we don't know.

### The Mathematical Definition

For a discrete random variable $X$ with possible values $x_1, x_2, ..., x_n$:

$$H(X) = -\sum_{i=1}^{n} P(x_i) \log_2 P(x_i)$$

The $\log_2$ means we measure entropy in **bits**. One bit is the information gained from one fair coin flip.

### Why the Negative Sign?

Probabilities are between 0 and 1, so $\log_2(P(x_i))$ is negative. The negative sign makes entropy positive, which matches our intuition that "uncertainty" should be a positive quantity.

In [None]:
def entropy(probabilities):
    """
    Calculate entropy for a discrete probability distribution.
    
    Parameters:
    -----------
    probabilities : array-like
        Probability distribution (must sum to 1)
    
    Returns:
    --------
    H : float
        Entropy in bits
        
    Notes:
    ------
    We handle the case where P(x) = 0 by defining 0 * log(0) = 0,
    which is the correct limiting value.
    """
    probs = np.array(probabilities)
    
    # Verify this is a valid probability distribution
    assert np.abs(probs.sum() - 1.0) < 1e-10, "Probabilities must sum to 1"
    assert np.all(probs >= 0), "Probabilities must be non-negative"
    
    # Remove zero probabilities to avoid log(0)
    probs = probs[probs > 0]
    
    # Calculate entropy
    H = -np.sum(probs * np.log2(probs))
    
    return H

# Let's test our function with the coin examples
fair_coin = [0.5, 0.5]  # 50-50 chance
biased_coin = [0.99, 0.01]  # 99% heads
certain = [1.0, 0.0]  # Always heads

print("Entropy Examples:")
print(f"Fair coin (50-50): {entropy(fair_coin):.4f} bits")
print(f"Biased coin (99-1): {entropy(biased_coin):.4f} bits")
print(f"Certain outcome: {entropy(certain):.4f} bits")
print("\nInterpretation:")
print("- Fair coin: Maximum entropy (1 bit) - maximum uncertainty")
print("- Biased coin: Low entropy - we're pretty sure what will happen")
print("- Certain: Zero entropy - no uncertainty at all!")

### Visualizing Entropy as a Function of Probability

Let's see how entropy changes as we vary the probability of a binary event. This classic curve shows that entropy is **maximized** when outcomes are equally likely.

In [None]:
# Create a range of probabilities for heads
p_heads = np.linspace(0.01, 0.99, 100)
entropies = [entropy([p, 1-p]) for p in p_heads]

# Plot
fig, ax = plt.subplots(1, 1, figsize=(10, 6))
ax.plot(p_heads, entropies, linewidth=3, color='#2E86AB')
ax.axhline(y=1.0, color='red', linestyle='--', alpha=0.5, 
           label='Maximum entropy (1 bit)')
ax.axvline(x=0.5, color='red', linestyle='--', alpha=0.5)
ax.scatter([0.5], [1.0], color='red', s=200, zorder=5, 
           label='Fair coin', marker='*')

# Annotations
ax.annotate('Maximum uncertainty\nat P=0.5', xy=(0.5, 1.0), 
            xytext=(0.65, 0.85),
            arrowprops=dict(arrowstyle='->', color='red', lw=2),
            fontsize=12, fontweight='bold')

ax.set_xlabel('P(Heads)', fontsize=14, fontweight='bold')
ax.set_ylabel('Entropy (bits)', fontsize=14, fontweight='bold')
ax.set_title('Binary Entropy Function: H(p)', fontsize=16, fontweight='bold')
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Key Insight:")
print("Entropy is maximized when we're most uncertain (P=0.5 for binary events)")
print("As we become more certain (P‚Üí0 or P‚Üí1), entropy decreases to zero")

## Part 2: Joint and Conditional Entropy

### Joint Entropy: Uncertainty About Multiple Variables

When we have two random variables $X$ and $Y$, their **joint entropy** $H(X, Y)$ measures the total uncertainty about both variables together:

$$H(X,Y) = -\sum_{i,j} P(x_i, y_j) \log_2 P(x_i, y_j)$$

### Conditional Entropy: Uncertainty After Learning Something

**Conditional entropy** $H(X|Y)$ measures how much uncertainty remains about $X$ after we learn $Y$:

$$H(X|Y) = -\sum_{i,j} P(x_i, y_j) \log_2 P(x_i | y_j)$$

Or equivalently:
$$H(X|Y) = H(X,Y) - H(Y)$$

This is the **chain rule** of entropy!

### A Concrete Example: The Beer Price Problem

Let's implement the example from your slides: predicting beer prices based on volume.

In [None]:
def joint_entropy(joint_probs):
    """
    Calculate joint entropy H(X,Y) from a joint probability distribution.
    
    Parameters:
    -----------
    joint_probs : 2D array
        Joint probability matrix P(X=i, Y=j)
        
    Returns:
    --------
    H : float
        Joint entropy in bits
    """
    jp = np.array(joint_probs)
    
    # Verify it's a valid joint distribution
    assert np.abs(jp.sum() - 1.0) < 1e-10, "Joint probabilities must sum to 1"
    
    # Remove zeros and calculate
    jp_nonzero = jp[jp > 0]
    return -np.sum(jp_nonzero * np.log2(jp_nonzero))


def conditional_entropy(joint_probs):
    """
    Calculate conditional entropy H(X|Y) from joint distribution.
    
    We use: H(X|Y) = H(X,Y) - H(Y)
    
    Parameters:
    -----------
    joint_probs : 2D array
        Joint probability matrix P(X=i, Y=j)
        Rows correspond to X, columns to Y
        
    Returns:
    --------
    H_X_given_Y : float
        Conditional entropy H(X|Y) in bits
    """
    jp = np.array(joint_probs)
    
    # Calculate H(X,Y)
    H_XY = joint_entropy(jp)
    
    # Calculate H(Y) from marginal
    marginal_Y = jp.sum(axis=0)  # Sum over rows to get P(Y)
    H_Y = entropy(marginal_Y)
    
    # Apply chain rule
    return H_XY - H_Y


# Example: Beer prices (X) and volumes (Y)
# Let's create a realistic joint distribution
# Rows = price (low, medium, high), Columns = volume (small, large)

beer_joint = np.array([
    # Small (0.25L)  Large (0.5L)
    [0.20,          0.05],  # Low price (3 euros)
    [0.35,          0.15],  # Medium price (5 euros)  
    [0.05,          0.20],  # High price (7 euros)
])

print("Beer Price Example")
print("=" * 50)
print("\nJoint Distribution P(Price, Volume):")
print("                Small(0.25L)  Large(0.5L)")
print(f"Low (3‚Ç¨):       {beer_joint[0,0]:.2f}         {beer_joint[0,1]:.2f}")
print(f"Medium (5‚Ç¨):    {beer_joint[1,0]:.2f}         {beer_joint[1,1]:.2f}")
print(f"High (7‚Ç¨):      {beer_joint[2,0]:.2f}         {beer_joint[2,1]:.2f}")

# Calculate entropies
marginal_price = beer_joint.sum(axis=1)  # P(Price)
marginal_volume = beer_joint.sum(axis=0)  # P(Volume)

H_price = entropy(marginal_price)
H_volume = entropy(marginal_volume)
H_joint = joint_entropy(beer_joint)
H_price_given_volume = conditional_entropy(beer_joint)

print("\n" + "=" * 50)
print("Entropy Calculations:")
print("=" * 50)
print(f"H(Price) = {H_price:.4f} bits")
print(f"H(Volume) = {H_volume:.4f} bits")
print(f"H(Price, Volume) = {H_joint:.4f} bits")
print(f"H(Price|Volume) = {H_price_given_volume:.4f} bits")
print("\nInterpretation:")
print(f"- Before knowing volume: {H_price:.4f} bits of uncertainty about price")
print(f"- After knowing volume: {H_price_given_volume:.4f} bits remain")
print(f"- Uncertainty reduced by {H_price - H_price_given_volume:.4f} bits")

## Part 3: Mutual Information

### The Key Measure of Dependence

**Mutual Information (MI)** quantifies how much knowing one variable reduces uncertainty about another:

$$I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)$$

The symmetry $(I(X;Y) = I(Y;X))$ is beautiful: the information $X$ provides about $Y$ equals the information $Y$ provides about $X$.

### Alternative Formulation: KL Divergence

MI can also be written as:

$$I(X;Y) = \sum_{i,j} P(x_i, y_j) \log_2 \frac{P(x_i, y_j)}{P(x_i)P(y_j)}$$

This shows MI measures how far the joint distribution is from independence. If $X$ and $Y$ are independent: $P(X,Y) = P(X)P(Y)$, so $I(X;Y) = 0$.

### Key Properties

1. **Non-negative**: $I(X;Y) \geq 0$ always
2. **Zero iff independent**: $I(X;Y) = 0 \iff X \perp Y$
3. **Bounded**: $I(X;Y) \leq \min(H(X), H(Y))$
4. **Symmetric**: $I(X;Y) = I(Y;X)$

In [None]:
def mutual_information(joint_probs):
    """
    Calculate mutual information I(X;Y) from joint distribution.
    
    Uses: I(X;Y) = H(X) + H(Y) - H(X,Y)
    
    Parameters:
    -----------
    joint_probs : 2D array
        Joint probability matrix P(X=i, Y=j)
        
    Returns:
    --------
    MI : float
        Mutual information in bits
    """
    jp = np.array(joint_probs)
    
    # Calculate marginals
    marginal_X = jp.sum(axis=1)
    marginal_Y = jp.sum(axis=0)
    
    # Calculate individual and joint entropies
    H_X = entropy(marginal_X)
    H_Y = entropy(marginal_Y)
    H_XY = joint_entropy(jp)
    
    # MI = H(X) + H(Y) - H(X,Y)
    return H_X + H_Y - H_XY


# Calculate MI for our beer example
MI_beer = mutual_information(beer_joint)

print("Mutual Information Analysis")
print("=" * 50)
print(f"\nI(Price; Volume) = {MI_beer:.4f} bits")
print("\nVerification (should match):")
print(f"H(Price) - H(Price|Volume) = {H_price - H_price_given_volume:.4f} bits ‚úì")
print("\nInterpretation:")
print(f"Knowing the volume reduces price uncertainty by {MI_beer:.4f} bits")
print(f"Equivalently, knowing price reduces volume uncertainty by {MI_beer:.4f} bits")

# Let's also create examples of different dependency strengths
print("\n" + "=" * 50)
print("Comparing Different Dependency Strengths")
print("=" * 50)

# Perfect independence
independent = np.outer([0.25, 0.75], [0.4, 0.6])
MI_indep = mutual_information(independent)

# Perfect dependence (deterministic)
deterministic = np.array([
    [0.5, 0.0],
    [0.0, 0.5]
])
MI_det = mutual_information(deterministic)

print(f"\n1. Independent variables: I(X;Y) = {MI_indep:.6f} bits ‚âà 0")
print(f"2. Our beer example: I(X;Y) = {MI_beer:.4f} bits (moderate dependence)")
print(f"3. Deterministic relation: I(X;Y) = {MI_det:.4f} bit (perfect dependence)")

### Visualizing the Venn Diagram of Information

The classic "information diagram" helps visualize the relationship between entropy and mutual information. This works perfectly for two variables (but we'll see later why it breaks for three!).

In [None]:
from matplotlib.patches import Circle, Rectangle
from matplotlib.collections import PatchCollection

def plot_information_venn(H_X, H_Y, MI):
    """
    Create a Venn diagram representation of information measures.
    """
    fig, ax = plt.subplots(1, 1, figsize=(10, 8))
    
    # Calculate regions
    H_X_given_Y = H_X - MI
    H_Y_given_X = H_Y - MI
    
    # Draw circles
    circle1 = Circle((0.3, 0.5), 0.25, alpha=0.5, color='#E63946', label=f'H(X)={H_X:.3f}')
    circle2 = Circle((0.7, 0.5), 0.25, alpha=0.5, color='#457B9D', label=f'H(Y)={H_Y:.3f}')
    ax.add_patch(circle1)
    ax.add_patch(circle2)
    
    # Add text labels
    ax.text(0.15, 0.5, f'H(X|Y)\n{H_X_given_Y:.3f}', 
            fontsize=11, ha='center', va='center', fontweight='bold')
    ax.text(0.5, 0.5, f'I(X;Y)\n{MI:.3f}', 
            fontsize=12, ha='center', va='center', fontweight='bold',
            bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.7))
    ax.text(0.85, 0.5, f'H(Y|X)\n{H_Y_given_X:.3f}', 
            fontsize=11, ha='center', va='center', fontweight='bold')
    
    # Add title and labels
    ax.set_xlim(0, 1)
    ax.set_ylim(0.1, 0.9)
    ax.set_aspect('equal')
    ax.axis('off')
    ax.set_title('Information Venn Diagram', fontsize=16, fontweight='bold', pad=20)
    ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05), ncol=2, fontsize=12)
    
    # Add equations at bottom
    eq_text = (
        f"H(X,Y) = H(X|Y) + I(X;Y) + H(Y|X)\n"
        f"       = {H_X_given_Y:.3f} + {MI:.3f} + {H_Y_given_X:.3f} = {H_X_given_Y + MI + H_Y_given_X:.3f} bits"
    )
    fig.text(0.5, 0.05, eq_text, ha='center', fontsize=11,
             bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
    
    plt.tight_layout()
    return fig, ax

# Plot for our beer example
fig, ax = plot_information_venn(H_price, H_volume, MI_beer)
plt.show()

print("\nKey Insight:")
print("The overlapping region represents shared information (mutual information).")
print("The non-overlapping regions are information unique to each variable.")
print("\n‚ö†Ô∏è  IMPORTANT: This diagram works for 2 variables, but breaks for 3+!")
print("    We'll see why in the next notebook when we encounter negative information.")

## Part 4: Conditional Mutual Information

### Beyond Pairwise Dependencies

**Conditional Mutual Information (CMI)** measures the information shared between $X$ and $Y$ after accounting for a third variable $Z$:

$$I(X;Y|Z) = H(X|Z) - H(X|Y,Z)$$

Equivalently:
$$I(X;Y|Z) = H(X|Z) + H(Y|Z) - H(X,Y|Z)$$

### Why This Matters

CMI is crucial for understanding **multivariate** dependencies. It answers: "After I know $Z$, how much does learning $Y$ still help me predict $X$?"

### A Surprising Property: CMI Can Increase!

Unlike correlation (which conditioning always reduces), CMI can actually **increase** when we condition. This is called **explaining away** and we'll explore it in the next notebook.

For now, let's implement CMI:

In [None]:
def conditional_mutual_information(joint_probs_XYZ):
    """
    Calculate conditional mutual information I(X;Y|Z).
    
    Parameters:
    -----------
    joint_probs_XYZ : 3D array
        Joint probability P(X, Y, Z)
        Shape: (n_X, n_Y, n_Z)
        
    Returns:
    --------
    CMI : float
        Conditional mutual information I(X;Y|Z) in bits
        
    Formula:
    --------
    I(X;Y|Z) = sum_z P(z) * I(X;Y|Z=z)
             = H(X|Z) + H(Y|Z) - H(X,Y|Z)
    """
    jp_XYZ = np.array(joint_probs_XYZ)
    n_X, n_Y, n_Z = jp_XYZ.shape
    
    # Calculate marginal P(Z)
    P_Z = jp_XYZ.sum(axis=(0, 1))  # Sum over X and Y
    
    # Initialize CMI
    CMI = 0.0
    
    # For each value of Z, calculate I(X;Y|Z=z) and weight by P(z)
    for z in range(n_Z):
        if P_Z[z] > 0:
            # Conditional joint distribution P(X,Y|Z=z)
            P_XY_given_z = jp_XYZ[:, :, z] / P_Z[z]
            
            # Calculate MI for this conditioning value
            MI_given_z = mutual_information(P_XY_given_z)
            
            # Weight by P(z)
            CMI += P_Z[z] * MI_given_z
    
    return CMI


# Example: Coffee shop scenario
# X = Weather (Rainy, Sunny)
# Y = Coffee sales (Low, High) 
# Z = Day type (Weekday, Weekend)

# Create a realistic 3D joint distribution
coffee_joint = np.zeros((2, 2, 2))  # (weather, sales, day_type)

# Weekday (Z=0): Weather affects sales more
coffee_joint[:, :, 0] = np.array([
    # Low sales  High sales
    [0.15,       0.05],  # Rainy
    [0.05,       0.15],  # Sunny
])

# Weekend (Z=1): Sales always high regardless of weather
coffee_joint[:, :, 1] = np.array([
    # Low sales  High sales
    [0.05,       0.25],  # Rainy
    [0.05,       0.25],  # Sunny
])

# Calculate various MIs
marginal_XY = coffee_joint.sum(axis=2)  # P(Weather, Sales)
MI_weather_sales = mutual_information(marginal_XY)
CMI_weather_sales_given_day = conditional_mutual_information(coffee_joint)

print("Coffee Shop Example: Weather ‚Üí Sales")
print("=" * 50)
print(f"\nI(Weather; Sales) = {MI_weather_sales:.4f} bits")
print(f"I(Weather; Sales | DayType) = {CMI_weather_sales_given_day:.4f} bits")
print("\nInterpretation:")
print("- Overall: Weather and sales share some information")
print("- After knowing if it's a weekday/weekend:")
if CMI_weather_sales_given_day > MI_weather_sales:
    print("  The relationship is STRONGER (explaining away effect!)")
elif CMI_weather_sales_given_day < MI_weather_sales:
    print("  The relationship is WEAKER (day type explains some dependence)")
else:
    print("  The relationship is UNCHANGED (day type is irrelevant)")

## Part 5: Estimating Information from Data

### From Theory to Practice

So far we've worked with **known** probability distributions. In real research, we have **data samples** and need to **estimate** these distributions.

### Three Common Approaches

1. **Histogram/Binning**: Discretize continuous data into bins
2. **Kernel Density Estimation (KDE)**: Smooth density estimation
3. **k-Nearest Neighbors (KNN)**: Distance-based estimation

Each has trade-offs in bias, variance, and computational cost. Let's implement the simplest approach (binning) and understand its limitations.

In [None]:
def estimate_entropy_binned(data, n_bins=10):
    """
    Estimate entropy from continuous data using binning.
    
    Parameters:
    -----------
    data : array-like
        1D array of samples
    n_bins : int
        Number of bins for histogram
        
    Returns:
    --------
    H : float
        Estimated entropy in bits
        
    Warning:
    --------
    This is a biased estimator! The choice of n_bins matters.
    """
    # Create histogram
    counts, _ = np.histogram(data, bins=n_bins)
    
    # Convert to probabilities
    probs = counts / counts.sum()
    
    # Remove zero bins
    probs = probs[probs > 0]
    
    return entropy(probs)


def estimate_mi_binned(X, Y, n_bins=10):
    """
    Estimate mutual information from samples using 2D histogram.
    """
    # Create 2D histogram for joint distribution
    joint_counts, _, _ = np.histogram2d(X, Y, bins=n_bins)
    
    # Convert to probabilities
    joint_probs = joint_counts / joint_counts.sum()
    
    return mutual_information(joint_probs)


# Generate some sample data with known MI
n_samples = 5000

# Create correlated Gaussian variables
rho = 0.7  # Correlation coefficient
mean = [0, 0]
cov = [[1, rho], [rho, 1]]
data = np.random.multivariate_normal(mean, cov, n_samples)
X_data = data[:, 0]
Y_data = data[:, 1]

# For Gaussians, MI has an exact formula: I(X;Y) = -0.5 * log(1 - rho¬≤)
true_MI = -0.5 * np.log2(1 - rho**2)

print("Estimating MI from Data")
print("=" * 50)
print(f"\nTrue MI (Gaussian formula): {true_MI:.4f} bits")
print("\nEstimated MI with different bin sizes:")

for n_bins in [5, 10, 20, 30, 50]:
    est_MI = estimate_mi_binned(X_data, Y_data, n_bins)
    error = abs(est_MI - true_MI)
    print(f"  {n_bins:2d} bins: {est_MI:.4f} bits (error: {error:.4f})")

print("\n‚ö†Ô∏è  Key Lesson:")
print("Bin size matters! Too few bins ‚Üí underestimate, too many ‚Üí noise")
print("This is why advanced packages (like Frites and HOI) use better methods")

### Visualizing the Bias-Variance Trade-off

Let's see how our MI estimate varies with bin size and sample size:

In [None]:
# Test different sample sizes and bin numbers
sample_sizes = [100, 500, 1000, 5000, 10000]
bin_numbers = np.arange(5, 51, 5)

# Store results
results = np.zeros((len(sample_sizes), len(bin_numbers)))

for i, n_samples in enumerate(sample_sizes):
    # Generate data
    data = np.random.multivariate_normal(mean, cov, n_samples)
    X = data[:, 0]
    Y = data[:, 1]
    
    for j, n_bins in enumerate(bin_numbers):
        results[i, j] = estimate_mi_binned(X, Y, n_bins)

# Plot
fig, ax = plt.subplots(1, 1, figsize=(12, 7))

for i, n_samples in enumerate(sample_sizes):
    ax.plot(bin_numbers, results[i, :], marker='o', linewidth=2,
            label=f'N = {n_samples}', alpha=0.7)

ax.axhline(y=true_MI, color='black', linestyle='--', linewidth=2.5,
           label=f'True MI = {true_MI:.4f} bits')

ax.set_xlabel('Number of Bins', fontsize=14, fontweight='bold')
ax.set_ylabel('Estimated MI (bits)', fontsize=14, fontweight='bold')
ax.set_title('MI Estimation: Effect of Sample Size and Binning', 
             fontsize=16, fontweight='bold')
ax.legend(fontsize=11, loc='best')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nObservations:")
print("1. Small sample sizes ‚Üí high variance (estimates jump around)")
print("2. Too few bins ‚Üí systematic underestimation (bias)")
print("3. Too many bins with small samples ‚Üí overestimation (sparse bins)")
print("4. More data allows more bins and more accurate estimates")

## Part 6: Summary and Bridge to Advanced Methods

### What We've Built

Congratulations! You now understand:

1. **Entropy**: Measures uncertainty/surprise
2. **Joint and Conditional Entropy**: Uncertainty about multiple variables
3. **Mutual Information**: How much variables tell us about each other
4. **Conditional MI**: Information shared after accounting for a third variable
5. **Estimation Challenges**: Why we need sophisticated methods

### The Foundations Are Set

These concepts form the bedrock of multivariate information theory. In the next notebooks, we'll:

- **Notebook 2**: See why pairwise measures fail (the XOR problem)
- **Notebook 3**: Learn about synergy, redundancy, and interaction information using **HOI**
- **Notebook 4**: Apply these to real neural time series with **Frites**
- **Notebook 5**: Explore higher-order network structures with **XGI**
- **Notebook 6**: Integrate everything for advanced analyses

### Practice Exercises

Before moving on, try these to solidify your understanding:

In [None]:
print("PRACTICE EXERCISES")
print("=" * 70)
print("\n1. Calculate the entropy of a fair 6-sided die.")
print("   Hint: Each outcome has probability 1/6")
fair_die = np.ones(6) / 6
H_die = entropy(fair_die)
print(f"   Answer: {H_die:.4f} bits")
print(f"   Compare to fair coin (1 bit): You need {H_die:.2f}/1.0 = {H_die:.2f}x as many dice rolls")
print("   to match the information from a coin flip!\n")

print("2. Why is H(X,Y) ‚â§ H(X) + H(Y)?")
print("   This is called subadditivity. Can you explain why?")
print("   Hint: When are they equal? When are they most different?\n")

print("3. Create two variables X and Y where I(X;Y) = H(X).")
print("   What does this mean about the relationship between X and Y?")
# Example: Perfect dependence
perfect_dep = np.array([[0.3, 0.0], [0.0, 0.7]])
MI_perfect = mutual_information(perfect_dep)
H_X_perfect = entropy(perfect_dep.sum(axis=1))
print(f"   Example MI: {MI_perfect:.4f}, H(X): {H_X_perfect:.4f}")
print(f"   This means: Y completely determines X (or vice versa)!\n")

print("4. Generate two independent random variables and verify I(X;Y) ‚âà 0.")
X_indep = np.random.randn(1000)
Y_indep = np.random.randn(1000)
MI_indep_est = estimate_mi_binned(X_indep, Y_indep, n_bins=20)
print(f"   Estimated MI: {MI_indep_est:.6f} bits (should be near 0)")
print("   Small non-zero value is due to finite sample size!")

print("\n" + "=" * 70)
print("Ready for Notebook 2: The XOR Problem!")
print("There we'll see why everything breaks with 3+ variables...")
print("=" * 70)

## Additional Resources

### Recommended Reading

1. **Cover & Thomas** - "Elements of Information Theory" (Chapters 2-3)
2. **MacKay** - "Information Theory, Inference, and Learning Algorithms" (Chapter 2)
3. **Ince et al. 2017** - "A statistical framework for neuroimaging data analysis based on mutual information"

### Going Deeper

- For continuous variables: differential entropy
- For high-dimensional data: dimensionality reduction before MI estimation  
- For time series: transfer entropy and Granger causality
- For neural data: Gaussian Copula MI (what Frites uses!)

### Coming Up

In the next notebook, we'll confront the **XOR problem**‚Äîthe moment when your intuition about information will be completely challenged. You'll see that two variables can individually carry **zero** information about a target, yet together carry **complete** information. This is the gateway to understanding synergy, redundancy, and higher-order interactions.

See you there! üöÄ