# Mutual Information

- StatQuest Video: https://youtu.be/eJIp_mgVLwE

Mutual Information (MI) is a measure from information theory that quantifies the dependency between two random variables. In data science, it's a powerful tool for **feature selection** because it tells us how much information a feature provides about the **target variable** we want to predict.

**Main Idea:** Mutual Information is a numeric value that gives us a sense of how closely related two variables are.
*   **A value of 0** means the variables are independent; knowing one tells you nothing about the other.
*   **A higher value** means the variables are more related; the changes in one variable correspond to predictable changes in the other.

### The Mutual Information Formula

The mathematical formula for Mutual Information, as presented in the video, is:

$$
\sum_{x \in X} \sum_{y \in Y} p(x, y) \log \left[ \frac{p(x, y)}{p(x)p(y)} \right]
$$

The Mutual Information formula is built on two fundamental concepts from probability:

1.  **Joint Probability:** This is the probability of two events happening at the same time.
    *   **Video Terminology:** `p(x, y)`
    *   **Example:** The probability that someone both **Likes Popcorn** (`Yes`) *and* **Loves Troll 2** (`Yes`).

2.  **Marginal Probability:** This is the probability of a single event occurring, regardless of the other events.
    *   **Video Terminology:** `p(x)` or `p(y)`
    *   **Example:** The overall probability that someone **Loves Troll 2** (`Yes`), without considering whether they like popcorn or not.
    *   **Why "Marginal"?** In a probability table, these values are the totals found in the *margins* (the sum of a row or column).

In [22]:
import pandas as pd

data = {
    'Likes Popcorn': ['Yes', 'Yes', 'Yes', 'No', 'No'],
    'Height (m)': [1.77, 1.32, 1.81, 1.56, 1.64],
    'Loves Troll 2': ['Yes', 'Yes', 'Yes', 'No', 'Yes'] # Target variable
}
data = pd.DataFrame(data)

data

Unnamed: 0,Likes Popcorn,Height (m),Loves Troll 2
0,Yes,1.77,Yes
1,Yes,1.32,Yes
2,Yes,1.81,Yes
3,No,1.56,No
4,No,1.64,Yes


## Calculating Mutual Information

Before calculating the MI score, we first organize the probabilities from our dataset into a table. The four inner cells contain the **Joint Probabilities**, and the totals in the margins contain the **Marginal Probabilities**.

|                            | **Likes Popcorn: Yes** | **Likes Popcorn: No** | **Row Totals (Marginal P(Troll 2))** |
| :------------------------- | :--------------------: | :-------------------: | :--------------------------------: |
| **Loves Troll 2: Yes**     |          3/5           |          1/5          |                **4/5**                 |
| **Loves Troll 2: No**      |          0/5           |          1/5          |                **1/5**                 |
| **Column Totals (Marginal P(Popcorn))**|          **3/5**           |          **2/5**          |                  1                   |

The video walks through a clear example using the variables `Likes Popcorn` and `Loves Troll 2`.

1.  **Create a Probability Table:**
    *   Set up a table like the one above. The values in this table are all you need for the formula.

2.  **Plug into the Formula:**
    The Mutual Information formula sums a value for every possible combination of outcomes (i.e., for each of the four inner cells in the table). For each term in the summation:
    *   `p(x, y)` is the **joint probability** from an inner cell.
    *   `p(x)p(y)` is the product of the two corresponding **marginal probabilities** from the margins.
    *   The formula calculates how different the actual joint probability is from what we would expect if the two variables were independent.


In [23]:
import numpy as np

def mutual_information(x, y):
    """
    Calculates the Mutual Information between two discrete variables.

    This function follows the methodology from the StatQuest video:
    1. It identifies all unique value combinations.
    2. It calculates joint and marginal probabilities for each combination.
    3. It sums the terms of the MI formula: Σ p(x,y) * log( p(x,y) / (p(x)*p(y)) )

    Args:
        x (pd.Series): The first variable (e.g., a column from a DataFrame).
        y (pd.Series): The second variable.

    Returns:
        float: The Mutual Information score.
    """

    n = len(x)

    x_values = x.unique()
    y_values = y.unique()
    
    mi_score = 0

    for y_val in y_values:
        for x_val in x_values:
            
            p_x = (x == x_val).sum() / n
            p_y = (y == y_val).sum() / n
            p_xy = ((x == x_val) & (y == y_val)).sum() / n

            if p_xy > 0:
                mi_score += p_xy * np.log(p_xy / (p_x * p_y))

    return mi_score

In [24]:
mi_score_scratch = mutual_information(data['Likes Popcorn'], data['Loves Troll 2'])

# 0.22, same as what's shown in the video
print(f"The Mutual Information between 'Likes Popcorn' and 'Loves Troll 2' is: {mi_score_scratch:.4f}")

The Mutual Information between 'Likes Popcorn' and 'Loves Troll 2' is: 0.2231


### Handling Continuous Variables

Mutual Information isn't limited to discrete (categorical) variables.

*   **Problem:** A variable like `Height` is continuous and doesn't have simple categories like "Yes" or "No".
*   **Solution:** **Discretize the continuous variable.** This means converting it into a set of bins or categories. For example, you can create a histogram for `Height` and treat each bin as a discrete category (e.g., "Bin #1", "Bin #2", etc.).
    *   Once the continuous variable is binned, you can calculate the joint and marginal probabilities just as you would with two discrete variables.

In [25]:
n_bins = 3
data['Height Binned'] = pd.cut(
    x=data['Height (m)'],       # The data to bin
    bins=n_bins,                # The number of bins to create
    labels=False,               # IMPORTANT: This gives us integer labels (0, 1, 2)
    include_lowest=True         # Ensures the smallest value is included in the first bin
)

data['Height Binned']

0    2
1    0
2    2
3    1
4    1
Name: Height Binned, dtype: int64

In [26]:
mi_score_scratch = mutual_information(data['Height Binned'], data['Loves Troll 2'])

# 0.22, same as what's shown in the video
print(f"The Mutual Information between 'Height (m)' and 'Loves Troll 2' is: {mi_score_scratch:.4f}")

The Mutual Information between 'Height (m)' and 'Loves Troll 2' is: 0.2231


## Relevant Scikit-Learn Components

Scikit-learn provides powerful tools to calculate Mutual Information for feature selection, saving you from doing the math by hand. The main module is `sklearn.feature_selection`.

1.  **`mutual_info_classif`**:
    *   **What it is:** A function that estimates the mutual information between each feature and a discrete (categorical) target variable.
    *   **Relevance:** This is the most direct implementation of the concept shown in the video for a classification problem (like predicting `Loves Troll 2`). It automatically handles continuous features by using a sophisticated form of binning (based on nearest neighbors), so you don't have to discretize them manually.

2.  **`mutual_info_regression`**:
    *   **What it is:** A similar function that estimates the mutual information between each feature and a continuous target variable.
    *   **Relevance:** This is what you would use if your target variable was something like `Height` or `Price` instead of a "Yes/No" category.

In [31]:
from sklearn.feature_selection import mutual_info_classif

x = data['Likes Popcorn'].map({'Yes': 1, 'No': 0})
y = data['Loves Troll 2'].map({'Yes': 1, 'No': 0})

# Step: Reshape the feature data 'x' into a 2D array
# Scikit-learn expects the feature data (X) to be in the format [n_samples, n_features].
# Since we have only one feature, it should be a column vector.
X_reshaped = x.to_numpy().reshape(-1, 1)

# Step: Call the function with the `discrete_features=True` parameter
# This tells the function to treat our feature as discrete/categorical,
# which is the key to getting the correct result on this small dataset.
mi_score = mutual_info_classif(
    X_reshaped, 
    y, 
    discrete_features=True, 
    random_state=42
)

# The result is an array with one score, so we access it with [0]
final_score = mi_score[0]

print(f"The Mutual Information between 'Likes Popcorn' and 'Loves Troll 2' is: {final_score:.4f}")

The Mutual Information between 'Likes Popcorn' and 'Loves Troll 2' is: 0.2231


*   Scikit-learn's result for `Height (m)` would be `0.0000`, and not `0.2223`.
*   **Why the difference?** Scikit-learn's `mutual_info_classif` does **not** use simple histogram binning for continuous data. It uses a more sophisticated and robust method based on k-nearest neighbors (k-NN) to estimate entropy without explicit binning. This method is generally better as it doesn't depend on an arbitrary number of bins, but it's more complex than the intuitive binning approach taught in the StatQuest video.