# Multivariate Distributions

In [None]:
%%html
<link rel="stylesheet" type="text/css" href="../styles/styles.css">

## Learning Objectives

By the end of this lesson, you will be able to:
1. Define and work with **joint probability distributions**
2. Compute **marginal** and **conditional** distributions
3. Test for **independence** between random variables
4. Calculate and interpret **covariance** and **correlation**
5. Apply multivariate distributions to **ML problems**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from mpl_toolkits.mplot3d import Axes3D

# Set style for better-looking plots
plt.style.use('seaborn-v0_8-darkgrid')
#sns.set_palette("husl")

# Set random seed for reproducibility
np.random.seed(42)

In [None]:
from pprint import pprint

In [None]:
from IPython.display import display, HTML

In [None]:
import sys
from pathlib import Path

# Add the "resources" directory to the path
project_root = Path().resolve().parent
resources_path = project_root / 'resources'
sys.path.insert(0, str(resources_path))

In [None]:
from multivariate import(get_viz_movie_data, viz_joint_distr_cont, covariance_demo, demo_corr_dependence, get_xy_plane, get_all_pdf_cont, demo_correlation, generate_movie_data)

<div class="alert alert-info">
<h3>🎯 The Recommendation System Dilemma</h3>

A streaming service tracks two variables for each user-movie interaction:
- $X = \text{Watch Time}$ (in minutes)
- $Y = \text{User Rating}$ (1-5 stars)
    
**The Data Science Team's Problem:**
They try a simple prediction rule:

<blockquote style="background: #f0f0f0; padding: 10px; border-left: 4px solid #2196f3; margin: 10px 0;">
    "If watch_time > 50 minutes → predict 4 stars<br>
     Otherwise → predict 2 stars"
</blockquote>

**Results:**
- This simple approach is wrong 62% of the time!
- Why? The relationship isn't that simple:

    <ul>
        <li>❤️ Some users watch briefly but rate high ("loved it quickly!")</li>
        <li>😤 Some users watch long but rate low ("couldn't stop hate-watching!")</li>
        <li>📊 The distribution is more complex than a simple threshold</li>
    </ul>

    
**Questions:**
1. Are watch time and rating independent?
2. How do they vary together (joint distribution)?
3. Can we build a better prediction model using their *joint distribution*?
    
Can we build a probabilistic model that cuts prediction error in half?

**By the end:**  You'll solve this problem using multivariate distributions!
</div>

In [None]:
# visualisation of the problem
get_viz_movie_data()

Before diving into formulas, let's understand the concepts with a *small, concrete example*. 

Imagine we surveyed **only 20 users** about a specific movie. We recorded two things:

- $X = \text{Watch Time}$: Short (< 30 min) or Long (≥ 30 min)
- $Y = \text{Rating}$: Low (1-3 stars) or High (4-5 stars)

Note: we introduce these "categories" (short / long / low / high) for simplicity.

This creates a simple **2×2 frequency table**:

|Watch_Time\ Rating| Low (1-3 ★) | High (4-5 ★) | TOTAL | 
|---:|:----:|:----:|:----:|
|**Short (< 30 min)** |   3       |  1    |  1 + 3 = 4|
|**Long (≥ 30 min)** | 2 | 14 |   14 + 2 = 16|
|**TOTAL**|  2 + 3 = 5 | 14 + 1 = 15  |    15 + 5 = 20|


In [None]:
# TOY EXAMPLE
toy_data = [
    # (watch_time_category, rating_category)
    ('Short', 'Low'),   # User 1
    ('Short', 'Low'),   # User 2
    ('Short', 'Low'),   # User 3
    ('Short', 'High'),  # User 4
    ('Long', 'Low'),    # User 5
    ('Long', 'Low'),    # User 6
    ('Long', 'High'),   # User 7
    ('Long', 'High'),   # User 8
    ('Long', 'High'),   # User 9
    ('Long', 'High'),   # User 10
    ('Long', 'High'),   # User 11
    ('Long', 'High'),   # User 12
    ('Long', 'High'),   # User 13
    ('Long', 'High'),   # User 14
    ('Long', 'High'),   # User 15
    ('Long', 'High'),   # User 16
    ('Long', 'High'),   # User 17
    ('Long', 'High'),   # User 18
    ('Long', 'High'),   # User 19
    ('Long', 'High'),   # User 20
]

# Convert to DataFrame
df_toy = pd.DataFrame(toy_data, columns=['Watch_Time', 'Rating'])
df_toy['User_ID'] = range(1, 21)

In [None]:
# Visualize each user as a dot
fig, ax = plt.subplots(figsize=(10, 6))

# Map categories to numbers for plotting
watch_map = {'Short': 0, 'Long': 1}
rating_map = {'Low': 0, 'High': 1}

x_vals = df_toy['Watch_Time'].map(watch_map)
y_vals = df_toy['Rating'].map(rating_map)

# Add jitter so overlapping points are visible
jitter = 0.05
x_jittered = x_vals + np.random.uniform(-jitter, jitter, len(x_vals))
y_jittered = y_vals + np.random.uniform(-jitter, jitter, len(y_vals))

# Color by combination
colors = []
for _, row in df_toy.iterrows():
    if row['Watch_Time'] == 'Short' and row['Rating'] == 'Low':
        colors.append('red')
    elif row['Watch_Time'] == 'Short' and row['Rating'] == 'High':
        colors.append('green')
    elif row['Watch_Time'] == 'Long' and row['Rating'] == 'Low':
        colors.append('orange')
    else:  # Long and High
        colors.append('blue')

ax.scatter(x_jittered, y_jittered, s=200, alpha=0.7, c=colors, edgecolors='black', linewidth=2)

# Label each point with user ID
for i, row in df_toy.iterrows():
    ax.text(x_jittered.iloc[i], y_jittered.iloc[i], str(row['User_ID']), 
            ha='center', va='center', fontsize=9, fontweight='bold')

ax.set_xticks([0, 1])
ax.set_xticklabels(['Short\n(< 30 min)', 'Long\n(≥ 30 min)'], fontsize=12)
ax.set_yticks([0, 1])
ax.set_yticklabels(['Low\n(1-3 stars)', 'High\n(4-5 stars)'], fontsize=12)
ax.set_xlabel('Watch Time', fontsize=14, fontweight='bold')
ax.set_ylabel('Rating', fontsize=14, fontweight='bold')
ax.set_title('Our Population: 20 Users (each dot = 1 user)', fontsize=16, fontweight='bold')
ax.grid(True, alpha=0.3)
ax.set_xlim(-0.3, 1.3)
ax.set_ylim(-0.3, 1.3)

# Add legend
from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor='red', edgecolor='black', label='Short + Low'),
    Patch(facecolor='green', edgecolor='black', label='Short + High'),
    Patch(facecolor='orange', edgecolor='black', label='Long + Low'),
    Patch(facecolor='blue', edgecolor='black', label='Long + High')
]
ax.legend(handles=legend_elements, loc='center left', fontsize=11)

plt.tight_layout()
plt.show()

In [None]:
# Create crosstab (contingency table)
counts_table = pd.crosstab(df_toy['Watch_Time'], df_toy['Rating'], margins=True, margins_name='Total')

print("\nFrequency Table (Counts):")
print(counts_table)

## Joint Probability Distributions

Note that by now, we are dealing with frequencies, **NOT PROBABILITIES**. Let's convert counts to Probabilities by dividing by total (20 in our case).

<table>
<thead>
  <tr>
    <th align="right">Watch_Time\ Rating</th>
    <th align="center">Low (1-3 ★)</th>
    <th align="center">High (4-5 ★)</th>    
    <th align="center">TOTAL</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td align="right"><strong>Short (&lt; 30 min)</strong></td>
    <td align="center" style="background-color: #ffe6e6;">3/20</td>
    <td align="center" style="background-color: #ffe6e6;">1/20</td>    
    <td align="center">4/20 = 1/5</td>
  </tr>
  <tr>
    <td align="right"><strong>Long (≥ 30 min)</strong></td>
    <td align="center" style="background-color: #ffe6e6;">2/20 = 1/10</td>
    <td align="center" style="background-color: #ffe6e6;">14/20 = 7/10</td>    
    <td align="center">16/20 = 4/5</td>
  </tr>
  <tr>
    <td align="right"><strong>TOTAL</strong></td>
    <td align="center">5/20 = 1/4</td>
    <td align="center">15/20 = 3/4</td>    
    <td align="center">20/20 = 1</td>
  </tr>
</tbody>
</table>

In [None]:
# Divide by total to get probabilities
n_total = len(df_toy)
joint_prob_table = pd.crosstab(df_toy['Watch_Time'], df_toy['Rating']) / n_total

print("Joint Probability Table P(X, Y):")
# print(joint_prob_table.round(3))

# Apply styling to central cells
styled = joint_prob_table.style.apply(lambda x: ['background-color: #ffe6e6' #if i in [0, 1] and j in [0, 1] else '' 
                                     for j, _ in enumerate(x)], axis=1, subset=['High', 'Low'])

display(styled)

print("\nInterpretation:")
print(f"  - P(Short, Low) = {joint_prob_table.loc['Short', 'Low']:.2f}")
print(f"    → {joint_prob_table.loc['Short', 'Low']*100:.0f}% of users watched short AND rated low")
print(f"  - P(Long, High) = {joint_prob_table.loc['Long', 'High']:.2f}")
print(f"    → {joint_prob_table.loc['Long', 'High']*100:.0f}% of users watched long AND rated high")

Note that all values in the middle of the table are non-negative and sum up to 1: 

$$7/10 + 1/10 + 1/20 + 3/20 = 8/10 + 4/20 = 8/10 + 2/10 = 1$$

In [None]:
# Visualize as heatmap
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Counts
sns.heatmap(counts_table.iloc[:-1, :-1], annot=True, fmt='d', cmap='YlOrRd', 
            cbar_kws={'label': 'Count'}, ax=axes[0],
            linewidths=3, linecolor='white', square=True, vmin=0)
axes[0].set_title('Frequency Table (Counts)', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Rating', fontsize=12)
axes[0].set_ylabel('Watch Time', fontsize=12)

# Plot 2: Probabilities
sns.heatmap(joint_prob_table, annot=True, fmt='.2f', cmap='YlGnBu', 
            cbar_kws={'label': 'Probability'}, ax=axes[1],
            linewidths=3, linecolor='white', square=True, vmin=0, vmax=1)
axes[1].set_title('Joint Probability P(X, Y)', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Rating', fontsize=12)
axes[1].set_ylabel('Watch Time', fontsize=12)

plt.tight_layout()
plt.show()

print("\nThe center of the table = JOINT DISTRIBUTION")
print("This tells us: P(X=x AND Y=y) for all combinations")

By now, for simplicity, we have used categorical data ("Low", "High", "Short", "Long") but to deal with real value random variables, we can assign numerical values, e.g.: 
$$X (Watch Time): Short = 0, Long = 1$$
$$Y (Rating):     Low = 0,   High = 1$$

In [None]:
# Create numerical versions
df_toy['X'] = df_toy['Watch_Time'].map({'Short': 0, 'Long': 1})
df_toy['Y'] = df_toy['Rating'].map({'Low': 0, 'High': 1})

print(df_toy[['User_ID', 'Watch_Time', 'X', 'Rating', 'Y']].head(10))

In [None]:
# Create numerical joint probability table
X_values = [0, 1]  # Short=0, Long=1
Y_values = [0, 1]  # Low=0, High=1

# Joint probabilities
P_XY = {}
P_XY[(0, 0)] = 3/20  # P(Short, Low)
P_XY[(0, 1)] = 1/20  # P(Short, High)
P_XY[(1, 0)] = 2/20  # P(Long, Low)
P_XY[(1, 1)] = 14/20 # P(Long, High)

pprint(P_XY)

<div class="alert alert-success">
<h4>Definition: Joint Probability Distribution</h4>

When we have two random variables $X$ and $Y$, their **joint distribution** describes the probability of both variables taking specific values simultaneously.

<h5>Discrete Case (Joint PMF):</h5>

$$P_{XY}(x, y) = P(X = x, Y = y)$$

*Remark:* the following notations may also be used: $\mathbb{P}(X=x \text{ and } Y=y)$ and $\mathbb{P}(X=x \cap Y=y)$ (intersection)

**Properties:**

* $\sum_{(x_i,y_j)\in \mathbb{R}_{XY}} \mathbb{P}_{XY}(x_i,y_j) = 1$
* $\forall (x_i,y_j)\in \mathbb{R}_{XY} :\ \mathbb{P}_{XY}(x_i,y_j) \geq 0$

*Remark:* $\mathbb{R}_{XY}$ is often defined as $\mathbb{R}_{XY} = \mathbb{R}_X \times \mathbb{R}_Y$. Note that in this case, for certain pairs $(x_i, y_j)$, the probability $\mathbb{P}_{XY}(x_i,y_j)$ may equal 0.

*Remark:* when dealing with discrete random variables, we often consider $\mathbb{N}^2$ as the value space.


<h5>Continuous Case (Joint PDF):</h5>

Let $X$ and $Y$ be two continuous r.v. The random variables $X$ and $Y$ are called **jointly continuous random variables** if there exists a non-negative function $f_{XY} : \ \mathbb{R}^2 \to \mathbb{R}$ such that for every set $A \in \mathbb{R}^2$ we have:

$\forall A \in \mathbb{R}^2, \ \mathbb{P}\left((X,Y) \in A\right) = \iint\limits_{(x,y)\in A} f_{XY}(x,y)dxdy$

The function $f_{XY}(x,y)$ is called the **joint probability density function** or **joint PDF** of r.v. $X$ and $Y$.


**Properties:**

- $f_{XY}(x, y) \geq 0$ for all $(x, y)$ (*non-negativity*)
- $\int\limits_{-\infty}^{+\infty}\int\limits_{-\infty}^{+\infty}f_{XY}(x,y)dxdy = 1$ (integrate over entire space)

</div>

In [None]:
# continuous example
viz_joint_distr_cont()

<div class="alert alert-primary">
<h4>🤖 ML Application Spotlight: Spam Detection</h4>

<p><strong>Real-world use of joint distributions:</strong></p>

<p>In spam email classification:</p>
<ul>
    <li>X = Number of exclamation marks</li>
    <li>Y = Number of CAPITAL letters</li>
</ul>

<p>Understanding P(X, Y | Spam) vs P(X, Y | Ham) helps classify emails.</p>

<p><strong>Naive Bayes assumes independence</strong>, but joint modeling captures <strong>feature interactions</strong>!</p>
</div>

## Joint Cumulative Distribution Function

<div class="alert alert-success">
<h4>Definition: Joint CDF</h4>

We call the **joint cumulative distribution function** of the pair $(X,Y)$, the mapping $F_{XY} : \mathbb{R}^2 \rightarrow \mathbb{R}$

$$F_{XY}(x,y) = \mathbb{P}(X\leq x, Y\leq y), \ \forall (x_i,y_j) \in \mathbb{R}_{XY}$$
where $0\leq F_{XY}(x,y) \leq 1$.

In other words:

* $\lim\limits_{x\rightarrow -\infty \\ y\rightarrow -\infty}F_{XY}(x,y) = 0$
* $\lim\limits_{x\rightarrow -\infty}F_{XY}(x,y) = \lim\limits_{y\rightarrow -\infty}F_{XY}(x,y) = 0$
* $\lim\limits_{x\rightarrow +\infty \\ y\rightarrow +\infty}F_{XY}(x,y) = 1$

The joint CDF also has the following properties:

* $F_X(x) = \lim\limits_{y\to \infty} F_{XY}(x,y)$ (*marginal CDF*)
* $F_Y(y) = \lim\limits_{x\to \infty} F_{XY}(x,y)$ (*marginal CDF*)
* $\mathbb{P}(x_1 < X \leq x_2, y_1 < Y \leq y_2) = F_{XY}(x_2, y_2) - F_{XY}(x_1, y_2) - F_{XY}(x_2, y_1) + F_{XY}(x_1, y_1)$
* if $X$ and $Y$ are independent (see below), then: $F_{XY}(x,y) = F_X(x)F_Y(y)$

<h5>Continuous case:</h5>

It is possible to define this CDF via the following integral representation:

$$F_{XY}(x,y) = \int\limits_{-\infty}^{x}\int\limits_{-\infty}^{y}f_{XY}(t_1,t_2)dt_2dt_1$$

</div>

From a graphical point of view, the joint CDF of the pair $(X,Y)$ corresponds to the probability that $(X,Y)$ belongs to a region bounded by $x$ and $y$:

In [None]:
get_xy_plane()

<div class="alert alert-success" style='background-color:white'>
<h4>Property: Link between Joint PDF and Joint CDF</h4>

The joint PDF $f_{XY}(x,y)$ of r.v. $X$ and $Y$ from the joint CDF $F_{XY}(x,y)$ can be obtained by double differentiation of $F_{XY}(x,y)$:

$f_{XY}(x,y) = \frac{\partial^2}{\partial x \partial y} F_{XY}(x,y)$

**Remark:** as in the univariate case, the joint PDF $f_{XY}(x,y)$ can have values greater than 1, because it is a density and **not** a probability.

As a reminder, in the univariate case, we said that probability was reflected by the area under the curve of the density function $f_X(x)$. Hence, the probability at a specific point equals 0.

We will develop this idea for the case of a pair of continuous r.v.

In the case of two continuous r.v. $X$ and $Y$ with joint PDF $f_{XY}(x,y)$, probability is given by the volume under the curve, because according to the definition $\forall A \in \mathbb{R}^2, \ \mathbb{P}\left((X,Y) \in A\right) = \iint\limits_{(x,y)\in A} f_{XY}(x,y)dxdy$.

The joint PDF can thus be viewed as a measure of probability per unit area.

Thus, in the univariate case, for a small $\Delta > 0$, we can present the density function as follows:
$f_X(x) = \lim\limits_{\Delta\to 0}\frac{\mathbb{P}(x < X \leq x + \Delta)}{\Delta}$

In the case of the pair $(X,Y)$, the PDF $f_{XY}(x,y)$ can be interpreted as the probability that the pair $(X,Y)$ has its values in a small rectangle of width $\delta_x$ and height $\delta_y$ around the point $(x,y)$, i.e.:
$\mathbb{P}(x < X\leq x + \delta_x,\ y < Y \leq y + \delta_y) \approx f_{XY}(x,y)\delta_x\delta_y$

<center>
<img src="img/density-region.png" alt="Small rectangle region to define the density" width="600px">
</center>

Let's calculate the joint CDF $F_XY(x, y) = P(X \leq x, Y \leq y)$ for our toy example:

<table>
<thead>
  <tr>
    <th align="right">Watch_Time\ Rating</th>
    <th align="center">Low (1-3 ★)</br><center>0</center></th>
    <th align="center">High (4-5 ★)</br><center>1</center></th>    
    <th align="center">TOTAL</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td align="right"><strong>Short (&lt; 30 min)</strong></br><center>0</center></td>
    <td align="center">3/20 = 0.15</td>
    <td align="center">1/20 = 0.05</td>    
    <td align="center">4/20 = 1/5 = 0.2</td>
  </tr>
  <tr>
    <td align="right"><strong>Long (≥ 30 min)</strong></br><center>1</center></td>
    <td align="center">1/10 = 0.1</td>
    <td align="center">7/10 = 0.7</td>    
    <td align="center">4/5 = 0.8</td>
  </tr>
  <tr>
    <td align="right"><strong>TOTAL</strong></td>
    <td align="center">1/4 = 0.25</td>
    <td align="center">3/4 = 0.75</td>    
    <td align="center">1</td>
  </tr>
</tbody>
</table>

1. $F_XY(0, 0) = P(X \leq 0, Y \leq 0)$

$F_XY(0, 0) = P(X \leq 0, Y \leq 0) = P(X = 0, Y = 0) = 0.15$

2. $F_XY(0, 1) = P(X \leq 0, Y \leq 1)$

$F_XY(0, 1) = P(X \leq 0, Y \leq 1) = P(X = 0, Y = 0) + P(X = 0, Y = 1) = 0.15 + 0.05 = 0.2$

3. $F_XY(1, 0) = P(X \leq 1, Y \leq 0)$

$F_XY(1, 0) = P(X \leq 1, Y \leq 0) = P(X = 0, Y = 0) + P(X = 1, Y = 0) = 0.15 + 0.1 = 0.25$

4. $F_XY(1, 1) = P(X \leq 1, Y \leq 1)$

$F_XY(1, 1) = P(X \leq 1, Y \leq 1) = P(X = 0, Y = 0) + P(X = 0, Y = 1) + P(X = 1, Y = 0) = P(X = 1, Y = 1) = 0.15 + 0.05 + 0.1 + 0.7 = 1$

In [None]:
# Calculate joint CDF at key points
F_XY = {}

# Point (0, 0)
print("\n  F_XY(0, 0) = P(X ≤ 0, Y ≤ 0)")
print("             = P(X = 0, Y = 0)")
print(f"             = {P_XY[(0,0)]:.2f}")
F_XY[(0, 0)] = P_XY[(0,0)]

# Point (0, 1)
print("\n  F_XY(0, 1) = P(X ≤ 0, Y ≤ 1)")
print("             = P(X = 0, Y = 0) + P(X = 0, Y = 1)")
print(f"             = {P_XY[(0,0)]:.2f} + {P_XY[(0,1)]:.2f}")
print(f"             = {P_XY[(0,0)] + P_XY[(0,1)]:.2f}")
F_XY[(0, 1)] = P_XY[(0,0)] + P_XY[(0,1)]

# Point (1, 0)
print("\n  F_XY(1, 0) = P(X ≤ 1, Y ≤ 0)")
print("             = P(X = 0, Y = 0) + P(X = 1, Y = 0)")
print(f"             = {P_XY[(0,0)]:.2f} + {P_XY[(1,0)]:.2f}")
print(f"             = {P_XY[(0,0)] + P_XY[(1,0)]:.2f}")
F_XY[(1, 0)] = P_XY[(0,0)] + P_XY[(1,0)]

# Point (1, 1)
print("\n  F_XY(1, 1) = P(X ≤ 1, Y ≤ 1)")
print("             = P(X = 0, Y = 0) + P(X = 0, Y = 1) + P(X = 1, Y = 0) + P(X = 1, Y = 1)")
print(f"             = {P_XY[(0,0)]:.2f} + {P_XY[(0,1)]:.2f} + {P_XY[(1,0)]:.2f} + {P_XY[(1,1)]:.2f}")
print(f"             = {sum(P_XY.values()):.2f}")
F_XY[(1, 1)] = sum(P_XY.values())

print("\n" + "="*70)
print("Joint CDF Table:")
print("="*70)
cdf_table = pd.DataFrame({
    'F_XY(x, y)': ['(0, 0)', '(0, 1)', '(1, 0)', '(1, 1)'],
    'Value': [F_XY[(0,0)], F_XY[(0,1)], F_XY[(1,0)], F_XY[(1,1)]],
    'Interpretation': [
        'P(Short, Low)',
        'P(Short, Low or High)',
        'P(Short or Long, Low)',
        'P(Short or Long, Low or High) = 1'
    ]
})
print(cdf_table.to_string(index=False))

In [None]:
# Create 1x2 subplot layout
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Joint CDF as 3D surface
axes[0].remove()
ax3 = fig.add_subplot(1, 2, 1, projection='3d')

# Create grid
x_grid = np.array([-0.5, 0, 0.5, 1, 1.5])
y_grid = np.array([-0.5, 0, 0.5, 1, 1.5])
X_mesh, Y_mesh = np.meshgrid(x_grid, y_grid)

# Calculate F_XY for grid
F_mesh = np.zeros_like(X_mesh)
for i, x_val in enumerate(x_grid):
    for j, y_val in enumerate(y_grid):
        if x_val < 0 or y_val < 0:
            F_mesh[j, i] = 0
        elif x_val >= 1 and y_val >= 1:
            F_mesh[j, i] = 1.0
        elif x_val >= 1 and y_val < 1:
            F_mesh[j, i] = F_XY[(1, 0)]
        elif x_val < 1 and y_val >= 1:
            F_mesh[j, i] = F_XY[(0, 1)]
        else:  # x_val < 1 and y_val < 1
            if x_val >= 0 and y_val >= 0:
                F_mesh[j, i] = F_XY[(0, 0)]
            else:
                F_mesh[j, i] = 0

# Plot surface
surf = ax3.plot_surface(X_mesh, Y_mesh, F_mesh, cmap='viridis', alpha=0.7, edgecolor='black', linewidth=0.5)

# Mark key points
key_points = [(0, 0, F_XY[(0,0)]), (0, 1, F_XY[(0,1)]), (1, 0, F_XY[(1,0)]), (1, 1, F_XY[(1,1)])]
for (x, y, z) in key_points:
    ax3.scatter([x], [y], [z], color='red', s=100, edgecolors='black', linewidth=2)
    ax3.text(x, y, z, f'  {z:.2f}', fontsize=9, fontweight='bold')

ax3.set_xlabel('x (Watch Time)', fontsize=10, fontweight='bold')
ax3.set_ylabel('y (Rating)', fontsize=10, fontweight='bold')
ax3.set_zlabel('F_XY(x, y)', fontsize=10, fontweight='bold')
ax3.set_title('Joint CDF: F_XY(x, y)', fontsize=13, fontweight='bold')
ax3.set_zlim(0, 1.1)
fig.colorbar(surf, ax=ax3, shrink=0.5, aspect=5)


# Plot 2: Joint CDF as heatmap
ax4 = axes[1]

# Create detailed grid for heatmap
x_detailed = np.linspace(-0.5, 1.5, 50)
y_detailed = np.linspace(-0.5, 1.5, 50)
X_detail, Y_detail = np.meshgrid(x_detailed, y_detailed)

F_detail = np.zeros_like(X_detail)
for i in range(len(x_detailed)):
    for j in range(len(y_detailed)):
        x_val = x_detailed[i]
        y_val = y_detailed[j]
        
        if x_val < 0 or y_val < 0:
            F_detail[j, i] = 0
        elif x_val >= 1 and y_val >= 1:
            F_detail[j, i] = 1.0
        elif x_val >= 1:  # x >= 1, y < 1
            if y_val >= 0:
                F_detail[j, i] = F_XY[(1, 0)]
            else:
                F_detail[j, i] = 0
        elif y_val >= 1:  # y >= 1, x < 1
            if x_val >= 0:
                F_detail[j, i] = F_XY[(0, 1)]
            else:
                F_detail[j, i] = 0
        else:  # both < 1
            if x_val >= 0 and y_val >= 0:
                F_detail[j, i] = F_XY[(0, 0)]
            else:
                F_detail[j, i] = 0

contour = ax4.contourf(X_detail, Y_detail, F_detail, levels=10, cmap='viridis')
ax4.contour(X_detail, Y_detail, F_detail, levels=10, colors='black', alpha=0.3, linewidths=0.5)

# Mark key points
for (x, y, z) in key_points:
    ax4.plot(x, y, 'ro', markersize=12, markeredgecolor='black', markeredgewidth=2)
    ax4.text(x + 0.1, y + 0.1, f'{z:.2f}', fontsize=10, fontweight='bold', 
            bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

ax4.set_xlabel('x (Watch Time: 0=Short, 1=Long)', fontsize=12, fontweight='bold')
ax4.set_ylabel('y (Rating: 0=Low, 1=High)', fontsize=12, fontweight='bold')
ax4.set_title('Joint CDF: Contour Plot', fontsize=13, fontweight='bold')
ax4.grid(True, alpha=0.3)
plt.colorbar(contour, ax=ax4, label='F_XY(x, y)')

# Add lines at discrete values
ax4.axvline(0, color='white', linestyle='--', linewidth=1, alpha=0.5)
ax4.axvline(1, color='white', linestyle='--', linewidth=1, alpha=0.5)
ax4.axhline(0, color='white', linestyle='--', linewidth=1, alpha=0.5)
ax4.axhline(1, color='white', linestyle='--', linewidth=1, alpha=0.5)

plt.tight_layout()
plt.show()



## Marginal Distribution

Now, let's focus on the margins of our table.

<table>
<thead>
  <tr>
    <th align="right">Watch_Time\ Rating</th>
    <th align="center">Low (1-3 ★)</th>
    <th align="center">High (4-5 ★)</th>    
    <th align="center">TOTAL</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td align="right"><strong>Short (&lt; 30 min)</strong></td>
    <td align="center">3/20</td>
    <td align="center">1/20</td>    
    <td align="center" style="background-color: #e6f3ff;">4/20 = 1/5</td>
  </tr>
  <tr>
    <td align="right"><strong>Long (≥ 30 min)</strong></td>
    <td align="center" >2/20 = 1/10</td>
    <td align="center">14/20 = 7/10</td>    
    <td align="center" style="background-color: #e6f3ff;">16/20 = 4/5</td>
  </tr>
  <tr>
    <td align="right"><strong>TOTAL</strong></td>
    <td align="center" style="background-color: #e6f3ff;">5/20 = 1/4</td>
    <td align="center" style="background-color: #e6f3ff;">15/20 = 3/4</td>    
    <td align="center">20/20 = 1</td>
  </tr>
</tbody>
</table>

</br>

> What is the probability to rate movies "HIGH" regardless the watching time?

**ANSWER**: 3/4

> What is the probability to watch movie for a SHORT time regardless a rating provided?

**ANSWER**: 1/5

This is what we call **marginal probabilities**.

In [None]:
marginal_X = joint_prob_table.sum(axis=1) # sum across columns
marginal_Y = joint_prob_table.sum(axis=0) # sum across rows

# Create table with margins
joint_with_margins = joint_prob_table.copy()
joint_with_margins['P(X)'] = marginal_X
joint_with_margins.loc['P(Y)'] = marginal_Y.tolist() + [1.0]

# Apply styling to margin cells
def highlight_margins(s):
    # Get the dataframe shape
    n_rows = len(joint_with_margins)
    n_cols = len(joint_with_margins.columns)
    
    # Create a DataFrame of styles with the same shape
    styles = pd.DataFrame('', index=s.index, columns=s.columns)
    
    # Highlight last row
    styles.iloc[-1, :] = 'background-color: #e6f3ff'
    
    # Highlight last column
    styles.iloc[:, -1] = 'background-color: #e6f3ff'
    
    return styles


styled_margins = joint_with_margins.style.apply(highlight_margins, axis=None)

display(styled_margins)

<div class="alert alert-success">
<h4>Definition: Marginal Distribution</h4>

The **marginal distribution** gives the probability of one variable *without considering* the other.</p>

<h5>Discrete:</h5>

$$\forall x\in \mathbb{R}_X: \ \mathbb{P}_X(x) = \sum_{y_j\in \mathbb{R}_Y} \mathbb{P}_{XY}(x_i,y_j)$$
$$\forall y\in \mathbb{R}_Y: \ \mathbb{P}_Y(y) = \sum_{x_i\in \mathbb{R}_X} \mathbb{P}_{XY}(x_i,y_j)$$


<h5>Continuous:</h5>


$$f_X(x) = \int f_{XY}(x, y) dy$$
$$f_Y(y) = \int f_{XY}(x, y) dx$$


*Intuition*: "Sum (or integrate) across the row/column to get the margin"

</div>

In [None]:
# Visualize marginals as bar charts
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Marginal of Watch Time
axes[0].bar(marginal_X.index, marginal_X.values, color=['coral', 'steelblue'], 
            alpha=0.7, edgecolor='black', linewidth=2)
axes[0].set_ylabel('Probability', fontsize=12)
axes[0].set_xlabel('Watch Time', fontsize=12)
axes[0].set_title('Marginal Distribution: P(X)\n"What % watched short vs long?"', 
                  fontsize=13, fontweight='bold')
axes[0].set_ylim(0, 1)
axes[0].grid(axis='y', alpha=0.3)
for i, v in enumerate(marginal_X.values):
    axes[0].text(i, v + 0.03, f'{v:.2f}\n({v*100:.0f}%)', 
                ha='center', fontsize=11, fontweight='bold')

# Marginal of Rating
axes[1].bar(marginal_Y.index, marginal_Y.values, color=['tomato', 'lightgreen'], 
            alpha=0.7, edgecolor='black', linewidth=2)
axes[1].set_ylabel('Probability', fontsize=12)
axes[1].set_xlabel('Rating', fontsize=12)
axes[1].set_title('Marginal Distribution: P(Y)\n"What % rated low vs high?"', 
                  fontsize=13, fontweight='bold')
axes[1].set_ylim(0, 1)
axes[1].grid(axis='y', alpha=0.3)
for i, v in enumerate(marginal_Y.values):
    axes[1].text(i, v + 0.03, f'{v:.2f}\n({v*100:.0f}%)', 
                ha='center', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

print("\n📊 Interpretation:")
print(f"   P(Long) = {marginal_X['Long']:.3f}")
print(f"   → {marginal_X['Long']*100:.1f}% of users watch for long time (ignoring rating)")
print(f"\n   P(High) = {marginal_Y['High']:.3f}")
print(f"   → {marginal_Y['High']*100:.1f}% of ratings are high (ignoring watch time)")

In [None]:
# Marginal PMFs
P_X = {}
P_X[0] = P_XY[(0,0)] + P_XY[(0,1)]  # P(X=0) = P(Short)
P_X[1] = P_XY[(1,0)] + P_XY[(1,1)]  # P(X=1) = P(Long)
print(f"Marginal PMF P_X: {P_X}")

P_Y = {}
P_Y[0] = P_XY[(0,0)] + P_XY[(1,0)]  # P(Y=0) = P(Low)
P_Y[1] = P_XY[(0,1)] + P_XY[(1,1)]  # P(Y=1) = P(High)
print(f"Marginal PMF P_Y: {P_Y}")

In [None]:
# continuous case
get_all_pdf_cont(show_marginal=True, show_cond=False)

<div class="alert alert-success">
<h4>Definition: Marginal CDF</h4>

Let $(X,Y)$ be a pair of r.v. with CDF $F_{XY}(x,y)$.

We call the **marginal CDFs** of $X$ and $Y$ the functions defined as follows:

$\forall x\in \mathbb{R}_X: F_X(x) = F_{XY}(x,\infty) = \lim\limits_{y\rightarrow \infty} F_{XY}(x,y)$
$\forall y\in \mathbb{R}_Y: F_Y(y) = F_{XY}(\infty,y) = \lim\limits_{x\rightarrow \infty} F_{XY}(x,y)$

<h5>Continuous case:</h5>

We can define the **marginal CDFs** of $X$ and $Y$ as follows:
$$F_X(x) = \int\limits_{-\infty}^{x}\int\limits_{-\infty}^{+\infty}f_{XY}(t_1, t_2) dt_2 dt_1,  \forall x\in \mathbb{R}_X$$
$$F_Y(y) = \int\limits_{-\infty}^{y}\int\limits_{-\infty}^{+\infty}f_{XY}(t_1, t_2) dt_1 dt_2,  \forall y\in \mathbb{R}_Y$$

</div>

<div class="alert alert-success" style='background-color:white'>
<h4>Property of CDF</h4>

Let $(X,Y)$ be a pair of r.v. Let $x_1 \leq x_2$, $y_1 \leq y_2$, $x_1, x_2, y_1, y_2 \in \mathbb{R}$.

Then:
$\mathbb{P}(x_1 < X \leq x_2, y_1 < Y \leq y_2) = F_{XY}(x_2,y_2) - F_{XY}(x_1,y_2) - F_{XY}(x_2, y_1) + F_{XY}(x_1,y_1)$

</div>

In [None]:
# CDF visualisation for our toy example
fig, axes = plt.subplots(1, 2, figsize=(16, 6))


# Plot 1: CDF of X (Marginal)
ax1 = axes[0]

# Create step function for CDF
x_plot = np.array([-0.5, 0, 0, 1, 1, 1.5])
F_x_plot = np.array([0, 0, P_X[0], P_X[0], 1.0, 1.0])

ax1.plot(x_plot, F_x_plot, 'b-', linewidth=2, label='F_X(x)')

# Mark the jumps (discrete points)
ax1.plot([0, 1], [P_X[0], 1.0], 'bo', markersize=10, label='Jumps at discrete values')

# Open circles at left endpoints of jumps
ax1.plot([0, 1], [0, P_X[0]], 'bo', markersize=10, fillstyle='none')

# Horizontal lines showing constant regions
ax1.axhline(y=0, xmin=0, xmax=0.4, color='blue', linestyle='--', alpha=0.3)
ax1.axhline(y=P_X[0], xmin=0.4, xmax=0.7, color='blue', linestyle='--', alpha=0.3)
ax1.axhline(y=1.0, xmin=0.7, xmax=1, color='blue', linestyle='--', alpha=0.3)

ax1.set_xlabel('x (Watch Time: 0=Short, 1=Long)', fontsize=12, fontweight='bold')
ax1.set_ylabel('F_X(x) = P(X ≤ x)', fontsize=12, fontweight='bold')
ax1.set_title('Marginal CDF of X', fontsize=13, fontweight='bold')
ax1.set_xlim(-0.6, 1.6)
ax1.set_ylim(-0.1, 1.1)
ax1.grid(True, alpha=0.3)
ax1.legend(fontsize=10)

# Add annotations
ax1.text(0, P_X[0], f'  {P_X[0]:.2f}', fontsize=11, va='center', fontweight='bold')
ax1.text(1, 1.0, '  1.00', fontsize=11, va='center', fontweight='bold')


# Plot 2: CDF of Y (Marginal)
ax2 = axes[1]

y_plot = np.array([-0.5, 0, 0, 1, 1, 1.5])
F_y_plot = np.array([0, 0, P_Y[0], P_Y[0], 1.0, 1.0])

ax2.plot(y_plot, F_y_plot, 'r-', linewidth=2, label='F_Y(y)')
ax2.plot([0, 1], [P_Y[0], 1.0], 'ro', markersize=10, label='Jumps at discrete values')
ax2.plot([0, 1], [0, P_Y[0]], 'ro', markersize=10, fillstyle='none')

ax2.set_xlabel('y (Rating: 0=Low, 1=High)', fontsize=12, fontweight='bold')
ax2.set_ylabel('F_Y(y) = P(Y ≤ y)', fontsize=12, fontweight='bold')
ax2.set_title('Marginal CDF of Y', fontsize=13, fontweight='bold')
ax2.set_xlim(-0.6, 1.6)
ax2.set_ylim(-0.1, 1.1)
ax2.grid(True, alpha=0.3)
ax2.legend(fontsize=10)

# Add annotations
ax2.text(0, P_Y[0], f'  {P_Y[0]:.2f}', fontsize=11, va='center', fontweight='bold')
ax2.text(1, 1.0, '  1.00', fontsize=11, va='center', fontweight='bold')

## Conditional Probability

<table>
<thead>
  <tr>
    <th align="right">Watch_Time\ Rating</th>
    <th align="center">Low (1-3 ★)</th>
    <th align="center">High (4-5 ★)</th>    
    <th align="center">TOTAL</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td align="right"><strong>Short (&lt; 30 min)</strong></td>
    <td align="center">3/20</td>
    <td align="center">1/20</td>    
    <td align="center">4/20 = 1/5</td>
  </tr>
  <tr>
    <td align="right"><strong>Long (≥ 30 min)</strong></td>
    <td align="center" >2/20 = 1/10</td>
    <td align="center" style="background-color: #e1f7b0ff;">14/20 = 7/10</td>    
    <td align="center" style="background-color: #fffd9cff;">16/20 = 4/5</td>
  </tr>
  <tr>
    <td align="right"><strong>TOTAL</strong></td>
    <td align="center">5/20 = 1/4</td>
    <td align="center">15/20 = 3/4</td>    
    <td align="center">20/20 = 1</td>
  </tr>
</tbody>
</table>

</br>

> If I know someone watched LONG, what's the probability they rated HIGH?
</br>

This is asking for: $P(Y = High | X = Long)$ 

Recall: $P(Y | X) = P(X, Y) / P(X)$

In [None]:
# conditional probability
p_long_high = joint_prob_table.loc['Long', 'High']
p_long = marginal_X['Long']
p_high_given_long = p_long_high / p_long

print(f"  P(Long, High) = {p_long_high:.2f}  (from joint table)")
print(f"  P(Long)       = {p_long:.2f}  (from marginal)")
print(f"  P(High | Long) = {p_long_high:.2f} / {p_long:.2f} = {p_high_given_long:.2f}")

print(f"\nInterpretation:")
print(f"  Among users who watched LONG, {p_high_given_long*100:.0f}% rated HIGH")


In [None]:
print("All Conditional Probabilities: P(Rating | Watch Time)")
print("-"*70)

cond_Y_given_X = joint_prob_table.div(marginal_X, axis=0)
print("\n", cond_Y_given_X.round(3))

print("\nHow to read this:")
print(f"  P(Low | Short)  = {cond_Y_given_X.loc['Short', 'Low']:.2f}")
print(f"    → Of those who watched SHORT, {cond_Y_given_X.loc['Short', 'Low']*100:.0f}% rated LOW")
print(f"  P(High | Long)  = {cond_Y_given_X.loc['Long', 'High']:.2f}")
print(f"    → Of those who watched LONG, {cond_Y_given_X.loc['Long', 'High']*100:.0f}% rated HIGH")

print("\nCheck: Each row sums to 1.0 (it's a probability distribution)")
print(cond_Y_given_X.sum(axis=1))

In [None]:
# Visualize conditionals
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Conditional distribution for each watch time category
for idx, watch_cat in enumerate(['Short', 'Long']):
    probs = cond_Y_given_X.loc[watch_cat]
    
    axes[idx].bar(probs.index, probs.values, color=['tomato', 'lightgreen'], 
                  alpha=0.7, edgecolor='black', linewidth=2)
    axes[idx].set_ylabel('Probability', fontsize=12)
    axes[idx].set_xlabel('Rating', fontsize=12)
    axes[idx].set_title(f'P(Rating | Watch Time = {watch_cat})\n"Given they watched {watch_cat}..."', 
                       fontsize=13, fontweight='bold')
    axes[idx].set_ylim(0, 1)
    axes[idx].grid(axis='y', alpha=0.3)
    
    for i, v in enumerate(probs.values):
        axes[idx].text(i, v + 0.03, f'{v:.2f}\n({v*100:.0f}%)', 
                      ha='center', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

print("\nKey Insight:")
print("  The conditional distributions are DIFFERENT!")
print(f"    - Short watchers: {cond_Y_given_X.loc['Short', 'High']*100:.0f}% rated high")
print(f"    - Long watchers:  {cond_Y_given_X.loc['Long', 'High']*100:.0f}% rated high")
print("  This means watch time and rating are DEPENDENT (not independent)")

<div class="alert alert-success">
<h4>Definition: Conditional Distribution</h4>

The **conditional distribution** gives the probability of one variable *given* we know the value of the other.


$$P_{X|Y}(x | y) = P(X = x | Y = y) = \frac{P_{XY}(x, y)}{P_Y(y)}$$

(if $P_Y(y) > 0$)

We call the **conditional CDF** of $X$ given $Y=y_j$ the mapping $F^{[Y=y_j]}_X$ from $\mathbb{R}$ to $[0,1]$ defined for all $x\in \mathbb{R}$ by:
$F^{[Y=y_j]}_X(x)=F_{X|Y=y_j}(x) = \mathbb{P}(X\leq x|Y=y_j) = \frac{\mathbb{P}(X\leq x,Y=y_j)}{\mathbb{P}(Y=y_j)}$

<h5>For continuous:</h5>

The **conditional PDF** of $X$ given $Y=y$ where $f_Y(Y=y) \neq 0$:
$f_{X|Y}(x|y) = \frac{f_{XY}(x,y)}{f_Y(y)}$

The conditional probability of $X\in A$ given $Y=y$:
$\mathbb{P}(X\in A|Y=y) = \int_A f_{X|Y}(x,y)dx$

Condition CDF can be expressed as:

$F_{X|Y}(x|y) = \mathbb{P}(X\leq x| Y = y) = \int\limits_{-\infty}^{x}f_{X|Y}(x|y)dx$

*Remark:* it is possible to define more generally the conditional distribution of $X$ given any event $A$:
$\forall x_i\in \mathbb{R}_X, \ \mathbb{P}_{X|A}(x_i) = \mathbb{P}(X=x_i|A) = \frac{\mathbb{P}(X=x_i \text{ AND } A)}{\mathbb{P}(A)}$

The CDF of $X$ given $A$ is therefore given by:
$F_{X|A}(x) = \mathbb{P}(X\leq x|A)$

</div>

<div class="alert alert-success" style='background-color:white'>


Let $A$ be an event defined as $a < X < b$ (where it is possible that $a = -\infty$ or $b = +\infty$), then:

$F_{X|A}(x) = \left\{\begin{array}{ll} 1 & x > b \\ \frac{F_X(x) - F_X(a)}{F_X(b) - F_X(a)} & a \leq x < b \\ 0 & x < a \end{array}\right.$

and

$f_{X|A}(x) = \left\{\begin{array}{ll} \frac{f_X(x)}{\mathbb{P}(A)} & a \leq x < b \\ 0 & \text{otherwise} \end{array}\right.$

</div>

In [None]:
# continuous case
get_all_pdf_cont(show_marginal=False, show_cond=True, py=False)

In [None]:
# PDF, marginal and conditional density functions
get_all_pdf_cont()

## Independence

**Independence Test:**

If $X$ and $Y$ are independent, then: $P(Y | X) = P(Y)$. In other words: knowing $X$ doesn't change our belief about $Y$.

Let's check:

In [None]:
# Compare conditional with marginal
print(f"\nP(High) = {marginal_Y['High']:.2f}  (marginal - unconditional)")
print(f"P(High | Short) = {cond_Y_given_X.loc['Short', 'High']:.2f}")
print(f"P(High | Long)  = {cond_Y_given_X.loc['Long', 'High']:.2f}")

print("\nComparison:")
if abs(marginal_Y['High'] - cond_Y_given_X.loc['Short', 'High']) < 0.01:
    print("  P(High) ≈ P(High | Short) → INDEPENDENT for Short")
else:
    print(f"  P(High) = {marginal_Y['High']:.2f} ≠ P(High | Short) = {cond_Y_given_X.loc['Short', 'High']:.2f} → DEPENDENT")

if abs(marginal_Y['High'] - cond_Y_given_X.loc['Long', 'High']) < 0.01:
    print("  P(High) ≈ P(High | Long) → INDEPENDENT for Long")
else:
    print(f"  P(High) = {marginal_Y['High']:.2f} ≠ P(High | Long) = {cond_Y_given_X.loc['Long', 'High']:.2f} → DEPENDENT")
    

print("CONCLUSION: Watch Time and Rating are DEPENDENT!")
print("Knowing how long someone watched DOES tell us something about their rating.")


<div class="alert alert-success">
<h4>Definition: Independent Variables</h4>

Two r.v. $X$ and $Y$ are said to be **independent** if
$\forall x \in \mathbb{R}_X, \forall y \in \mathbb{R}_Y, \ \mathbb{P}(X\leq x, Y\leq y) = \mathbb{P}(X\leq x)\times \mathbb{P}(Y\leq y)$

This is equivalent to the following condition:
$\forall x \in \mathbb{R}_X, \forall y \in \mathbb{R}_Y, \ F_{XY}(x, y) = F_X(x)\times F_Y(y)$

<h5>Continuous case:</h5>

$X$ and $Y$ are independent if and only if:
$$\forall s \in \mathbb{R}_X, \forall t\in \mathbb{R}_Y, \ f_{XY}(s,t) = f_X(s)\times f_Y(t)$$

</div>

<div class="alert alert-success" style='background-color:white'>


Let $X$ and $Y$ be two discrete r.v. $X$ and $Y$ are independent if and only if:
$\forall x \in \mathbb{R}_X, \forall y\in \mathbb{R}_Y, \ \mathbb{P}_{XY}(x,y) = \mathbb{P}(X=x, Y=y) = \mathbb{P}_X(x)\times \mathbb{P}_Y(y)$

</div>

<div class="alert alert-success" style='background-color:white'>


Let $X$ and $Y$ be two independent discrete r.v. Then:
$\mathbb{P}_{X|Y}(x_i,y_j) = \mathbb{P}(X=x_i|Y=y_j) = \frac{\mathbb{P}_{XY}(x_i,y_j)}{\mathbb{P}_Y(y_j)} = \frac{\mathbb{P}_X(x_i)\times \mathbb{P}_Y(y_j)}{\mathbb{P}_Y(y_j)} = \mathbb{P}_X(x_i)$

</div>

<div class="alert alert-primary">
<h4>🤖 ML Application Spotlight: Decision Trees</h4>

<p>At each split in a decision tree, we compute <strong>conditional probabilities</strong>:</p>
<ul>
    <li>P(Class = Positive | Feature > threshold)</li>
    <li>P(Class = Positive | Feature ≤ threshold)</li>
</ul>

<p>The split that <strong>maximizes the difference</strong> in conditional probabilities (information gain) is chosen.</p>

<p>Understanding conditional distributions is fundamental to tree-based models!</p>
</div>

## Expectation and Variance

<div class="alert alert-success">
<h4>Definition: Expectation of a pair of r.v.</h4>

We call the **expectation** of the pair of r.v. $(X,Y)$, denoted $\mathbb{E}(X,Y)$, the element of $\mathbb{R}^2$ defined as follows:

$\mathbb{E}(X,Y) = (\mathbb{E}(X), \mathbb{E}(Y))$

</div>

Let's calculate expected values for our toy examples.

$$E[X] = \sum_{i=1}^2 x_i\times P(X=x_i) = 0\times 0.2 + 1\times 0.8 = 0.8$$

In [None]:
E_X_terms = []
for x in X_values:
    term = x * P_X[x]
    E_X_terms.append(term)
    print(f"  x={x}: {x} × {P_X[x]:.2f} = {term:.2f}")

E_X = sum(E_X_terms)

print(f"\nE[X] = {' + '.join([f'{t:.2f}' for t in E_X_terms])} = {E_X:.2f}")
print(f"Average watch time category = {E_X:.2f}")

# verify with Python
# Method 1: Direct calculation from data
E_X_python = df_toy['X'].mean()
print(f"\nUsing pandas .mean():")
print(f"  E[X] = {E_X_python:.2f} ✓")
# Method 2: Using probability table
E_X_prob = sum(x * P_X[x] for x in X_values)
print(f"Using probability formula:")
print(f"  E[X] = {E_X_prob:.2f} ✓")

$$E[Y] = \sum_{i=1}^2 y_i\times P(Y=y_i) = 0\times 0.25 + 1\times 0.75 = 0.75$$

In [None]:
E_Y_terms = []
for y in Y_values:
    term = y * P_Y[y]
    E_Y_terms.append(term)
    print(f"  y={y}: {y} × {P_Y[y]:.2f} = {term:.2f}")

E_Y = sum(E_Y_terms)

print(f"\nE[Y] = {' + '.join([f'{t:.2f}' for t in E_Y_terms])} = {E_Y:.2f}")
print(f"Average rating category = {E_Y:.2f}")

# verify with Python
# Method 1: Direct calculation from data
E_Y_python = df_toy['Y'].mean()
print(f"\nUsing pandas .mean():")
print(f"  E[Y] = {E_Y_python:.2f} ✓")
# Method 2: Using probability table
E_Y_prob = sum(y * P_Y[x] for y in Y_values)
print(f"Using probability formula:")
print(f"  E[Y] = {E_Y_prob:.2f} ✓")

Let's now calculate the variances $Var(X)$ and $Var(Y)$. Recall that $Var(X) = E[(X - E[X])^2] = \sum (x_i - EX)^2 · P(X = x_i) = E[X^2] - (E[X])^2$

In [None]:
# Method 1: Var(X) = E[(X - μ_X)²] = Σ (x - μ_X)² · P(X = x)
Var_X_terms = []
for x in X_values:
    deviation = x - E_X
    squared_dev = deviation**2
    term = squared_dev * P_X[x]
    Var_X_terms.append(term)
    print(f"  x={x}: ({x} - {E_X:.2f})² × {P_X[x]:.2f} = {squared_dev:.4f} × {P_X[x]:.2f} = {term:.4f}")

Var_X = sum(Var_X_terms)

print(f"\nVar(X) = {' + '.join([f'{t:.4f}' for t in Var_X_terms])} = {Var_X:.4f}")

# Method 2: Using E[X²] - (E[X])²
print("\n")
E_X2_terms = []
for x in X_values:
    term = (x**2) * P_X[x]
    E_X2_terms.append(term)
    print(f"  x={x}: {x}² × {P_X[x]:.2f} = {x**2} × {P_X[x]:.2f} = {term:.2f}")

E_X2 = sum(E_X2_terms)
Var_X_alt = E_X2 - (E_X**2)
print(f"\nVar(X) = E[X²] - (E[X])²")
print(f"       = {E_X2:.2f} - ({E_X:.2f})²")
print(f"       = {E_X2:.2f} - {E_X**2:.2f}")
print(f"       = {Var_X_alt:.4f}")

# standard deviation
std_X = np.sqrt(Var_X)
print(f"\nStandard Deviation: σ_X = √{Var_X:.4f} = {std_X:.4f}")

In [None]:
# Variance of Y
Var_Y_terms = []
for y in Y_values:
    deviation = y - E_Y
    squared_dev = deviation**2
    term = squared_dev * P_Y[y]
    Var_Y_terms.append(term)
    print(f"  y={y}: ({y} - {E_Y:.2f})² × {P_Y[y]:.2f} = {squared_dev:.4f} × {P_Y[y]:.2f} = {term:.4f}")

Var_Y = sum(Var_Y_terms)

print(f"\nVar(Y) = {' + '.join([f'{t:.4f}' for t in Var_Y_terms])} = {Var_Y:.4f}")

# standard deviation
std_Y = np.sqrt(Var_Y)
print(f"\nStandard Deviation: σ_Y = √{Var_Y:.4f} = {std_Y:.4f}")

In [None]:
# verify with Python
# Using pandas
Var_X_python = df_toy['X'].var(ddof=0)  # ddof=0 for population variance
Var_Y_python = df_toy['Y'].var(ddof=0)

print(f"\nUsing pandas .var(ddof=0):")
print(f"  Var(X) = {Var_X_python:.4f} ✓")
print(f"  Var(Y) = {Var_Y_python:.4f} ✓")

<div class="alert alert-success">
<h4>Definition: Expectation of a Function of two r.v.</h4>

The expectation can be calculated as follows.

<h5>Discrete case</h5>

Let $h : \mathbb{R}^2 \to \mathbb{R}$ be a bounded and piecewise continuous function. The expectation of the r.v. $Z = h(X,Y)$ is given by:

$$\mathbb{E}[Z] = \sum_{(i,j)\in \mathbb{R}_{XY}}h(i,j)\times \mathbb{P}(X = i, Y = j)$$

Note that $\mathbb{R}_{XY}$ is a finite or countable subset of $\mathbb{N}^2$ in which the values of $(X,Y)$ are found.

<h5>Continuous case</h5>

Let $(X,Y)$ have density function $f_{XY}(x,y)$. Let $h : \mathbb{R}^2 \to \mathbb{R}$ be a bounded and piecewise continuous function. The expectation of the r.v. $Z = h(X,Y)$ is given by:

$$\mathbb{E}[Z] = \int\limits_{\mathbb{R}^2} h(u,v)\times f_{XY}(u,v)dudv$$

when this integral exists.

The expectation of a function of two variables has the following properties:

* *linearity*: $\mathbb{E}[aX+bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]$

</div>

## Conditional Expectation

It is quite possible to calculate the *conditional expectation* of $X$ given that event $A$ has occurred or given the value of the random variable $Y=y_j$. The conditional expectation resembles the simple expectation except that the mass function/density are replaced by their conditional analogues.

<div class="alert alert-success">
<h4>Definition: Conditional Expectation</h4>

Let $(X,Y)$ be a pair of discrete r.v. Let $A$ be an event.

The **conditional expectation** of $X$:

1. given $A$ is defined by:
$\mathbb{E}[X|A] = \sum_{x_i\in \mathbb{R}_X} x_i\mathbb{P}_{X|A}(x_i)$

2. given $Y=y_j$ is defined by:
$\mathbb{E}[X|Y=y_j] = \sum_{x_i\in \mathbb{R}_X} x_i\mathbb{P}_{X|Y}(x_i|y_j)$

In the continuous case, the *conditional expectation* of $X$ given $Y=y$ is given by:

$\mathbb{E}[X|Y=y] = \int_{-\infty}^{+\infty}xf_{X|Y}(x|y)dx$

</div>

Note that the value of the conditional expectation changes as a function of the value $Y=y$.


<div class="alert .alert-exercise">
<h4>Calculated Example</h4>

The joint PDF is given by:
$$f_{XY}(x,y) = \left\{\begin{array}{l} 10xy \text{, if } 0 \leq x \leq 1\text{, } 0 \leq y \leq \sqrt{x} \\ 0 \text{, otherwise} \end{array}\right.$$

The conditional density is:
$$f_{X|Y}(x|y) = \left\{\begin{array}{ll} \frac{2x}{1 - y^4} & y^2 \leq x \leq 1 \\ 0 & \text{otherwise} \end{array}\right.$$

What is the expectation $\mathbb{E}[X|Y=y]$ for $0 \leq y\leq \sqrt{x}$?
</div>

<details>
<summary>Reveal solution</summary>

Note that $y \leq \sqrt{x}$ is equivalent to $y^2 \leq x$ for $x \geq 0$. That is, in our case we can consider $y^2 \leq x \leq 1$.

According to the formula, the conditional expectation $\mathbb{E}[X|Y=y]$ for $0\leq y\leq 1$ can be calculated as follows:

$$\mathbb{E}[X|Y=y] = \int_{-\infty}^{+\infty}xf_{X|Y}(x|y)dx = \int_{y^2}^{1}x \frac{2x}{1 - y^4} dx = \int_{y^2}^{1} \frac{2x^2}{1 - y^4} dx =$$

$$= \frac{2}{1-y^4}\frac{x^3}{3}\Bigg\rvert_{y^2}^{1} = \frac{2}{3(1-y^4)}(1 - y^6) = \frac{2(1 - y^6)}{3(1-y^4)}$$
</details>

<div class="alert alert-success">
<h4>Definition: Law of Total Expectation</h4>

**Law of Total Expectation** (or *Law of Iterated Expectation*):

<h5>Discrete case</h5>

Let $(X,Y)$ be a pair of discrete r.v. The total expectation of $X$ can be calculated as follows:

$$\mathbb{E}X = \sum_{y_j\in \mathbb{R}_Y}\mathbb{E}[X|Y=y_j]\mathbb{P}_Y(y_j)$$

<h5>Continuous case</h5>

Let $(X,Y)$ be a pair of jointly continuous r.v. The total expectation of $X$ can be calculated as follows:

$$\mathbb{E}X = \int\limits_{-\infty}^{\infty} \mathbb{E}[X|Y=y]f_Y(y)dy = \mathbb{E}\left[\mathbb{E}[X|Y]\right]$$

</div>

## Conditional Variance

<div class="alert alert-success">
<h4>Conditional Variance</h4>

<h5>Discrete case</h5>

Let $(X,Y)$ be a pair of discrete r.v. Let $\mu_{X|Y}(y) = \mathbb{E}[X|Y=y]$. The **conditional variance** of $X$ given $Y=y$, denoted $Var(X|Y=y)$, is defined by:

$$Var(X|Y=y) = \mathbb{E}\left[\left(X-\mu_{X|Y}(y)\right)^2|Y=y\right] = \sum_{x_i\in \mathbb{R}_X}\left(x_i-\mu_{X|Y}(y)\right)^2\mathbb{P}_{X|Y}(x_i) = \mathbb{E}[X^2|Y=y] - \mu_{X|Y}(y)^2$$

</div>

Note that as in the case of conditional expectation, conditional variance is a function of the r.v. $Y$ because its value depends on the value of $Y=y$.

<div class="alert alert-success">
<h4>Definition: Law of Total Variance</h4>

**Law of Total Variance**:

Let $(X,Y)$ be a pair of discrete r.v. The total variance of $X$ can be calculated as follows:

$$Var(X) = \mathbb{E}[Var(X|Y)] + Var\left(\mathbb{E}[X|Y]\right)$$

</div>


## Covariance

We can wonder about the relationship between the variations in values of two random variables $X$ and $Y$.

In [None]:
# demo
covariance_demo()

<div class="alert alert-success">
<h4>Definition: Covariance</h4>

**Covariance** measures how two variables change together.


$$Cov(X, Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y]$$


**Interpretation:**

- $Cov(X, Y) > 0$: When $X$ increases, $Y$ tends to increase
- $Cov(X, Y) < 0$: When $X$ increases, $Y$ tends to decrease
- $Cov(X, Y) = 0$: No linear relationship

**Properties:**

* $Cov(X,X) = \mathbb{E}[(X-\mathbb{E}X)(X-\mathbb{E}X)] = \mathbb{E}[XX] - \mathbb{E}X\times\mathbb{E}X = \mathbb{E}[X^2] - (\mathbb{E}X)^2 = Var(X)$

Note that $Cov(X,X)\geq 0$. If $Cov(X,X) = 0$, then $X = const$ almost surely.

* $Cov(X,Y) = Cov(Y,X)$
* $Cov(aX_1 + bY_1, X_2) = a\ Cov(X_1,X_2) + b\ Cov(Y_1,X_2),\ \forall a,b \in \mathbb{R}$
* $Cov(X_1, aX_2 + bY_2) = a\ Cov(X_1,X_2) + b\ Cov(X_1,Y_2),\ \forall a,b \in \mathbb{R}$
* $Cov(X,a) = 0$
* $Cov(aX, bY) = ab\ Cov(X,Y)$
* $Cov(X + c, Y) = Cov(X,Y)$

</div>

The covariance between $X$ and $Y$ reflects the behavior of one r.v. with respect to the other. It translates their joint variations (joint deviations from their respective expectations) but it remains quite difficult to interpret.

Let's calculate the covariance for our toy example: 

1. Method 1: $Cov(X,Y) = E[(X - EX)(Y - EY)]$.

We know: $EX = 0.8$ and $EY = 0.75$.

<table>
<thead>
  <tr>
    <th align="right">Watch_Time\ Rating</th>
    <th align="center">Low (1-3 ★)</br><center>0</center></th>
    <th align="center">High (4-5 ★)</br><center>1</center></th>    
    <th align="center"><center>TOTAL</center></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td align="right"><strong>Short (&lt; 30 min)</strong></br><center>0</center></td>
    <td align="center">3/20 = 0.15</td>
    <td align="center">1/20 = 0.05</td>    
    <td align="center">4/20 = 1/5 = 0.2</td>
  </tr>
  <tr>
    <td align="right"><strong>Long (≥ 30 min)</strong></br><center>1</center></td>
    <td align="center" >1/10 = 0.1</td>
    <td align="center">7/10 = 0.7</td>    
    <td align="center">4/5 = 0.8</td>
  </tr>
  <tr>
    <td align="right"><strong>TOTAL</strong></td>
    <td align="center">1/4 = 0.25</td>
    <td align="center">3/4 = 0.75</td>    
    <td align="center">1</td>
  </tr>
</tbody>
</table>


| $X$ | $Y$ | $X - EX$| $Y - EY$ | $(X - EX)(Y - EY)$ | $P(X=x, Y=y)$| $(X - EX)(Y - EY)\times P(X=x, Y=y)$|
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
| $0$ | $0$ | $0 - 0.8 = -0.8$ | $0 - 0.75 = -0.75$ | $-0.8\times (-0.75) = 0.6$| $0.15$ | $0.6\times 0.15 = 0.09$ |
| $0$ | $1$ | $0 - 0.8 = -0.8$ | $1 - 0.75 = 0.25$ | $-0.8\times 0.25 = -0.2$| $0.05$ | $-0.2 \times 0.05 = -0.01$ |
| $1$ | $0$ | $1 - 0.8 = 0.2$ | $0 - 0.75 = -0.75$ | $0.2\times (-0.75) = -0.15$| $0.1$ | $-0.15\times 0.1 = -0.015$ |
| $1$ | $1$ | $1 - 0.8 = 0.2$ | $1 - 0.75 = 0.25$ | $0.2\times 0.25 = 0.05$| $0.7$ | $0.05\times 0.7 = 0.035$ |

$Cov(X, Y) = 0.09 - 0.01 - 0.015 + 0.035 = 0.1$

In [None]:
Cov_XY_terms = []
for x in X_values:
    for y in Y_values:
        prob = P_XY[(x, y)]
        dev_x = x - E_X
        dev_y = y - E_Y
        product = dev_x * dev_y
        term = product * prob
        Cov_XY_terms.append(term)
        
        print(f"{x:<5} {y:<5} {prob:<10.2f} {dev_x:<12.4f} {dev_y:<12.4f} {product:<12.4f} {term:<12.4f}")

Cov_XY = sum(Cov_XY_terms)
print(f"\nCov(X,Y) = {' + '.join([f'{t:.4f}' for t in Cov_XY_terms])}")
print(f"         = {Cov_XY:.4f}")

if Cov_XY > 0:
    print(f"\n✓ Positive covariance ({Cov_XY:.4f} > 0)")
    print("  → X and Y tend to move together")
    print("  → When X increases, Y tends to increase")
    print("\nIn our context:")
    print("  → Users who watch Long tend to rate High")
    print("  → Users who watch Short tend to rate Low")
elif Cov_XY < 0:
    print(f"\n✓ Negative covariance ({Cov_XY:.4f} < 0)")
    print("  → X and Y tend to move in opposite directions")
else:
    print(f"\n✓ Zero covariance (Cov = 0)")
    print("  → No linear relationship")

In [None]:
# verify with python
# Using numpy
Cov_matrix = np.cov(df_toy['X'], df_toy['Y'], ddof=0)
Cov_XY_numpy = Cov_matrix[0, 1]
print(f"numpy Cov(X, Y) = {Cov_XY_numpy:.4f} ✓")

# Using pandas
Cov_XY_pandas = df_toy[['X', 'Y']].cov(ddof=0).iloc[0, 1]
print(f"pandas Cov(X, Y) = {Cov_XY_pandas:.4f} ✓")

2. Method 2: $Cov(X,Y) = E[XY] - E[X]·E[Y]$

We know: $EX = 0.8$ and $EY = 0.75$. So $E[X]·E[Y] = 0.8\cdot 0.75 = 0.6$

| $X$ | $Y$ | $XY$| $P(X=x, Y=y)$| $XY \times P(X=x, Y=y)$|
|:--:|:--:|:--:|:--:|:--:|
| $0$ | $0$ | $0 \times 0 = 0$ | $0.15$ | $0\times 0.15 = 0$ |
| $0$ | $1$ | $0 \times 1 = 0$ | $0.05$ | $0 \times 0.05 = 0$ |
| $1$ | $0$ | $1 \times 0 = 0$ | $0.1$ | $0\times 0.1 = 0$ |
| $1$ | $1$ | $1 \times 1 = 1$ | $0.7$ | $1\times 0.7 = 0.7$ |

$E[XY] = 0 + 0 + 0 + 0.7 = 0.7$

$Cov(X,Y) = E[XY] - E[X]·E[Y] = 0.7 - 0.6 = 0.1$

In [None]:
E_XY_terms = []
for x in X_values:
    for y in Y_values:
        prob = P_XY[(x, y)]
        xy_product = x * y
        term = xy_product * prob
        E_XY_terms.append(term)
        print(f"{x:<5} {y:<5} {xy_product:<8} {prob:<10.2f} {term:<25.4f}")

E_XY = sum(E_XY_terms)
print(f"\nE[XY] = {' + '.join([f'{t:.4f}' for t in E_XY_terms])} = {E_XY:.4f}")

Cov_XY_alt = E_XY - E_X * E_Y
print(f"\nCov(X,Y) = E[XY] - E[X]·E[Y]")
print(f"         = {E_XY:.4f} - {E_X:.2f} × {E_Y:.2f}")
print(f"         = {E_XY:.4f} - {E_X * E_Y:.4f}")
print(f"         = {Cov_XY_alt:.4f}")



<div class="alert .alert-exercise">
<h4>Calculated Example</h4>

Let's take an example proposed by [@pishro-nik_introduction_2014].

Let $X$ be a continuous r.v. with uniform distribution on $[1,2]$, i.e., $X \sim \mathcal{U}([1,2])$. Let $Y$ be a r.v. which under condition $X = x$ follows the exponential distribution with parameter $\lambda = x$, i.e., $\mathcal{E}(\lambda=x)$. Find the covariance $Cov(X,Y)$.

</div>

<details>
<summary>Reveal Solution</summary>
As a reminder:

| | $f_X(x) \neq 0$ | $EX$ | $Var(X)$ |
|------|---------|--------|------|
|uniform distribution, $\mathcal{U}([a,b]),\ a < b$ | $\frac{1}{b-a}$ | $\frac{a+b}{2}$ | $\frac{(b-a)^2}{12}$ |
|exponential distribution, $\mathcal{E}(\lambda),\ \lambda > 0$| $\lambda e^{-\lambda x}$ | $\frac{1}{\lambda}$ | $\frac{1}{\lambda^2}$ |

To find the covariance, we can use the formula:

$Cov(X,Y) = \mathbb{E}[XY] - \mathbb{E}X\mathbb{E}Y$

When $X \sim \mathcal{U}([1,2])$, its expectation is given by:

$\mathbb{E}X = \frac{a+b}{2} = \frac{1+2}{2} = \frac{3}{2}$

Regarding the r.v. $Y$, we know its conditional distribution. So, to find the expectation of $Y$, we can use the law of total expectation, according to which:

$\mathbb{E}Y = \int\limits_{-\infty}^{\infty} \mathbb{E}[Y|X = x]f_X(x)dx = \mathbb{E}[\mathbb{E}[Y|X]]$

When $Y$ given $X = x$ follows the exponential distribution with parameter $\lambda = x$, in other words $Y|X \sim \mathcal{E}(X)$, then:

$\mathbb{E}[Y|X] = \frac{1}{\lambda} = \frac{1}{X}$

Now, we just need to find:

$\mathbb{E}[\mathbb{E}[Y|X]] = \mathbb{E}\left[\frac{1}{X}\right]$

This is therefore the expectation of a function of $X$. As a reminder:

$\mathbb{E}[h(X)] = \int_{\mathbb{R}}h(x)f_X(x)dx$

When $X\sim \mathcal{U}([1,2])$, we will only focus on the interval $[1,2]$. Thus:

$\mathbb{E}\left[\frac{1}{X}\right] = \int\limits_{1}^{2}\frac{1}{x}\times \frac{1}{2-1} dx = \int\limits_{1}^{2}\frac{1}{x} dx = \ln x\Bigg\rvert_{1}^{2} = \ln 2 - \ln 1 = \ln2$

Now, let's find $\mathbb{E}[XY]$.

Based on the law of total expectation, we can rewrite the expression for $\mathbb{E}[XY]$ as follows:

$\mathbb{E}[XY] = \mathbb{E}\left[\mathbb{E}[XY|X]\right]$

Note that $\mathbb{E}[X|X=x] = x$. Then:

$\mathbb{E}[XY] = \mathbb{E}\left[\mathbb{E}[XY|X]\right] = \mathbb{E}\left[X\mathbb{E}[Y|X]\right]$

We know that $\mathbb{E}[Y|X] = \frac{1}{X}$. Therefore:

$\mathbb{E}[XY] = \mathbb{E}\left[\mathbb{E}[XY|X]\right] = \mathbb{E}\left[X\mathbb{E}[Y|X]\right] = \mathbb{E}\left[X\frac{1}{X}\right] = \mathbb{E}[1] = 1$

Now let's put everything together:

$Cov(X,Y) = \mathbb{E}[XY] - \mathbb{E}X\mathbb{E}Y = 1 - \frac{3}{2}\ln2$

</details>

<div class="alert alert-success" style='background-color:white'>
<h4>Link with Independence</h4>

Let $X$ and $Y$ be two independent r.v. Then, we have:

1. $\mathbb{E}[XY] = \mathbb{E}X\times \mathbb{E}Y$ and $\mathbb{E}[g(X)h(Y)] = \mathbb{E}[g(X)]\times \mathbb{E}[h(Y)]$
2. $\mathbb{E}[X|Y] = \mathbb{E}[X]$ and $\mathbb{E}[g(X)|Y] = \mathbb{E}[g(X)]$
3. $Var(X + Y) = Var(X) + Var(Y)$
4. $Cov(X,Y) = 0$ because $\mathbb{E}[XY] = \mathbb{E}X\times \mathbb{E}Y$


Let $X$ and $Y$ be two r.v., and $h$ and $g$ two functions from $\mathbb{R}$ to $\mathbb{R}$.

If $X$ and $Y$ are independent, then the two r.v. $g(X)$ and $h(Y)$ are independent and:

$\mathbb{E}\left(g(X)h(Y)\right) = \mathbb{E}(g(X))\times \mathbb{E}(h(Y))$

</div>

## Correlation

<div class="alert alert-success">
<h4>Definition: Correlation Coefficient</h4>

The **correlation coefficient** $\rho$ (rho) is normalized covariance:


$$\rho_{XY} = \frac{Cov(X, Y)}{\sigma_X \sigma_Y}$$


**Key Properties:**

- $-1 ≤ \rho ≤ 1$ (always bounded)
- $\rho = 1$: Perfect positive linear relationship
- $\rho = -1$: Perfect negative linear relationship
- $\rho = 0$: No linear relationship
- Scale-invariant: Units don't matter

</div>

In [None]:
# demo
demo_correlation()

Let's calculate the correlation coefficient for our toy example, $\rho_{XY} = \frac{Cov(X, Y)}{\sigma_X \sigma_Y}$.

We know: $Cov(X,Y) = 0.1$, $\sigma_X = 0.4000$, $\sigma_Y = 0.4330$.

Hence: $\rho_{XY} = \frac{Cov(X, Y)}{\sigma_X \sigma_Y} = \frac{0.1}{0.4 \times 0.433} = \frac{0.1}{0.1732} \approx 0.5774$


In [None]:
denominator = std_X * std_Y
rho_XY = Cov_XY / denominator

print(f"\nρ(X,Y) = Cov(X,Y) / (σ_X · σ_Y)")
print(f"       = {Cov_XY:.4f} / ({std_X:.4f} × {std_Y:.4f})")
print(f"       = {Cov_XY:.4f} / {denominator:.4f}")
print(f"       = {rho_XY:.4f}")

In [None]:
# verify with Python
# Using numpy
rho_numpy = np.corrcoef(df_toy['X'], df_toy['Y'])[0, 1]
print(f"numpy  ρ(X, Y) = {rho_numpy:.4f} ✓")

# Using pandas
rho_pandas = df_toy[['X', 'Y']].corr().iloc[0, 1]
print(f"pandas  ρ(X, Y) = {rho_pandas:.4f} ✓")

In [None]:
# Determine strength
if abs(rho_XY) > 0.9:
    strength = "VERY STRONG"
    color_desc = "nearly perfect"
elif abs(rho_XY) > 0.7:
    strength = "STRONG"
    color_desc = "strong"
elif abs(rho_XY) > 0.5:
    strength = "MODERATE"
    color_desc = "moderate"
elif abs(rho_XY) > 0.3:
    strength = "WEAK"
    color_desc = "weak"
else:
    strength = "VERY WEAK/NONE"
    color_desc = "very weak or no"

direction = "positive" if rho_XY > 0 else "negative" if rho_XY < 0 else "no"

print(f"\n✓ {strength} {direction} linear relationship")
print(f"  → Correlation of {rho_XY:.4f} indicates a {color_desc} {direction} association")

if rho_XY > 0:
    print(f"\n  Interpretation:")
    print(f"  • When watch time is Long, rating tends to be High")
    print(f"  • When watch time is Short, rating tends to be Low")
    print(f"  • The strength is {strength.lower()}")

print(f"\n  R² (coefficient of determination) = ρ² = ({rho_XY:.4f})² = {rho_XY**2:.4f}")
print(f"  → About {rho_XY**2*100:.1f}% of variance in Y can be 'explained' by X")

print(f"\n  Key advantage of ρ over Cov:")
print(f"  • ρ is between -1 and 1 (easy to interpret!)")
print(f"  • ρ is dimensionless (units don't matter)")
print(f"  • ρ = {rho_XY:.4f} clearly shows {strength.lower()} relationship")
print(f"  • Cov = {Cov_XY:.4f} was harder to interpret")

<div class="alert alert-danger">
<h4>⚠️ Common Mistake: Correlation ≠ Causation</h4>

<p><strong>High correlation does NOT imply causation!</strong></p>

<div style="background: white; padding: 15px; border-radius: 5px; margin-top: 10px;">
    <p><strong>Classic Example:</strong></p>
    <p>Ice cream sales and drowning deaths are highly correlated (ρ ≈ 0.9)</p>
    <p><strong>But:</strong> Ice cream doesn't cause drowning!</p>
    <p style="color: green; font-weight: bold;">Hidden variable: Summer temperature (affects both)</p>
</div>

<p style="margin-top: 15px;"><strong>In ML:</strong> Always consider confounding variables and conduct proper causal inference!</p>
</div>

<div class="alert alert-warning">
<h4>💡 Key Insight: Independence vs Uncorrelated</h4>

<p><strong>If X and Y are independent → ρ = 0</strong> (uncorrelated)</p>
<p><strong>But:</strong> ρ = 0 does NOT imply independence!</p>

<div style="background: white; padding: 10px; border-radius: 5px; margin-top: 10px;">
    <p><strong>Example:</strong> Y = X² where X ∈ [-1, 1]</p>
    <p>ρ = 0 (symmetric relationship), but X and Y are clearly dependent!</p>
    <p style="color: red; font-style: italic;">Correlation only measures <strong>linear</strong> relationships.</p>
</div>
</div>

In [None]:
# demo
demo_corr_dependence()

<div class="alert alert-primary">
<h4>🤖 ML Application Spotlight: Feature Engineering & EDA</h4>

<p><strong>Practical uses of correlation in ML:</strong></p>

<ol>
    <li><strong>Correlation heatmaps:</strong> Visualize feature relationships before modeling</li>
    <li><strong>Multicollinearity detection:</strong> High correlation between features (ρ > 0.9) can hurt linear models</li>
    <li><strong>Feature selection:</strong> Remove redundant highly-correlated features</li>
    <li><strong>PCA:</strong> Principal Component Analysis finds uncorrelated linear combinations</li>
</ol>

<div style="background: rgba(0,0,0,0.05); padding: 10px; border-radius: 5px; margin-top: 10px;">
    <p><strong>Pro Tip:</strong></p>
    <p>In practice:</p>
    <ul>
        <li><code>pandas.DataFrame.corr()</code></li>
        <li><code>numpy.corrcoef()</code></li>
        <li><code>seaborn.heatmap()</code> for visualization</li>
    </ul>
</div>
</div>

<div class="alert alert-example">
<h4>Calculated Example: Calculate Correlation with Python step by step</h4>

<p>Given the following data about 5 students:</p>
<ul>
    <li>X = Hours of sleep before exam</li>
    <li>Y = Exam score (out of 100)</li>
</ul>

<p><strong>Data:</strong></p>
<table style="border-collapse: collapse;">
    <tr>
        <th style="border: 1px solid black; padding: 8px;">Student</th>
        <th style="border: 1px solid black; padding: 8px;">Sleep (hrs)</th>
        <th style="border: 1px solid black; padding: 8px;">Score</th>
    </tr>
    <tr>
        <td style="border: 1px solid black; padding: 8px;">1</td>
        <td style="border: 1px solid black; padding: 8px;">4</td>
        <td style="border: 1px solid black; padding: 8px;">60</td>
    </tr>
    <tr>
        <td style="border: 1px solid black; padding: 8px;">2</td>
        <td style="border: 1px solid black; padding: 8px;">5</td>
        <td style="border: 1px solid black; padding: 8px;">70</td>
    </tr>
    <tr>
        <td style="border: 1px solid black; padding: 8px;">3</td>
        <td style="border: 1px solid black; padding: 8px;">7</td>
        <td style="border: 1px solid black; padding: 8px;">80</td>
    </tr>
    <tr>
        <td style="border: 1px solid black; padding: 8px;">4</td>
        <td style="border: 1px solid black; padding: 8px;">8</td>
        <td style="border: 1px solid black; padding: 8px;">85</td>
    </tr>
    <tr>
        <td style="border: 1px solid black; padding: 8px;">5</td>
        <td style="border: 1px solid black; padding: 8px;">9</td>
        <td style="border: 1px solid black; padding: 8px;">95</td>
    </tr>
</table>

<p><strong>Calculate: ρ(Sleep, Score)</strong></p>
</div>

In [None]:
# Data
sleep_hours = np.array([4, 5, 7, 8, 9])
scores = np.array([60, 70, 80, 85, 95])
students = ['Student 1', 'Student 2', 'Student 3', 'Student 4', 'Student 5']

n = len(sleep_hours)
example_df = pd.DataFrame({
    'Student': students,
    'Sleep (X)': sleep_hours,
    'Score (Y)': scores
})
print(example_df.to_string(index=False))

In [None]:
# Step 1: Means
print("STEP 1: Calculate means")
mu_sleep = sleep_hours.mean()
mu_score = scores.mean()

print(f"μ_X = (4 + 5 + 7 + 8 + 9) / 5 = {sleep_hours.sum()} / 5 = {mu_sleep:.1f}")
print(f"μ_Y = (60 + 70 + 80 + 85 + 95) / 5 = {scores.sum()} / 5 = {mu_score:.1f}")

In [None]:
# Step 2: Create calculation table
print("STEP 2: Calculate deviations and products")

# center the data by substracting the means
dev_sleep = sleep_hours - mu_sleep
dev_score = scores - mu_score
# element-wise product
products = dev_sleep * dev_score
# squared centered data
sq_dev_sleep = dev_sleep**2
sq_dev_score = dev_score**2

calc_df = pd.DataFrame({
    'Student': students,
    'X': sleep_hours,
    'Y': scores,
    'X - μ_X': dev_sleep,
    'Y - μ_Y': dev_score,
    '(X-μ_X)(Y-μ_Y)': products,
    '(X-μ_X)²': sq_dev_sleep,
    '(Y-μ_Y)²': sq_dev_score
})

print("Calculation Table:")
print(calc_df.to_string(index=False))


In [None]:
# Step 3: Covariance
print("STEP 3: Calculate covariance")
sum_products = products.sum()
cov = sum_products / n

print(f"Σ(X-μ_X)(Y-μ_Y) = {' + '.join([f'({p:.1f})' for p in products])} = {sum_products:.1f}")
print(f"Cov(X,Y) = {sum_products:.1f} / {n} = {cov:.2f}")

In [None]:
# Step 4: Standard deviations
print("STEP 4: Calculate standard deviations")

sum_sq_sleep = sq_dev_sleep.sum()
sum_sq_score = sq_dev_score.sum()

var_sleep = sum_sq_sleep / n
var_score = sum_sq_score / n

std_sleep = np.sqrt(var_sleep)
std_score = np.sqrt(var_score)

print(f"For X (Sleep):")
print(f"  Σ(X-μ_X)² = {' + '.join([f'{s:.1f}' for s in sq_dev_sleep])} = {sum_sq_sleep:.1f}")
print(f"  σ²_X = {sum_sq_sleep:.1f} / {n} = {var_sleep:.2f}")
print(f"  σ_X = √{var_sleep:.2f} = {std_sleep:.3f}")

print(f"\nFor Y (Score):")
print(f"  Σ(Y-μ_Y)² = {' + '.join([f'{s:.1f}' for s in sq_dev_score])} = {sum_sq_score:.1f}")
print(f"  σ²_Y = {sum_sq_score:.1f} / {n} = {var_score:.2f}")
print(f"  σ_Y = √{var_score:.2f} = {std_score:.3f}")

In [None]:
# Step 5: Correlation
print("STEP 5: Calculate correlation coefficient")
rho_manual = cov / (std_sleep * std_score)

print(f"ρ = Cov(X,Y) / (σ_X × σ_Y)")
print(f"ρ = {cov:.2f} / ({std_sleep:.3f} × {std_score:.3f})")
print(f"ρ = {cov:.2f} / {std_sleep * std_score:.3f}")
print(f"ρ = {rho_manual:.4f}")

In [None]:
# Verify
rho_numpy = np.corrcoef(sleep_hours, scores)[0, 1]
print(f"\nVerification: NumPy gives ρ = {rho_numpy:.4f} ✓")

In [None]:
# Visualize the worked example
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot 1: Original data
axes[0].scatter(sleep_hours, scores, s=200, alpha=0.7, color='steelblue',
               edgecolors='black', linewidth=2)

for i, student in enumerate(students):
    axes[0].annotate(student, (sleep_hours[i], scores[i]), 
                    xytext=(5, 5), textcoords='offset points', fontsize=10)

# Add mean lines
axes[0].axhline(mu_score, color='red', linestyle='--', linewidth=2, alpha=0.7, label=f'Mean Score = {mu_score:.1f}')
axes[0].axvline(mu_sleep, color='green', linestyle='--', linewidth=2, alpha=0.7, label=f'Mean Sleep = {mu_sleep:.1f}')

# Add regression line
z = np.polyfit(sleep_hours, scores, 1)
p = np.poly1d(z)
axes[0].plot(sleep_hours, p(sleep_hours), "purple", linewidth=2, label='Best fit line')

axes[0].set_xlabel('Sleep Hours', fontsize=13, fontweight='bold')
axes[0].set_ylabel('Exam Score', fontsize=13, fontweight='bold')
axes[0].set_title(f'Sleep vs Score\nρ = {rho_manual:.4f}', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# Plot 2: Contribution breakdown
axes[1].bar(range(1, n+1), products, color=['green' if p > 0 else 'red' for p in products],
           alpha=0.7, edgecolor='black', linewidth=2)
axes[1].axhline(0, color='black', linestyle='-', linewidth=1)
axes[1].set_xlabel('Student', fontsize=13, fontweight='bold')
axes[1].set_ylabel('(X - μ_X)(Y - μ_Y)', fontsize=13, fontweight='bold')
axes[1].set_title('Contribution to Covariance', fontsize=14, fontweight='bold')
axes[1].set_xticks(range(1, n+1))
axes[1].grid(axis='y', alpha=0.3)



## Return to Opening Challenge

Let's revisit our opening challenge with the full understanding we've gained:

**Original Problem Recap:**
- Simple Rule: "If watch_time > 50 min → predict 4★, else predict 2★"
- Performance: Wrong 62% of the time!
- Question: Can we do better using multivariate distributions?

OUR SOLUTION: Applying Multivariate Distribution Theory

1. STEP 1: Understanding the Joint Distribution

We learned that the relationship between watch time and rating is captured by the JOINT DISTRIBUTION P(X, Y). 

From our toy example, we found:
- $E[X] = 0.8$ (average watch time category)
- $E[Y] = 0.75$ (average rating category)
- $Cov(X,Y) = 0.1$ (positive covariance)
- $\rho(X,Y) = 0.5774$ (strong positive correlation)

Key finding: Watch time and rating are DEPENDENT!

→ $P(High | Long) ≠ P(High)$

→ Knowing watch time changes our belief about rating

2. Answering Question 1: *Are watch time and rating independent?*

ANSWER: NO, they are DEPENDENT

Evidence:
- Test: If independent, P(Y|X) should equal P(Y)
- From toy example:
    * $P(High) = 0.75$ (marginal probability)
    * $P(High | Long) = 0.88$ (conditional probability)
    * $0.75 \neq 0.88 \rightarrow DEPENDENT!$
    * Correlation: $\rho = 0.5774$ (strong positive)

$\Rightarrow$ Conclusion: Watch time and rating are NOT independent

3. Answering Question 2: *How do they vary together (joint distribution)?*

ANSWER: Positive association - they tend to move together

Evidence:
- Joint probability table shows:

| Watch_Time\ Rating |    High |   Low |
|---|----|---|            
|Long | 0.70 | 0.10 |
|Short | 0.05 | 0.15 |

- $P(Long, High) = 0.70$ - largest probability → Most users who watch long also rate high
- $P(Short, Low) = 0.15$ → Users who watch short tend to rate low
- $Covariance = 0.1000 > 0$ → positive relationship

$\Rightarrow$ Conclusion: Strong positive joint distribution

4. Answering Question 3: *Can we build a better model using joint distribution?*

ANSWER: YES! Using conditional distributions improves predictions

Instead of a simple threshold, we use:
- $P(Rating | Short watch)$ to predict for short watchers
- $P(Rating | Long watch)$ to predict for long watchers

From our conditional distribution table:

|Watch_Time \ Rating | High | Low |
|---|----|---|            
|Long | 0.875 | 0.125 |
|Short | 0.250 | 0.750 |

Prediction Strategy:
- IF watch time is Short:</br>
    → Predict Low (probability = 0.75)</br>
    → Predict High (probability = 0.25)</br>
    → Best prediction: Low (higher probability)

- IF watch time is Long:</br>
    → Predict Low (probability = 0.13)</br>
    → Predict High (probability = 0.88)</br>
    → Best prediction: High (higher probability)

In [None]:
# Make predictions using conditional probabilities
df_toy['predicted_rating'] = df_toy['Watch_Time'].map({
    'Short': 'Low',
    'Long': 'High'
})
print(df_toy)

# Calculate accuracy
accuracy = (df_toy['Rating'] == df_toy['predicted_rating']).mean()
error_rate = 1 - accuracy

print(f"\n📈 Performance of Conditional Probability Model:")
print(f"  • Accuracy: {accuracy*100:.1f}%")
print(f"  • Error rate: {error_rate*100:.1f}%")

In [None]:
print("Model Comparison")


# Model 1: Simple threshold (from hook problem)
# Predict 4 stars if Long, 2 stars if Short (in continuous case)
# In our toy discrete case: predict High if Long, Low if Short
simple_predictions = df_toy['Watch_Time'].map({'Short': 'Low', 'Long': 'High'})
simple_correct = (df_toy['Rating'] == simple_predictions).sum()
simple_accuracy = simple_correct / len(df_toy)
simple_error = 1 - simple_accuracy

# Model 2: Marginal only (always predict most common)
marginal_prediction = df_toy['Rating'].mode()[0]  # Most common rating High
marginal_correct = (df_toy['Rating'] == marginal_prediction).sum()
marginal_accuracy = marginal_correct / len(df_toy)
marginal_error = 1 - marginal_accuracy

# Model 3: Conditional probability model (same as simple in this case)
conditional_correct = (df_toy['Rating'] == df_toy['predicted_rating']).sum()
conditional_accuracy = conditional_correct / len(df_toy)
conditional_error = 1 - conditional_accuracy

print("\nModel Performance Summary:")
print("="*70)
print(f"{'Model':<40} {'Correct':<10} {'Accuracy':<12} {'Error Rate'}")
print("-"*70)
print(f"{'1. Marginal Only (ignore watch time)':<40} {marginal_correct:>7}/20   {marginal_accuracy*100:>6.1f}%     {marginal_error*100:>6.1f}%")
print(f"{'2. Simple Threshold (watch > 50)':<40} {simple_correct:>7}/20   {simple_accuracy*100:>6.1f}%     {simple_error*100:>6.1f}%")
print(f"{'3. Conditional Probability Model':<40} {conditional_correct:>7}/20   {conditional_accuracy*100:>6.1f}%     {conditional_error*100:>6.1f}%")
print("="*70)

improvement = (marginal_error - conditional_error) / marginal_error * 100
improvement_simple = (simple_error - conditional_error) / simple_error * 100

print(f"\n✓ Improvement over marginal: {improvement:.1f}%")
print(f"✓ Error reduced from {marginal_error*100:.1f}% to {conditional_error*100:.1f}%")
print(f"\n✓ Improvement over simple: {improvement_simple:.1f}%")
print(f"✓ Error reduced from {simple_error*100:.1f}% to {conditional_error*100:.1f}%")

Note that based on our toy example, our conditional model does not reach a better performance than a simple rule. However, it shows an improvement over the use of the most frequent value (*High* in our case).

In [None]:
# visualisation
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

# Plot 1: Joint distribution showing the pattern
ax1 = axes[0, 0]
joint_viz = joint_prob_table * 100  # Convert to percentages
sns.heatmap(joint_viz, annot=True, fmt='.0f', cmap='YlGnBu', 
            cbar_kws={'label': 'Percentage of Users'}, ax=ax1,
            linewidths=3, linecolor='white', square=True)
ax1.set_title('Joint Distribution: P(Watch Time, Rating)\n(Percentage of 20 users)', 
             fontsize=13, fontweight='bold')
ax1.set_xlabel('Rating', fontsize=11, fontweight='bold')
ax1.set_ylabel('Watch Time', fontsize=11, fontweight='bold')

# Highlight the dominant pattern
ax1.add_patch(plt.Rectangle((0, 0), 1, 1, fill=False, edgecolor='red', lw=4))
ax1.text(0.5, 0.75, 'Main\nPattern', ha='center', va='center', 
        fontsize=11, fontweight='bold', color='red',
        bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.7))

# Plot 2: Conditional distributions
ax2 = axes[0, 1]
x_pos = [0, 1, 3, 4]
heights = [
    cond_Y_given_X.loc['Short', 'Low'],
    cond_Y_given_X.loc['Short', 'High'],
    cond_Y_given_X.loc['Long', 'Low'],
    cond_Y_given_X.loc['Long', 'High']
]
colors = ['coral', 'lightcoral', 'lightblue', 'steelblue']
bars = ax2.bar(x_pos, heights, color=colors, edgecolor='black', linewidth=2)

ax2.set_xticks([0.5, 3.5])
ax2.set_xticklabels(['P(Rating | Short)', 'P(Rating | Long)'], fontsize=11, fontweight='bold')
ax2.set_ylabel('Probability', fontsize=11, fontweight='bold')
ax2.set_title('Conditional Distributions\n(Key to Better Predictions)', 
             fontsize=13, fontweight='bold')
ax2.set_ylim(0, 1)
ax2.grid(axis='y', alpha=0.3)

# Add value labels
for i, (pos, height) in enumerate(zip(x_pos, heights)):
    label = 'Low' if i % 2 == 0 else 'High'
    ax2.text(pos, height + 0.03, f'{label}\n{height:.2f}', 
            ha='center', fontsize=10, fontweight='bold')

# Add prediction arrows
ax2.annotate('Predict\nLow', xy=(0.5, 0.5), xytext=(0.5, 0.7),
            arrowprops=dict(arrowstyle='->', lw=3, color='red'),
            fontsize=11, fontweight='bold', ha='center',
            bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.7))
ax2.annotate('Predict\nHigh', xy=(3.5, 0.5), xytext=(3.5, 0.7),
            arrowprops=dict(arrowstyle='->', lw=3, color='green'),
            fontsize=11, fontweight='bold', ha='center',
            bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.7))

# Plot 3: Confusion matrix for marginal model
ax3 = axes[1, 0]
marginal_pred = np.array([marginal_prediction] * len(df_toy))
from sklearn.metrics import confusion_matrix
cm_marginal = confusion_matrix(df_toy['Rating'], marginal_pred, labels=['Low', 'High'])
sns.heatmap(cm_marginal, annot=True, fmt='d', cmap='Reds', ax=ax3,
           xticklabels=['Low', 'High'], yticklabels=['Low', 'High'],
           cbar_kws={'label': 'Count'})
ax3.set_title(f'Marginal Model (Always Predict High)\nAccuracy: {marginal_accuracy*100:.1f}%', 
             fontsize=13, fontweight='bold')
ax3.set_xlabel('Predicted', fontsize=11, fontweight='bold')
ax3.set_ylabel('Actual', fontsize=11, fontweight='bold')

# Plot 4: Confusion matrix for conditional model
ax4 = axes[1, 1]
cm_conditional = confusion_matrix(df_toy['Rating'], df_toy['predicted_rating'], 
                                  labels=['Low', 'High'])
sns.heatmap(cm_conditional, annot=True, fmt='d', cmap='Greens', ax=ax4,
           xticklabels=['Low', 'High'], yticklabels=['Low', 'High'],
           cbar_kws={'label': 'Count'})
ax4.set_title(f'Conditional Model (Use P(Y|X))\nAccuracy: {conditional_accuracy*100:.1f}%', 
             fontsize=13, fontweight='bold')
ax4.set_xlabel('Predicted', fontsize=11, fontweight='bold')
ax4.set_ylabel('Actual', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

Now let's explore, a generated dataset.

In [None]:
# get a generated dataset 
df = generate_movie_data()
print(df.head(10))
print(f"Dataset Summary:\n{df.describe(include='all')}")


In [None]:
# Calculate E[X] (watch time)
E_X_real = df['watch_time'].mean()
print(f"E[X] (Expected Watch Time):  E[X] = {E_X_real:.2f} minutes")

# Calculate E[Y] (rating)
E_Y_real = df['rating'].mean()
print(f"E[Y] (Expected Rating):  E[Y] = {E_Y_real:.2f} stars")

print(f"These are the 'centers' of our distributions")

In [None]:
# Calculate Var(X)
Var_X_real = df['watch_time'].var(ddof=0)  # Population variance
std_X_real = df['watch_time'].std(ddof=0)

print("\nVar(X) (Variance in Watch Time):")
print(f"  Var(X) = {Var_X_real:.2f} min²")
print(f"  σ_X = √Var(X) = {std_X_real:.2f} minutes")

# Calculate Var(Y)
Var_Y_real = df['rating'].var(ddof=0)
std_Y_real = df['rating'].std(ddof=0)

print("\nVar(Y) (Variance in Rating):")
print(f"  Var(Y) = {Var_Y_real:.2f} stars²")
print(f"  σ_Y = √Var(Y) = {std_Y_real:.2f} stars")

print("\nInterpretation:")
print(f"  • Watch times spread out by ±{std_X_real:.1f} min around the mean")
print(f"  • Ratings spread out by ±{std_Y_real:.2f} stars around the mean")
print(f"  • Higher variance = more variability in behavior")

In [None]:
# Visualize the distributions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Watch time distribution
axes[0].hist(df['watch_time'], bins=30, edgecolor='black', alpha=0.7, color='steelblue')
axes[0].axvline(E_X_real, color='red', linestyle='--', linewidth=2, label=f'Mean = {E_X_real:.1f}')
axes[0].axvline(E_X_real - std_X_real, color='orange', linestyle='--', linewidth=1.5, 
               label=f'±1 SD = [{E_X_real-std_X_real:.1f}, {E_X_real+std_X_real:.1f}]')
axes[0].axvline(E_X_real + std_X_real, color='orange', linestyle='--', linewidth=1.5)
axes[0].set_xlabel('Watch Time (minutes)', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Frequency', fontsize=12, fontweight='bold')
axes[0].set_title(f'Distribution of Watch Time\nMean = {E_X_real:.1f}, SD = {std_X_real:.1f}', 
                 fontsize=13, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(axis='y', alpha=0.3)

# Rating distribution
axes[1].hist(df['rating'], bins=np.arange(0.5, 6.5, 1), edgecolor='black', alpha=0.7, color='coral')
axes[1].axvline(E_Y_real, color='red', linestyle='--', linewidth=2, label=f'Mean = {E_Y_real:.2f}')
axes[1].axvline(E_Y_real - std_Y_real, color='orange', linestyle='--', linewidth=1.5,
               label=f'±1 SD = [{E_Y_real-std_Y_real:.2f}, {E_Y_real+std_Y_real:.2f}]')
axes[1].axvline(E_Y_real + std_Y_real, color='orange', linestyle='--', linewidth=1.5)
axes[1].set_xlabel('Rating (stars)', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Frequency', fontsize=12, fontweight='bold')
axes[1].set_title(f'Distribution of Rating\nMean = {E_Y_real:.2f}, SD = {std_Y_real:.2f}', 
                 fontsize=13, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(axis='y', alpha=0.3)
axes[1].set_xticks([1, 2, 3, 4, 5])

plt.tight_layout()
plt.show()

In [None]:
# Calculate covariance
Cov_XY_real = np.cov(df['watch_time'], df['rating'], ddof=0)[0, 1]
print(f"Cov(X, Y) = {Cov_XY_real:.4f}")

print("\nInterpretation:")
if Cov_XY_real > 0:
    print(f"  • Cov(X, Y) = {Cov_XY_real:.4f} > 0")
    print(f"  • Positive covariance: X and Y tend to increase together")
    print(f"  • Longer watch time associated with higher ratings")
elif Cov_XY_real < 0:
    print(f"  • Cov(X, Y) = {Cov_XY_real:.4f} < 0")
    print(f"  • Negative covariance: X and Y move in opposite directions")
    print(f"  • Longer watch time associated with lower ratings")
else:
    print(f"  • Cov(X, Y) ≈ 0")
    print(f"  • No linear relationship")


In [None]:
# Calculate correlation
rho_XY_real = np.corrcoef(df['watch_time'], df['rating'])[0, 1]
print(f"CORRELATION: ρ(X, Y) = {rho_XY_real:.4f}")

# Interpret strength
if abs(rho_XY_real) > 0.9:
    strength = "VERY STRONG"
elif abs(rho_XY_real) > 0.7:
    strength = "STRONG"
elif abs(rho_XY_real) > 0.5:
    strength = "MODERATE"
elif abs(rho_XY_real) > 0.3:
    strength = "WEAK"
else:
    strength = "VERY WEAK/NONE"

direction = "positive" if rho_XY_real > 0 else "negative" if rho_XY_real < 0 else "no"

print(f"\nInterpretation:")
print(f"  • Strength: {strength} {direction} linear relationship")
print(f"  • Range: -1 ≤ ρ ≤ 1 (normalized!)")
print(f"  • Our value: ρ = {rho_XY_real:.4f}")

if rho_XY_real > 0:
    print(f"\n  • Positive correlation → variables move together")
    print(f"  • As watch time increases, rating tends to increase")
    print(f"  • But relationship is {strength.lower()} (not perfect)")

print(f"\n  • R² = ρ² = {rho_XY_real**2:.4f}")
print(f"  • About {rho_XY_real**2*100:.1f}% of variance in rating explained by watch time")

print(f"\n💡 Why not ρ = 1.0?")
print(f"  • We have three distinct groups (Main, Quick Lovers, Hate Watchers)")
print(f"  • Quick Lovers: short watch + high rating (breaks positive pattern)")
print(f"  • Hate Watchers: long watch + low rating (breaks positive pattern)")
print(f"  • These exceptions weaken the overall correlation")

In [None]:
df['group'].unique()

In [None]:
# Calculate correlation per group
df_main = df[df['group'] == 'Main']
rho_XY_real_main = np.corrcoef(df_main['watch_time'], df_main['rating'])[0, 1]
print(f"Main group CORRELATION: ρ(X, Y) = {rho_XY_real_main:.4f}")

df_quick = df[df['group'] == 'Quick Lover']
rho_XY_real_quick = np.corrcoef(df_quick['watch_time'], df_quick['rating'])[0, 1]
print(f"Quick Lover group CORRELATION: ρ(X, Y) = {rho_XY_real_quick:.4f}")

df_hate = df[df['group'] == 'Hate Watcher']
rho_XY_real_hate = np.corrcoef(df_hate['watch_time'], df_hate['rating'])[0, 1]
print(f"Hate Watcher group CORRELATION: ρ(X, Y) = {rho_XY_real_hate:.4f}")

In [None]:
# Visualize correlation
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Scatter plot with all groups
colors_map = {'Main': 'steelblue', 'Quick Lover': 'green', 'Hate Watcher': 'red'}
for group in ['Main', 'Quick Lover', 'Hate Watcher']:
    mask = df['group'] == group
    axes[0].scatter(df[mask]['watch_time'], df[mask]['rating'], 
                   alpha=0.6, s=30, c=colors_map[group], label=group)

# Add regression line
z = np.polyfit(df['watch_time'], df['rating'], 1)
p = np.poly1d(z)
x_line = np.linspace(df['watch_time'].min(), df['watch_time'].max(), 100)
axes[0].plot(x_line, p(x_line), 'k--', linewidth=2, label=f'Best fit line')

axes[0].set_xlabel('Watch Time (minutes)', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Rating (stars)', fontsize=12, fontweight='bold')
axes[0].set_title(f'Scatter Plot: Watch Time vs Rating\nρ = {rho_XY_real:.4f} ({strength})', 
                 fontsize=13, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# Correlation heatmap
corr_matrix = df[['watch_time', 'rating']].corr()
sns.heatmap(corr_matrix, annot=True, fmt='.4f', cmap='coolwarm', 
            center=0, vmin=-1, vmax=1, square=True, linewidths=2,
            cbar_kws={'label': 'Correlation'}, ax=axes[1])
axes[1].set_title('Correlation Matrix', fontsize=13, fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# test for independence
print("Testing: Are Watch Time and Rating independent?")
print("\nDefinition: X and Y are independent if:")
print("  P(Y = y | X = x) = P(Y = y) for all x, y")
print("  OR equivalently: Knowing X gives no information about Y")

In [None]:
print("TEST 1: Correlation Test")
print(f"\nIf X and Y are independent → ρ(X,Y) = 0")
print(f"Our data: ρ(X,Y) = {rho_XY_real:.4f}")

if abs(rho_XY_real) < 0.05:
    print(f"\n✓ |ρ| ≈ 0 → Could be independent (but need more tests)")
else:
    print(f"\n✗ |ρ| = {abs(rho_XY_real):.4f} ≠ 0 → NOT independent")
    print(f"  Variables show {strength.lower()} linear relationship")

print("\n⚠️ Important: ρ = 0 does NOT prove independence!")
print("   (Could have non-linear dependence)")
print(f"   But ρ ≠ 0 DOES prove dependence!")

In [None]:
print("TEST 2: Conditional vs Marginal Test")

print("\nStrategy: Compare P(Y | X) with P(Y)")
print("  If independent: P(High | Long watch) should equal P(High)")

# Create categories for easier analysis
df['watch_category'] = pd.cut(df['watch_time'], 
                               bins=[0, 40, 120], 
                               labels=['Short', 'Long'])
df['rating_category'] = pd.cut(df['rating'], 
                                bins=[0, 3, 5], 
                                labels=['Low', 'High'])

# Calculate marginal P(High)
p_high_marginal = (df['rating'] >= 4).mean()

# Calculate conditional P(High | Short)
short_mask = df['watch_time'] < 40
p_high_given_short = (df[short_mask]['rating'] >= 4).mean()

# Calculate conditional P(High | Long)
long_mask = df['watch_time'] >= 40
p_high_given_long = (df[long_mask]['rating'] >= 4).mean()

print(f"\nMarginal probability:")
print(f"  P(High Rating) = {p_high_marginal:.4f}")

print(f"\nConditional probabilities:")
print(f"  P(High | Short watch) = {p_high_given_short:.4f}")
print(f"  P(High | Long watch)  = {p_high_given_long:.4f}")


print("\nComparison:")
print(f"  P(High) = {p_high_marginal:.4f}")
print(f"  P(High | Short) = {p_high_given_short:.4f}")
print(f"  P(High | Long)  = {p_high_given_long:.4f}")

# Check for independence
threshold = 0.05  # 5% tolerance
if abs(p_high_marginal - p_high_given_short) < threshold and \
   abs(p_high_marginal - p_high_given_long) < threshold:
    print(f"\n✓ All probabilities approximately equal → Could be INDEPENDENT")
else:
    print(f"\n✗ Probabilities differ → NOT INDEPENDENT")
    print(f"\n  Differences:")
    print(f"    |P(High) - P(High|Short)| = {abs(p_high_marginal - p_high_given_short):.4f}")
    print(f"    |P(High) - P(High|Long)|  = {abs(p_high_marginal - p_high_given_long):.4f}")
    print(f"\n  → Knowing watch time DOES change probability of high rating")


In [None]:
print(f"\n💡 Why they're dependent:")
print(f"  • Main group (70%): strong positive correlation")
print(f"  • Quick Lovers (15%): break the pattern")
print(f"  • Hate Watchers (15%): break the pattern")

## Common Mistakes

<div class="alert alert-danger">
<h4>⚠️ Common Pitfalls:</h4>
<ul>
    <li>❌ Assuming independence without testing</li>
    <li>❌ Confusing correlation with causation</li>
    <li>❌ Using correlation for non-linear relationships</li>
    <li>❌ Ignoring joint structure in predictions</li>
</ul>
</div>

## Applications in Machine Learning

<div class="alert alert-secondary">
<h4>🤖 ML Applications Summary</h4>

<ul>
    <li>Feature correlation analysis</li>
    <li>Multicollinearity detection</li>
    <li>Conditional probability in decision trees</li>
    <li>Gaussian processes and regression</li>
    <li>Uncertainty quantification</li>
</ul>

</div>

## Key Takeaways

<div class="alert alert-summary">
<h4>🎓 Key Takeaways</h4>

<ol>
<li><strong>Joint Distributions</strong>
    <ul>
    <li>P<sub>XY</sub>(x, y) describes probability of both variables simultaneously</li>
    <li>Discrete (PMF) vs Continuous (PDF)</li>
    <li>Visualized with heatmaps and 3D surfaces</li>
    </ul>
</li>
    
<li><strong>Marginal Distributions</strong>
        <ul>
            <li>P<sub>X</sub>(x) = sum/integrate over Y</li>
            <li>Gives distribution of one variable ignoring the other</li>
            <li>"Collapse" the joint distribution</li>
        </ul>
</li>
    
<li><strong>Conditional Distributions</strong>
        <ul>
            <li>P(X | Y) = P(X, Y) / P(Y)</li>
            <li>Foundation of prediction and regression</li>
            <li>Key for testing independence</li>
        </ul>
</li>
    
<li><strong>Covariance & Correlation</strong>
        <ul>
            <li>Cov(X,Y) measures how variables change together</li>
            <li>ρ ∈ [-1, 1] is normalized, scale-invariant</li>
            <li><strong>Warning:</strong> Correlation ≠ Causation!</li>
            <li><strong>Warning:</strong> ρ = 0 ⇏ Independence</li>
        </ul>
    </li>
</ol>
</div>