## Probability vs Likelihood

## Speaker Notes

- Emphasize the distinction: probability asks "what data would we see?" while likelihood asks "what parameters explain the data we saw?"

- In the probability case, $p$ is fixed (we know the coin is fair) and we're asking about random outcomes

- In the likelihood case, the outcome is fixed (we observed 8/10 tails) and we're searching over possible $p$ values

- The likelihood function $\mathcal{L}(p)$ peaks at $p=0.8$ because that's the parameter value that maximizes the probability of observing exactly what we saw

- This is why MLE gives $\hat{p} = 8/10 = 0.8$ - it's simply the empirical frequency

- Note that while $p=0.8$ is most likely, a fair coin ($p=0.5$) could still produce this outcome, just less frequently

## MLE or Maximum Likelihood Estimation

**Which parameter maximizes the probability of the observed data?**

## Log-Likelihood

Warum?

The likelihood is the product of densities:

**SAY: "For 40 data points, you're multiplying 40 numbers each less than 1. This gets astronomically small — computers can't handle it."**

### Underflow: The Technical Details

| Type | Bits | Smallest Number |
|------|------|-----------------|
| float32 | 32 (1 sign, 8 exp, 23 mantissa) | ~10⁻³⁸ |
| float64 | 64 (1 sign, 11 exp, 52 mantissa) | ~10⁻³⁰⁸ |

**Why not just use float64?**
- 2× memory, 2× slower on GPUs
- Deep learning uses float32 or even float16
- Still fails: n=1000 → likelihood can be 10⁻⁵⁰⁰

**SAY: "Likelihood shrinks exponentially with sample size. No floating point format saves you."**

**Solution:** Log turns products into sums:

$$\ell(\mu, \sigma) = \log L = \sum_{i=1}^{n} \log f(x_i \mid \mu, \sigma)$$

**SAY: "Log is strictly increasing, so maximizing log-likelihood gives us the same answer as maximizing likelihood."**


## MLE Derivation for Normal Distribution

## Speaker Notes

**MLE and Loss Functions:**
- The assumed error distribution determines the loss function: Gaussian → MSE (mean), Laplace → MAE (median)
- Both distributions are symmetric (mean = median = mode in population), but MLE mechanics differ due to their probability densities
- Gaussian penalizes quadratic deviations → sample mean minimizes MSE
- Laplace penalizes linear deviations → sample median minimizes MAE
- Using median for Gaussian data is valid but statistically less efficient than MLE

**Sample Statistics:**
- Population parameter $\mu$ is unknown; we estimate it with sample statistic $\bar{x} = \frac{1}{n}\sum x_i$
- **LLN:** $\bar{x} \to \mu$ as $n \to \infty$ (consistency)
- **CLT:** Distribution of $\bar{x}$ approaches $\mathcal{N}(\mu, \sigma^2/n)$ regardless of original distribution
- For small samples from Normal: $\bar{x} \approx \text{median} \approx \text{mode}$, but not exact equality
- For skewed distributions: mean ≠ median even in population (e.g., exponential, log-normal)

**Practice:**
1. Visualize data distribution (histograms, Q-Q plots)
2. Make distributional assumption based on data characteristics
3. Choose loss function matching assumption (MSE for Gaussian, MAE for Laplace/outliers)
4. Remember: MLE under Gaussian assumption gives $\hat{\mu}_{\text{MLE}} = \bar{x}$, which is exactly what linear regression with MSE does

---

### Why Smaller Residuals → Bigger Likelihood

From the log-likelihood:

$$\ell = -n\log(\sigma) - \frac{n}{2}\log(2\pi) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_i - w^T x_i)^2$$

The sum of squared residuals appears with a **negative coefficient**: $-\frac{1}{2\sigma^2}$

When $\sum (y_i - w^T x_i)^2$ is **smaller**:
- The term $-\frac{1}{2\sigma^2}\sum (y_i - w^T x_i)^2$ becomes **less negative** (closer to 0)
- Therefore $\ell$ becomes **larger** (maximized)

**Example:**
- If $\sum (y_i - w^T x_i)^2 = 100$: $\ell = \text{const} - 50$ 
- If $\sum (y_i - w^T x_i)^2 = 10$: $\ell = \text{const} - 5$ ✓ (larger!)

Small errors → high probability of observing data → high likelihood!

---

### Note on Bias

The MLE estimator $\hat{\sigma}^2$ divides by $n$, not $n-1$.

- MLE is **biased**: $E[\hat{\sigma}^2] = \frac{n-1}{n}\sigma^2$
- The unbiased estimator uses $n-1$ (Bessel's correction)

For large $n$, the difference is negligible. MLE optimizes likelihood, not unbiasedness.

---

## Speaker Notes: MLE Bias and Bessel's Correction

**Why MLE is biased:**
- MLE gives $\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum(x_i - \bar{x})^2$
- The true expected value is $E[\hat{\sigma}^2_{\text{MLE}}] = \frac{n-1}{n}\sigma^2$, which is slightly less than $\sigma^2$
- This happens because we use sample mean $\bar{x}$ instead of true mean $\mu$, which reduces variability

**Bessel's correction:**
- Unbiased estimator: $s^2 = \frac{1}{n-1}\sum(x_i - \bar{x})^2$
- We "lose one degree of freedom" by estimating $\mu$ from data
- Now $E[s^2] = \sigma^2$ (unbiased)

**In practice:**
- For $n=100$: bias is $\frac{99}{100} = 0.99$ (1% difference)
- For $n=1000$: bias is $\frac{999}{1000} = 0.999$ (0.1% difference)
- Most ML uses large datasets, so the bias is negligible
- MLE maximizes likelihood (not unbiasedness), which is why it divides by $n$

**Key insight:** MLE doesn't care about unbiasedness - it only cares about what parameter values make the observed data most probable.

---

## Speaker Notes: Where $E[s^2] = \sigma^2$ Comes From

**The math:**

Starting with $\sum(x_i - \bar{x})^2$, it can be shown that:

$$E\left[\sum_{i=1}^{n}(x_i - \bar{x})^2\right] = (n-1)\sigma^2$$

This is because:
- We have $n$ observations
- But $\bar{x}$ is calculated from the same data, creating a constraint
- Only $(n-1)$ values are "free" to vary (degrees of freedom)

Therefore:

$$E\left[\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2\right] = E[s^2] = \sigma^2$$

**Why $n-1$?** When we estimate $\mu$ with $\bar{x}$, we've "used up" one piece of information from our data. The last observation is determined once you know the first $(n-1)$ and the mean.

**Contrast with MLE:**

$$E\left[\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2\right] = \frac{n-1}{n}\sigma^2 < \sigma^2 \text{ (biased)}$$

---

## Derivation: Why $E[\sum(x_i - \bar{x})^2] = (n-1)\sigma^2$

**Start with the sum of squared deviations from sample mean:**

$$\sum_{i=1}^{n}(x_i - \bar{x})^2$$

**Trick: Add and subtract the true mean $\mu$:**

$$\sum_{i=1}^{n}(x_i - \bar{x})^2 = \sum_{i=1}^{n}[(x_i - \mu) - (\bar{x} - \mu)]^2$$

**Expand the square:**

$$= \sum_{i=1}^{n}[(x_i - \mu)^2 - 2(x_i - \mu)(\bar{x} - \mu) + (\bar{x} - \mu)^2]$$

**Split into three sums:**

$$= \sum_{i=1}^{n}(x_i - \mu)^2 - 2(\bar{x} - \mu)\sum_{i=1}^{n}(x_i - \mu) + \sum_{i=1}^{n}(\bar{x} - \mu)^2$$

**Simplify middle term:**

$$\sum_{i=1}^{n}(x_i - \mu) = \sum_{i=1}^{n}x_i - n\mu = n\bar{x} - n\mu = n(\bar{x} - \mu)$$

So: $-2(\bar{x} - \mu) \cdot n(\bar{x} - \mu) = -2n(\bar{x} - \mu)^2$

**Simplify last term:**

$$\sum_{i=1}^{n}(\bar{x} - \mu)^2 = n(\bar{x} - \mu)^2$$

**Combine:**

$$\sum_{i=1}^{n}(x_i - \bar{x})^2 = \sum_{i=1}^{n}(x_i - \mu)^2 - 2n(\bar{x} - \mu)^2 + n(\bar{x} - \mu)^2$$

$$= \sum_{i=1}^{n}(x_i - \mu)^2 - n(\bar{x} - \mu)^2$$

**Take expectation of both sides:**

$$E\left[\sum_{i=1}^{n}(x_i - \bar{x})^2\right] = E\left[\sum_{i=1}^{n}(x_i - \mu)^2\right] - E[n(\bar{x} - \mu)^2]$$

**First term:**

$$E\left[\sum_{i=1}^{n}(x_i - \mu)^2\right] = n \cdot E[(x_i - \mu)^2] = n\sigma^2$$

**Second term (using $\text{Var}(\bar{x}) = \frac{\sigma^2}{n}$):**

$$E[n(\bar{x} - \mu)^2] = n \cdot \text{Var}(\bar{x}) = n \cdot \frac{\sigma^2}{n} = \sigma^2$$

**Final result:**

$$E\left[\sum_{i=1}^{n}(x_i - \bar{x})^2\right] = n\sigma^2 - \sigma^2 = (n-1)\sigma^2$$

**Therefore:**

$$E\left[\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2\right] = \sigma^2 \quad \text{(unbiased)}$$

$$E\left[\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2\right] = \frac{n-1}{n}\sigma^2 \quad \text{(biased)}$$

**Why MSE uses $\frac{1}{n}$, not $\frac{1}{n-1}$:**

1. **MSE is for optimization, not estimation:**
   - We're minimizing $\frac{1}{n}\sum(y_i - \hat{y}_i)^2$ to find best $w$
   - The $\frac{1}{n}$ is just for averaging—it doesn't affect which $w$ minimizes it
   - Scaling by $\frac{1}{n}$ vs $\frac{1}{n-1}$ doesn't change the optimal $w$

2. **We're not estimating population variance:**
   - In statistics, we estimate $\sigma^2$ from a sample → use $n-1$ for unbiasedness
   - In ML, we're minimizing training error → just want average loss per sample

3. **Practical reason:**
   - $\frac{1}{n}$ is the true average loss per sample
   - Makes metrics comparable across different dataset sizes
   - $\text{MSE} = 0.5$ means "average squared error is 0.5 per example"

**Bottom line:** Bessel's correction matters when you're doing statistical inference (estimating $\sigma^2$). For loss functions, we just want the mean—bias doesn't matter because we're optimizing, not estimating population parameters.

---

## While training with MSE, we're minimizing the **empirical variance of residuals** (on our training data), which under the Gaussian assumption corresponds to maximizing likelihood.

$$\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - w^T x_i)^2$$

This is:
- The sample variance of residuals (if mean residual ≈ 0)
- MLE estimate $\hat{\sigma}^2$ under Gaussian noise assumption
- NOT estimating the true population variance $\sigma^2$ (that would need $n-1$)

**What we're really doing:**
- Finding $w$ that makes residuals as small as possible
- Equivalently: minimizing the spread/variance of prediction errors
- Under Gaussian assumption: maximizing likelihood

**So yes, you can say:** "MSE training minimizes the variance of residuals on the training set."

But remember: it's the *empirical* variance (biased estimator), not an unbiased estimate of population variance.