## ML Pipeline
```
1. Problem Definition
   └─ Regression? Classification? Time series?

2. Data Collection
   └─ Sources, APIs, databases

3. Exploratory Data Analysis (EDA)
   ├─ Distributions
   ├─ Correlations
   ├─ Missing values
   └─ Outliers

4. Feature Engineering
   ├─ Transformations (log, sqrt)
   ├─ Interactions
   ├─ Encoding (one-hot, target)
   └─ Domain-specific features

5. Train/Val/Test Split
   └─ Respect time order if applicable

6. Preprocessing
   ├─ Scaling/Normalization (fit on train only!)
   ├─ Imputation (fill missing)
   └─ Encoding

7. Model Selection
   ├─ Start simple (linear, logistic)
   ├─ Try trees (Random Forest, XGBoost)
   └─ Neural networks if needed

!! 8. Training !!
   ├─ Choose loss function
   ├─ Gradient descent
   └─ Monitor train vs. validation

9. Hyperparameter Tuning
   └─ Cross-validation grid search

10. Evaluation
    ├─ Multiple metrics
    ├─ Residual analysis
    └─ Confusion matrix (classification)

11. Diagnosis
    ├─ Bias (underfitting)? → Add complexity
    ├─ Variance (overfitting)? → Regularize
    └─ Both? → Get more data

12. Final Test
    └─ Evaluate on held-out test set

13. Deployment
    ├─ Model serialization
    ├─ API/serving
    └─ Monitoring drift
```

# MLE = Maximum Likelihood Estimation

In [1]:
import sys
sys.path.append('../')
from src.mle_widget import run_mle_widget
import numpy as np
np.random.seed(42)

In [2]:
# Song tempos (BPM) - your sample
tempos = np.random.normal(loc=120, scale=15, size=40)
print(tempos)

[127.4507123  117.92603548 129.71532807 142.84544785 116.48769938
 116.48794565 143.68819223 131.51152094 112.95788421 128.13840065
 113.04873461 113.0140537  123.62943407  91.30079633  94.12623251
 111.56568706 104.80753319 124.71370999 106.37963887  98.81544448
 141.98473153 116.61335549 121.01292307  98.62877721 111.83425913
 121.66383885 102.73509634 125.63547028 110.99041965 115.62459375
 110.97440082 147.78417277 119.79754163 104.13433607 132.33817368
 101.68734525 123.13295393  90.60494814 100.07720927 122.95291854]


In [3]:
run_mle_widget(data=tempos, feature_name="Song Tempo (BPM)", height=600, width=1500)

VBox(children=(HTML(value='\n    <div style="background:#1e1e1e; padding:12px; border-radius:8px;\n           …

# MLE Derivation for Normal Distribution

---

## Setup

We have observations $x_1, x_2, \ldots, x_n$ assumed to be i.i.d. from $N(\mu, \sigma^2)$.

The probability density function (PDF) of a single observation is:

$$f(x_i \mid \mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)$$

---

## Step 1: Write the Likelihood Function

Since observations are independent, the joint density is the product:

$$L(\mu, \sigma) = \prod_{i=1}^{n} f(x_i \mid \mu, \sigma) = \prod_{i=1}^{n} \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)$$

---

## Step 2: Take the Logarithm

$$\ell(\mu, \sigma) = \log L(\mu, \sigma) = \sum_{i=1}^{n} \log\left[\frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)\right]$$

Using $\log(ab) = \log a + \log b$:

$$\ell = \sum_{i=1}^{n} \left[\log\left(\frac{1}{\sigma\sqrt{2\pi}}\right) + \log\exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)\right]$$

Since $\log(e^x) = x$ and $\log(1/a) = -\log(a)$:

$$\ell = \sum_{i=1}^{n} \left[-\log(\sigma) - \log(\sqrt{2\pi}) - \frac{(x_i - \mu)^2}{2\sigma^2}\right]$$

Since $\log(\sqrt{2\pi}) = \frac{1}{2}\log(2\pi)$:

$$\ell = \sum_{i=1}^{n} \left[-\log(\sigma) - \frac{1}{2}\log(2\pi) - \frac{(x_i - \mu)^2}{2\sigma^2}\right]$$

The first two terms don't depend on $i$, so they sum to $n$ times themselves:

$$\boxed{\ell(\mu, \sigma) = -n\log(\sigma) - \frac{n}{2}\log(2\pi) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2}$$

---

## Step 3: Find $\hat{\mu}$ — Derivative w.r.t. $\mu$

$$\frac{\partial \ell}{\partial \mu} = \frac{\partial}{\partial \mu}\left[-n\log(\sigma) - \frac{n}{2}\log(2\pi) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2\right]$$

The first two terms are constants w.r.t. $\mu$, so their derivatives are 0:

$$\frac{\partial \ell}{\partial \mu} = -\frac{1}{2\sigma^2} \cdot \frac{\partial}{\partial \mu}\sum_{i=1}^{n}(x_i - \mu)^2$$

Apply chain rule to $(x_i - \mu)^2$:

$$\frac{\partial}{\partial \mu}(x_i - \mu)^2 = 2(x_i - \mu) \cdot \frac{\partial}{\partial \mu}(x_i - \mu) = 2(x_i - \mu) \cdot (-1) = -2(x_i - \mu)$$

Therefore:

$$\frac{\partial \ell}{\partial \mu} = -\frac{1}{2\sigma^2} \sum_{i=1}^{n} \left[-2(x_i - \mu)\right] = \frac{1}{\sigma^2} \sum_{i=1}^{n} (x_i - \mu)$$

---

## Step 4: Solve for $\hat{\mu}$

Set the derivative to zero:

$$\frac{1}{\sigma^2} \sum_{i=1}^{n} (x_i - \mu) = 0$$

Multiply both sides by $\sigma^2$ (assuming $\sigma^2 > 0$):

$$\sum_{i=1}^{n} (x_i - \mu) = 0$$

Expand the sum:

$$\sum_{i=1}^{n} x_i - \sum_{i=1}^{n} \mu = 0$$

$$\sum_{i=1}^{n} x_i - n\mu = 0$$

Solve for $\mu$:

$$\boxed{\hat{\mu} = \frac{1}{n}\sum_{i=1}^{n} x_i = \bar{x}}$$

**The MLE for $\mu$ is the sample mean!**

---

## Step 5: Find $\hat{\sigma}$ — Derivative w.r.t. $\sigma$

Let's rewrite the log-likelihood, treating $\sigma$ as the variable:

$$\ell = -n\log(\sigma) - \frac{n}{2}\log(2\pi) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2$$

Take the derivative w.r.t. $\sigma$:

$$\frac{\partial \ell}{\partial \sigma} = \frac{\partial}{\partial \sigma}\left[-n\log(\sigma)\right] + \frac{\partial}{\partial \sigma}\left[-\frac{n}{2}\log(2\pi)\right] + \frac{\partial}{\partial \sigma}\left[-\frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2\right]$$

**First term:**
$$\frac{\partial}{\partial \sigma}\left[-n\log(\sigma)\right] = -n \cdot \frac{1}{\sigma} = -\frac{n}{\sigma}$$

**Second term:** (constant)
$$\frac{\partial}{\partial \sigma}\left[-\frac{n}{2}\log(2\pi)\right] = 0$$

**Third term:**

Let $S = \sum_{i=1}^{n}(x_i - \mu)^2$ (a constant w.r.t. $\sigma$). We need:

$$\frac{\partial}{\partial \sigma}\left[-\frac{S}{2\sigma^2}\right] = -\frac{S}{2} \cdot \frac{\partial}{\partial \sigma}\left(\sigma^{-2}\right)$$

Using power rule: $\frac{d}{d\sigma}\sigma^{-2} = -2\sigma^{-3} = -\frac{2}{\sigma^3}$

$$= -\frac{S}{2} \cdot \left(-\frac{2}{\sigma^3}\right) = \frac{S}{\sigma^3}$$

**Combining all terms:**

$$\frac{\partial \ell}{\partial \sigma} = -\frac{n}{\sigma} + \frac{1}{\sigma^3}\sum_{i=1}^{n}(x_i - \mu)^2$$

---

## Step 6: Solve for $\hat{\sigma}$

Set the derivative to zero:

$$-\frac{n}{\sigma} + \frac{1}{\sigma^3}\sum_{i=1}^{n}(x_i - \mu)^2 = 0$$

Multiply both sides by $\sigma^3$:

$$-n\sigma^2 + \sum_{i=1}^{n}(x_i - \mu)^2 = 0$$

Solve for $\sigma^2$:

$$\sigma^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \mu)^2$$

Substituting $\hat{\mu} = \bar{x}$:

$$\boxed{\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2}$$

$$\boxed{\hat{\sigma} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2}}$$

**The MLE for $\sigma$ is the (biased) sample standard deviation!**

---

## Note on Bias

The MLE estimator $\hat{\sigma}^2$ divides by $n$, not $n-1$.

- MLE is **biased**: $E[\hat{\sigma}^2] = \frac{n-1}{n}\sigma^2$
- The unbiased estimator uses $n-1$ (Bessel's correction)

For large $n$, the difference is negligible. MLE optimizes likelihood, not unbiasedness.

---

## Summary

| Parameter | MLE Estimator |
|-----------|---------------|
| $\mu$ | $\hat{\mu} = \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$ |
| $\sigma^2$ | $\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2$ |

These are exactly what you discovered with the sliders!

---
# We always deal with distributions