## ML Pipeline
```
1. Problem Definition
   └─ Regression? Classification? Time series?

2. Data Collection
   └─ Sources, APIs, databases

3. Exploratory Data Analysis (EDA)
   ├─ Distributions
   ├─ Correlations
   ├─ Missing values
   └─ Outliers

4. Feature Engineering
   ├─ Transformations (log, sqrt)
   ├─ Interactions
   ├─ Encoding (one-hot, target)
   └─ Domain-specific features

5. Train/Val/Test Split
   └─ Respect time order if applicable

6. Preprocessing
   ├─ Scaling/Normalization (fit on train only!)
   ├─ Imputation (fill missing)
   └─ Encoding

7. Model Selection
   ├─ Start simple (linear, logistic)
   ├─ Try trees (Random Forest, XGBoost)
   └─ Neural networks if needed

!! 8. Training !!
   ├─ Choose loss function
   ├─ Gradient descent
   └─ Monitor train vs. validation

9. Hyperparameter Tuning
   └─ Cross-validation grid search

10. Evaluation
    ├─ Multiple metrics
    ├─ Residual analysis
    └─ Confusion matrix (classification)

11. Diagnosis
    ├─ Bias (underfitting)? → Add complexity
    ├─ Variance (overfitting)? → Regularize
    └─ Both? → Get more data

12. Final Test
    └─ Evaluate on held-out test set

13. Deployment
    ├─ Model serialization
    ├─ API/serving
    └─ Monitoring drift
```

# Two approaches to ML: Linear algebra and Statistical

Statistical seems to be more meaningful. Check Aggarval 

# MLE = Maximum Likelihood Estimation

In [1]:
import sys
sys.path.append('../')
from src.mle_widget import run_mle_widget
import numpy as np
np.random.seed(42)

In [2]:
# Song tempos (BPM) - your sample
tempos = np.random.normal(loc=120, scale=15, size=40)
print(tempos)

[127.4507123  117.92603548 129.71532807 142.84544785 116.48769938
 116.48794565 143.68819223 131.51152094 112.95788421 128.13840065
 113.04873461 113.0140537  123.62943407  91.30079633  94.12623251
 111.56568706 104.80753319 124.71370999 106.37963887  98.81544448
 141.98473153 116.61335549 121.01292307  98.62877721 111.83425913
 121.66383885 102.73509634 125.63547028 110.99041965 115.62459375
 110.97440082 147.78417277 119.79754163 104.13433607 132.33817368
 101.68734525 123.13295393  90.60494814 100.07720927 122.95291854]


add formula of MLE

In [5]:
run_mle_widget(data=tempos, feature_name="Song Tempo (BPM)", height=600, width=1500)

VBox(children=(HTML(value='\n    <div style="background:#1e1e1e; padding:12px; border-radius:8px;\n           …

**Key insight:** We're modeling the marginal distribution of x. Just the tempos, period. No prediction, no inputs and outputs, just one variable following one distribution. It is not ML.

# MLE Derivation for Normal Distribution

---

## Setup

We have observations $x_1, x_2, \ldots, x_n$ assumed to be i.i.d. from $N(\mu, \sigma^2)$.

The probability density function (PDF) of a single observation is:

$$f(x_i \mid \mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)$$

---

## Step 1: Write the Likelihood Function

Since observations are independent, the joint density is the product:

$$L(\mu, \sigma) = \prod_{i=1}^{n} f(x_i \mid \mu, \sigma) = \prod_{i=1}^{n} \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)$$

---

## Step 2: Take the Logarithm

$$\ell(\mu, \sigma) = \log L(\mu, \sigma) = \sum_{i=1}^{n} \log\left[\frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)\right]$$

Using $\log(ab) = \log a + \log b$:

$$\ell = \sum_{i=1}^{n} \left[\log\left(\frac{1}{\sigma\sqrt{2\pi}}\right) + \log\exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)\right]$$

Since $\log(e^x) = x$ and $\log(1/a) = -\log(a)$:

$$\ell = \sum_{i=1}^{n} \left[-\log(\sigma) - \log(\sqrt{2\pi}) - \frac{(x_i - \mu)^2}{2\sigma^2}\right]$$

Since $\log(\sqrt{2\pi}) = \frac{1}{2}\log(2\pi)$:

$$\ell = \sum_{i=1}^{n} \left[-\log(\sigma) - \frac{1}{2}\log(2\pi) - \frac{(x_i - \mu)^2}{2\sigma^2}\right]$$

The first two terms don't depend on $i$, so they sum to $n$ times themselves:

$$\boxed{\ell(\mu, \sigma) = -n\log(\sigma) - \frac{n}{2}\log(2\pi) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2}$$

---

## Step 3: Find $\hat{\mu}$ — Derivative w.r.t. $\mu$

$$\frac{\partial \ell}{\partial \mu} = \frac{\partial}{\partial \mu}\left[-n\log(\sigma) - \frac{n}{2}\log(2\pi) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2\right]$$

The first two terms are constants w.r.t. $\mu$, so their derivatives are 0:

$$\frac{\partial \ell}{\partial \mu} = -\frac{1}{2\sigma^2} \cdot \frac{\partial}{\partial \mu}\sum_{i=1}^{n}(x_i - \mu)^2$$

Apply chain rule to $(x_i - \mu)^2$:

$$\frac{\partial}{\partial \mu}(x_i - \mu)^2 = 2(x_i - \mu) \cdot \frac{\partial}{\partial \mu}(x_i - \mu) = 2(x_i - \mu) \cdot (-1) = -2(x_i - \mu)$$

Therefore:

$$\frac{\partial \ell}{\partial \mu} = -\frac{1}{2\sigma^2} \sum_{i=1}^{n} \left[-2(x_i - \mu)\right] = \frac{1}{\sigma^2} \sum_{i=1}^{n} (x_i - \mu)$$

---

## Step 4: Solve for $\hat{\mu}$

Set the derivative to zero:

$$\frac{1}{\sigma^2} \sum_{i=1}^{n} (x_i - \mu) = 0$$

Multiply both sides by $\sigma^2$ (assuming $\sigma^2 > 0$):

$$\sum_{i=1}^{n} (x_i - \mu) = 0$$

Expand the sum:

$$\sum_{i=1}^{n} x_i - \sum_{i=1}^{n} \mu = 0$$

$$\sum_{i=1}^{n} x_i - n\mu = 0$$

Solve for $\mu$:

$$\boxed{\hat{\mu} = \frac{1}{n}\sum_{i=1}^{n} x_i = \bar{x}}$$

**The MLE for $\mu$ is the sample mean!**

---

## Step 5: Find $\hat{\sigma}$ — Derivative w.r.t. $\sigma$

Let's rewrite the log-likelihood, treating $\sigma$ as the variable:

$$\ell = -n\log(\sigma) - \frac{n}{2}\log(2\pi) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2$$

Take the derivative w.r.t. $\sigma$:

$$\frac{\partial \ell}{\partial \sigma} = \frac{\partial}{\partial \sigma}\left[-n\log(\sigma)\right] + \frac{\partial}{\partial \sigma}\left[-\frac{n}{2}\log(2\pi)\right] + \frac{\partial}{\partial \sigma}\left[-\frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2\right]$$

**First term:**
$$\frac{\partial}{\partial \sigma}\left[-n\log(\sigma)\right] = -n \cdot \frac{1}{\sigma} = -\frac{n}{\sigma}$$

**Second term:** (constant)
$$\frac{\partial}{\partial \sigma}\left[-\frac{n}{2}\log(2\pi)\right] = 0$$

**Third term:**

Let $S = \sum_{i=1}^{n}(x_i - \mu)^2$ (a constant w.r.t. $\sigma$). We need:

$$\frac{\partial}{\partial \sigma}\left[-\frac{S}{2\sigma^2}\right] = -\frac{S}{2} \cdot \frac{\partial}{\partial \sigma}\left(\sigma^{-2}\right)$$

Using power rule: $\frac{d}{d\sigma}\sigma^{-2} = -2\sigma^{-3} = -\frac{2}{\sigma^3}$

$$= -\frac{S}{2} \cdot \left(-\frac{2}{\sigma^3}\right) = \frac{S}{\sigma^3}$$

**Combining all terms:**

$$\frac{\partial \ell}{\partial \sigma} = -\frac{n}{\sigma} + \frac{1}{\sigma^3}\sum_{i=1}^{n}(x_i - \mu)^2$$

---

## Step 6: Solve for $\hat{\sigma}$

Set the derivative to zero:

$$-\frac{n}{\sigma} + \frac{1}{\sigma^3}\sum_{i=1}^{n}(x_i - \mu)^2 = 0$$

Multiply both sides by $\sigma^3$:

$$-n\sigma^2 + \sum_{i=1}^{n}(x_i - \mu)^2 = 0$$

Solve for $\sigma^2$:

$$\sigma^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \mu)^2$$

Substituting $\hat{\mu} = \bar{x}$:

$$\boxed{\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2}$$

$$\boxed{\hat{\sigma} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2}}$$

**The MLE for $\sigma$ is the (biased) sample standard deviation!**

---

## Note on Bias

The MLE estimator $\hat{\sigma}^2$ divides by $n$, not $n-1$.

- MLE is **biased**: $E[\hat{\sigma}^2] = \frac{n-1}{n}\sigma^2$
- The unbiased estimator uses $n-1$ (Bessel's correction)

For large $n$, the difference is negligible. MLE optimizes likelihood, not unbiasedness.

---

## Summary

| Parameter | MLE Estimator |
|-----------|---------------|
| $\mu$ | $\hat{\mu} = \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$ |
| $\sigma^2$ | $\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2$ |

These are exactly what you discovered with the sliders!

# MLE Derivation for Laplace Distribution

---

## Setup

We have observations $x_1, x_2, \ldots, x_n$ assumed to be i.i.d. from a Laplace distribution with location $\mu$ and scale $b$.

The probability density function (PDF) of a single observation is:

$$f(x_i \mid \mu, b) = \frac{1}{2b} \exp\left(-\frac{\lvert x_i - \mu \rvert}{b}\right)$$

---

## Step 1: Write the Likelihood Function

Since observations are independent, the joint density is the product:

$$L(\mu, b) = \prod_{i=1}^{n} f(x_i \mid \mu, b) = \prod_{i=1}^{n} \frac{1}{2b} \exp\left(-\frac{\lvert x_i - \mu \rvert}{b}\right)$$

---

## Step 2: Take the Logarithm

$$\ell(\mu, b) = \log L(\mu, b) = \sum_{i=1}^{n} \log\left[\frac{1}{2b} \exp\left(-\frac{\lvert x_i - \mu \rvert}{b}\right)\right]$$

Using $\log(ab) = \log a + \log b$, $\log(e^x) = x$, and $\log(1/a) = -\log(a)$:

$$\ell = \sum_{i=1}^{n} \left[-\log(2b) - \frac{\lvert x_i - \mu \rvert}{b}\right]$$

The first term doesn't depend on $i$, so it sums to $n$ times itself:

$$\boxed{\ell(\mu, b) = -n\log(2b) - \frac{1}{b}\sum_{i=1}^{n}\lvert x_i - \mu \rvert}$$

**Key observation:** The log-likelihood contains $\sum_{i=1}^{n}\lvert x_i - \mu \rvert$. Maximizing the log-likelihood means minimizing this sum. This is exactly the **MAE loss**. Hence: MAE = MLE under Laplace assumption.

---

## Step 3: Find $\hat{\mu}$ — Derivative w.r.t. $\mu$

$$\frac{\partial \ell}{\partial \mu} = \frac{\partial}{\partial \mu}\left[-n\log(2b) - \frac{1}{b}\sum_{i=1}^{n}\lvert x_i - \mu \rvert\right]$$

The first term is constant w.r.t. $\mu$, so its derivative is 0:

$$\frac{\partial \ell}{\partial \mu} = -\frac{1}{b} \sum_{i=1}^{n} \frac{\partial}{\partial \mu}\lvert x_i - \mu \rvert$$

---

## ⚠️ The Problem: Absolute Value is Not Differentiable at Zero

For the Normal distribution, we had $(x_i - \mu)^2$, which is smooth everywhere. The absolute value $\lvert x_i - \mu \rvert$ has a **kink** at $x_i = \mu$.

Using the limit definition of the derivative at $x = 0$:

$$\lim_{h \to 0^+} \frac{\lvert h \rvert - 0}{h} = \frac{h}{h} = +1 \qquad \text{vs} \qquad \lim_{h \to 0^-} \frac{\lvert h \rvert - 0}{h} = \frac{-h}{h} = -1$$

The left and right limits disagree. Therefore, the derivative does not exist at zero.

---

## Step 4: The Derivative Where It Exists

Away from the kink, we can compute:

$$\frac{\partial}{\partial \mu}\lvert x_i - \mu \rvert = \begin{cases} -1 & \text{if } x_i > \mu \\ +1 & \text{if } x_i < \mu \\ \text{undefined} & \text{if } x_i = \mu \end{cases}$$

This can be written compactly using the **sign function**:

$$\frac{\partial}{\partial \mu}\lvert x_i - \mu \rvert = -\text{sign}(x_i - \mu)$$

where:

$$\text{sign}(z) = \begin{cases} +1 & \text{if } z > 0 \\ \ \ 0 & \text{if } z = 0 \\ -1 & \text{if } z < 0 \end{cases}$$

**Crucial difference from Normal:** The derivative is $\pm 1$ regardless of how far $x_i$ is from $\mu$. The derivative "forgets" the magnitude and only remembers the **direction** (above or below).

---

## Step 5: Setting the Derivative to Zero

Substituting back:

$$\frac{\partial \ell}{\partial \mu} = -\frac{1}{b} \sum_{i=1}^{n} \left[-\text{sign}(x_i - \mu)\right] = \frac{1}{b} \sum_{i=1}^{n} \text{sign}(x_i - \mu) = 0$$

Since $b > 0$:

$$\sum_{i=1}^{n} \text{sign}(x_i - \mu) = 0$$

---

## Step 6: Interpreting the Condition — The Counting Argument

Each term $\text{sign}(x_i - \mu)$ contributes $+1$ if $x_i > \mu$ and $-1$ if $x_i < \mu$.

For the sum to equal zero, the number of $+1$s must equal the number of $-1$s:

$$\#\{x_i > \mu\} = \#\{x_i < \mu\}$$

**This is the definition of the median!**

---

## Concrete Example

Consider five observations: $x = \{2, 5, 7, 12, 100\}$

Let's evaluate $\sum_{i=1}^{5} \text{sign}(x_i - \mu)$ for different values of $\mu$:

| $\mu$ | Signs: $\text{sign}(x_i - \mu)$ for each $x_i$ | Sum | Interpretation |
|-------|------------------------------------------------|-----|----------------|
| 3 | $\text{sign}(2-3), \text{sign}(5-3), \text{sign}(7-3), \text{sign}(12-3), \text{sign}(100-3)$ = $(-1, +1, +1, +1, +1)$ | $+3$ | 1 below, 4 above → move right |
| 6 | $(-1, -1, +1, +1, +1)$ | $+1$ | 2 below, 3 above → move right |
| 7 | $(-1, -1, 0, +1, +1)$ | $0$ ✓ | 2 below, 2 above → **balanced!** |
| 10 | $(-1, -1, -1, +1, +1)$ | $-1$ | 3 below, 2 above → move left |

At $\mu = 7$ (the median), the sum equals zero. The counts are balanced.

**Notice:** The outlier $x_5 = 100$ contributes the same $+1$ as $x_4 = 12$. Outliers have no extra influence — this is why the median is robust!

Compare to Normal/MSE: the derivative $\sum(x_i - \mu)$ at $\mu = 7$ would be $(2-7) + (5-7) + (7-7) + (12-7) + (100-7) = -5 - 2 + 0 + 5 + 93 = 91 \neq 0$. The outlier pulls the mean toward 25.2.

$$\boxed{\hat{\mu} = \text{median}(x_1, x_2, \ldots, x_n)}$$

---

## Step 7: Find $\hat{b}$ — Derivative w.r.t. $b$

Starting from:

$$\ell(\mu, b) = -n\log(2b) - \frac{1}{b}\sum_{i=1}^{n}\lvert x_i - \mu \rvert$$

Let $S = \sum_{i=1}^{n}\lvert x_i - \mu \rvert$. Using $\log(2b) = \log 2 + \log b$:

$$\ell = -n\log 2 - n\log b - \frac{S}{b}$$

Take the derivative w.r.t. $b$:

**First term:** $\frac{\partial}{\partial b}(-n\log 2) = 0$

**Second term:** $\frac{\partial}{\partial b}(-n\log b) = -\frac{n}{b}$

**Third term:** $\frac{\partial}{\partial b}\left(-\frac{S}{b}\right) = -S \cdot (-b^{-2}) = \frac{S}{b^2}$

**Combining:**

$$\frac{\partial \ell}{\partial b} = -\frac{n}{b} + \frac{S}{b^2}$$

---

## Step 8: Solve for $\hat{b}$

Set the derivative to zero:

$$-\frac{n}{b} + \frac{S}{b^2} = 0$$

Multiply both sides by $b^2$:

$$-nb + S = 0 \implies b = \frac{S}{n}$$

Substituting $\hat{\mu} = \text{median}$:

$$\boxed{\hat{b} = \frac{1}{n}\sum_{i=1}^{n}\lvert x_i - \text{median} \rvert}$$

**The MLE for $b$ is the Mean Absolute Deviation from the median (MAD).**

---

## Summary: Normal vs Laplace

| Aspect | Normal (MSE) | Laplace (MAE) |
|--------|--------------|---------------|
| Term in exponent | $(x_i - \mu)^2$ | $\lvert x_i - \mu \rvert$ |
| Derivative w.r.t. $\mu$ | $2(x_i - \mu)$ — proportional to distance | $\pm 1$ — only the sign |
| First-order condition | $\sum(x_i - \mu) = 0$ | $\sum \text{sign}(x_i - \mu) = 0$ |
| Solution type | Arithmetic (add, divide) | Counting (balance above/below) |
| Location MLE | $\hat{\mu} = \bar{x}$ (mean) | $\hat{\mu} = \text{median}$ |
| Scale MLE | $\hat{\sigma}^2 = \frac{1}{n}\sum(x_i - \bar{x})^2$ | $\hat{b} = \frac{1}{n}\sum\lvert x_i - \text{median} \rvert$ |
| Robustness | Sensitive to outliers | Robust to outliers |

---

## Key Insight

The choice of loss function encodes a distributional assumption about errors:

| Loss | Distribution | Optimal Estimator |
|------|--------------|-------------------|
| MSE | Normal | Mean |
| MAE | Laplace | Median |

### So Why Does This Matter?

When we assume $\varepsilon \sim \mathcal{N}(0, \sigma^2)$ and derive the MLE, we get MSE, which targets the mean.

When we assume $\varepsilon \sim \text{Laplace}(0, b)$ and derive the MLE, we get MAE, which targets the median.

For symmetric distributions, mean = median, so both estimators target the same value. But they behave differently with outliers:

- **MSE** heavily penalizes large errors (squared), so it's sensitive to outliers
- **MAE** penalizes all errors linearly, so it's more robust to outliers

Even though they target the same "center" for symmetric distributions, the choice matters for robustness and optimization behavior.

# Keep just one example and add others to reference (maybe create a separate notebook)
### Real Production Systems & Research Papers: MAE vs MSE Impact

Here are actual documented cases from research and industry where the choice between MAE and MSE had measurable impact.

---

### 1. **Kaggle: Allstate Claims Severity (2016)**

**Competition**: Predict insurance claim costs

**Dataset**:  ~188K claims, most $1K-$10K, but some $50K-$100K+

**What happened**:
- Competition used **MAE as the evaluation metric** (Mean Absolute Error)
- Dataset had heavy-tailed distribution of claim amounts
- Top solutions used gradient boosting (XGBoost, LightGBM) with MAE-friendly objectives

**Winner's approach** (2nd place - Alexey Noskov):
- Used custom loss functions and extensive hyperparameter tuning
- Neural nets with careful target transformation to handle skewness
- Achieved CV score of 1130.29 / LB score of 1110.69

**Key insight**: The competition explicitly chose MAE over MSE because claim severity has a long tail, and they wanted models that performed well on typical claims rather than being pulled toward extreme values.

**Sources**:
- Competition page: https://www.kaggle.com/c/allstate-claims-severity
- Winner interview: https://medium.com/kaggle-blog/allstate-claims-severity-competition-2nd-place-winners-interview-alexey-noskov-f4e4ce18fcfc
- Analysis: https://medium.com/nerd-for-tech/a-kaggle-competition-allstate-claims-severity-a32f4635c849

---

### 2. **Uber: ETA Prediction System (DeepETA)**

**Problem**: Predict arrival times for rides and food delivery

**Published Research**: DeepETA (2022)

**What they found**:
- **Primary metric is MAE** between predicted ETA and actual arrival time
- Previous XGBoost models needed to scale to billions of predictions
- Switched to deep learning (Transformer-based architecture)

**Key requirement from paper**:
> "Accuracy: Uber's primary metric is the **mean absolute error (MAE)** between predicted ETA and true ETA. The new ML model must improve over the XGBoost model."

**Why MAE**:
- Users care about absolute time differences (5 min late vs 5 min early = same impact)
- MSE would over-penalize traffic incident outliers
- Business impact: more accurate typical-case ETAs improve user trust

**Technical details**:
- Processes highest QPS (queries per second) at Uber
- Achieves median latency of 3.25ms
- Deployed globally for all mobility and delivery predictions

**Sources**:
- Main paper: https://arxiv.org/pdf/2206.02127 (DeeprETA: An ETA Post-processing System at Scale)
- Uber blog: https://www.uber.com/blog/deepeta-how-uber-predicts-arrival-times/
- Technical overview: https://codecompass00.substack.com/p/uber-billion-dollar-problem-predicting-eta

---

### 3. **Google/DeepMind: Datacenter Cooling (2016-2018)**

**Problem**: Optimize cooling systems in Google data centers

**Research**: DeepMind AI for data center optimization

**Key finding**:
- **40% reduction in cooling energy** using neural networks
- 15% reduction in overall PUE (Power Usage Effectiveness)

**Technical approach**:
- Neural networks with 5 hidden layers, 50 nodes each
- Trained on 2 years of monitoring data
- 19 normalized input variables → 1 output (PUE)

**Note on loss functions**:
While the public materials don't explicitly state MAE vs MSE was the deciding factor, the technical paper mentions robustness to sensor anomalies was critical:
> "Robust predictions in the presence of sensor failures and equipment anomalies"

This suggests MAE-like robustness properties were valued (sensor failures are outliers).

**Sources**:
- DeepMind blog: https://deepmind.google/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-by-40/
- Safety paper: https://deepmind.google/discover/blog/safety-first-ai-for-autonomous-data-centre-cooling-and-industrial-control/
- Technical paper: https://arxiv.org/abs/2211.07357 (Controlling Commercial Cooling Systems Using Reinforcement Learning)
- MIT Tech Review: https://www.technologyreview.com/2018/08/17/140987/google-just-gave-control-over-data-center-cooling-to-an-ai/

---

### 4. **Medical: ICU Length of Stay Prediction (MIMIC-III)**

**Problem**: Predict ICU stay duration for resource planning

**Dataset**: MIMIC-III database, ~40K+ ICU stays

**Key findings from multiple studies**:
- Most stays: 2-5 days (median ~2.64 days)
- Long-tail: Some patients stay 30-60+ days (complex complications)

**Empirical results** (from various papers):
- Support Vector Regressor achieved lowest **MAE of 2.81 days**
- RMSE values typically higher but less interpretable for operations
- Classification approach (short vs. long stay) often preferred over regression

**Why this matters**:
- Hospital staffing optimizes for typical cases
- Long-stay patients handled with adaptive protocols
- MAE predictions more clinically useful for day-to-day planning

**Sources**:
- MIMIC-III database paper: https://www.nature.com/articles/sdata201635 (Nature, 2016)
- ICU prediction study: https://pmc.ncbi.nlm.nih.gov/articles/PMC8135024/ (uses MAE as primary metric)
- Length of stay prediction: https://www.mdpi.com/2075-4418/11/12/2242 (MDPI, 2021)
- IEEE study: https://ieeexplore.ieee.org/document/10195011/ (R² 0.86, RMSE 1.2)

---

### 5. **Walmart: M5 Forecasting Competition (2020)**

**Competition**: Predict 28-day ahead sales for 30,490 time series

**Dataset**: 
- 3,049 products across 10 stores in 3 US states
- Hierarchical data with zero-inflation (intermittent sales)
- 1,941 days of history

**Evaluation metric**: **WRMSSE** (Weighted Root Mean Squared Scaled Error)
- Despite using a scaled version of RMSE, the metric addresses intermittency
- Scaling makes it more robust to extreme values than pure RMSE
- Many winners used **Tweedie loss** (between MSE and MAE) for training

**Why standard MSE fails here**:
- Many zero-sales days (intermittent demand)
- Promotional spikes create outliers
- MSE over-forecasts slow-moving items to hedge against spikes

**Top solutions**:
- Used LightGBM with Tweedie objective (power = 1.1 or 1.2)
- Tweedie is a generalization: power=0 → Normal (MSE), power=1 → Poisson, power=2 → Gamma
- Winners achieved ~22% improvement over benchmarks

**Key quote from literature**:
> "MSE optimization led to systematic over-forecasting of slow-moving items to hedge against occasional spikes"

**Sources**:
- Competition page: https://www.kaggle.com/competitions/m5-forecasting-accuracy
- Academic paper: https://www.sciencedirect.com/science/article/pii/S0169207021001874 (International Journal of Forecasting, 2022)
- Results paper: https://statmodeling.stat.columbia.edu/wp-content/uploads/2021/10/M5_accuracy_competition.pdf
- Analysis: https://www.christophenicault.com/post/m5_forecasting_accuracy/

---

### 6. **Wind Power Forecasting (Multiple Studies)**

**Problem**: Predict wind farm power output for grid integration

**Research**: IEEE and various energy journals (2019-2024)

**Key challenges**:
- High variability in wind conditions
- Zero-inflation during calm periods or maintenance
- Extreme weather events create outliers

**Findings across multiple papers**:
- **MSE commonly used** but creates problems with outliers
- Papers exploring robust alternatives: Correntropy loss, Huber loss
- Recent work on entropy-based loss functions for extreme values

**Example study** (2021):
> "Various wind power forecasting methods have been developed... Most of these techniques are designed based on the **mean square error (MSE) loss**, which are very suitable for the assumption that the error distribution obeys the Gaussian distribution. However, there are **many outliers in real wind power data** due to many uncertain factors such as weather, temperature, and other random factors."

**Proposed solutions**:
- LSTM with Correntropy loss (more robust than MSE)
- Hybrid models combining MSE for normal conditions + robust loss for extremes

**Why it matters**:
- Grid operators need typical-case accuracy for daily operations
- Extreme events handled by reserve capacity
- Trade-off between accuracy metrics (MAE, RMSE) and operational costs

**Sources**:
- Correntropy loss paper: https://ideas.repec.org/a/eee/energy/v214y2021ics0360544220320879.html (Energy journal, 2021)
- Entropy-based loss: https://ieeexplore.ieee.org/document/10520483/ (IEEE, 2024)
- Survey paper: https://pmc.ncbi.nlm.nih.gov/articles/PMC9823194/ (comprehensive review, 2023)
- ML/DL comparison: https://pmc.ncbi.nlm.nih.gov/articles/PMC12217728/ (Scientific Reports, 2024)

---

### 7. **Uber: Time Series Forecasting with Uncertainty (2017)**

**Problem**: Predict extreme events (demand spikes, traffic) with uncertainty quantification

**Published Research**: "Deep and Confident Prediction for Time Series at Uber"

**Key contribution**:
- Bayesian Neural Networks for uncertainty quantification
- Three types of uncertainty: model uncertainty, inherent noise, model misspecification

**Loss function considerations**:
- Used Gaussian likelihood (equivalent to MSE) as baseline
- But incorporated uncertainty estimates to handle outliers
- Prediction intervals more important than point estimates

**Business impact**:
- Better anomaly detection during holidays/events
- Reduced false alarm rates
- Improved resource allocation

**Sources**:
- Uber blog: https://www.uber.com/blog/neural-networks-uncertainty-estimation/ (2017)
- Technical paper: https://www.researchgate.net/publication/319525051_Deep_and_Confident_Prediction_for_Time_Series_at_Uber

---

## Summary: When Did MAE/Robust Losses Actually Win?

| Domain | Metric Choice | Measured Impact | Source |
|--------|---------------|-----------------|---------|
| Allstate Insurance | MAE (competition metric) | Better typical-case predictions | Kaggle |
| Uber ETA | MAE (primary metric) | More accurate ETAs, user trust | DeepETA paper |
| Google Datacenter | Implicit robustness | 40% cooling cost reduction | DeepMind blog |
| ICU Prediction | MAE preferred | Better resource allocation | Multiple papers |
| Walmart M5 | WRMSSE + Tweedie | 22% over benchmarks | Academic paper |
| Wind Power | Exploring robust losses | Ongoing research | IEEE papers |

---

## The Pattern

**MAE/robust losses win when**:
1. **Tail events handled separately** (safety margins, alerts, manual review)
2. **Typical-case performance matters most** (user experience, daily operations)
3. **Heavy-tailed distributions** (insurance claims, retail sales, power spikes)
4. **Zero-inflation present** (intermittent demand, equipment downtime)

**MSE still appropriate when**:
1. Errors genuinely Gaussian (rare in real data)
2. All error magnitudes equally important
3. Smooth gradients critical for optimization
4. Theory assumes Gaussian noise (classical statistics)

**Hybrid approaches emerging**:
1. Huber loss (MSE for small errors, MAE for large)
2. Quantile regression (predict specific percentiles)
3. Tweedie/Poisson losses (for count data with zeros)
4. Custom loss functions per business objective

---
## Signal-Noise Perspective

**Model:**

$$y = f(x; \theta) + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2)$$

Equivalently:

$$y | x \sim \mathcal{N}(f(x; \theta), \sigma^2)$$

### What Is Noise? Real Examples

Epsilon ($\varepsilon$) represents everything else that affects the outcome besides your input features.

For song popularity, noise could be:

- The artist's existing fanbase (a huge factor you didn't measure)
- Whether the song went viral on TikTok (random luck)
- The music video quality (not in your features)
- Current cultural trends (zeitgeist)
- Pure measurement error (how popularity is counted)
- Genuinely random human preferences (some people just randomly like or dislike things)

### The Complete Model

So the actual observed popularity is:

$$y = \underbrace{50 + 0.3 \cdot \text{tempo} - 0.001 \cdot \text{tempo}^2}_{f_{\text{true}}(\text{tempo})} + \underbrace{\varepsilon}_{\text{all the other stuff}}$$

Where $\varepsilon$ might be normally distributed with mean zero and variance one hundred:

$$\varepsilon \sim \mathcal{N}(0, 100)$$

This means that even if two songs have **identical tempo**, their popularity can differ by $\pm 20$ points just due to all these other factors [**95% confidence interval** (roughly $\pm 2$ standard deviations)].

### Key Insight

**The noise is real-world randomness and unmeasured variables.** It's why two datapoints with the same $x$ can have different $y$ values.

### What Are We Learning?

Not the noise — we're learning $f(x; \theta)$, the **signal**.

#### The noise assumption determines the loss function, 
but we optimize to find the signal that best explains the data despite the noise. 

Concretely: we find $f(x; \theta)$ such that residuals $y_i - f(x_i; \theta)$ look like samples from $\mathcal{N}(0, \sigma^2)$.

### Why Does the Noise Assumption Matter?

Different noise distributions → different optimal strategies:

| Assumption | Loss | Behavior | Learns |
|------------|------|----------|--------|
| Gaussian | MSE | Sensitive to outliers | Mean of $P(y \| x)$ |
| Laplace | MAE | Robust to outliers | Median of $P(y \| x)$ |

**Example:** For $x=1$ with observations $y = \{5, 5.1, 4.9, 5.2, 100\}$:
- MSE optimal: $f(1) \approx 24$ (pulled toward outlier)

- MAE optimal: $f(1) \approx 5.1$ (ignores outlier)

The noise assumption determines which "center" you're learning.

### Diagnostic Check

After training, verify residuals match the assumed distribution:

**For MSE/Gaussian:**

- Histogram of residuals → should look normal

- Q-Q plot → should be linear

- Residual vs. fitted plot → no patterns

#### **If residuals don't match:** model is misspecified:

- **Wrong noise assumption** → try different loss (MAE, Huber)

- **Insufficient capacity** → model too weak to capture the signal → try stronger model

### Machine Learning Regression: Many Variables, Conditional Distribution

**Setup:** Now you have input features x equals tempo, duration, loudness and an output y equals popularity. You want to predict y from x.

**The probabilistic assumption:**

Here's the crucial shift in perspective. You're not modeling the distribution of y alone. You're modeling the distribution of y GIVEN x. Mathematically:

$$y \mid x \sim \mathcal{N}(f(x; \theta), \sigma^2)$$

Read this carefully: "y, conditional on x, follows a normal distribution with mean f of x with parameters theta and variance sigma-squared."

What this says:
- For each specific value of x, there's a distribution of possible y values
- That distribution is centered at f of x with theta (your model's prediction)
- The spread of that distribution is sigma-squared (the noise level)

**Concrete example:**

Suppose x equals tempo equals one hundred twenty beats per minute. Your model might predict f of x with theta equals sixty-five (popularity score of sixty-five). The full model says:

$$y \mid \text{tempo}=120 \sim \mathcal{N}(65, 100)$$

This means: for songs with tempo one hundred twenty, the popularity scores are normally distributed around sixty-five with standard deviation ten. Some songs will be at fifty, some at eighty, most near sixty-five.

**Different x, different distribution:**

For tempo equals one hundred forty, your model might predict f of x with theta equals seventy. Then:

$$y \mid \text{tempo}=140 \sim \mathcal{N}(70, 100)$$

For this tempo, popularity is centered at seventy.

**The key insight: Your model f of x with theta defines the CENTER of the distribution for each x.**

### What About Noise in Classification?

You asked: "there is no noise like in regression?"

Great observation. In regression, we have:

$$Y = f(\mathbf{x}) + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2)$$

In classification, the "noise" is inherent in the Bernoulli sampling process itself. Even if we know $p(\mathbf{x})$ perfectly, the outcome $Y$ is still random — it's a coin flip with probability $p$. The randomness isn't additive noise; it's the fundamental stochasticity of the binary outcome.

Think of it this way: in regression, we predict the mean of $Y$, and the noise creates spread around that mean. In classification, we predict the probability of $Y=1$, and the "noise" is the irreducible uncertainty of a probabilistic binary event.

---
# We deal with distributions

| Distribution | Description | Loss function assumes about |
|-------------|-------------|-----------------------------|
| $P(X)$ | Input distribution | Nothing! |
| $f: X \to y$ | Learned function | Not a distribution! |
| $P(y - f(X))$ | **Residual distribution** | **THIS is what matters!** |

---

## The Chicken-and-Egg Problem: Does the Loss Function Bias Residuals?

**The Question:** We don't know residuals before training — we assume a distribution, choose a loss, train, then check residuals. But the loss function shaped those residuals! How do we know the "true" distribution if our choice influences what we get?

### The Paradox
```
Before training: We don't know residuals
              ↓
We ASSUME a distribution (choose loss function)
              ↓
We train (optimize parameters)
              ↓
After training: We CHECK if residuals match assumption

BUT: The loss function ITSELF shaped those residuals!
```

### The Deep Answer: There Is No "True" Distribution

Residuals are not a property of the data — they're a property of **the model**.

$$\text{Residuals} = \underbrace{(f_{\text{true}}(x) - f_{\theta}(x))}_{\text{model misspecification}} + \underbrace{\varepsilon}_{\text{true noise}}$$

We never observe true noise $\varepsilon$ separately — only its sum with model error.

### Yes, Loss Function Changes Residuals

**Same data, different losses → different residuals:**

| Loss | Behavior | Resulting Residuals |
|------|----------|---------------------|
| MSE | Spreads error across all points | Systematic bias, no huge outliers |
| MAE | Ignores outliers, fits majority | Most near zero, isolated large outliers |

Neither is "wrong" — they optimize for different objectives.

### So What Do We Actually Do?

**It's an iterative refinement process:**

1. Make initial assumption (e.g., Gaussian) → choose loss (MSE)
2. Train model
3. Check residuals — do they match assumption?
4. **YES** → assumption was reasonable
5. **NO** → revise assumption, try different loss, repeat

### What We Check For

We're not seeking "truth" — we're seeking **consistency**:

- Residuals centered at 0
- Constant variance across fitted values
- No correlation with predictors
- Match assumed distribution (Q-Q plot)
- No systematic patterns in residual vs. fitted plot

### The Practical Answer

> "The choice of loss function implicitly assumes a noise distribution, which influences fitted parameters and resulting residuals. This creates a circular dependency — we can't know the 'true' noise distribution a priori.
>
> The goal isn't finding 'objective truth' but achieving **consistency** between assumptions, residuals, and predictive performance."

# Bias-Variance Tradeoff

---

## Setup

We have a true function $f_{\text{true}}(x)$ that we want to learn.

We observe noisy data: $y = f_{\text{true}}(x) + \varepsilon$, where $\varepsilon \sim \mathcal{N}(0, \sigma^2)$.

We train a model $\hat{f}(x)$ on a finite training set.

Our goal is to understand the expected prediction error:

$$\mathbb{E}[(y - \hat{f}(x))^2]$$

---

## What Does $\mathbb{E}[\cdot]$ Mean Here?

The expectation is taken **over different possible training datasets**.

Imagine sampling many training sets from the same distribution, training your model on each, and getting different fitted functions $\hat{f}_1(x), \hat{f}_2(x), \hat{f}_3(x), \ldots$

Then $\mathbb{E}[\hat{f}(x)]$ is the **average prediction** at point $x$ across all possible training sets.

The model itself is a random variable — it depends on which training data you happened to get.

---

## Step 1: Decompose the Residual

Start with the residual at point $x$:

$$y - \hat{f}(x)$$

Since $y = f_{\text{true}}(x) + \varepsilon$:

$$y - \hat{f}(x) = \underbrace{(y - f_{\text{true}}(x))}_{\varepsilon \text{ (noise)}} + \underbrace{(f_{\text{true}}(x) - \hat{f}(x))}_{\text{model error}}$$

---

## Step 2: Decompose the Model Error

Add and subtract $\mathbb{E}[\hat{f}(x)]$:

$$f_{\text{true}}(x) - \hat{f}(x) = \underbrace{(f_{\text{true}}(x) - \mathbb{E}[\hat{f}(x)])}_{\text{Bias}} + \underbrace{(\mathbb{E}[\hat{f}(x)] - \hat{f}(x))}_{\text{Variance term}}$$

**Bias:** How far is the average model prediction from the truth?

**Variance term:** How far is this particular model from the average model?

---

## Step 3: Compute $\mathbb{E}[(y - \hat{f}(x))^2]$

Substitute $y = f_{\text{true}}(x) + \varepsilon$:

$$\mathbb{E}[(y - \hat{f}(x))^2] = \mathbb{E}[(\varepsilon + f_{\text{true}}(x) - \hat{f}(x))^2]$$

Expand the square:

$$= \mathbb{E}[\varepsilon^2] + 2\mathbb{E}[\varepsilon(f_{\text{true}}(x) - \hat{f}(x))] + \mathbb{E}[(f_{\text{true}}(x) - \hat{f}(x))^2]$$

The cross-term vanishes because $\mathbb{E}[\varepsilon] = 0$ and noise is independent of the model:

$$= \sigma^2 + \mathbb{E}[(f_{\text{true}}(x) - \hat{f}(x))^2]$$

---

## Step 4: Decompose the Model Error Term

We need to expand $\mathbb{E}[(f_{\text{true}}(x) - \hat{f}(x))^2]$.

Add and subtract $\mathbb{E}[\hat{f}(x)]$ inside:

$$\mathbb{E}[(f_{\text{true}}(x) - \hat{f}(x))^2] = \mathbb{E}[(f_{\text{true}}(x) - \mathbb{E}[\hat{f}(x)] + \mathbb{E}[\hat{f}(x)] - \hat{f}(x))^2]$$

Define:

$$B = f_{\text{true}}(x) - \mathbb{E}[\hat{f}(x)] \quad \text{(bias, a constant)}$$

$$V = \mathbb{E}[\hat{f}(x)] - \hat{f}(x) \quad \text{(variance term, random)}$$

Expand $(B + V)^2$:

$$\mathbb{E}[(B + V)^2] = B^2 + 2B \cdot \mathbb{E}[V] + \mathbb{E}[V^2]$$

Compute $\mathbb{E}[V]$:

$$\mathbb{E}[V] = \mathbb{E}[\mathbb{E}[\hat{f}(x)] - \hat{f}(x)] = \mathbb{E}[\hat{f}(x)] - \mathbb{E}[\hat{f}(x)] = 0$$

The cross-term vanishes, leaving:

$$= \underbrace{(f_{\text{true}}(x) - \mathbb{E}[\hat{f}(x)])^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2]}_{\text{Variance}}$$

---

## The Final Result

$$\boxed{\mathbb{E}[(y - \hat{f}(x))^2] = \sigma^2 + \text{Bias}^2 + \text{Variance}}$$

where:

$$\text{Bias} = f_{\text{true}}(x) - \mathbb{E}[\hat{f}(x)]$$

$$\text{Variance} = \mathbb{E}[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2]$$

---

## Interpretation

| Term | Meaning | Can We Reduce It? |
|------|---------|-------------------|
| $\sigma^2$ | Irreducible noise in the data | No |
| $\text{Bias}^2$ | Systematic error — wrong on average | Yes — use more flexible models |
| $\text{Variance}$ | Instability across training sets | Yes — use simpler models or more data |

The tradeoff: reducing bias typically increases variance, and vice versa.

---

## Connection to MLE and Regularization

**MLE** minimizes training error, which tends to reduce bias aggressively at the cost of increasing variance (overfitting).

**Regularization** intentionally increases bias to reduce variance:

$$\hat{\theta}_{\text{MAP}} = \arg\min_{\theta} \left[ \text{Loss} + \lambda \cdot \text{Penalty} \right]$$

| Regularization | Prior | Penalty Term |
|----------------|-------|--------------|
| Ridge (L2) | Gaussian: $P(\theta) \sim \mathcal{N}(0, \tau^2)$ | $\lambda \lVert \theta \rVert_2^2$ |
| Lasso (L1) | Laplace: $P(\theta) \sim \text{Laplace}(0, b)$ | $\lambda \lVert \theta \rVert_1$ |

Regularization is MAP estimation — MLE with a prior belief that parameters should be small.

---
# What about Classification?

Normally, we assume Bernoulli distribution.

### Cross-entropy (assumes Bernoulli/Categorical):
### !! This optimizes for the mode (most likely outcome) via probability matching.

#### !! Add the derivatives that prov e that we optimize for mode

**MLE objective:** Maximize the likelihood of observed data. For a single Bernoulli observation:

$$P(y|\hat{p}) = \hat{p}^y \cdot (1-\hat{p})^{1-y}$$

Log-likelihood (easier to optimize):

$$\log P(y|\hat{p}) = y \cdot \log(\hat{p}) + (1-y) \cdot \log(1-\hat{p})$$

Negative log-likelihood (because we minimize losses):

$$-\log P(y|\hat{p}) = -\left[y \cdot \log(\hat{p}) + (1-y) \cdot \log(1-\hat{p})\right]$$

That's exactly the **binary cross-entropy** formula.

Alternatives exist:
You could treat it as a deterministic mapping and use hinge loss (SVMs)—no probabilistic interpretation at all, just geometric margin maximization.

# Regularization

You're doing **MLE** when you just minimize the loss (MSE, MAE, cross-entropy).

Here's the connection:

**MSE ↔ Gaussian residuals:**
$$P(y|x, \theta) = \mathcal{N}(f_\theta(x), \sigma^2)$$
$$-\log P(y|x,\theta) \propto (y - f_\theta(x))^2$$

**MAE ↔ Laplace residuals:**
$$P(y|x, \theta) = \text{Laplace}(f_\theta(x), b)$$
$$-\log P(y|x,\theta) \propto |y - f_\theta(x)|$$

You're finding $\theta$ that maximizes $P(\text{data}|\theta)$ — that's MLE.

---

**MAP** comes in when you add a **prior** on the weights $P(\theta)$:

$$\theta_{\text{MAP}} = \arg\max P(\theta|\text{data}) = \arg\max P(\text{data}|\theta) \cdot P(\theta)$$

In practice:
- **L2 regularization** = Gaussian prior on weights → MAP
- **L1 regularization** = Laplace prior on weights → MAP

## __So: vanilla training = MLE, regularized training = MAP.__ !!!!!!!!!

---
# Why we normally use MSE and cross-entropy losses? 

Central Limit Theorem: With enough data, many noise distributions approximately behave Gaussian in aggregate

What for Bernoulli?

When Wrong Loss REALLY Hurts:

- Regression with outliers: MSE terrible, MAE much better

- Imbalanced classification: Standard cross-entropy bad, __weighted/focal loss better__ !!

- Count data (always positive): MSE can predict negatives, Poisson loss respects structure

- Financial forecasting: Care about worst-case (95th percentile), not mean → quantile loss

---
## Everything Connects Now

### The Complete Picture

1. **Data** = points in multidimensional space $(x_i, y_i)$

2. **Model** = function mapping inputs to outputs
   - Architecture defines the function class
   - Parameters $\theta$ define the specific function $f(x; \theta)$

3. **Loss** = how we measure prediction quality (derived from assumed distribution)
   - MSE $\leftarrow$ Gaussian noise
   - MAE $\leftarrow$ Laplace noise
   - Cross-entropy $\leftarrow$ Bernoulli/Categorical

4. **Training** = finding $\theta$ that minimizes loss via gradient descent

5. **Residuals** = what's left after the model does its job
   $$\underbrace{y - f(x; \theta)}_{\text{residual}} = \underbrace{(f_{\text{true}}(x) - f(x; \theta))}_{\text{bias}} + \underbrace{\varepsilon}_{\text{irreducible noise}}$$
   
   Should look random if the model is good.

6. **Bias-Variance Tradeoff** = complexity vs. stability
   $$\mathbb{E}[(y - \hat{f}(x))^2] = \underbrace{\sigma^2}_{\text{noise}} + \underbrace{\text{Bias}^2}_{\text{underfitting}} + \underbrace{\text{Variance}}_{\text{overfitting}}$$

   | Model | Bias | Variance | Problem |
   |-------|------|----------|---------|
   | Too simple | High | Low | Underfitting |
   | Too complex | Low | High | Overfitting |
   | Regularized | ↑ slightly | ↓ significantly | Better tradeoff |

7. **Generalization** = does the learned function work on new data?
   - Validation set checks this
   - Overfitting = memorizing training points (high variance)
   - Underfitting = too simple function (high bias)

---
# Summary

**1. Can we train our model in the best possible way with a wrong loss function?**

*No — but it depends on what "optimal" means.*

---
## Loss Functions Reference

### Regression

| Loss | Formula | Assumed Distribution | Optimizes | When to Use |
|------|---------|---------------------|-----------|-------------|
| MSE | $\frac{1}{n}\sum(y_i - \hat{y}_i)^2$ | Gaussian | Mean | Default choice, no outliers |
| MAE | $\frac{1}{n}\sum\|y_i - \hat{y}_i\|$ | Laplace | Median | Outliers present, robust predictions |
| Huber | MSE if $\|e\| \leq \delta$, MAE otherwise | Gaussian core, heavy tails | Mean (robust) | Best of both: smooth + robust |
| Log-Cosh | $\sum\log(\cosh(\hat{y}_i - y_i))$ | *task-driven* | Mean (robust) | Smooth alternative to Huber |
| Quantile | $(q)\|e\|$ if $e \geq 0$, $(1-q)\|e\|$ otherwise | *task-driven* | $q$-th quantile | Prediction intervals, asymmetric costs |

### Binary Classification

| Loss | Formula | Assumed Distribution | Optimizes | When to Use |
|------|---------|---------------------|-----------|-------------|
| Binary Cross-Entropy | $-[y\log(\hat{p}) + (1-y)\log(1-\hat{p})]$ | Bernoulli | $P(y=1 \| x)$ | Default for binary classification |
| Hinge | $\max(0, 1 - y \cdot \hat{y})$ | *task-driven* | Margin | SVMs, max-margin classifiers |
| Focal | $-\alpha(1-\hat{p})^\gamma \log(\hat{p})$ | Bernoulli | $P(y=1 \| x)$ | Imbalanced datasets, hard examples |

### Multiclass Classification

| Loss | Formula | Assumed Distribution | Optimizes | When to Use |
|------|---------|---------------------|-----------|-------------|
| Cross-Entropy | $-\sum_{c} y_c \log(\hat{p}_c)$ | Categorical | $P(y=c \| x)$ | Default for multiclass |
| Label Smoothing CE | $-\sum_{c} y_c' \log(\hat{p}_c)$, $y_c' = (1-\alpha)y_c + \alpha/K$ | Categorical | Calibrated $P(y=c \| x)$ | Reduce overconfidence, improve calibration |


**Note:** *task-driven* losses are not derived from MLE of a probability distribution — they're designed to optimize a desired property (robustness, quantiles, margins) rather than assuming how noise is distributed.

The confusion arises because in classical statistics, you're fitting a distribution to data. The parameters you estimate are properties of that distribution.
In machine learning, you're fitting a function that predicts one variable from another. But you still need probability to quantify uncertainty, so you assume a distribution for the outputs given the predictions. The parameters you estimate are properties of the prediction function, not directly of the distribution.
The loss function encodes your distributional assumption about the residuals (for regression) or the outputs (for classification). Mean squared error assumes normal residuals. Cross-entropy assumes Bernoulli or categorical outputs. They're all doing MLE, just with different distributional assumptions.