# 1. Classification Loss Functions

## 1.1. Motivation for Probabilistic Outputs

In a classification task, we want to assign an input \$x\$ to one of \$K\$ classes.

A naïve idea is to design a neural network with \$K\$ output neurons and set the correct one to **1** and all others to **0**.
But this “hard assignment” has two problems:

* It does not express **uncertainty** (e.g., the model might be 70% dog, 30% cat).
* It prevents us from interpreting outputs as **probabilities**, which are essential for downstream decisions.

---

## 1.2. From Logits to Probabilities: Softmax

To produce probabilities, we apply the **softmax** function to logits \$z\_k\$:

$$
Q(k) = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}.
$$

* Each \$Q(k)\in\[0,1]\$
* \$\sum\_k Q(k)=1\$
* The highest logit → highest probability

Thus, softmax converts arbitrary real values into a probability distribution.

---

## 1.3. Why Not Use MSE?

If we compare predicted \$Q\$ with one-hot targets \$P\$ using MSE:

$$
\frac{\partial C}{\partial z_k} = (Q(k)-P(k)) \cdot Q(k)(1-Q(k)).
$$

The extra factor \$Q(k)(1-Q(k))\$ (derivative of softmax) makes gradients **tiny** whenever logits saturate.
→ **Vanishing gradients** → slow learning.

This motivates using **Cross-Entropy Loss** instead.

---

## 1.4. Cross-Entropy Loss

Cross-entropy directly measures how close predicted \$Q\$ is to true \$P\$:

$$
\mathcal{L} = -\sum_{k=1}^K P(k)\log Q(k).
$$

For one-hot targets (true class \$y\$):

$$
\mathcal{L} = -\log Q(y).
$$

* If \$Q(y)\$ is high → small loss.
* If \$Q(y)\to 0\$ → loss \$\to\infty\$ (strong penalty).

**Gradient simplification:**

$$
\frac{\partial \mathcal{L}}{\partial z_k} = Q(k) - P(k).
$$

No vanishing softmax derivative → stable training.

---

## 1.5. Information-Theoretic Roots (KL Divergence)

Cross-Entropy connects to **KL divergence**:

$$
D_{\mathrm{KL}}(P\|Q) = \sum_k P(k)\log\frac{P(k)}{Q(k)}.
$$

Expands to:

$$
D_{\mathrm{KL}}(P\|Q) = H(P,Q) - H(P).
$$

Since \$H(P)\$ is constant, minimizing \$D\_{\mathrm{KL}}\$ ≡ minimizing **cross-entropy**.
Thus, **cross-entropy loss = maximum likelihood learning** for categorical data.

---

## 1.6. Label Distributions in Practice

1. **Predicted (\$Q\$):** from logits via softmax.
2. **True (\$P\$):**

   * Normally one-hot (e.g., label “dog” → \$P=\[0,1,0]\$).
   * Can be soft:

     * **Label smoothing**: e.g. $\[0.05,0.9,0.05]\$
     * **Knowledge distillation**: soft targets from a teacher model

---

## 1.7. Worked Example

* 3 classes: cat, dog, rabbit
* True label = dog → \$P=\[0,1,0]\$
* Model predicts \$Q=\[0.2,0.7,0.1]\$

Loss:

$$
H(P,Q) = -\log 0.7.
$$

---

## 1.8. Cross-Entropy Derivation from KL (Explicit)

Setup:

* Logits \$z\_k\$, softmax \$Q(k|x)\$.
* One-hot target \$P(k|x)=\mathbf{1}\[k=y]\$.

KL divergence:

$$
D_{\mathrm{KL}}(P\|Q) = -\sum_k P(k)\log Q(k) + \text{const}.
$$

With one-hot \$P\$:

$$
\mathcal{L}(x,y) = -\log Q(y|x).
$$

For a batch:

$$
\mathcal{L}_{\text{batch}} = -\frac{1}{n}\sum_{i=1}^n \log Q(y_i|x_i).
$$

Gradient:

$$
\frac{\partial \mathcal{L}}{\partial z_k} = Q(k) - \mathbf{1}[k=y].
$$

---

## 1.9. Negative Log-Likelihood (NLL)

The **Negative Log-Likelihood Loss (NLL)** is another way to write maximum likelihood:

* Likelihood:

$$
L(\theta) = \prod_{i=1}^n Q_\theta(y_i|x_i).
$$

* Log-likelihood:

$$
\log L(\theta) = \sum_{i=1}^n \log Q_\theta(y_i|x_i).
$$

* Negative log-likelihood:

$$
\mathcal{L}_{\text{NLL}} = -\sum_{i=1}^n \log Q(y_i|x_i).
$$

Example: for output \$(0.1,0.85,0.05)\$, target = class 2 →
Loss = \$-\log(0.85)\$.

NLL appears negative because \$\log(p)\leq 0\$ for \$0\<p\leq 1\$, so we negate it to make loss positive.

---

## 1.10. Cross-Entropy vs NLL

**Cross-Entropy (general):**

$$
H(P,Q) = -\sum_k P(k)\log Q(k).
$$

* Works for **any target distribution \$P\$**.
* Handles one-hot, smoothed labels, or soft teacher distributions.

**NLL (special case):**

$$
\mathcal{L}_{\text{NLL}} = -\log Q(y).
$$

* Tied to **single-class labels** (one-hot).
* Directly maximizes likelihood of observed labels.

**When they coincide:**

If \$P\$ is one-hot → CE and NLL are identical.
That’s why in PyTorch:

* `nn.CrossEntropyLoss` = `nn.LogSoftmax + nn.NLLLoss`.

**When they differ:**

* CE works with soft targets (label smoothing, distillation).
* NLL requires a single observed class.

Refs: [1](https://www.youtube.com/watch?v=Pwgpl9mKars), [2](https://www.youtube.com/watch?v=PpFTODTztsU)

---

## **1.11 Examples** 

### **Numeric Example** 

* True class: **Dog** ($y=2$)
* Classes: Cat (0), Dog (1), Rabbit (2)
* Predicted softmax distribution:

  $$
  Q = [0.1, \; 0.7, \; 0.2]
  $$
* Target one-hot vector:

  $$
  P = [0, \; 1, \; 0]
  $$

---

**1. Negative Log-Likelihood (NLL)**

NLL looks only at the probability of the **true class** (Dog = index 1):

$$
\text{NLL} = -\log Q(y) = -\log(0.7) \approx 0.357
$$

---

**2. Cross-Entropy**

Cross-Entropy is:

$$
H(P,Q) = -\sum_k P(k)\log Q(k)
$$

Since $P$ is one-hot ($P(1)=1$):

$$
H(P,Q) = -\log Q(1) = -\log(0.7) \approx 0.357
$$

 With one-hot labels, **CE = NLL**.

---

**3. Mean Squared Error (MSE)**

MSE compares probability vectors:

$$
\text{MSE} = \tfrac{1}{2}\sum_k (Q(k)-P(k))^2
$$

$$
= \tfrac{1}{2}\big[(0.1-0)^2 + (0.7-1)^2 + (0.2-0)^2\big]
$$

$$
= \tfrac{1}{2}(0.01 + 0.09 + 0.04) = \tfrac{1}{2}(0.14) = 0.07
$$

---

**Comparison Table**

| Loss Type         | Formula                            | Value in example | Notes                                            |
| ----------------- | ---------------------------------- | ---------------- | ------------------------------------------------ |
| **NLL**           | $-\log Q(y)$                       | **0.357**        | Looks only at true class probability.            |
| **Cross-Entropy** | $-\sum_k P(k)\log Q(k)$            | **0.357**        | Equals NLL if $P$ is one-hot.                    |
| **MSE**           | $\tfrac{1}{2}\sum_k (Q(k)-P(k))^2$ | **0.07**         | Much smaller, but gradients weaker (can vanish). |

---

**4. What if labels are not one-hot?**

Suppose we use **label smoothing**:

$$
P = [0.05, \; 0.9, \; 0.05]
$$

* **Cross-Entropy**:

  $$
  H(P,Q) = -(0.05\log 0.1 + 0.9\log 0.7 + 0.05\log 0.2)
  $$

  $$
  \approx -(0.05 \cdot -2.302 + 0.9\cdot -0.357 + 0.05\cdot -1.609)
  \approx 0.453
  $$

* **NLL**: Not defined (since it expects a single class label).

* **MSE**: Still computable, but less meaningful.

This is where **Cross-Entropy ≠ NLL**, and CE is more general.

---

This small example shows:

* With one-hot labels → **CE = NLL**.
* MSE gives a much smaller number (and weaker gradients).
* With soft labels → only **CE** works properly.

---

### Binary Classification with 2 Classes and comparing **MSE vs Cross-Entropy** Example: 


* True class: $y=0$ (so target distribution is $P = [1,0]$)
* Model logits:

  $$
  z = [3, -2]
  $$
* Apply softmax to get predicted probabilities:

  $$
  Q(0) = \frac{e^3}{e^3 + e^{-2}}, \quad Q(1) = \frac{e^{-2}}{e^3 + e^{-2}}
  $$

---

**Compute probabilities**

$$
e^3 \approx 20.085, \quad e^{-2} \approx 0.135
$$

$$
Q(0) = \frac{20.085}{20.085 + 0.135} \approx 0.9933
$$

$$
Q(1) = \frac{0.135}{20.085 + 0.135} \approx 0.0067
$$

So the model predicts class 0 with **99.3% probability**, class 1 with **0.7% probability**.

---

**MSE Loss**

MSE compares probabilities $Q$ to the target one-hot vector $P = [1,0]$:

$$
\text{MSE} = \tfrac{1}{2}\big[(Q(0)-1)^2 + (Q(1)-0)^2\big]
$$

$$
= \tfrac{1}{2}\big[(0.9933-1)^2 + (0.0067-0)^2\big]
$$

$$
= \tfrac{1}{2}\big[( -0.0067)^2 + (0.0067)^2\big] \approx \tfrac{1}{2}(0.000045+0.000045)
$$

$$
= 0.000045
$$

Pretty small loss.

---

**Gradient with MSE**

General gradient with MSE and softmax:

$$
\frac{\partial C}{\partial z_k} = (Q(k)-P(k)) \cdot Q(k)(1-Q(k))
$$

For class 0:

$$
(Q(0)-P(0)) \cdot Q(0)(1-Q(0)) = (0.9933-1)\cdot 0.9933\cdot(0.0067)
$$

$$
= (-0.0067)\cdot(0.00665) \approx -0.000044
$$

For class 1:

$$
(Q(1)-P(1)) \cdot Q(1)(1-Q(1)) = (0.0067-0)\cdot(0.0067)(0.9933)
$$

$$
= (0.0067)\cdot(0.00665) \approx 0.000044
$$

Gradient is **tiny** (\~$10^{-5}$).
Even though the model is slightly wrong, updates will be **very slow**.

---

**Cross-Entropy Loss**

Cross-entropy loss for true class $y=0$:

$$
\text{CE} = -\log Q(0) = -\log(0.9933) \approx 0.0067
$$

---

**Gradient with Cross-Entropy**

General gradient with CE + softmax:

$$
\frac{\partial C}{\partial z_k} = Q(k) - P(k)
$$

For class 0:

$$
Q(0)-P(0) = 0.9933-1 = -0.0067
$$

For class 1:

$$
Q(1)-P(1) = 0.0067-0 = 0.0067
$$

Gradient is **\~100× larger** than with MSE.
So updates will be meaningful, not negligible.

---

**Key Insight**

* With **MSE**, gradients are dampened by the softmax derivative $Q(1-Q)$.
  → Tiny updates → slow learning.
* With **Cross-Entropy**, gradients simplify to $(Q-P)$.
  → Strong updates proportional to the prediction error.


---

### Model is Confident But it is Wrong Example

**Setup**

* True class: $y=0$ → target $P = [1,0]$.
* Model logits:

  $$
  z = [-2, \; 3]
  $$
* This time the network strongly favors class 1 instead of class 0.

---

**Compute probabilities (softmax)**

$$
e^{-2} \approx 0.135, \quad e^{3} \approx 20.085
$$

$$
Q(0) = \frac{0.135}{0.135+20.085} \approx 0.0067
$$

$$
Q(1) = \frac{20.085}{0.135+20.085} \approx 0.9933
$$

So:

* Model says class 0 (true class) → **0.67%**
* Model says class 1 (wrong class) → **99.3%**

---

**Step 2: MSE Loss**

$$
\text{MSE} = \tfrac{1}{2}\big[(Q(0)-1)^2 + (Q(1)-0)^2\big]
$$

$$
= \tfrac{1}{2}\big[(0.0067-1)^2 + (0.9933-0)^2\big]
$$

$$
= \tfrac{1}{2}\big[(-0.9933)^2 + (0.9933)^2\big] 
= \tfrac{1}{2}(0.9866+0.9866) \approx 0.9866
$$

So the loss is **≈ 1**. Not small, but also not huge.

---

**Step 3: Gradient with MSE**

$$
\frac{\partial C}{\partial z_k} = (Q(k)-P(k)) \cdot Q(k)(1-Q(k))
$$

For class 0:

$$
(Q(0)-P(0)) \cdot Q(0)(1-Q(0)) = (0.0067-1)\cdot 0.0067\cdot 0.9933
$$

$$
= (-0.9933)\cdot (0.00665) \approx -0.0066
$$

For class 1:

$$
(Q(1)-P(1)) \cdot Q(1)(1-Q(1)) = (0.9933-0)\cdot 0.9933\cdot 0.0067
$$

$$
= (0.9933)\cdot (0.00665) \approx 0.0066
$$

Gradient is still **tiny (\~0.006)**.
Even though the model is *very wrong*, MSE barely pushes it to update.

---

**Cross-Entropy Loss**

For true class $y=0$:

$$
\text{CE} = -\log Q(0) = -\log(0.0067) \approx 5.01
$$

This is **much larger** than MSE’s \~1.
The model is heavily penalized for being confidently wrong.

---

**Gradient with Cross-Entropy**

$$
\frac{\partial C}{\partial z_k} = Q(k) - P(k)
$$

For class 0:

$$
Q(0)-P(0) = 0.0067 - 1 = -0.9933
$$

For class 1:

$$
Q(1)-P(1) = 0.9933 - 0 = 0.9933
$$

Gradient is **\~150× larger** than with MSE. 
The network will be strongly corrected.

---



* **MSE**: even if the model is **confident but wrong**, gradients remain small → weak correction.
* **Cross-Entropy**: the loss **explodes** when the model is confident in the wrong class → strong correction.

This behavior is exactly what we want:

* Reward confident correct predictions (loss → 0).
* Severely punish confident wrong predictions (loss → ∞).



## 2. Regression Loss Functions

* **MSELoss**
* **L1Loss**
* **SmoothL1Loss (Huber)**
Scoring function: $|\textbf{W}.\textbf{X}_i+b|$

### 2.1. Mean Absolute Error Loss
Treats outlier like another point so it won't go out of its way for outliers which might lead to poor prediction from time to time. Also, it is not optimized for gradient descent (due to discontinuity on zero)Support vector regression use this.

### 2.2. Mean Squared Error Loss
The advantage is we can easily compute gradient for ML. Good for regression but outlier will make it a poor model

### 2.3. Pseudo-Huber Loss
Mean Absolute Error don't comply with outlier and Mean Squared Error freaks out with outlier. If your data is 70% on one side and 30% on the other side, both will result in a poor model.

### 2.4. Welsch (Leclerc)
### 2.5. Geman McClure
### 2.6. Causchy

## 3. Geodesic & Geometry-Aware Loss Functions

Geodesic loss is commonly used in deep learning when predicting **rotations** (e.g., in camera pose estimation, object orientation prediction, or visual odometry).
It measures the **shortest distance** between two rotations on the rotation manifold **SO(3)**, ensuring predictions respect rotation geometry.

---

#### 3.1 Why Not Use L2 Loss?

Rotations live on a **curved manifold** (SO(3)), not Euclidean space.
Naive L2 loss between matrices or quaternions can:

* Misrepresent distances,
* Violate rotation constraints,
* Hinder convergence.

Instead, we use **geodesic distance**, which correctly measures the angular distance between two rotations.

---

#### 3.2 Geodesic Loss with Rotation Matrices

Given:

* Ground truth: \$R\_{\text{gt}} \in SO(3)\$
* Prediction: \$R\_{\text{pred}} \in SO(3)\$

The **relative rotation** is:

$$
R_{\text{rel}} = R_{\text{gt}}^\top R_{\text{pred}}
$$

The **geodesic angle** is:

$$
\theta = \cos^{-1} \left( \frac{\text{trace}(R_{\text{rel}}) - 1}{2} \right)
$$

* \$\text{trace}(R\_{\text{rel}}) = 3\$ → identical rotations (\$\theta = 0\$)
* \$\text{trace}(R\_{\text{rel}}) = -1\$ → 180° difference (\$\theta = \pi\$)

This comes directly from Rodrigues’ rotation formula:
\$\text{trace}(R) = 1 + 2 \cos(\theta)\$

---

#### PyTorch Implementation

```python
import torch

def geodesic_loss(R_pred, R_gt):
    """
    Computes geodesic loss (radians) between two rotation matrices.
    R_pred, R_gt: (B, 3, 3)
    """
    R_diff = torch.matmul(R_pred.transpose(1, 2), R_gt)
    trace = R_diff[:, 0, 0] + R_diff[:, 1, 1] + R_diff[:, 2, 2]
    cos_theta = (trace - 1) / 2
    cos_theta = torch.clamp(cos_theta, -1.0, 1.0)  # stability
    theta = torch.acos(cos_theta)
    return theta.mean()
```

---

#### 3.3 Geodesic Loss with Quaternions

If rotations are represented as **unit quaternions** \$q\_{\text{gt}}, q\_{\text{pred}} \in \mathbb{R}^4\$:

$$
\theta = 2 \cos^{-1}\!\Big(|q_{\text{gt}} \cdot q_{\text{pred}}|\Big)
$$

* The dot product measures similarity.
* The **absolute value** accounts for the double-cover property (\$q\$ and \$-q\$ represent the same rotation).

---

#### PyTorch Implementation

```python
def geodesic_loss_quaternion(q_pred, q_gt):
    """
    Compute geodesic loss between predicted & ground-truth quaternions.
    q_pred, q_gt: (B, 4), normalized
    """
    dot = torch.sum(q_pred * q_gt, dim=1)
    dot = torch.abs(dot)                  # handle double cover
    dot = torch.clamp(dot, -1.0, 1.0)     # stability
    theta = 2 * torch.acos(dot)
    return theta.mean()
```

---

### 3.4 Intuition & Special Cases

#### Case 1: Perfect Match

\$q\_{\text{pred}} = q\_{\text{gt}} ;;\Rightarrow;; \theta = 0\$

#### Case 2: Opposite Quaternions (same rotation)

\$q\_{\text{pred}} = -q\_{\text{gt}} ;;\Rightarrow;; \theta = 0\$
(because quaternions double-cover SO(3))

#### Case 3: 90° Difference

\$|q\_{\text{gt}} \cdot q\_{\text{pred}}| = \cos(45^\circ) = \tfrac{\sqrt{2}}{2}\$
\$\Rightarrow \theta = 90^\circ = \tfrac{\pi}{2}\$

---

### 3.5 Clarification: Opposite Rotations vs Opposite Quaternions

* \$q\$ vs \$-q\$ →  same rotation (\$\theta = 0\$)
* Rotations of +90° and –90° around the same axis →  different rotations (\$\theta > 0\$)

---

#### 3.6 Summary Table

| Case                                   | Dot Product             | Geodesic Distance  |
| -------------------------------------- | ----------------------- | ------------------ |
| \$q\_{\text{pred}} = q\_{\text{gt}}\$  | \$1\$                   | \$0\$              |
| \$q\_{\text{pred}} = -q\_{\text{gt}}\$ | \$-1\$ (abs → \$1\$)    | \$0\$              |
| 90° apart                              | \$\tfrac{\sqrt{2}}{2}\$ | \$\tfrac{\pi}{2}\$ |
| 180° apart                             | \$0\$                   | \$\pi\$            |

---




## **PyTorch Example** 


####  `nn.CrossEntropyLoss`

* **Expected input**: raw logits (no softmax).
* **Expected target**: class index (integer), tensor of indices for each item in the batch.
* Internally:

  * Applies `F.log_softmax(logits, dim=1)`
  * Then applies `nn.NLLLoss`.

So `CrossEntropyLoss = log_softmax + NLLLoss`.

Example:

In [12]:
import torch.nn as nn
import torch
criterion = nn.CrossEntropyLoss()

# Batch of size 2, 3 classes → logits
outputs = torch.tensor([[0.1, 0.3, 0.2], [0.4, 0.1, 0.2]])  # shape (2,3)

# True class index (not one-hot!)
label = torch.tensor([2, 0])  # shape (2,)

loss = criterion(outputs, label)


---

####  **label smoothing**

* PyTorch’s `nn.CrossEntropyLoss` by default does **not** accept one-hot or smoothed probability vectors — it expects class indices.

* To use label smoothing, PyTorch added `label_smoothing` argument:

```python
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
```

Under the hood, it converts the hard label into a smoothed distribution and computes generalized CE.

##  loss tensor:

`nn.CrossEntropyLoss()` by default, computes the **mean** of the per-sample losses:

```python
criterion = torch.nn.CrossEntropyLoss()  # default: reduction='mean'
criterion = nn.CrossEntropyLoss(reduction="mean")
criterion = nn.CrossEntropyLoss(reduction="sum")
```

That means:

```python
loss.item() = sum of individual losses/batch size
```


In [13]:
criterion = nn.CrossEntropyLoss(reduction="sum")
loss = criterion(outputs, label)
loss.item()

2.041774034500122

In [14]:
criterion = nn.CrossEntropyLoss(reduction="mean") 
loss = criterion(outputs, label)
loss.item()

1.020887017250061

**If You Want Total Loss Over Epoch**

You can choose:

* **Use `reduction='sum'`** and accumulate directly
* **Stick with mean**, but multiply by batch size to compute total loss:

```python
loss = criterion(outputs_batch, labels_batch) # default: reduction='mean'
running_loss += loss.item() * inputs_batch.size(0)  # batch_size
```

This is useful when you want the total loss across all samples in the epoch.

---

**Best Practice for Epoch-Level Loss:**

```python
epoch_loss = 0.0
total_samples = 0

for inputs_batch, labels_batch in train_loader:
    outputs_batch = model(inputs_batch)
    loss = criterion(outputs_batch, labels_batch)
    
    batch_size = inputs_batch.size(0)
    epoch_loss += loss.item() * batch_size  # accumulate total loss
    total_samples += batch_size

# average loss for the epoch
epoch_loss = epoch_loss / total_samples
```


###  `torch.max(outputs, 1)` or `outputs.max(1)`

```python
_, predicted = outputs.max(1)
# or 
_, predicted = torch.max(outputs, 1)
```
Is **standard for multi-class classification** where `outputs` is of shape `[B, C]`, with **raw logits** per class.

In our case  Batch of size `2, 3` classes → `logits` 

```python
outputs = torch.tensor([[0.1, 0.3, 0.2], [0.4, 0.1, 0.2]])  # shape (2,3)
```

**what it does:**

* `outputs`: Tensor of shape `[B, C]` (batch size × number of classes). 

* `torch.max(outputs, 1)` returns a tuple:

  * `values` (that's the `_`) : max logit per sample → not used 
  * `indices` (that's the `predicted`): **predicted class index** for each sample.


In [15]:
_, predicted = outputs.max(1)
print("predicted", predicted)

predicted tensor([1, 0])


### `predicted.eq(label).sum().item()`

In [16]:
correct=0
correct += predicted.eq(label).sum().item()
# or 
# correct += (predicted == label).sum().item()
print(correct)


1


Compares predicted class indices to true labels and counts the number of correct predictions.


```python
total_loss += loss.item() * images.size(0)
```

is **accumulating the total loss over the entire dataset**, not averaging yet.

---

To compute the **epoch average**, you divide at the end:

```python
epoch_loss = total_loss / dataset_size  # dataset_size = len(train_dataset)
```

**Summary**

| Variable         | Type   | Meaning                           |
| ---------------- | ------ | --------------------------------- |
| `loss`           | Tensor | Mean loss for current batch       |
| `loss.item()`    | float  | Scalar value of batch mean loss   |
| `images.size(0)` | int    | Number of samples in the batch    |
| `total_loss`     | float  | Accumulated (summed) batch losses |
| `epoch_loss`     | float  | Average loss over the whole epoch |

---