###  loss tensor:

`nn.CrossEntropyLoss()` by default computes the **mean** of the per-sample losses:

```python
loss = torch.nn.CrossEntropyLoss()  # default: reduction='mean'
```

That means:

```python
loss.item() = sum of individual losses / batch size
```

So if your batch has 10 samples, and the individual losses are:

```python
[1.1, 0.8, 1.0, ..., 0.9]  # 10 values
```

Then the returned `loss` will be the **average**, e.g. `0.95`.

---

**If You Want Total Loss Over Epoch**

You can choose:

* **Use `reduction='sum'`** and accumulate directly
* **Stick with mean**, but multiply by batch size to compute total loss:

```python
loss = criterion(outputs_batch, labels_batch)
running_loss += loss.item() * inputs_batch.size(0)  # batch_size
```

This is useful when you want total loss across all samples in the epoch.

---

**Best Practice for Epoch-Level Loss:**

```python
epoch_loss = 0.0
total_samples = 0

for inputs_batch, labels_batch in train_loader:
    outputs_batch = model(inputs_batch)
    loss = criterion(outputs_batch, labels_batch)
    
    batch_size = inputs_batch.size(0)
    epoch_loss += loss.item() * batch_size  # accumulate total loss
    total_samples += batch_size

# average loss for the epoch
epoch_loss = epoch_loss / total_samples
```


In [2]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, random_split, Dataset, Subset
import torchvision.datasets as datasets
import torch.nn.functional as F
from tqdm import tqdm
import torch.optim as optim


class MyCustomDataSet(Dataset):
    def __init__(self, data, lables):
        self.data = data
        self.lables = lables

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx], self.lables[idx]


torch.manual_seed(42)
sample_size = 100
number_of_classes = 4
data_dim = 2
data = torch.randn(sample_size, data_dim)
lables = torch.randint(0, number_of_classes, (sample_size,))

dataset = MyCustomDataSet(data=data, lables=lables)

train_size = int(0.75*len(dataset))
val_size = int(0.15*len(dataset))
test_size = len(dataset)-train_size-val_size

dataset = MyCustomDataSet(data=data, lables=lables)


generator = torch.Generator().manual_seed(42)


train_dataset, val_dataset, test_dataset = random_split(
    dataset, [train_size, val_size, test_size], generator=generator)

train_loader = DataLoader(train_dataset, batch_size=10, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=10, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=10, shuffle=True)


class MyCustomModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(2, 10)
        self.fc2 = nn.Linear(10, 4)

    def forward(self, input):
        x = torch.relu(self.fc1(input))
        x = self.fc2(x)
        return x


criterion = nn.CrossEntropyLoss()
model = MyCustomModel()

optimizer = optim.AdamW(params=model.parameters(),
                        lr=1e-3,   weight_decay=1e-2)


epochs = 10

for epoch in range(epochs):
    model.train()
    running_loss = 0.0

    for batch_idx, (inputs_batch, labels_batch) in enumerate(train_loader):
        print("inputs_batch:\n", inputs_batch)
        print("labels_batch:\n", labels_batch)

        batch_size = inputs_batch.size(0)

        print("------------------------------------------------------")

        outputs_batch = model(inputs_batch)
        optimizer.zero_grad()

        loss = criterion(outputs_batch, labels_batch)
        print("Loss:", loss.item())
        print("Loss:", loss)

        loss.backward()
        optimizer.step()

        print("------------------------------------------------------")
        print("outputs_batch.shape: ", outputs_batch.shape)

        print("outputs_batch:\n", outputs_batch)
        print("------------------------------------------------------")
        print("labels_batch:\n", labels_batch)
        _, predicted = torch.max(outputs_batch, 1)

        print("predicted:\n", predicted)

        print((predicted == labels_batch).sum().item())

        running_loss += loss.item() * batch_size
        break
    break


inputs_batch:
 tensor([[ 0.0249, -0.3460],
        [ 2.2181,  0.5232],
        [ 0.3125, -0.0335],
        [ 0.1748, -1.0939],
        [ 0.0109, -0.3387],
        [-0.6867,  0.6368],
        [ 0.9007, -2.1055],
        [ 1.0608,  0.2083],
        [ 1.0950,  0.3399],
        [-0.7521,  1.6487]])
labels_batch:
 tensor([2, 2, 3, 3, 2, 0, 0, 3, 0, 2])
------------------------------------------------------
Loss: 1.4052388668060303
Loss: tensor(1.4052, grad_fn=<NllLossBackward0>)
------------------------------------------------------
outputs_batch.shape:  torch.Size([10, 4])
outputs_batch:
 tensor([[-0.2916, -0.1549,  0.1629, -0.0074],
        [ 0.2972,  0.5286,  0.1931, -0.1649],
        [-0.2981, -0.0999,  0.1152, -0.0366],
        [-0.1561, -0.2563,  0.3386,  0.0204],
        [-0.2915, -0.1556,  0.1632, -0.0069],
        [-0.3373, -0.1296,  0.0648,  0.0202],
        [-0.0698, -0.2705,  0.3611,  0.1934],
        [-0.0946,  0.1462,  0.1615, -0.1537],
        [-0.0589,  0.1862,  0.1649, -0.1

## Common Classification Loss Functions




| Task Type                  | PyTorch Loss Function             | Target Shape          | Output Shape          |
| -------------------------- | --------------------------------- | --------------------- | --------------------- |
| Binary Classification      | `nn.BCEWithLogitsLoss`            | `[B, 1]` or `[B]`     | Sigmoid/logits, `[B]` |
| Multi-Class (single label) | `nn.CrossEntropyLoss`             | `[B]` (class indices) | Raw logits, `[B, C]`  |
| Multi-Label (multi-class)  | `nn.BCEWithLogitsLoss`            | `[B, C]`              | Logits, `[B, C]`      |
| Probabilistic models       | `nn.NLLLoss` (with `log_softmax`) | `[B]`                 | Log-probabilities     |
| Custom metric learning     | `nn.TripletMarginLoss`, etc.      | depends               | depends               |

---

### Inside the Training/Eval Loop


```python
# TRAINING
model.train()
for images, labels in train_loader:
    optimizer.zero_grad()
    outputs = model(images)         # shape [B, C] for multi-class
    loss = criterion(outputs, labels)  # criterion = e.g., nn.CrossEntropyLoss()
    loss.backward()
    optimizer.step()

# EVALUATION
model.eval()
with torch.no_grad():
    for images, labels in val_loader:
        outputs = model(images)
        loss = criterion(outputs, labels)
        # collect predictions for accuracy, precision, etc.
```

---

The **loss** is a scalar value computed by comparing the model’s output (predictions) with the ground truth (labels). The choice of loss function depends on the task type (binary, multi-class, multi-label).



* `images.size(0)` is the **batch size**, i.e., the number of samples in the current mini-batch.
* `loss.item()` gives the **scalar average loss per sample** in that batch (computed by PyTorch internally).
* So `loss.item() * images.size(0)` gives the **total loss for that batch** (i.e., sum over all samples in the batch).

**Therefore:**

```python
total_loss += loss.item() * images.size(0)
```

is **accumulating the total loss over the entire dataset**, not averaging yet.

---

**Final Average Loss for the Epoch**

To compute the **epoch average**, you divide at the end:

```python
epoch_loss = total_loss / dataset_size  # dataset_size = len(train_dataset)
```

###  Summary

| Variable         | Type   | Meaning                           |
| ---------------- | ------ | --------------------------------- |
| `loss`           | Tensor | Mean loss for current batch       |
| `loss.item()`    | float  | Scalar value of batch mean loss   |
| `images.size(0)` | int    | Number of samples in the batch    |
| `total_loss`     | float  | Accumulated (summed) batch losses |
| `epoch_loss`     | float  | Average loss over the whole epoch |



---

###  Best Practices for Loss Calculation

1. **Use correct output format:**

   * For `CrossEntropyLoss`: model must output *raw logits* (no softmax needed).
   * For `BCEWithLogitsLoss`: model must output *raw logits* (no sigmoid).

2. **Do not use `.item()` inside training loop** unless you’re logging.

   ```python
   total_loss += loss.item() * images.size(0)  # accumulate loss for avg
   ```

3. **Track batch-wise loss properly** (for logging or early stopping):

   ```python
   running_loss += loss.item() * batch_size
   epoch_loss = running_loss / dataset_size
   ```

4. **Early stopping/monitoring:**

   * Use validation loss to monitor overfitting.
   * Save the best model using:

     ```python
     if val_loss < best_val_loss:
         torch.save(model.state_dict(), "best_model.pth")
     ```

---

### Custom Loss 

Example: Label smoothing for classification

```python
class LabelSmoothingCrossEntropy(nn.Module):
    def __init__(self, smoothing=0.1):
        super().__init__()
        self.smoothing = smoothing

    def forward(self, pred, target):
        n_class = pred.size(1)
        log_probs = F.log_softmax(pred, dim=1)
        with torch.no_grad():
            true_dist = torch.zeros_like(log_probs)
            true_dist.fill_(self.smoothing / (n_class - 1))
            true_dist.scatter_(1, target.data.unsqueeze(1), 1.0 - self.smoothing)
        return torch.mean(torch.sum(-true_dist * log_probs, dim=1))
```
---

### Mini Working Example (PyTorch)

* Model definition
* Training & validation loop
* Calculation of:

  * Loss
  * Accuracy
  * Precision, Recall, F1-score (per epoch)

We'll use **multi-class classification** as it's more general (e.g., 3 classes).

---


```python
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset, random_split
from sklearn.metrics import precision_score, recall_score, f1_score

# Set seed for reproducibility
torch.manual_seed(42)

# 🔹 Create synthetic dataset: 100 samples, 5 features, 3 classes
num_samples = 100
num_features = 5
num_classes = 3

X = torch.randn(num_samples, num_features)
y = torch.randint(0, num_classes, (num_samples,))  # Labels: 0, 1, 2

dataset = TensorDataset(X, y)
train_set, val_set = random_split(dataset, [80, 20])

train_loader = DataLoader(train_set, batch_size=16)
val_loader = DataLoader(val_set, batch_size=16)

# 🔹 Define a simple model
class SimpleClassifier(nn.Module):
    def __init__(self, input_dim, num_classes):
        super().__init__()
        self.fc = nn.Linear(input_dim, num_classes)
        
    def forward(self, x):
        return self.fc(x)

model = SimpleClassifier(num_features, num_classes)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
```

---

###  Training + Validation with Metrics

```python
for epoch in range(5):
    model.train()
    train_loss = 0.0
    train_preds, train_labels = [], []

    for x_batch, y_batch in train_loader:
        optimizer.zero_grad()
        outputs = model(x_batch)  # shape [B, C]
        loss = criterion(outputs, y_batch)
        loss.backward()
        optimizer.step()

        train_loss += loss.item() * y_batch.size(0)

        _, predicted = torch.max(outputs, 1)
        train_preds.extend(predicted.cpu().numpy())
        train_labels.extend(y_batch.cpu().numpy())

    # Calculate train metrics
    train_loss /= len(train_set)
    train_acc = sum(torch.tensor(train_preds) == torch.tensor(train_labels)) / len(train_labels)
    train_precision = precision_score(train_labels, train_preds, average='macro')
    train_recall = recall_score(train_labels, train_preds, average='macro')
    train_f1 = f1_score(train_labels, train_preds, average='macro')

    # 🔹 Validation
    model.eval()
    val_loss = 0.0
    val_preds, val_labels = [], []

    with torch.no_grad():
        for x_batch, y_batch in val_loader:
            outputs = model(x_batch)
            loss = criterion(outputs, y_batch)
            val_loss += loss.item() * y_batch.size(0)

            _, predicted = torch.max(outputs, 1)
            val_preds.extend(predicted.cpu().numpy())
            val_labels.extend(y_batch.cpu().numpy())

    val_loss /= len(val_set)
    val_acc = sum(torch.tensor(val_preds) == torch.tensor(val_labels)) / len(val_labels)
    val_precision = precision_score(val_labels, val_preds, average='macro')
    val_recall = recall_score(val_labels, val_preds, average='macro')
    val_f1 = f1_score(val_labels, val_preds, average='macro')

    # 🔸 Print results
    print(f"\nEpoch {epoch + 1}")
    print(f"Train Loss: {train_loss:.4f} | Acc: {train_acc:.2f} | Prec: {train_precision:.2f} | Rec: {train_recall:.2f} | F1: {train_f1:.2f}")
    print(f"Val   Loss: {val_loss:.4f} | Acc: {val_acc:.2f} | Prec: {val_precision:.2f} | Rec: {val_recall:.2f} | F1: {val_f1:.2f}")
```

---

This code will:

* Train a simple linear classifier
* Evaluate on both training and validation sets
* Compute and print:

  * Loss (mean)
  * Accuracy
  * Precision, Recall, F1-score (macro-averaged across classes)

---



this line:

```python
_, predicted = torch.max(preds, 1)
```

is **standard for multi-class classification** where `preds` is of shape `[B, C]`, with **raw logits** per class.

Let me explain and then contrast with binary classification:

---

###  `torch.max(preds, 1)` — what it does:

* `preds`: Tensor of shape `[B, C]` (batch size × number of classes).
* `torch.max(preds, 1)` returns a tuple:

  * `values`: max logit per sample → not used (that's the `_`)
  * `indices`: **predicted class index** for each sample.

So:

```python
_, predicted = torch.max(preds, 1)  # predicted shape: [B]
```

Then:

```python
correct += (predicted == y).sum().item()
```

compares predicted class indices to true labels and counts the number of correct predictions.

---

### 🔹 Does it change for **multi-class** vs **binary**?

| Case                      | Model Output Shape | Criterion                | Prediction logic        |
| ------------------------- | ------------------ | ------------------------ | ----------------------- |
| **Multi-class** (e.g. 3+) | `[B, C]`           | `nn.CrossEntropyLoss()`  | `torch.max(preds, 1)`   |
| **Binary** (2 classes)    | `[B]` or `[B, 1]`  | `nn.BCEWithLogitsLoss()` | `preds.sigmoid() > 0.5` |

---

### 🔸 For Binary Classification (with `BCEWithLogitsLoss`)

```python
# Model output is [B], raw logits
preds = model(x)               # shape [B]
probs = torch.sigmoid(preds)   # shape [B]
predicted = (probs > 0.5).long()
correct += (predicted == y).sum().item()
```

> Note: `y` should be float/int with shape `[B]` or `[B, 1]` matching the output.

---

###  Summary

| Type              | Output (from model) | Target `y`          | Prediction Rule                             |
| ----------------- | ------------------- | ------------------- | ------------------------------------------- |
| Multi-class       | `[B, C]` logits     | class indices `[B]` | `_, predicted = torch.max(preds, 1)`        |
| Binary (1 output) | `[B]` logits        | 0 or 1              | `predicted = (sigmoid(preds) > 0.5).long()` |

---



## Geodesic loss

**Geodesic loss** is a loss function used in deep learning when predicting **rotations**, especially **3D rotations**, like in camera pose estimation, object orientation prediction, or visual odometry. It measures the "shortest distance" between two rotations on the **manifold of 3D rotations**, i.e., **SO(3)** (the Special Orthogonal group of 3×3 rotation matrices).

---

##  What Is Geodesic Loss?

###  Problem:

Suppose you have:

* Ground truth rotation: $R_{\text{gt}} \in SO(3)$
* Predicted rotation: $R_{\text{pred}} \in SO(3)$

You want to measure how "far" these two are on the **rotation manifold**.

###  Geodesic Loss (in radians):

The geodesic distance $\theta$ between two rotation matrices is:

$$
\theta = \cos^{-1} \left( \frac{\text{trace}(R_{\text{gt}}^\top R_{\text{pred}}) - 1}{2} \right)
$$

This computes the **angle** between the two rotations.

---

##  Why Not Use L2 Loss?

Rotation matrices (or quaternions) lie on a curved manifold (non-Euclidean space). Using naive L2 loss between matrices or quaternions can:

* Be misleading,
* Fail to respect rotation constraints,
* Lead to worse convergence.

---

##  PyTorch Implementation (Rotation Matrices)

```python
import torch

def geodesic_loss(R_pred, R_gt):
    """
    Computes geodesic loss (in radians) between two rotation matrices.
    R_pred, R_gt: shape (..., 3, 3)
    Returns: loss value (mean geodesic distance)
    """
    batch_size = R_pred.size(0)
    R_diff = torch.matmul(R_pred.transpose(1, 2), R_gt)
    trace = R_diff[:, 0, 0] + R_diff[:, 1, 1] + R_diff[:, 2, 2]
    # Clamp for numerical stability
    cos_theta = (trace - 1) / 2
    cos_theta = torch.clamp(cos_theta, -1.0, 1.0)
    theta = torch.acos(cos_theta)
    return theta.mean()
```

---

##  If You Use Quaternions Instead

If rotations are represented as **unit quaternions** $q_1, q_2$, you can use:

$$
\theta = 2 \cos^{-1}(|q_1 \cdot q_2|)
$$

In PyTorch:

```python
def geodesic_loss_quaternion(q_pred, q_gt):
    """
    q_pred, q_gt: shape (B, 4), assumed normalized
    """
    dot = torch.sum(q_pred * q_gt, dim=1)
    dot = torch.abs(dot)  # account for double cover: q and -q are same
    dot = torch.clamp(dot, -1.0, 1.0)
    theta = 2 * torch.acos(dot)
    return theta.mean()
```

---

##  When to Use Geodesic Loss

Use geodesic loss when your network **outputs a rotation**, and:

* You care about **angular difference** between predicted and ground truth rotation.
* You use **rotation matrices**, **quaternions**, or **axis-angle** representations.
* Tasks like:

  * Visual odometry (rotation error),
  * Camera pose regression,
  * Orientation prediction in robotics.

---


###  **Rotations and Relative Rotation**

Given two rotation matrices:

* $R_{\text{gt}}$: ground truth
* $R_{\text{pred}}$: predicted

Then the **relative rotation** from prediction to ground truth is:

$$
R_{\text{rel}} = R_{\text{gt}}^\top R_{\text{pred}}
$$

If $R_{\text{pred}} = R_{\text{gt}}$, then:

$$
R_{\text{rel}} = R_{\text{gt}}^\top R_{\text{gt}} = I
$$

and:

$$
\text{trace}(R_{\text{rel}}) = \text{trace}(I) = 3
$$

So yes, **when two rotations are close, the relative rotation is close to identity**, and the trace is close to 3.

---

###  **Geodesic Angle from Trace**

From that relative rotation, the **angle** $\theta$ between rotations is:

$$
\theta = \cos^{-1} \left( \frac{\text{trace}(R_{\text{rel}}) - 1}{2} \right)
$$

This comes from the Rodrigues' rotation formula, which states:

$$
\text{trace}(R) = 1 + 2 \cos(\theta)
\quad \Rightarrow \quad
\cos(\theta) = \frac{\text{trace}(R) - 1}{2}
$$

---

###  Example

If:

* $\text{trace}(R_{\text{gt}}^\top R_{\text{pred}}) = 2.95$
* then:

$$
\cos(\theta) = \frac{2.95 - 1}{2} = 0.975
\Rightarrow \theta \approx \cos^{-1}(0.975) \approx 12.9^\circ
$$

---

###  Key Intuition

* **Trace close to 3** → $R_{\text{pred}}$ ≈ $R_{\text{gt}}$
* **Trace = -1** → 180° rotation difference
* **Trace = 1** → 90° rotation difference
* The geodesic loss converts this trace into a **rotation angle**, which is a proper metric on **SO(3)**.




---

###  Why is $\theta = 0$ when the trace is 3?

Actually, **it *is* zero**.

Let’s walk through it **step by step**.

---

###  1. If the two rotations are exactly equal:

Then:

$$
R_{\text{gt}}^\top R_{\text{pred}} = R^\top R = I
$$

So:

$$
\text{trace}(I) = 3
$$

---

###  2. Plug into geodesic loss:

$$
\theta = \cos^{-1}\left(\frac{\text{trace}(R_{\text{gt}}^\top R_{\text{pred}}) - 1}{2}\right)
= \cos^{-1}\left(\frac{3 - 1}{2}\right) = \cos^{-1}(1) = 0
$$

 **So yes — when the matrices are equal, $\theta = 0$**.

---

###  Recap of the formula:

$$
\theta = \cos^{-1} \left( \frac{\text{trace}(R_{\text{gt}}^\top R_{\text{pred}}) - 1}{2} \right)
$$

* If rotations are **identical**, trace = 3 → $\theta = \cos^{-1}(1) = 0$
* If rotations differ by 180°, trace = –1 → $\theta = \cos^{-1}(-1) = \pi$

---




---

##  Geodesic Distance Between Quaternions

Given:

* $q_{\text{gt}} \in \mathbb{R}^4$: ground truth unit quaternion
* $q_{\text{pred}} \in \mathbb{R}^4$: predicted unit quaternion

###  The geodesic distance is defined as:

$$
\theta = 2 \cos^{-1}(|q_{\text{gt}} \cdot q_{\text{pred}}|)
$$

where:

* $q_{\text{gt}} \cdot q_{\text{pred}}$ is the **dot product**
* The **absolute value** accounts for the double-cover property (explained below)

---

##  Why Absolute Value?

Quaternions have a **double cover** of SO(3), meaning:

* $q$ and $-q$ represent the **same** rotation.

So to get the minimal rotation angle between two quaternions, we take:

$$
|q_{\text{gt}} \cdot q_{\text{pred}}|
$$

This ensures the angle is in $[0, \pi]$.

---

## 🔹 Intuition

Let’s examine some special cases:

###  Case 1: Perfect Match

If $q_{\text{pred}} = q_{\text{gt}}$, then:

$$
q_{\text{gt}} \cdot q_{\text{pred}} = 1 \Rightarrow \theta = 2 \cos^{-1}(1) = 0
$$

 So yes — **zero angle when quaternions are identical**.

---

###  Case 2: Opposite Quaternions (still same rotation)

If $q_{\text{pred}} = -q_{\text{gt}}$, then:

$$
q_{\text{gt}} \cdot q_{\text{pred}} = -1 \Rightarrow \theta = 2 \cos^{-1}(|-1|) = 2 \cos^{-1}(1) = 0
$$

 Still zero — this reflects that $q$ and $-q$ are **physically the same rotation**.

---

###  Case 3: 90° Rotation Between

Suppose the angle between them is 90°:

$$
|q_{\text{gt}} \cdot q_{\text{pred}}| = \cos(45^\circ) = \frac{\sqrt{2}}{2}
\Rightarrow \theta = 2 \cos^{-1}\left(\frac{\sqrt{2}}{2}\right) = 90^\circ = \frac{\pi}{2}
$$

---

##  PyTorch Code

```python
import torch

def geodesic_loss_quaternion(q_pred, q_gt):
    """
    Compute geodesic loss between predicted and ground-truth quaternions.
    q_pred, q_gt: (B, 4), normalized unit quaternions
    """
    dot = torch.sum(q_pred * q_gt, dim=1)
    dot = torch.abs(dot)  # account for double-cover
    dot = torch.clamp(dot, -1.0, 1.0)  # numerical stability
    theta = 2 * torch.acos(dot)
    return theta.mean()
```

---

##  Summary

| Case                         | Dot Product       | $\theta = 2 \cos^{-1}(|\cdot|)$ |
|-----------------------------|-------------------|-------------------------------------|
| $q_{\text{pred}} = q_{\text{gt}}$ | 1                 | $0$                             |
| $q_{\text{pred}} = -q_{\text{gt}}$ | –1                | $0$                             |
| 90° apart                   | $\frac{\sqrt{2}}{2}$ | $\frac{\pi}{2}$                   |



---

##  Case 2 Revisited: Are Opposite Quaternions Opposite Rotations?

###  Short Answer:

**No**, they are *not* opposite rotations — they are **the same rotation**.

###  Why?

Unit quaternions represent 3D rotations via a double cover of the rotation group SO(3). That means:

$$
q \text{ and } -q \text{ represent the same rotation in 3D space}.
$$

---

## 🔹 Visual Intuition

A quaternion is often written as:

$$
q = \left[\cos\left(\frac{\theta}{2}\right), \, \sin\left(\frac{\theta}{2}\right) \hat{u} \right]
$$

Where:

* $\theta$ is the rotation angle
* $\hat{u}$ is the unit rotation axis

Then:

$$
-q = \left[-\cos\left(\frac{\theta}{2}\right), \, -\sin\left(\frac{\theta}{2}\right) \hat{u} \right]
$$

Which has the **same rotation effect**, because both correspond to the same 3D rotation matrix.

---

##  So When Are Quaternions "Opposite Rotations"?

They aren't. If you want to describe a rotation in the **opposite direction**, you **don’t negate the quaternion** — instead you:

* Keep the same axis $\hat{u}$
* Negate the angle: $\theta \to -\theta$

This yields a **different quaternion**, not simply the negation.

---

##  What Would Be a 180° Opposite Rotation?

Say the ground truth is rotation by +90° around Z-axis. The "opposite rotation" would be –90° around Z. The quaternion for:

* +90°: $q = [\cos(45^\circ), 0, 0, \sin(45^\circ)]$
* –90°: $q' = [\cos(-45^\circ), 0, 0, \sin(-45^\circ)]$

These are **not negatives of each other**, and their geodesic distance is:

$$
\theta = 2 \cos^{-1}(|q \cdot q'|) = \text{non-zero}
$$

---

##  Summary

| Quaternions                                     | Same Rotation? | Geodesic Distance |
| ----------------------------------------------- | -------------- | ----------------- |
| $q$ vs $-q$                                     | ✅ Yes          | $\theta = 0$      |
| $q_1$, $q_2$ with same axis but opposite angles | ❌ No           | $\theta > 0$      |

---



# Regression Loss Functions

scoring function: $|\textbf{W}.\textbf{X}_i+b|$

### 1) Mean Absolute Error Loss
Treats outlier like anyother point so it won't go out of its way for outliers which might lead to poor prediction from time to time. Also it is not optimized for gradient descent (due to incontinuty on zero)Support vector regression use this.

### 2) Mean Squared Error Loss
The adventage is we can easily compute gradient for ML. Good for regression but outlier will make it a poor model

### 3) Pseudo-Huber Loss
Mean Absolute Error doenst comply with outlier and Mean Squared Error freaks out with outlier. If your data is 70% on one side and 30% on the other side, both will reult in a poor model.

### 4) Welsch (Leclerc)


### 5) Geman McClure


### 6) Causchy


# Classification Loss Functions


### 1) Cross-Entropy Loss
When we use <b>sigmoid</b> as activation function and <b>Mean Squared Error Loss</b>, if our neuron output is very wrong ,since $\sigma(z)$ s saturated and it is almost a flat line ($z \to \infty$, $\sigma(z) \to 1$ and $z \to -\infty$, $\sigma(z) \to 0$) , $\sigma'(z)$ gives us a small value so learning will be very slow. 
<b>Cross-Entropy Loss</b> designed for binary classifiers. In a binary classifier,  $y \in \{0,1\}$ is the class lable and $a$ is the output of the activation function (could be any number).


$\begin{eqnarray}\frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z)\end{eqnarray}$
  
$\begin{eqnarray}\frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z),\end{eqnarray}$

Let say we have binary classifier, so we have 1 output node and the class labeles are either 1 or 0, 
$C=-\frac{1}{n} \sum_{x} y\ln(a)+ (1-y)\ln(1-a)$


### 2) Multi-Class Cross-Entropy Loss
If we have $j$ output in the last layer $L$, the total cost for all inputs is:

$\begin{eqnarray}  C = -\frac{1}{n} \sum_x
  \sum_j \left[y_j \ln a^L_j + (1-y_j) \ln (1-a^L_j) \right].
\end{eqnarray} $

One way to remmeber that the order is  $ y_j \ln a^L_j$ and not other way around is to remmeber that $y_j$ could be 
absoulute zero so it can't be the parameter for logarithm function, but $a^L_j$ can $\lim a^L_j \to 0 $, so it is the parameter of $\ln$

Now lets compute the the $\delta$ when the loss function is **Multi-Class Cross-Entropy Loss** and activation function is **sigmoid**:


$\delta_j=\begin{eqnarray}
  \frac{\partial C}{\partial z_j} &=\frac{\partial C}{\partial a_j}\frac{\partial a_j}{\partial z_j}   = & -\frac{1}{n} \sum_x \left(
    \frac{y }{\sigma(z_j)} -\frac{(1-y)}{1-\sigma(z_j)} \right)\sigma'(z_j) 
& = & -\frac{1}{n} \sum_x \left( 
    \frac{y -\sigma(z_j)}{\sigma(z_j)(1-\sigma(z_j))}
    \right).\sigma(z_j)(1-\sigma(z_j))
\end{eqnarray}$

Reminder: for sigmoid activation function:

$\sigma'(z_j)=\sigma(z_j)(1-\sigma(z_j))$


So we would have:


$\begin{eqnarray} 
    \delta^L = a^L-y.\tag 1
\end{eqnarray}$

Reminder: in quadratic loss function, $\delta$ was:

$\delta^L= 2(\textbf{a}^{(L)}-\textbf{y}) \odot   \sigma^{\prime}(\textbf{z}^{(L)})$

For previous layer $L-1, L-2,...$, the $\delta^{(L-1)}$ is same as wath we computed before in network with quadratic loss function

$\begin{eqnarray} 
      \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n} \sum_x 
      a^{L-1}_k  (a^L_j-y_j) \tag 2
\end{eqnarray}$

This tells us that the rate at which the weight learns is controlled by $\sigma(z)-y$ (by the error in the output). The larger the error, the faster the neuron will learn. This avoids the learning slowdown caused by the $\sigma '(z)$ term in the analogous equation for the quadratic cost,

$\begin{eqnarray}
      \frac{\partial C}{\partial b^L_{j}} & = & \frac{1}{n} \sum_x 
      (a^L_j-y_j).
  \tag{3}\end{eqnarray}$


### 3) Negative log likelihood

The <b>Negative Log-Likelihood Loss function (NLL)</b> is applied only on models with the softmax function as an output activation layer. We can think a softmax output layer with log-likelihood cost as being quite similar to a sigmoid output layer with cross-entropy cost.

The Softmax function is :




${\displaystyle \sigma (\mathbf {z} )_{i}={\frac {e^{z_{i}}}{\sum _{j=1}^{K}e^{z_{j}}}}}$


The softmax function takes an input vector of size $N$, and then modifies the values such that every one of them falls between 0 and 1.


$NLL$ uses a negative connotation since the probabilities (or likelihoods) vary between zero and one, and the logarithms of values in this range are negative. In the end, the loss value becomes positive. The negative log likelihood is retrieved from approximating the maximum likelihood estimation $MLE$. This means that we try to maximize the model’s log likelihood, and as a result, minimize the $NLL$.  

Neural network estimate: 
$f(X)_c=p(y=c|X)$

Which means NN estimate the probability that the output of network, $y$ for the given vector $X$ is the correct class $c$.
For instance we have 3 classes, which means 3 nodes at the output layer, and for a given input we get:
$(0.1,0.85,0.05 )$, which means which means our network belives with the probability of $0.85$ this input belong to the second class. Since the inputs are independent of each other the joint probability of them would be teh multipication of them just like $MLE$. Since output probabilities are betwen zero and one the $ln$ of them would be a negative value, so we turn them into positive number by multiplying tem by a negative number and we turn maximisation problem into minimization.

$l(\textbf{f}(X),y)=-ln f(X)_c=-\Sigma_c 1_{y=c}ln(f(X)_c)$


Refs:

[1](https://www.youtube.com/watch?v=PpFTODTztsU), [2](https://neptune.ai/blog/pytorch-loss-functions),
[3](https://ljvmiranda921.github.io/notebook/2017/08/13/softmax-and-the-negative-log-likelihood/)

In [14]:
import torch
import torch.nn as nn
# size of input (N x C) is = 3 x 5
#N
number_of_item_in_training_set=3

#C
number_of_neuron_in_output_layer=5

input = torch.randn(number_of_item_in_training_set, number_of_neuron_in_output_layer, requires_grad=True)

# target should have only 3 elements (N=3) and every element in target should have 0 <= value < C=5
# because these are target lables. Target: (N)
target = torch.tensor([4, 3, 2])

m = torch.nn.LogSoftmax(dim=1)

# NLLLoss expects the inputs to be log probabilities
nll_loss = torch.nn.NLLLoss()
output = nll_loss(m(input), target)
output.backward()

print('input: \n', input)
print('target: \n', target)
print('output: \n', output)
print('m(input).exp()\n',m(input).exp())

input: 
 tensor([[-0.3319,  1.6391,  0.8178,  1.8935, -0.0741],
        [ 0.1611,  0.1806,  2.6024,  0.9387, -1.5747],
        [ 0.5691, -0.7668, -0.9221,  1.5614,  0.6988]], requires_grad=True)
target: 
 tensor([4, 3, 2])
output: 
 tensor(2.6592, grad_fn=<NllLossBackward0>)
m(input).exp()
 tensor([[0.0457, 0.3280, 0.1443, 0.4230, 0.0591],
        [0.0630, 0.0643, 0.7243, 0.1372, 0.0111],
        [0.1878, 0.0494, 0.0423, 0.5067, 0.2138]], grad_fn=<ExpBackward0>)


In [15]:
loss = nn.NLLLoss()
a = torch.tensor(([0.88, 0.12], [0.51, 0.49]), dtype = torch.float)
target = torch.tensor([1, 0])
output = loss(torch.log(a), target)
print(output)

print((-torch.log(a[0, 1]) - torch.log(a[1, 0])) / 2)


tensor(1.3968)
tensor(1.3968)


### 4) Hinge Loss
Let's say we have binary classifier with two labels, $-1,+1$ to measure the misclassification error:

$\frac{1}{n}\sum_{i=1}^{n}[y_i\neq  sign(f(x_i))] = \frac{1}{n}\sum_{i=1}^{n}[y_i f(x_i)] \leq \frac{1}{n}\sum_{i=1}^{n}L(y_i f(x_i)) $

minimizing this error function is computationaly complicated so 

Mainly used in SVMm which set a boundry as far as possible between all data points (maximizes the minimum margin).
It will penalized the points even they are in the margin.


scoring function: $|\textbf{W}.\textbf{X}_i+b|$

Training test label, $y_i = ±1$

$L_i=max(0,1-y_i|\textbf{W}.\textbf{X}_i+b|)$


Refs: [1](https://www.youtube.com/watch?v=r-vYJqcFxBI)

<img src='images/hinge.svg' />

## Adaptive Loss Functions
All the abve loss functions can be generalized by:

$p(x,\alpha)=\frac{|2- \alpha|}{\alpha} \left (   \left ( \frac{x^2}{|2-\alpha|}+1  \right )^{\alpha/2}  -1 \right )  $


$NLL(\theta,\alpha)=min \rho_{\theta,\alpha} (x,\alpha )+logZ(\alpha) $



Refs: [1](https://arxiv.org/abs/1701.03077), [2](https://www.youtube.com/watch?v=QBbC3Cjsnjg), [3](https://medium.com/udacity-pytorch-challengers/a-brief-overview-of-loss-functions-in-pytorch-c0ddb78068f7), [4](https://rohanvarma.me/Loss-Functions/), [5](https://towardsdatascience.com/common-loss-functions-in-machine-learning-46af0ffc4d23), [6](https://neptune.ai/blog/pytorch-loss-functions), [7](https://www.youtube.com/watch?v=ErfnhcEV1O8)