# Loss Functions for Depth + Pose Estimation


| **Category**                          | **Loss name**                                                         | **Type**                    | **Mandatory?** | **Purpose / Description**                                                    |
| ------------------------------------- | --------------------------------------------------------------------- | --------------------------- | -------------- | ---------------------------------------------------------------------------- |
| **Photometric consistency**           | *Photometric loss* (SSIM + L1)                                        | Self-supervised             | ✅              | Image reconstruction via reprojection-based self-supervision                 |
| **Geometry regularization**           | *Edge-aware depth smoothness*                                         | Regularization              | ✅              | Enforces spatial coherence, sharp and stable depth maps                      |
| **Pose supervision / regularization** | *Geodesic loss*, *Quaternion loss*, *SE(3) transform loss*            | Pose supervision            | optional       | Penalize rotation and translation errors if GT/pseudo-GT poses are available |
| **Motion priors / dynamics**          | *Rotation magnitude loss*, *velocity smoothness*, *Motion prior (L2)* | Regularization              | optional       | Regularize PoseNet outputs, prevent large or inconsistent motion jumps       |
| **Additional (optional)**             | *Depth consistency*, *multi-scale weighting*, *explainability mask*   | Refinement / Regularization | optional       | Advanced refinement to improve robustness and interpretability               |

---


##  Self-Supervised Losses (always used)

They drive both DepthNet and PoseNet when you train from monocular videos **without ground truth**.

---

### (a) **Photometric Reprojection Loss** (a.k.a. View Synthesis Loss)

**Definition:**
$
L_\text{photo} = \min_s \Big( \alpha \frac{1 - \text{SSIM}(I_t, I_s')}{2} + (1 - \alpha) | I_t - I_s' |_1 \Big)
$

* $I_s'$: Source image warped into target frame using predicted depth + pose.
* $\alpha = 0.85$ works best (from Monodepth2).
* Take **minimum reprojection** across multiple source frames (to ignore occlusions).


---



## 1.4 Structural Similarity Index Measure (SSIM)


The **Structural Similarity Index Measure (SSIM)** is a widely used metric for measuring the similarity between two images. Unlike simple metrics such as Mean Squared Error (MSE) or Peak Signal-to-Noise Ratio (PSNR), SSIM is designed to model the way humans perceive image quality — focusing on structural information, contrast, and luminance rather than raw pixel differences.

---

### 1.4.1 The Idea Behind SSIM

SSIM tries to answer: *“How similar are two images in terms of structure, contrast, and brightness?”*

It decomposes similarity into **three components**:

1. **Luminance similarity** $l(x, y)$:

   Are the two images equally bright on average?

   $$
   l(x,y) = \frac{2 \mu_x \mu_y + C_1}{\mu_x^2 + \mu_y^2 + C_1}
   $$

- If both images have similar brightness, this term is near 1.
- If one image is much darker, it will drop below 1.   

2. **Contrast similarity** $c(x, y)$:

   Do the two images have the same amount of contrast?

   $$
   c(x,y) = \frac{2 \sigma_x \sigma_y + C_2}{\sigma_x^2 + \sigma_y^2 + C_2}
   $$

- If both images have similar contrast (variability), this is near 1.
- If one is flat (low contrast) and the other is textured (high contrast), the similarity decreases.

3. **Structural similarity** $s(x, y)$:

   Do the two images have the same patterns and textures?

   $$
   s(x,y) = \frac{\sigma_{xy} + C_3}{\sigma_x \sigma_y + C_3}
   $$
- If $x$ and $y$ rise and fall together (high correlation), this is near $1$.
- If they are uncorrelated or inverted (noise, wrong edges), this value becomes smaller or even negative.   

Here:

* $\mu_x, \mu_y$ are the mean intensities,
* $\sigma_x, \sigma_y$ are standard deviations,
* $\sigma_{xy}$ is covariance between $x$ and $y$,
* $C_1, C_2, C_3$ are small constants to stabilize division.

The final SSIM is:

$$
SSIM(x, y) = [l(x, y)]^\alpha \cdot [c(x, y)]^\beta \cdot [s(x, y)]^\gamma
$$

Usually $\alpha = \beta = \gamma = 1$.

---

### 1.4.2 Normalized Correlation
The **normalized correlation** part of SSIM is the most “structural” component, so it’s worth understanding carefully.

**Step 1: Represent the Images**

Suppose you have two image patches $x$ and $y$ of size $N$ pixels (can be grayscale or single channel).

$$
x = [x_1, x_2, \dots, x_N], \quad y = [y_1, y_2, \dots, y_N]
$$

---

**Step 2: Compute the Mean**

Compute the mean intensity (average brightness) of each patch:

$$
\mu_x = \frac{1}{N} \sum_{i=1}^{N} x_i, \qquad
\mu_y = \frac{1}{N} \sum_{i=1}^{N} y_i
$$

---

**Step 3: Compute the Standard Deviations**

Compute how much pixel values vary around the mean:

$$
\sigma_x = \sqrt{\frac{1}{N-1} \sum_{i=1}^{N} (x_i - \mu_x)^2}
$$

$$
\sigma_y = \sqrt{\frac{1}{N-1} \sum_{i=1}^{N} (y_i - \mu_y)^2}
$$

These represent the **contrast** of each image.

---

**Step 4: Compute the Covariance**

Covariance measures how much the two patches vary *together*:

$$
\sigma_{xy} = \frac{1}{N-1} \sum_{i=1}^{N} (x_i - \mu_x)(y_i - \mu_y)
$$

* If $x$ and $y$ increase/decrease together → covariance is **positive**.
* If one increases when the other decreases → covariance is **negative**.

---

**Step 5: Normalize → Get the Correlation**

The **Pearson correlation coefficient** is just the covariance normalized by the product of standard deviations:

$$
\rho_{xy} = \frac{\sigma_{xy}}{\sigma_x \sigma_y}
$$

* This value ranges between **-1 and 1**.

  * $1.0 \Rightarrow$ perfect positive linear correlation (structures align perfectly).
  * $0 \Rightarrow$ no linear correlation (structures unrelated).
  * $-1.0 \Rightarrow$ perfect negative correlation (inverted contrast).

---

**Step 6: Add Stabilization for SSIM**

In SSIM, to avoid division by zero when contrast is very low, we use a small constant $C_3$:

$$
s(x, y) = \frac{\sigma_{xy} + C_3}{\sigma_x \sigma_y + C_3}
$$

This keeps the measure well-defined even for very flat patches (e.g., almost uniform gray).

---

**Intuition**

* **Covariance** tells you whether pixel intensities move together.
* **Normalization** by $\sigma_x \sigma_y$ removes the effect of scale/contrast so you focus purely on **structure** (edges, textures, gradients).

This is why SSIM can still give a high similarity score if one image is slightly brighter/darker — because the **pattern** of variations is the same.

---

SSIM tries to mimic the **human visual system** by:

* **Normalizing for lighting** (so small brightness changes are ignored).
* **Normalizing for contrast** (so small contrast changes are less penalized).
* **Measuring structure** (so it cares about edges, patterns, textures).




### 1.4.3 Numerical SSIM Example
Awesome—let’s do a fully worked **numerical SSIM example** with two $3\times3$ grayscale image patches and compute every piece: mean, std, covariance, normalized correlation, the three SSIM terms (luminance/contrast/structure), and the final SSIM.

**Images (grayscale, 8-bit scale assumed)**

$$
x=\begin{bmatrix}
10&20&30\\
20&30&40\\
30&40&50
\end{bmatrix},\quad
y=\begin{bmatrix}
12&22&32\\
21&31&41\\
29&39&49
\end{bmatrix}
$$

We’ll treat the whole $3\times3$ window as one patch (i.e., a single SSIM window).

---

#### 1) Basic statistics

Let $N=9$.

**Means**

$$
\mu_x=30.0000,\qquad \mu_y=30.6667
$$

**Sample standard deviations** (ddof=1)

$$
\sigma_x=\sqrt{150}=12.2474,\qquad \sigma_y=\sqrt{129.25}=11.3688
$$

**Sample covariance**

$$
\sigma_{xy}=\frac{1}{N-1}\sum (x_i-\mu_x)(y_i-\mu_y)=138.75
$$

**Normalized correlation (Pearson $\rho$)**

$$
\rho=\frac{\sigma_{xy}}{\sigma_x\sigma_y}=\frac{138.75}{12.2474\cdot 11.3688}\approx 0.9965
$$

> Intuition: the patches vary together almost perfectly (very strong structural agreement).

---

#### 2) SSIM components

Use the standard SSIM constants for 8-bit images:

$$
L=255,\quad C_1=(0.01L)^2=6.5025,\quad C_2=(0.03L)^2=58.5225,\quad C_3=\frac{C_2}{2}=29.26125
$$

#### (a) Luminance term $l(x,y)$

$$
l=\frac{2\mu_x\mu_y+C_1}{\mu_x^2+\mu_y^2+C_1}
=\frac{2\cdot 30\cdot 30.6667+6.5025}{30^2+30.6667^2+6.5025}
\approx 0.99976
$$

#### (b) Contrast term $c(x,y)$

$$
c=\frac{2\sigma_x\sigma_y+C_2}{\sigma_x^2+\sigma_y^2+C_2}
=\frac{2\cdot 12.2474\cdot 11.3688+58.5225}{12.2474^2+11.3688^2+58.5225}
\approx 0.99771
$$

#### (c) Structure term $s(x,y)$

$$
s=\frac{\sigma_{xy}+C_3}{\sigma_x\sigma_y+C_3}
=\frac{138.75+29.26125}{12.2474\cdot 11.3688+29.26125}
\approx 0.99710
$$

---

#### 3) Final SSIM

$$
SSIM=l\cdot c\cdot s\approx 0.99976\cdot 0.99771\cdot 0.99710\approx \mathbf{0.99458}
$$

**Takeaway:** Despite small brightness/contrast differences, the structures match extremely well (high $\rho$ and high SSIM ≈ **0.995**). This is exactly the kind of case where SSIM (and the structure term) shines compared to plain MSE/PSNR.

If you want, I can also compute **MSE/PSNR** for the same pair so you can see how they react differently.



## 1.5 LPIPS

So far, we talked about **SSIM** (hand-crafted metric). Now, **LPIPS (Learned Perceptual Image Patch Similarity)** goes a step further: instead of manually designing similarity measures, it uses **deep features** from pretrained networks (e.g., AlexNet, VGG, SqueezeNet) to capture perceptual similarity.


* Proposed in **"The Unreasonable Effectiveness of Deep Features as a Perceptual Metric"** (Zhang et al., 2018).
* Idea: Humans judge images by *perceptual similarity*, not pixel-wise equality.
* LPIPS measures distance in the **feature space** of a pretrained CNN rather than raw pixels.

---

#### 1.5.1 How It Works

1. Take two images $x$ and $y$.
2. Pass both through a **pretrained network** (e.g., VGG).
3. Extract activations from multiple layers (feature maps).
4. Normalize features and compute **L2 distance** per spatial location.
5. Average distances across spatial positions and layers.
6. Optionally, train small linear weights to better align with human judgments.

$$
LPIPS(x,y) = \sum_l \frac{1}{H_l W_l} \sum_{h,w} w_l \; \| \hat{f}_l(x)_{h,w} - \hat{f}_l(y)_{h,w} \|_2^2
$$

* $f_l$: feature map from layer $l$.
* $\hat{f}_l$: channel-wise normalized.
* $w_l$: learned weights.

---

#### 1.5.2 Why LPIPS is Important

* **MSE/PSNR**: pixel-wise, not perceptual.
* **SSIM**: structural but still hand-crafted.
* **LPIPS**: learned perceptual similarity, matches human perception much better.

In practice, LPIPS is considered **state-of-the-art** for evaluating perceptual image quality (GANs, super-resolution, style transfer, inpainting).

---

**LPIPS in Deep Learning Workflows**

* Used as an **evaluation metric** for generative models (GANs, diffusion, etc.).
* Sometimes used as a **loss function** (LPIPS loss) for training perceptual similarity.
* Often combined with pixel losses (L1/L2) or SSIM.


---

#### 1.5.3 Comparison: SSIM vs LPIPS

| Metric       | Based on                           | Pros                                                         | Cons                                             |
| ------------ | ---------------------------------- | ------------------------------------------------------------ | ------------------------------------------------ |
| **MSE/PSNR** | Pixel differences                  | Simple, fast                                                 | Not perceptual, sensitive to shifts              |
| **SSIM**     | Luminance, contrast, structure     | Better perceptual alignment                                  | Hand-crafted, less robust to complex distortions |
| **LPIPS**    | Deep features (VGG, AlexNet, etc.) | Best matches human perception, widely used in GAN evaluation | Heavier, requires pretrained nets                |

---

### (b) **Edge-Aware Depth Smoothness Loss**

Encourages locally smooth depth while preserving depth discontinuities along image edges.

**Equation:**
$
L_\text{smooth} = |\partial_x d_t^*| e^{-|\partial_x I_t|} + |\partial_y d_t^*| e^{-|\partial_y I_t|}
$

where $d_t^* = d_t / \bar{d_t}$ (normalized disparity).

 This loss prevents noisy disparity, and the edge weighting keeps depth edges aligned with color edges.

---

**Combined core loss:**
$
L_\text{unsup} = L_\text{photo} + \lambda_\text{smooth} L_\text{smooth}
$
Typical λₛₘₒₒₜₕ ≈ 0.001 – 0.01.

---

##  3. Optional pose-related losses (for better motion consistency)

These are **not mandatory** for self-supervised training,
but useful if you have **pseudo ground truth poses** (e.g., from KITTI odometry or IMU).

---

### (a) **Geodesic Rotation Loss**

Encourages PoseNet’s predicted rotation (R_\text{pred}) to be close to ground truth (R_\text{gt}) on SO(3):

$
L_R = | \log(R_\text{gt}^T R_\text{pred}) |_2
$

Where `log()` is the matrix logarithm mapping to so(3).
This gives a **rotation angle error** in radians.

```python
def geodesic_loss(R_pred, R_gt):
    R_rel = R_pred.transpose(-1, -2) @ R_gt
    log_R = torch.linalg.logm(R_rel)
    return torch.norm(log_R, dim=(1,2)).mean()
```

Use if you have ground-truth rotations from KITTI or IMU fusion.

---

### (b) **Quaternion Loss**

If you represent rotations as quaternions (q):

$
L_q = 1 - \langle q_\text{pred}, q_\text{gt} \rangle^2
$

It penalizes quaternion misalignment.

```python
def quaternion_loss(q_pred, q_gt):
    return 1 - torch.sum(q_pred * q_gt, dim=-1).pow(2).mean()
```

---

### (c) **Full SE(3) Transformation Loss**

Combines rotation and translation errors in one expression:

$
L_{SE3} = | \log(T_\text{gt}^{-1} T_\text{pred}) |_2
$
This measures the 6D twist vector (ξ) difference between two SE(3) transforms.

 Great for fine-tuning PoseNet if you have ground-truth or pseudo ground-truth trajectories.

---

### (d) **Rotation Magnitude / Motion Prior Loss**

Encourages PoseNet outputs to have small, realistic motion per frame:

$
L_\text{motion} = |r_\text{pred}|*2 + |t*\text{pred}|_2
$

This acts like a regularizer and avoids large jumps in estimated pose.

---

##  4. Optional advanced terms (for refinement or stability)

| Loss                           | Purpose                                                          |
| ------------------------------ | ---------------------------------------------------------------- |
| **Explainability mask loss**   | Downweights moving objects and occlusions.                       |
| **Depth consistency loss**     | Enforces consistency between multi-scale depth predictions.      |
| **Temporal smoothness loss**   | Penalizes acceleration/jerk between consecutive predicted poses. |
| **Scale-invariant depth loss** | Used when GT depths are available but relative scale is unknown. |

---

## 5. Recommended combination for your project

Since you’re currently using **KITTI** and your setup is **self-supervised** (DepthNet + PoseNet trained together):

###  **Use these always**

| Loss                                      | Weight | Purpose              |
| ----------------------------------------- | ------ | -------------------- |
| Photometric (SSIM + L1, min reprojection) | 1.0    | Core supervision     |
| Edge-aware depth smoothness               | 0.001  | Depth regularization |

###  **Add these if GT poses available (optional fine-tuning)**

| Loss                   | Weight | Purpose               |
| ---------------------- | ------ | --------------------- |
| Geodesic rotation loss | 1.0    | Rotation accuracy     |
| Translation (L1) loss  | 0.1    | Motion scale accuracy |
| SE(3) transform loss   | 0.5    | Joint refinement      |

---

##  6. Example final total loss

```python
λ_smooth = 0.001
λ_se3 = 0.5
λ_geo = 1.0

total_loss = photo_loss + λ_smooth * smooth_loss

if use_pose_supervision:
    total_loss += λ_geo * geo_loss + λ_se3 * se3_loss
```

---


---


When you train a network to predict **rotations**, you want a loss function that measures “how far apart” two rotations are.
Rotations live on the **special orthogonal group** SO(3), which is not a flat Euclidean space — so we need to be careful.

## 1.1 Rotation Loss $SO(3)$

###  1.1.1 Rotation Representation 

* **Quaternions** (unit 4D vectors, $q \in \mathbb{R}^4, \|q\|=1$)
* **Rotation matrices** ($R \in SO(3)$, orthogonal 3×3 with det=+1)
* **Axis-angle** ($\theta, \mathbf{u}$)

For deep learning, **quaternions** are often used because:

* They are continuous (no singularities like Euler angles).
* They are compact (4 parameters).
* Easy to normalize to unit norm after network output.

---