# Overview of Approaches and Deep Learning Architecture For Visual Odometry

###  Deep Learning VO: main families

**A. Pose regression (end-to-end)**

* **Idea:** CNN (or CNN+RNN) regresses $(\mathbf{R},\mathbf{t}) \in SE(3)$ directly from stacked frames.
* **Architectures:** PoseNet-style CNNs → ConvLSTM/GRU for temporal context.
* **Losses:** supervised $\ell_1/\ell_2$ on translation, geodesic loss on rotation; or **self-supervised photometric** (see §3).
* **Pros:** simple inference; can be fast.
* **Cons:** scale drift (monocular), weak geometry priors, generalization risk.

**B. Depth+Pose joint learning (self-supervised SfM-style)**

* **Idea:** one net predicts **depth** $D_t$; another predicts **relative pose** $T_{t\rightarrow s}$. Reproject source $\mathbf{I}_s$ into target with $D_t$ and $T$; train by minimizing **photometric/SSIM** reconstruction.
* **Architectures:** U-Net depth backbones; small PoseNet; sometimes **cost volumes** (stereo) or **transformers**.
* **Extras:** auto-masking for non-rigid pixels, **explainability masks**, multi-scale supervision, **edge-aware smoothness** on depth.
* **Pros:** no GT poses needed; geometry-aware; scales well with data.
* **Cons:** moving objects/occlusions need handling; absolute scale ambiguous (mono).

**C. Geometry-aware networks (differentiable optimization inside)**

* **Idea:** embed **PnP/BA/ICP** as differentiable layers (Gauss-Newton blocks, differentiable bundle adjustment, learned Jacobians/weights).
* **Examples vibe:** DeepV2D-like depth-pose iterative refinement, BA-Net-style layers, **DROID-SLAM-like** dense matching + iterative pose/structure updates.
* **Pros:** better inductive bias; stronger generalization; better consistency.
* **Cons:** more complex; heavier training; careful stability engineering.

**D. Flow- or correspondence-driven VO**

* **Idea:** learn dense optical flow or correspondences; then recover pose via differentiable epipolar geometry, or train end-to-end.
* **Pros:** good on dynamic scenes with robust matchers; integrates with cost volumes/transformers.
* **Cons:** scale ambiguity (mono); need rigidity masks or scene flow.


---

### Core training losses (self-supervised mono/stereo)

Let $I_t$ target, $I_s$ source; predict $D_t$ and $T_{t\rightarrow s}$. For pixel $p$ in $I_t$:

1. **Back-project:** $\mathbf{X} = D_t(p)\,K^{-1}\tilde{p}$
2. **Transform:** $\mathbf{X}_s = T_{t\rightarrow s}\,\mathbf{X}$
3. **Project:** $p' \sim K\,\mathbf{X}_s$ → sample $\hat{I}_t(p) = I_s(p')$

**Photometric loss:**
$\mathcal{L}_{pho} = \alpha \frac{1 - \mathrm{SSIM}(I_t,\hat{I}_t)}{2} + (1-\alpha)\|I_t-\hat{I}_t\|_1$

**Depth smoothness (edge-aware):**
$\mathcal{L}_{sm} = \sum |\partial_x D_t| e^{-|\partial_x I_t|} + |\partial_y D_t| e^{-|\partial_y I_t|}$

**Geometry consistency:**

* Epipolar loss: $\tilde{p}'^\top F \tilde{p} \approx 0$ on learned correspondences.
* Cycle/reprojection min over multiple sources to handle occlusions.
* **Scale constraints:** stereo baseline, IMU priors, or learned scale head.

**Rotation loss (geodesic) for supervised/regularized pose:**
$\ell_R(R,\hat{R}) = \|\log(R^\top\hat{R})\|_2$

---

###  Popular architectural building blocks

* **Backbones:** ResNet/EfficientNet/MobileNet, or ViT/convnext-style hybrids.
* **Temporal:** ConvLSTM/GRU; 1D temporal convs; attention over clips.
* **Transformers:** for long-range associations, global cost volumes, memory (e.g., recurrent matching + pose/BA heads).
* **Cost volumes:** stereo/monocular multi-hypothesis depth refinement.
* **Differentiable solvers:** Gauss-Newton layers, differentiable PnP/ICP, learned robust weights (M-estimation).
* **Uncertainty heads:** aleatoric/epistemic to weight residuals and poses.

---

###  Practical training & engineering tips

* **Data curation:** varied motion/illumination; ensure exposure consistency or use brightness augmentation & learned photometric invariance.
* **Non-rigidity handling:** auto-mask moving objects; min-reprojection over multiple sources; per-pixel uncertainty weighting.
* **Scale:** prefer stereo or occasional depth supervision/IMU to anchor scale; else learn a scale head or post-scale with a known height/velocity prior.
* **Drift control:** keyframes + temporal windows; small **differentiable BA** blocks every N frames; pose-graph fine-tuning at segment ends.
* **Initialization:** identity or gyro-seeded initial pose; pyramids and coarse-to-fine warping reduce bad local minima.
* **Numerics:** SE(3) parametrization via **Lie algebra** $\boldsymbol{\xi} \in \mathbb{R}^6$; compose poses with $\exp(\cdot)$ / $\log(\cdot)$; use geodesic rotation losses.
* **Speed:** share encoders for depth/pose; mixed precision; tile-based warping; keep cost volumes shallow.

---

# 1. Loss Functions

When you train a network to predict **rotations**, you want a loss function that measures “how far apart” two rotations are.
Rotations live on the **special orthogonal group** SO(3), which is not a flat Euclidean space — so we need to be careful.

## 1.1 Rotation Loss $SO(3)$

###  1.1.1 Rotation Representation 

* **Quaternions** (unit 4D vectors, $q \in \mathbb{R}^4, \|q\|=1$)
* **Rotation matrices** ($R \in SO(3)$, orthogonal 3×3 with det=+1)
* **Axis-angle** ($\theta, \mathbf{u}$)

For deep learning, **quaternions** are often used because:

* They are continuous (no singularities like Euler angles).
* They are compact (4 parameters).
* Easy to normalize to unit norm after network output.

---

###  1.1.2 Quaternion Loss (Naïve Euclidean Loss)

The simplest approach is to minimize the **L2 distance between quaternions**:

$$
\mathcal{L}_\text{quat} = \| q_\text{pred} - q_\text{gt} \|_2
$$

But there are **two problems**:

1. **Double cover**: $q$ and $-q$ represent the same rotation.
   → If the network predicts $-q_\text{gt}$, the loss will be **large**, even though the rotation is exactly correct.
2. **Euclidean mismatch**: The Euclidean distance between quaternions does not exactly correspond to the **geodesic distance** (shortest path on SO(3)).

**Fix for double cover:**

Take the shorter path:

$$
\mathcal{L}_\text{quat} = 1 - |\langle q_\text{pred}, q_\text{gt} \rangle|
$$

where $\langle \cdot, \cdot \rangle$ is the quaternion dot product.
This gives a loss proportional to the **cosine of half the rotation angle**.

---


### 1.1.3 Geodesic Loss (Rotation-Angle Loss)

The **geodesic distance** between two rotations $R_1, R_2 \in SO(3)$ is the smallest rotation angle that aligns them.
If $R = R_1^\top R_2$, then:

$$
\theta = \cos^{-1}\!\left(\frac{\text{trace}(R) - 1}{2}\right)
$$



If $R_1, R_2$ are very close, then in $R = R_1^\top R_2$, their transpose are perpendicular, mening $R=I$, which means the $\text{trace(R)}=3$ therefor $\frac{\text{trace}(R_1^\top R_2) - 1}{2}$ is $1$, and  $\cos^{-1}(1)=0$ 


This is the true **shortest path distance on SO(3)** (a proper Riemannian metric).
Loss is typically defined as:

$$
\mathcal{L}_\text{geo} = \theta = \cos^{-1}\!\left(\frac{\text{trace}(R_1^\top R_2) - 1}{2}\right)
$$

If using quaternions, you can avoid matrices:

$$
\mathcal{L}_\text{geo} = 2 \cos^{-1} \!\big( |\langle q_\text{pred}, q_\text{gt} \rangle| \big)
$$

where again we take the absolute value to fix the double-cover issue.

---


### 1.1.4 Geodesic Loss Numerical Example

**Ground truth rotation**: 90° around z-axis

  $$
  q_\text{gt} = \left[\cos(45°), 0, 0, \sin(45°)\right] = [0.7071, 0, 0, 0.7071]
  $$

**Predicted rotation**: 60° around z-axis

  $$
  q_\text{pred} = \left[\cos(30°), 0, 0, \sin(30°)\right] = [0.8660, 0, 0, 0.5]
  $$

Both are already normalized.

---

**Compute Dot Product**

$$
\langle q_\text{pred}, q_\text{gt} \rangle
= (0.8660)(0.7071) + (0)(0) + (0)(0) + (0.5)(0.7071)
$$

$$
= 0.6124 + 0.3536 = 0.9660
$$

Take absolute value (to handle sign ambiguity):

$$
|\langle q_\text{pred}, q_\text{gt} \rangle| = 0.9660
$$

---

**Compute Geodesic Loss (Angle)**

Geodesic loss (in radians):

$$
\mathcal{L}_\text{geo} = 2 \cos^{-1}(0.9660)
$$

Compute step-by-step:

* $\cos^{-1}(0.9660) ≈ 0.2618 \, \text{rad}$
* Multiply by 2 → $\mathcal{L}_\text{geo} ≈ 0.5236 \, \text{rad}$

Convert to degrees:

$$
0.5236 \, \text{rad} × \frac{180°}{\pi} ≈ 30°
$$

---



###  1.1.5 Practical Advice

* **If you only care about small orientation errors** (e.g., fine-tuning a network near correct pose):
  Quaternion dot-product loss or L2 loss is usually fine (cheaper, smooth gradients).

* **If you care about accurate global orientation** (e.g., SLAM, pose estimation, camera relocalization):
  **Geodesic loss is strongly preferred** because it reflects the real physical difference between two orientations.

## 1.2 Full Transformation Loss $SE(3)$


To make an **SE(3) loss** you typically combine:

1. a **rotation term** that measures distance on SO(3) (your geodesic loss), and
2. a **translation term** that measures distance in $\mathbb{R}^3$.

There are two common (and solid) ways to do this.

---

### 1.2.1 Option A — Simple & Effective (weighted sum)

Use the geodesic **rotation angle** (in radians) plus a norm on translation:

$$
\mathcal{L}_{\text{SE(3)}} \;=\; \lambda_R \,\underbrace{\big(2\arccos(|\langle q_{\text{pred}}, q_{\text{gt}}\rangle|)\big)}_{\text{SO(3) geodesic angle}}
\;+\; \lambda_t \,\underbrace{\|\,t_{\text{pred}}-t_{\text{gt}}\,\|_2}_{\text{meters}}
$$

* $\lambda_R$ and $\lambda_t$ balance **units** (radians vs meters).
  Rules of thumb:

  * If typical translation errors are \~0.05–0.2 m and angular errors are \~2–10°, try $\lambda_R\in[0.5,2.0]$, $\lambda_t\in[1,10]$.
  * Tune so both terms contribute similar magnitude early in training.
* Often use **robust norms** (e.g., Huber) on translation.

This is the **go-to baseline**: simple, stable, and strong in practice.

### 1.2.2 Option B — True SE(3) geodesic via Lie Log (advanced)

Compute the **relative transform** $\Delta T = T_{\text{gt}}^{-1}T_{\text{pred}}$, take the **matrix logarithm** to get a 6-vector $\xi = [\omega, v]\in \mathbb{R}^6$ (rotation/translation in the tangent space), then penalize it:

$$
\xi \;=\; \log(\Delta T) \;=\; 
\begin{bmatrix} \omega \\ v \end{bmatrix},\quad
\mathcal{L} \;=\; \|\; W\,\xi \;\|_2
\quad\text{or}\quad
\mathcal{L} \;=\; \|W_\omega \omega\|_2 + \|W_v v\|_2.
$$

* Here $W$ (or $W_\omega, W_v$) sets the relative weighting/units.
* This treats rotation and translation **on the same manifold footing** and is invariant to **left/right multiplication** (choose consistently).
* Slightly more math and careful numerics (small-angle handling).


This version gives you a **true SE(3) tangent-space error**. Use if you want strict group-theoretic consistency (e.g., in pose-graph optimization or when composition/invariance properties matter).

---

### 1.2.3 Numerical Example SE(3) Option-B


We’ll compute
$\Delta T = T_{\text{gt}}^{-1}T_{\text{pred}}$, then $\xi=\log(\Delta T)=[\omega,\,v]\in\mathbb{R}^6$, and a loss $\|\omega\|_2+\|v\|_2$.

---

**Poses**

* Ground truth: rotation **+90° about z**, translation $t_\text{gt}=[1,\,0,\,0]$
* Prediction: rotation **+60° about z**, translation $t_\text{pred}=[1.2,\,0.1,\,0]$

Rotation matrices:

$$
R_z(\phi)=\begin{bmatrix}\cos\phi & -\sin\phi & 0\\ \sin\phi & \cos\phi & 0\\ 0&0&1\end{bmatrix}
$$

$$
R_\text{gt}=R_z(90^\circ),\quad
R_\text{pred}=R_z(60^\circ)
$$

---

#### Relative transform $\Delta T$

$$
R_{\text{rel}} = R_\text{gt}^\top R_\text{pred}
= \begin{bmatrix}
0.8660254 & 0.5 & 0\\
-0.5 & 0.8660254 & 0\\
0 & 0 & 1
\end{bmatrix}
\quad(\text{a }-30^\circ\text{ rotation about }z)
$$

$$
t_{\text{rel}} = R_\text{gt}^\top (t_\text{pred}-t_\text{gt})
= \begin{bmatrix}0.1\\ -0.2\\ 0\end{bmatrix}
$$

---

#### $\log(\Delta T)\Rightarrow [\omega,\,v]$

**SO(3) log (rotation):**

* $\theta=\arccos\big((\mathrm{tr}(R_{\text{rel}})-1)/2\big)=\arccos(0.8660254)=\;0.5235988$ rad $=30^\circ$
* Axis $=\ -\hat z$, so

$$
\omega = \theta\cdot(-\hat z) = \begin{bmatrix}0\\0\\-0.5235988\end{bmatrix}
$$

**SE(3) translation log:**

$$
V = I + \frac{1-\cos\theta}{\theta^2}[\omega]_\times
      + \frac{\theta-\sin\theta}{\theta^3}[\omega]_\times^2,
\qquad v = V^{-1} t_{\text{rel}}
$$

Numerically (for $\theta=0.5236$ rad):

$$
V \approx
\begin{bmatrix}
0.95492966 & 0.25587263 & 0\\
-0.25587263 & 0.95492966 & 0\\
0 & 0 & 1
\end{bmatrix},
\qquad
v \approx
\begin{bmatrix}
0.1500647\\
-0.1692298\\
0
\end{bmatrix}
$$

So

$$
\xi=\log(\Delta T)=
\big[\,\omega;\,v\,\big]
=
\begin{bmatrix}
0\\
0\\
-0.5235988\\
0.1500647\\
-0.1692298\\
0
\end{bmatrix}.
$$

---

#### Example loss

With $\mathcal L = \|\omega\|_2 + \|v\|_2$:

* $\|\omega\|_2 = 0.5235988$ (30° in radians)
* $\|v\|_2 \approx \sqrt{0.1500647^2 + (-0.1692298)^2} \approx 0.2261817$

$$
\boxed{\mathcal L \approx 0.5236 + 0.2262 = 0.7498}
$$

---


### 1.2.3 How to pick weights (very important)

* Units differ: **radians vs meters**. You must balance them.
* Three common strategies:

  1. **Manual tuning** (start with $\lambda_R=1, \lambda_t\in[1,10]$).
  2. **Normalize by dataset scale** (e.g., divide translation by scene extent).
  3. **Learned homoscedastic uncertainty** (Kendall & Cipolla):

     $$
     \mathcal{L} = \frac{1}{2\sigma_R^2} \, \mathcal{L}_R + \frac{1}{2\sigma_t^2}\, \mathcal{L}_t + \log \sigma_R + \log \sigma_t
     $$

     with $\log \sigma_R, \log \sigma_t$ as learnable scalars.

---

### 1.2.4 Quick recommendations

* Start with **Option A** (weighted sum, Huber on translation). It’s robust and easy to tune.
* If you need **group-consistent** errors (e.g., enforcing trajectory smoothness with relative poses), use **Option B** (Lie log).
* Always monitor **angle (deg)** and **translation (m)** separately as metrics, even if your loss is a combination.




## 1.3 Photometric Loss

Photometric loss is commonly used in **unsupervised visual odometry** or **monocular depth estimation**.

* You have a **target image** $I_t$ and a **reference image** $I_r$.
* Using predicted **depth** $D_t$ and **camera motion** ($R, \mathbf{t}$), you warp $I_r$ into the target frame, producing a **reconstructed image** $\hat{I}_t$.
* The **photometric loss** compares $I_t$ and $\hat{I}_t$: if they are visually similar, the loss is small — meaning your depth and motion predictions are consistent.

The most common version is **pixel-wise L1 loss**:

$$
\mathcal{L}_{\text{photo}} = \frac{1}{N} \sum_{i=1}^N \big| I_t(i) - \hat{I}_t(i) \big|
$$

where $N$ is the number of valid pixels.

---

### 1.3.1 Numerical Example (Pixel-wise Photometric Loss)

**Target image $I_t$:**

$$
I_t =
\begin{bmatrix}
0.1 & 0.2 & 0.3 \\
0.2 & 0.4 & 0.6 \\
0.3 & 0.5 & 0.7
\end{bmatrix}
$$

**Reconstructed image $\hat{I}_t$:**

$$
\hat{I}_t =
\begin{bmatrix}
0.0 & 0.1 & 0.4 \\
0.3 & 0.3 & 0.5 \\
0.4 & 0.4 & 0.8
\end{bmatrix}
$$

Pixel-wise absolute differences:

$$
|I_t - \hat{I}_t| =
\begin{bmatrix}
0.1 & 0.1 & 0.1 \\
0.1 & 0.1 & 0.1 \\
0.1 & 0.1 & 0.1
\end{bmatrix}
$$

Mean over all $N=9$ pixels:

$$
\mathcal{L}_{\text{photo}} = \frac{0.9}{9} = 0.1
$$

A lower value means the reconstruction is closer to the target.

---

### 1.3.2 Pose Representation (Rigid-Body Motion)

In unsupervised VO, the network predicts the **rigid-body motion** between two frames as an SE(3) transform:

$$
T_{t \rightarrow r} =
\begin{bmatrix}
R & \mathbf{t} \\
0 & 1
\end{bmatrix}
$$

* $R$: $3 \times 3$ rotation matrix, usually parameterized by a **quaternion**
* $\mathbf{t}$: $3 \times 1$ translation vector

---

**Example Transformation**

* **Quaternion:** $q = (w=0.9239, x=0, y=0, z=0.3827)$ → 45° rotation about Z-axis.
* **Translation:** $\mathbf{t} = [1, 0, 0]^T$

Rotation matrix from quaternion:

$$
R =
\begin{bmatrix}
\cos 45° & -\sin 45° & 0 \\
\sin 45° & \cos 45° & 0 \\
0 & 0 & 1
\end{bmatrix}
=
\begin{bmatrix}
0.707 & -0.707 & 0 \\
0.707 & 0.707 & 0 \\
0 & 0 & 1
\end{bmatrix}
$$

Full transform:

$$
T_{t \rightarrow r} =
\begin{bmatrix}
0.707 & -0.707 & 0 & 1 \\
0.707 & 0.707 & 0 & 0 \\
0 & 0 & 1 & 0 \\
0 & 0 & 0 & 1
\end{bmatrix}
$$

---

### 1.3.3 Back-Project Pixel to 3D (Target Frame)

Given intrinsics

$$
K =
\begin{bmatrix}
100 & 0 & 50 \\
0 & 100 & 50 \\
0 & 0 & 1
\end{bmatrix}
,\quad
K^{-1} =
\begin{bmatrix}
0.01 & 0 & -0.5 \\
0 & 0.01 & -0.5 \\
0 & 0 & 1
\end{bmatrix}
$$

Take target pixel $(u_t,v_t)=(60,40)$ with predicted depth $D_t=2.0$:

$$
K^{-1} \begin{bmatrix} 60 \\ 40 \\ 1 \end{bmatrix}
=
\begin{bmatrix}
0.1 \\ -0.1 \\ 1
\end{bmatrix}
\quad\Rightarrow\quad
P_t = D_t \cdot
\begin{bmatrix}
0.1 \\ -0.1 \\ 1
\end{bmatrix}
=
\begin{bmatrix}
0.2 \\ -0.2 \\ 2.0
\end{bmatrix}
$$

This is the **3D point in target frame**.

---

### 1.3.4 Transform Point to Reference Frame

Apply $P_r = R P_t + \mathbf{t}$:

1. Rotate:

$$
R P_t =
\begin{bmatrix}
0.28284 \\ 0 \\ 2.0
\end{bmatrix}
$$

2. Translate:

$$
P_r =
\begin{bmatrix}
1.28284 \\ 0 \\ 2.0
\end{bmatrix}
$$

---

### 1.3.5 Project Back to Reference Image

Project using intrinsics:

$$
u_r = 100 \cdot \frac{1.28284}{2.0} + 50 = 114.14,
\quad
v_r = 100 \cdot 0 + 50 = 50
$$

Resulting pixel in $I_r$: $(u_r,v_r)=(114.14, 50.0)$

If the image is $100 \times 100$, this lies **out of bounds** → we **mask it out** (does not contribute to loss).

---

### 1.3.6 Sanity Check (No Translation)

If $\mathbf{t}=0$:

$$
P_r=(0.28284, 0, 2.0) \quad\Rightarrow\quad
u_r=64.14, \; v_r=50
$$

Now the pixel is **inside the image** → it would contribute to the loss.

---

### 1.3.7 Scale Ambiguity in Monocular VO

If we **halve depth and translation**:

* $P_t'=(0.1,-0.1,1.0)$
* $\mathbf{t}'=[0.5,0,0]^T$

Then

$$
P_r'=(0.64142, 0, 1.0)
\quad\Rightarrow\quad
u_r'=114.14, v_r'=50
$$

The projected pixel is **identical** → photometric loss cannot recover absolute scale, only **relative geometry**.

---

### 1.3.8 Training Loop Summary

1. Predict depth $D_t$ and pose $(R,\mathbf{t})$.
2. Back-project pixels → 3D points $P_t$.
3. Transform → reference frame $P_r$.
4. Project back to image plane → $(u_r,v_r)$.
5. Sample $I_r$ at these coordinates → reconstruct $\hat{I}_t$.
6. Compute photometric loss (L1 or L1+SSIM) over valid pixels.
7. Backprop through projection, sampling, and networks.

---



##  Convention Used in Most Papers (e.g. SfMLearner, Monodepth2)

| **Frame**                    | **Role**                                                          |
| ---------------------------- | ----------------------------------------------------------------- |
| $I_i$                        | **Target frame** → you predict depth $D_i$ for this frame         |
| $I_{i+1}$ (and/or $I_{i-1}$) | **Reference frame(s)** → you warp them into frame $i$’s viewpoint |

---

##  What Happens Step by Step

1. **Depth Prediction:**
   $D_i = \text{DepthNet}(I_i)$
   → per-pixel depth map **for frame $i$**.

2. **Pose Prediction:**
   $(R, \mathbf{t}) = \text{PoseNet}(I_i, I_{i+1})$
   → relative transform $T_{i \rightarrow i+1}$ (from frame $i$ to frame $i+1$).

3. **Back-Project:**
   Use $D_i$ and camera intrinsics $K$ to get 3D points $P_i$ in frame $i$.

4. **Transform:**
   Move points to frame $i+1$:
   $P_{i+1} = T_{i \rightarrow i+1} \cdot P_i$.

5. **Project:**
   Project $P_{i+1}$ to 2D using intrinsics $K$ → get pixel coords $(u_{i+1}, v_{i+1})$.

6. **Sample:**
   Bilinear sample $I_{i+1}$ at these coords → reconstructed image $\hat{I}_i$.

7. **Loss:**
   Compare $I_i$ (target) and $\hat{I}_i$:

   $$
   \mathcal{L}_{\text{photo}} = \frac{1}{N}\sum |I_i - \hat{I}_i|
   $$

---

##  Intuition

* **Target frame:** $I_i$ — you are trying to reproduce this image.
* **Reference frame:** $I_{i+1}$ — you are "borrowing" its pixels, warping them into $i$’s viewpoint.
* If reconstruction is good, $I_i \approx \hat{I}_i$.

---

##  You Can Also Swap

You can just as well make $I_{i+1}$ the target and $I_i$ the reference — as long as you're consistent.
But by convention:

* **Depth is always predicted for the target frame.**
* **Reference frames are the ones you warp into the target’s viewpoint.**

---

### TL;DR (Answer to Your Question)

**With two consecutive frames $I_i, I_{i+1}$:**

* **Target frame:** $I_i$ (depth $D_i$ is predicted for this one)
* **Reference frame:** $I_{i+1}$ (warped into frame $i$’s viewpoint to create $\hat{I}_i$)
* **Loss computed between:** $I_i$ and $\hat{I}_i$

---

Would you like me to draw a small diagram (camera frustums for $i$ and $i+1$, showing how a 3D point projects to each, and how we warp $I_{i+1}$ to reconstruct $I_i$)? It usually makes this concept stick immediately.


## 1.4 Structural Similarity Index Measure (SSIM)


The **Structural Similarity Index Measure (SSIM)** is a widely used metric for measuring the similarity between two images. Unlike simple metrics such as Mean Squared Error (MSE) or Peak Signal-to-Noise Ratio (PSNR), SSIM is designed to model the way humans perceive image quality — focusing on structural information, contrast, and luminance rather than raw pixel differences.

---

### 1.4.1 The Idea Behind SSIM

SSIM tries to answer: *“How similar are two images in terms of structure, contrast, and brightness?”*

It decomposes similarity into **three components**:

1. **Luminance similarity** $l(x, y)$:

   Are the two images equally bright on average?

   $$
   l(x,y) = \frac{2 \mu_x \mu_y + C_1}{\mu_x^2 + \mu_y^2 + C_1}
   $$

- If both images have similar brightness, this term is near 1.
- If one image is much darker, it will drop below 1.   

2. **Contrast similarity** $c(x, y)$:

   Do the two images have the same amount of contrast?

   $$
   c(x,y) = \frac{2 \sigma_x \sigma_y + C_2}{\sigma_x^2 + \sigma_y^2 + C_2}
   $$

- If both images have similar contrast (variability), this is near 1.
- If one is flat (low contrast) and the other is textured (high contrast), the similarity decreases.

3. **Structural similarity** $s(x, y)$:

   Do the two images have the same patterns and textures?

   $$
   s(x,y) = \frac{\sigma_{xy} + C_3}{\sigma_x \sigma_y + C_3}
   $$
- If $x$ and $y$ rise and fall together (high correlation), this is near $1$.
- If they are uncorrelated or inverted (noise, wrong edges), this value becomes smaller or even negative.   

Here:

* $\mu_x, \mu_y$ are the mean intensities,
* $\sigma_x, \sigma_y$ are standard deviations,
* $\sigma_{xy}$ is covariance between $x$ and $y$,
* $C_1, C_2, C_3$ are small constants to stabilize division.

The final SSIM is:

$$
SSIM(x, y) = [l(x, y)]^\alpha \cdot [c(x, y)]^\beta \cdot [s(x, y)]^\gamma
$$

Usually $\alpha = \beta = \gamma = 1$.

---

### 1.4.2 Normalized Correlation
The **normalized correlation** part of SSIM is the most “structural” component, so it’s worth understanding carefully.

**Step 1: Represent the Images**

Suppose you have two image patches $x$ and $y$ of size $N$ pixels (can be grayscale or single channel).

$$
x = [x_1, x_2, \dots, x_N], \quad y = [y_1, y_2, \dots, y_N]
$$

---

**Step 2: Compute the Mean**

Compute the mean intensity (average brightness) of each patch:

$$
\mu_x = \frac{1}{N} \sum_{i=1}^{N} x_i, \qquad
\mu_y = \frac{1}{N} \sum_{i=1}^{N} y_i
$$

---

**Step 3: Compute the Standard Deviations**

Compute how much pixel values vary around the mean:

$$
\sigma_x = \sqrt{\frac{1}{N-1} \sum_{i=1}^{N} (x_i - \mu_x)^2}
$$

$$
\sigma_y = \sqrt{\frac{1}{N-1} \sum_{i=1}^{N} (y_i - \mu_y)^2}
$$

These represent the **contrast** of each image.

---

**Step 4: Compute the Covariance**

Covariance measures how much the two patches vary *together*:

$$
\sigma_{xy} = \frac{1}{N-1} \sum_{i=1}^{N} (x_i - \mu_x)(y_i - \mu_y)
$$

* If $x$ and $y$ increase/decrease together → covariance is **positive**.
* If one increases when the other decreases → covariance is **negative**.

---

**Step 5: Normalize → Get the Correlation**

The **Pearson correlation coefficient** is just the covariance normalized by the product of standard deviations:

$$
\rho_{xy} = \frac{\sigma_{xy}}{\sigma_x \sigma_y}
$$

* This value ranges between **-1 and 1**.

  * $1.0 \Rightarrow$ perfect positive linear correlation (structures align perfectly).
  * $0 \Rightarrow$ no linear correlation (structures unrelated).
  * $-1.0 \Rightarrow$ perfect negative correlation (inverted contrast).

---

**Step 6: Add Stabilization for SSIM**

In SSIM, to avoid division by zero when contrast is very low, we use a small constant $C_3$:

$$
s(x, y) = \frac{\sigma_{xy} + C_3}{\sigma_x \sigma_y + C_3}
$$

This keeps the measure well-defined even for very flat patches (e.g., almost uniform gray).

---

**Intuition**

* **Covariance** tells you whether pixel intensities move together.
* **Normalization** by $\sigma_x \sigma_y$ removes the effect of scale/contrast so you focus purely on **structure** (edges, textures, gradients).

This is why SSIM can still give a high similarity score if one image is slightly brighter/darker — because the **pattern** of variations is the same.

---

SSIM tries to mimic the **human visual system** by:

* **Normalizing for lighting** (so small brightness changes are ignored).
* **Normalizing for contrast** (so small contrast changes are less penalized).
* **Measuring structure** (so it cares about edges, patterns, textures).




### 1.4.3 Numerical SSIM Example
Awesome—let’s do a fully worked **numerical SSIM example** with two $3\times3$ grayscale image patches and compute every piece: mean, std, covariance, normalized correlation, the three SSIM terms (luminance/contrast/structure), and the final SSIM.

**Images (grayscale, 8-bit scale assumed)**

$$
x=\begin{bmatrix}
10&20&30\\
20&30&40\\
30&40&50
\end{bmatrix},\quad
y=\begin{bmatrix}
12&22&32\\
21&31&41\\
29&39&49
\end{bmatrix}
$$

We’ll treat the whole $3\times3$ window as one patch (i.e., a single SSIM window).

---

#### 1) Basic statistics

Let $N=9$.

**Means**

$$
\mu_x=30.0000,\qquad \mu_y=30.6667
$$

**Sample standard deviations** (ddof=1)

$$
\sigma_x=\sqrt{150}=12.2474,\qquad \sigma_y=\sqrt{129.25}=11.3688
$$

**Sample covariance**

$$
\sigma_{xy}=\frac{1}{N-1}\sum (x_i-\mu_x)(y_i-\mu_y)=138.75
$$

**Normalized correlation (Pearson $\rho$)**

$$
\rho=\frac{\sigma_{xy}}{\sigma_x\sigma_y}=\frac{138.75}{12.2474\cdot 11.3688}\approx 0.9965
$$

> Intuition: the patches vary together almost perfectly (very strong structural agreement).

---

#### 2) SSIM components

Use the standard SSIM constants for 8-bit images:

$$
L=255,\quad C_1=(0.01L)^2=6.5025,\quad C_2=(0.03L)^2=58.5225,\quad C_3=\frac{C_2}{2}=29.26125
$$

#### (a) Luminance term $l(x,y)$

$$
l=\frac{2\mu_x\mu_y+C_1}{\mu_x^2+\mu_y^2+C_1}
=\frac{2\cdot 30\cdot 30.6667+6.5025}{30^2+30.6667^2+6.5025}
\approx 0.99976
$$

#### (b) Contrast term $c(x,y)$

$$
c=\frac{2\sigma_x\sigma_y+C_2}{\sigma_x^2+\sigma_y^2+C_2}
=\frac{2\cdot 12.2474\cdot 11.3688+58.5225}{12.2474^2+11.3688^2+58.5225}
\approx 0.99771
$$

#### (c) Structure term $s(x,y)$

$$
s=\frac{\sigma_{xy}+C_3}{\sigma_x\sigma_y+C_3}
=\frac{138.75+29.26125}{12.2474\cdot 11.3688+29.26125}
\approx 0.99710
$$

---

#### 3) Final SSIM

$$
SSIM=l\cdot c\cdot s\approx 0.99976\cdot 0.99771\cdot 0.99710\approx \mathbf{0.99458}
$$

**Takeaway:** Despite small brightness/contrast differences, the structures match extremely well (high $\rho$ and high SSIM ≈ **0.995**). This is exactly the kind of case where SSIM (and the structure term) shines compared to plain MSE/PSNR.

If you want, I can also compute **MSE/PSNR** for the same pair so you can see how they react differently.



## 1.5 LPIPS

So far, we talked about **SSIM** (hand-crafted metric). Now, **LPIPS (Learned Perceptual Image Patch Similarity)** goes a step further: instead of manually designing similarity measures, it uses **deep features** from pretrained networks (e.g., AlexNet, VGG, SqueezeNet) to capture perceptual similarity.


* Proposed in **"The Unreasonable Effectiveness of Deep Features as a Perceptual Metric"** (Zhang et al., 2018).
* Idea: Humans judge images by *perceptual similarity*, not pixel-wise equality.
* LPIPS measures distance in the **feature space** of a pretrained CNN rather than raw pixels.

---

#### 1.5.1 How It Works

1. Take two images $x$ and $y$.
2. Pass both through a **pretrained network** (e.g., VGG).
3. Extract activations from multiple layers (feature maps).
4. Normalize features and compute **L2 distance** per spatial location.
5. Average distances across spatial positions and layers.
6. Optionally, train small linear weights to better align with human judgments.

$$
LPIPS(x,y) = \sum_l \frac{1}{H_l W_l} \sum_{h,w} w_l \; \| \hat{f}_l(x)_{h,w} - \hat{f}_l(y)_{h,w} \|_2^2
$$

* $f_l$: feature map from layer $l$.
* $\hat{f}_l$: channel-wise normalized.
* $w_l$: learned weights.

---

#### 1.5.2 Why LPIPS is Important

* **MSE/PSNR**: pixel-wise, not perceptual.
* **SSIM**: structural but still hand-crafted.
* **LPIPS**: learned perceptual similarity, matches human perception much better.

In practice, LPIPS is considered **state-of-the-art** for evaluating perceptual image quality (GANs, super-resolution, style transfer, inpainting).

---

**LPIPS in Deep Learning Workflows**

* Used as an **evaluation metric** for generative models (GANs, diffusion, etc.).
* Sometimes used as a **loss function** (LPIPS loss) for training perceptual similarity.
* Often combined with pixel losses (L1/L2) or SSIM.


---

#### 1.5.3 Comparison: SSIM vs LPIPS

| Metric       | Based on                           | Pros                                                         | Cons                                             |
| ------------ | ---------------------------------- | ------------------------------------------------------------ | ------------------------------------------------ |
| **MSE/PSNR** | Pixel differences                  | Simple, fast                                                 | Not perceptual, sensitive to shifts              |
| **SSIM**     | Luminance, contrast, structure     | Better perceptual alignment                                  | Hand-crafted, less robust to complex distortions |
| **LPIPS**    | Deep features (VGG, AlexNet, etc.) | Best matches human perception, widely used in GAN evaluation | Heavier, requires pretrained nets                |

---

## 1.6 Smoothness Loss



### 1.6.1 What it is (and why)

Photometric+SSIM losses make the network match views, but monocular depth has many plausible solutions (esp. in textureless regions). **Smoothness loss** is a regularizer that encourages **piecewise-smooth disparity/depth** while **preserving edges** aligned with image gradients.

### 1.6.2 Common formulations

Let $I\in\mathbb{R}^{H\times W\times 3}$ be the target image, and $d$ the predicted **disparity** (often smoother than raw depth; $d=1/z$). Finite differences:
$\partial_x f_{i,j}=f_{i,j}-f_{i,j-1}$, $\partial_y f_{i,j}=f_{i,j}-f_{i-1,j}$.

### 1.6.3 First-order, edge-aware (most used)

$$
\mathcal{L}_{\text{sm}}^{(1)}=
\frac{1}{HW}\sum_{i,j}\Big(
\left|\partial_x d_{i,j}\right|\,e^{-\alpha\,\|\partial_x I_{i,j}\|}
+
\left|\partial_y d_{i,j}\right|\,e^{-\alpha\,\|\partial_y I_{i,j}\|}
\Big)
$$

* The exponential **down-weights** the penalty at strong image edges so you **don’t over-smooth boundaries**.
* Typical $\alpha\in[5,10]$. Use grayscale $I$ or per-channel gradient norm.

### 1.6.4 Second-order (curvature) variant

$$
\mathcal{L}_{\text{sm}}^{(2)}=
\frac{1}{HW}\sum_{i,j}\Big(
\left|\partial_{xx} d_{i,j}\right| e^{-\alpha \|\partial_x I_{i,j}\|}
+
\left|\partial_{yy} d_{i,j}\right| e^{-\alpha \|\partial_y I_{i,j}\|}
\Big)
$$

* Penalizes **changes of slope**; good for avoiding “staircasing”.

### 1.6.5 Robust penalty / Charbonnier

Replace $|x|$ with $\rho(x)=\sqrt{x^2+\epsilon^2}$ (e.g., $\epsilon=10^{-3}$) for stability.

### 1.6.6 Scale-invariant normalization

Disparity amplitude can drift. Common tricks:

* Use disparity $d$ instead of depth $z$.
* Or divide by mean disparity per image: $\tilde d = d / (\bar d + \varepsilon)$ before taking gradients.

### 1.6.7 Multi-scale

Compute $\mathcal{L}_{\text{sm}}$ at pyramid levels $s=0..S-1$ (coarsest $\to$ finest). Weight by $w_s$ (e.g., $w_s=1/2^s$) and sum.

### 1.6.8 How it plugs into monocular VO (with a ViT)

Even if your depth/pose networks are **ViT-based**, the smoothness term is unchanged—just compute it on the **full-resolution** disparity map (after your ViT’s upsampling head).

**Total loss** (typical self-supervised monocular pipeline):

$$
\mathcal{L} = 
\lambda_{\text{photo}} \, \mathcal{L}_{\text{photo}}
+ \lambda_{\text{ssim}} \, \mathcal{L}_{\text{ssim}}
+ \lambda_{\text{sm}} \, \mathcal{L}_{\text{sm}}
\,(+ \text{other terms: automask, occlusion, geometry})
$$

Reasonable starting weights (tune per dataset):

* $\lambda_{\text{photo}}=1.0$
* $\lambda_{\text{ssim}}=0.15$ (if photo is L1)
* $\lambda_{\text{sm}} \in [0.001, 0.1]$ (start small; increase if depth is noisy)

**ViT-specific tips**

* Upsample tokens to image space (conv+pixelshuffle or interpolation) **before** smoothness.
* If you see block boundaries, add a tiny second-order term or anti-blocking conv in the upsampling head.
* Detach image gradients (no backprop through $I$).


> Notes

* Provide `image` in **grayscale** or compute gradient norm channel-wise and average (as above).
* If you want Charbonnier: replace `.abs()` with `torch.sqrt(x*x + eps*eps)`.

### 1.6.9 Practical tuning & pitfalls

* **Start small** $\lambda_{\text{sm}}$: too large $\Rightarrow$ over-smoothed, “melted” geometry; too small $\Rightarrow$ noisy depth.
* Use **disparity** rather than depth; it naturally stabilizes scale.
* **Detach** image gradients (as shown) to prevent weird coupling.
* Consider **second-order** term if you see “staircase” artifacts.
* Compute on **multiple scales**; strongest impact at coarse scales.
* Dynamic objects/occlusions: combine with **auto-masking / per-pixel min reprojection** to avoid penalizing impossible warps.
* For ViT heads, ensure good **anti-aliasing upsampling**; otherwise smoothness fights token blocking.

### 1.6.10 Minimal recipe (drop-in)

1. Predict multi-scale disparities $d^{(s)}$ from your ViT depth head.
2. For each scale, compute $\mathcal{L}_{\text{sm}}^{(1)}$ with $\alpha=10$, normalize by mean disparity, weight by $w_s=1/2^s$.
3. Set $\lambda_{\text{sm}}=0.01$ as a starting point; tune against validation photometric error and scale drift.
4. Keep your usual $\mathcal{L}_{\text{photo}} + \mathcal{L}_{\text{SSIM}}$ and occlusion handling.



##  Evaluation Metrics

| Metric  | Tool      | Meaning                                  |
| ------- | --------- | ---------------------------------------- |
| **ATE** | `evo_ape` | Absolute Trajectory Error — global drift |
| **RPE** | `evo_rpe` | Relative Pose Error — local accuracy     |


## Popular Monocular Depth Datasets

| Dataset          | Description                        | License / Notes                    |
| ---------------- | ---------------------------------- | ---------------------------------- |
| **NYU Depth V2** | Indoor scenes, Kinect RGB-D images | ✔️ Standard for indoor depth       |
| **KITTI**        | Outdoor driving scenes (LiDAR)     | ✔️ Standard for autonomous driving |
| **Make3D**       | Outdoor stills (Stanford)          | Older, smaller                     |
| **DIML/CVT**     | Outdoor depth from stereo          | Large and high-resolution          |
| **TUM RGB-D**    | Indoor SLAM dataset                | ✔️ Camera + depth                  |
