# Overview of Approaches and Deep Learning Architecture For Visual Odometry

###  Deep Learning VO: main families

**A. Pose regression (end-to-end)**

* **Idea:** CNN (or CNN+RNN) regresses $(\mathbf{R},\mathbf{t}) \in SE(3)$ directly from stacked frames.
* **Architectures:** PoseNet-style CNNs → ConvLSTM/GRU for temporal context.
* **Losses:** supervised $\ell_1/\ell_2$ on translation, geodesic loss on rotation; or **self-supervised photometric** (see §3).
* **Pros:** simple inference; can be fast.
* **Cons:** scale drift (monocular), weak geometry priors, generalization risk.

**B. Depth+Pose joint learning (self-supervised SfM-style)**

* **Idea:** one net predicts **depth** $D_t$; another predicts **relative pose** $T_{t\rightarrow s}$. Reproject source $\mathbf{I}_s$ into target with $D_t$ and $T$; train by minimizing **photometric/SSIM** reconstruction.
* **Architectures:** U-Net depth backbones; small PoseNet; sometimes **cost volumes** (stereo) or **transformers**.
* **Extras:** auto-masking for non-rigid pixels, **explainability masks**, multi-scale supervision, **edge-aware smoothness** on depth.
* **Pros:** no GT poses needed; geometry-aware; scales well with data.
* **Cons:** moving objects/occlusions need handling; absolute scale ambiguous (mono).

**C. Geometry-aware networks (differentiable optimization inside)**

* **Idea:** embed **PnP/BA/ICP** as differentiable layers (Gauss-Newton blocks, differentiable bundle adjustment, learned Jacobians/weights).
* **Examples vibe:** DeepV2D-like depth-pose iterative refinement, BA-Net-style layers, **DROID-SLAM-like** dense matching + iterative pose/structure updates.
* **Pros:** better inductive bias; stronger generalization; better consistency.
* **Cons:** more complex; heavier training; careful stability engineering.

**D. Flow- or correspondence-driven VO**

* **Idea:** learn dense optical flow or correspondences; then recover pose via differentiable epipolar geometry, or train end-to-end.
* **Pros:** good on dynamic scenes with robust matchers; integrates with cost volumes/transformers.
* **Cons:** scale ambiguity (mono); need rigidity masks or scene flow.


---

### Core training losses (self-supervised mono/stereo)

Let $I_t$ target, $I_s$ source; predict $D_t$ and $T_{t\rightarrow s}$. For pixel $p$ in $I_t$:

1. **Back-project:** $\mathbf{X} = D_t(p)\,K^{-1}\tilde{p}$
2. **Transform:** $\mathbf{X}_s = T_{t\rightarrow s}\,\mathbf{X}$
3. **Project:** $p' \sim K\,\mathbf{X}_s$ → sample $\hat{I}_t(p) = I_s(p')$

**Photometric loss:**
$\mathcal{L}_{pho} = \alpha \frac{1 - \mathrm{SSIM}(I_t,\hat{I}_t)}{2} + (1-\alpha)\|I_t-\hat{I}_t\|_1$

**Depth smoothness (edge-aware):**
$\mathcal{L}_{sm} = \sum |\partial_x D_t| e^{-|\partial_x I_t|} + |\partial_y D_t| e^{-|\partial_y I_t|}$

**Geometry consistency:**

* Epipolar loss: $\tilde{p}'^\top F \tilde{p} \approx 0$ on learned correspondences.
* Cycle/reprojection min over multiple sources to handle occlusions.
* **Scale constraints:** stereo baseline, IMU priors, or learned scale head.

**Rotation loss (geodesic) for supervised/regularized pose:**
$\ell_R(R,\hat{R}) = \|\log(R^\top\hat{R})\|_2$

---

###  Popular architectural building blocks

* **Backbones:** ResNet/EfficientNet/MobileNet, or ViT/convnext-style hybrids.
* **Temporal:** ConvLSTM/GRU; 1D temporal convs; attention over clips.
* **Transformers:** for long-range associations, global cost volumes, memory (e.g., recurrent matching + pose/BA heads).
* **Cost volumes:** stereo/monocular multi-hypothesis depth refinement.
* **Differentiable solvers:** Gauss-Newton layers, differentiable PnP/ICP, learned robust weights (M-estimation).
* **Uncertainty heads:** aleatoric/epistemic to weight residuals and poses.

---

###  Practical training & engineering tips

* **Data curation:** varied motion/illumination; ensure exposure consistency or use brightness augmentation & learned photometric invariance.
* **Non-rigidity handling:** auto-mask moving objects; min-reprojection over multiple sources; per-pixel uncertainty weighting.
* **Scale:** prefer stereo or occasional depth supervision/IMU to anchor scale; else learn a scale head or post-scale with a known height/velocity prior.
* **Drift control:** keyframes + temporal windows; small **differentiable BA** blocks every N frames; pose-graph fine-tuning at segment ends.
* **Initialization:** identity or gyro-seeded initial pose; pyramids and coarse-to-fine warping reduce bad local minima.
* **Numerics:** SE(3) parametrization via **Lie algebra** $\boldsymbol{\xi} \in \mathbb{R}^6$; compose poses with $\exp(\cdot)$ / $\log(\cdot)$; use geodesic rotation losses.
* **Speed:** share encoders for depth/pose; mixed precision; tile-based warping; keep cost volumes shallow.

---

###  1.1.2 Quaternion Loss (Naïve Euclidean Loss)

The simplest approach is to minimize the **L2 distance between quaternions**:

$$
\mathcal{L}_\text{quat} = \| q_\text{pred} - q_\text{gt} \|_2
$$

But there are **two problems**:

1. **Double cover**: $q$ and $-q$ represent the same rotation.
   → If the network predicts $-q_\text{gt}$, the loss will be **large**, even though the rotation is exactly correct.
2. **Euclidean mismatch**: The Euclidean distance between quaternions does not exactly correspond to the **geodesic distance** (shortest path on SO(3)).

**Fix for double cover:**

Take the shorter path:

$$
\mathcal{L}_\text{quat} = 1 - |\langle q_\text{pred}, q_\text{gt} \rangle|
$$

where $\langle \cdot, \cdot \rangle$ is the quaternion dot product.
This gives a loss proportional to the **cosine of half the rotation angle**.

---


### 1.1.3 Geodesic Loss (Rotation-Angle Loss)

The **geodesic distance** between two rotations $R_1, R_2 \in SO(3)$ is the smallest rotation angle that aligns them.
If $R = R_1^\top R_2$, then:

$$
\theta = \cos^{-1}\!\left(\frac{\text{trace}(R) - 1}{2}\right)
$$



If $R_1, R_2$ are very close, then in $R = R_1^\top R_2$, their transpose are perpendicular, mening $R=I$, which means the $\text{trace(R)}=3$ therefor $\frac{\text{trace}(R_1^\top R_2) - 1}{2}$ is $1$, and  $\cos^{-1}(1)=0$ 


This is the true **shortest path distance on SO(3)** (a proper Riemannian metric).
Loss is typically defined as:

$$
\mathcal{L}_\text{geo} = \theta = \cos^{-1}\!\left(\frac{\text{trace}(R_1^\top R_2) - 1}{2}\right)
$$

If using quaternions, you can avoid matrices:

$$
\mathcal{L}_\text{geo} = 2 \cos^{-1} \!\big( |\langle q_\text{pred}, q_\text{gt} \rangle| \big)
$$

where again we take the absolute value to fix the double-cover issue.

---


### 1.1.4 Geodesic Loss Numerical Example

**Ground truth rotation**: 90° around z-axis

  $$
  q_\text{gt} = \left[\cos(45°), 0, 0, \sin(45°)\right] = [0.7071, 0, 0, 0.7071]
  $$

**Predicted rotation**: 60° around z-axis

  $$
  q_\text{pred} = \left[\cos(30°), 0, 0, \sin(30°)\right] = [0.8660, 0, 0, 0.5]
  $$

Both are already normalized.

---

**Compute Dot Product**

$$
\langle q_\text{pred}, q_\text{gt} \rangle
= (0.8660)(0.7071) + (0)(0) + (0)(0) + (0.5)(0.7071)
$$

$$
= 0.6124 + 0.3536 = 0.9660
$$

Take absolute value (to handle sign ambiguity):

$$
|\langle q_\text{pred}, q_\text{gt} \rangle| = 0.9660
$$

---

**Compute Geodesic Loss (Angle)**

Geodesic loss (in radians):

$$
\mathcal{L}_\text{geo} = 2 \cos^{-1}(0.9660)
$$

Compute step-by-step:

* $\cos^{-1}(0.9660) ≈ 0.2618 \, \text{rad}$
* Multiply by 2 → $\mathcal{L}_\text{geo} ≈ 0.5236 \, \text{rad}$

Convert to degrees:

$$
0.5236 \, \text{rad} × \frac{180°}{\pi} ≈ 30°
$$

---



###  1.1.5 Practical Advice

* **If you only care about small orientation errors** (e.g., fine-tuning a network near correct pose):
  Quaternion dot-product loss or L2 loss is usually fine (cheaper, smooth gradients).

* **If you care about accurate global orientation** (e.g., SLAM, pose estimation, camera relocalization):
  **Geodesic loss is strongly preferred** because it reflects the real physical difference between two orientations.

## 1.2 Full Transformation Loss $SE(3)$


To make an **SE(3) loss** you typically combine:

1. a **rotation term** that measures distance on SO(3) (your geodesic loss), and
2. a **translation term** that measures distance in $\mathbb{R}^3$.

There are two common (and solid) ways to do this.

---

### 1.2.1 Option A — Simple & Effective (weighted sum)

Use the geodesic **rotation angle** (in radians) plus a norm on translation:

$$
\mathcal{L}_{\text{SE(3)}} \;=\; \lambda_R \,\underbrace{\big(2\arccos(|\langle q_{\text{pred}}, q_{\text{gt}}\rangle|)\big)}_{\text{SO(3) geodesic angle}}
\;+\; \lambda_t \,\underbrace{\|\,t_{\text{pred}}-t_{\text{gt}}\,\|_2}_{\text{meters}}
$$

* $\lambda_R$ and $\lambda_t$ balance **units** (radians vs meters).
  Rules of thumb:

  * If typical translation errors are \~0.05–0.2 m and angular errors are \~2–10°, try $\lambda_R\in[0.5,2.0]$, $\lambda_t\in[1,10]$.
  * Tune so both terms contribute similar magnitude early in training.
* Often use **robust norms** (e.g., Huber) on translation.

This is the **go-to baseline**: simple, stable, and strong in practice.

### 1.2.2 Option B — True SE(3) geodesic via Lie Log (advanced)

Compute the **relative transform** $\Delta T = T_{\text{gt}}^{-1}T_{\text{pred}}$, take the **matrix logarithm** to get a 6-vector $\xi = [\omega, v]\in \mathbb{R}^6$ (rotation/translation in the tangent space), then penalize it:

$$
\xi \;=\; \log(\Delta T) \;=\; 
\begin{bmatrix} \omega \\ v \end{bmatrix},\quad
\mathcal{L} \;=\; \|\; W\,\xi \;\|_2
\quad\text{or}\quad
\mathcal{L} \;=\; \|W_\omega \omega\|_2 + \|W_v v\|_2.
$$

* Here $W$ (or $W_\omega, W_v$) sets the relative weighting/units.
* This treats rotation and translation **on the same manifold footing** and is invariant to **left/right multiplication** (choose consistently).
* Slightly more math and careful numerics (small-angle handling).


This version gives you a **true SE(3) tangent-space error**. Use if you want strict group-theoretic consistency (e.g., in pose-graph optimization or when composition/invariance properties matter).

---

### 1.2.3 Numerical Example SE(3) Option-B


We’ll compute
$\Delta T = T_{\text{gt}}^{-1}T_{\text{pred}}$, then $\xi=\log(\Delta T)=[\omega,\,v]\in\mathbb{R}^6$, and a loss $\|\omega\|_2+\|v\|_2$.

---

**Poses**

* Ground truth: rotation **+90° about z**, translation $t_\text{gt}=[1,\,0,\,0]$
* Prediction: rotation **+60° about z**, translation $t_\text{pred}=[1.2,\,0.1,\,0]$

Rotation matrices:

$$
R_z(\phi)=\begin{bmatrix}\cos\phi & -\sin\phi & 0\\ \sin\phi & \cos\phi & 0\\ 0&0&1\end{bmatrix}
$$

$$
R_\text{gt}=R_z(90^\circ),\quad
R_\text{pred}=R_z(60^\circ)
$$

---

#### Relative transform $\Delta T$

$$
R_{\text{rel}} = R_\text{gt}^\top R_\text{pred}
= \begin{bmatrix}
0.8660254 & 0.5 & 0\\
-0.5 & 0.8660254 & 0\\
0 & 0 & 1
\end{bmatrix}
\quad(\text{a }-30^\circ\text{ rotation about }z)
$$

$$
t_{\text{rel}} = R_\text{gt}^\top (t_\text{pred}-t_\text{gt})
= \begin{bmatrix}0.1\\ -0.2\\ 0\end{bmatrix}
$$

---

#### $\log(\Delta T)\Rightarrow [\omega,\,v]$

**SO(3) log (rotation):**

* $\theta=\arccos\big((\mathrm{tr}(R_{\text{rel}})-1)/2\big)=\arccos(0.8660254)=\;0.5235988$ rad $=30^\circ$
* Axis $=\ -\hat z$, so

$$
\omega = \theta\cdot(-\hat z) = \begin{bmatrix}0\\0\\-0.5235988\end{bmatrix}
$$

**SE(3) translation log:**

$$
V = I + \frac{1-\cos\theta}{\theta^2}[\omega]_\times
      + \frac{\theta-\sin\theta}{\theta^3}[\omega]_\times^2,
\qquad v = V^{-1} t_{\text{rel}}
$$

Numerically (for $\theta=0.5236$ rad):

$$
V \approx
\begin{bmatrix}
0.95492966 & 0.25587263 & 0\\
-0.25587263 & 0.95492966 & 0\\
0 & 0 & 1
\end{bmatrix},
\qquad
v \approx
\begin{bmatrix}
0.1500647\\
-0.1692298\\
0
\end{bmatrix}
$$

So

$$
\xi=\log(\Delta T)=
\big[\,\omega;\,v\,\big]
=
\begin{bmatrix}
0\\
0\\
-0.5235988\\
0.1500647\\
-0.1692298\\
0
\end{bmatrix}.
$$

---

#### Example loss

With $\mathcal L = \|\omega\|_2 + \|v\|_2$:

* $\|\omega\|_2 = 0.5235988$ (30° in radians)
* $\|v\|_2 \approx \sqrt{0.1500647^2 + (-0.1692298)^2} \approx 0.2261817$

$$
\boxed{\mathcal L \approx 0.5236 + 0.2262 = 0.7498}
$$

---


### 1.2.3 How to pick weights (very important)

* Units differ: **radians vs meters**. You must balance them.
* Three common strategies:

  1. **Manual tuning** (start with $\lambda_R=1, \lambda_t\in[1,10]$).
  2. **Normalize by dataset scale** (e.g., divide translation by scene extent).
  3. **Learned homoscedastic uncertainty** (Kendall & Cipolla):

     $$
     \mathcal{L} = \frac{1}{2\sigma_R^2} \, \mathcal{L}_R + \frac{1}{2\sigma_t^2}\, \mathcal{L}_t + \log \sigma_R + \log \sigma_t
     $$

     with $\log \sigma_R, \log \sigma_t$ as learnable scalars.

---

### 1.2.4 Quick recommendations

* Start with **Option A** (weighted sum, Huber on translation). It’s robust and easy to tune.
* If you need **group-consistent** errors (e.g., enforcing trajectory smoothness with relative poses), use **Option B** (Lie log).
* Always monitor **angle (deg)** and **translation (m)** separately as metrics, even if your loss is a combination.




## Popular Monocular Depth Datasets

| Dataset          | Description                        | License / Notes                    |
| ---------------- | ---------------------------------- | ---------------------------------- |
| **NYU Depth V2** | Indoor scenes, Kinect RGB-D images | ✔️ Standard for indoor depth       |
| **KITTI**        | Outdoor driving scenes (LiDAR)     | ✔️ Standard for autonomous driving |
| **Make3D**       | Outdoor stills (Stanford)          | Older, smaller                     |
| **DIML/CVT**     | Outdoor depth from stereo          | Large and high-resolution          |
| **TUM RGB-D**    | Indoor SLAM dataset                | ✔️ Camera + depth                  |


## DepthNet

DepthNet usually refers to a **deep learning model that predicts per-pixel depth from an image (or image pair)**, and there are several variants of "DepthNet" depending on the paper or implementation you are looking at.
Let’s break it down step by step:

---

### 1. **What DepthNet Does**

DepthNet takes as input:

* **Single image** (monocular depth estimation)
* or **stereo pair** (left & right image)
* or even **consecutive frames** (for self-supervised depth + ego-motion)

and outputs:

* A **dense depth map** – one depth value per pixel (usually inverse depth/disparity for numerical stability).

So, it is a CNN that learns to "understand" the scene geometry and infer how far each pixel is from the camera.

---

### 2. **Typical Architecture**

DepthNet usually follows an **encoder-decoder (U-Net-like) architecture**:

1. **Encoder (Feature Extractor):**

   * Often a backbone CNN like ResNet-18/34/50, EfficientNet, etc.
   * Extracts multi-scale features from the image.
   * Captures semantics and context (helps to know if a region is road, wall, sky).

2. **Decoder (Upsampling):**

   * Series of up-convolution (transpose convolution) or interpolation layers.
   * Skip connections from encoder layers help recover spatial details.
   * Produces a per-pixel prediction of depth or disparity.

3. **Output:**

   * Final layer applies `sigmoid` (or `ReLU`) to constrain depth to a valid range.
   * Sometimes outputs **multi-scale depth predictions** (coarse → fine).

---

### 3. **Training Approaches**

#### (A) **Supervised DepthNet**

* Trained with ground-truth depth maps (e.g., KITTI LiDAR scans).
* Loss function: L1/L2 loss between predicted depth and ground truth.

$$
\mathcal{L}_{depth} = \frac{1}{N} \sum_{i=1}^{N} \| D_{pred}(i) - D_{gt}(i) \|
$$

---

#### (B) **Self-Supervised (Monocular) DepthNet**

If ground truth depth is not available, DepthNet is trained **unsupervised**:

1. **DepthNet** predicts disparity (inverse depth).
2. **PoseNet** predicts camera motion between consecutive frames.
3. **View synthesis loss:** reconstruct one view from the other using predicted depth + pose via differentiable warping.

$$
\mathcal{L}_{photo} = \| I_{t} - \hat{I}_{t} \|
$$

Additional regularizers:

* **Smoothness Loss** (encourages locally smooth depth)
* **Edge-aware Loss** (preserves object boundaries)

---

### 4. **Why DepthNet Works**

It learns geometric priors:

* Parallel lines converge at vanishing points → infers depth.
* Objects of known shape/size → learns perspective cues.
* Motion parallax (in self-supervised mode) → deduces scene structure.

Because CNNs can capture global context, DepthNet generalizes well beyond traditional stereo matching.

---

### 5. **Example: Monodepth2 (Popular DepthNet)**

* Encoder: ResNet-18/50
* Decoder: U-Net-like
* Multi-scale disparity prediction
* Trained with photometric reprojection loss + smoothness loss

Result: **Real-time monocular depth estimation** with good generalization.

---
