##  Self-Supervised Losses (always used)

They drive both DepthNet and PoseNet when you train from monocular videos **without ground truth**.

---

### (a) **Photometric Reprojection Loss** (a.k.a. View Synthesis Loss)

**Definition:**
$
L_\text{photo} = \min_s \Big( \alpha \frac{1 - \text{SSIM}(I_t, I_s')}{2} + (1 - \alpha) | I_t - I_s' |_1 \Big)
$

* $I_s'$: Source image warped into target frame using predicted depth + pose.
* $\alpha = 0.85$ works best (from Monodepth2).
* Take **minimum reprojection** across multiple source frames (to ignore occlusions).


---



* You have a **target image** $I_t$ and a **reference image** $I_r$.
* Using predicted **depth** $D_t$ and **camera motion** ($R, \mathbf{t}$), you warp $I_r$ into the target frame, producing a **reconstructed image** $\hat{I}_t$.
* The **photometric loss** compares $I_t$ and $\hat{I}_t$: if they are visually similar, the loss is small — meaning your depth and motion predictions are consistent.

The most common version is **pixel-wise L1 loss**:

$$
\mathcal{L}_{\text{photo}} = \frac{1}{N} \sum_{i=1}^N \big| I_t(i) - \hat{I}_t(i) \big|
$$

where $N$ is the number of valid pixels.

---

### 1.3.1 Numerical Example (Pixel-wise Photometric Loss)

**Target image $I_t$:**

$$
I_t =
\begin{bmatrix}
0.1 & 0.2 & 0.3 \\
0.2 & 0.4 & 0.6 \\
0.3 & 0.5 & 0.7
\end{bmatrix}
$$

**Reconstructed image $\hat{I}_t$:**

$$
\hat{I}_t =
\begin{bmatrix}
0.0 & 0.1 & 0.4 \\
0.3 & 0.3 & 0.5 \\
0.4 & 0.4 & 0.8
\end{bmatrix}
$$

Pixel-wise absolute differences:

$$
|I_t - \hat{I}_t| =
\begin{bmatrix}
0.1 & 0.1 & 0.1 \\
0.1 & 0.1 & 0.1 \\
0.1 & 0.1 & 0.1
\end{bmatrix}
$$

Mean over all $N=9$ pixels:

$$
\mathcal{L}_{\text{photo}} = \frac{0.9}{9} = 0.1
$$

A lower value means the reconstruction is closer to the target.

---

### 1.3.2 Pose Representation (Rigid-Body Motion)

In unsupervised VO, the network predicts the **rigid-body motion** between two frames as an SE(3) transform:

$$
T_{t \rightarrow r} =
\begin{bmatrix}
R & \mathbf{t} \\
0 & 1
\end{bmatrix}
$$

* $R$: $3 \times 3$ rotation matrix, usually parameterized by a **quaternion**
* $\mathbf{t}$: $3 \times 1$ translation vector

---

**Example Transformation**

* **Quaternion:** $q = (w=0.9239, x=0, y=0, z=0.3827)$ → 45° rotation about Z-axis.
* **Translation:** $\mathbf{t} = [1, 0, 0]^T$

Rotation matrix from quaternion:

$$
R =
\begin{bmatrix}
\cos 45° & -\sin 45° & 0 \\
\sin 45° & \cos 45° & 0 \\
0 & 0 & 1
\end{bmatrix}
=
\begin{bmatrix}
0.707 & -0.707 & 0 \\
0.707 & 0.707 & 0 \\
0 & 0 & 1
\end{bmatrix}
$$

Full transform:

$$
T_{t \rightarrow r} =
\begin{bmatrix}
0.707 & -0.707 & 0 & 1 \\
0.707 & 0.707 & 0 & 0 \\
0 & 0 & 1 & 0 \\
0 & 0 & 0 & 1
\end{bmatrix}
$$

---

### 1.3.3 Back-Project Pixel to 3D (Target Frame)

Given intrinsics

$$
K =
\begin{bmatrix}
100 & 0 & 50 \\
0 & 100 & 50 \\
0 & 0 & 1
\end{bmatrix}
,\quad
K^{-1} =
\begin{bmatrix}
0.01 & 0 & -0.5 \\
0 & 0.01 & -0.5 \\
0 & 0 & 1
\end{bmatrix}
$$

Take target pixel $(u_t,v_t)=(60,40)$ with predicted depth $D_t=2.0$:

$$
K^{-1} \begin{bmatrix} 60 \\ 40 \\ 1 \end{bmatrix}
=
\begin{bmatrix}
0.1 \\ -0.1 \\ 1
\end{bmatrix}
\quad\Rightarrow\quad
P_t = D_t \cdot
\begin{bmatrix}
0.1 \\ -0.1 \\ 1
\end{bmatrix}
=
\begin{bmatrix}
0.2 \\ -0.2 \\ 2.0
\end{bmatrix}
$$

This is the **3D point in target frame**.

---

### 1.3.4 Transform Point to Reference Frame

Apply $P_r = R P_t + \mathbf{t}$:

1. Rotate:

$$
R P_t =
\begin{bmatrix}
0.28284 \\ 0 \\ 2.0
\end{bmatrix}
$$

2. Translate:

$$
P_r =
\begin{bmatrix}
1.28284 \\ 0 \\ 2.0
\end{bmatrix}
$$

---

### 1.3.5 Project Back to Reference Image

Project using intrinsics:

$$
u_r = 100 \cdot \frac{1.28284}{2.0} + 50 = 114.14,
\quad
v_r = 100 \cdot 0 + 50 = 50
$$

Resulting pixel in $I_r$: $(u_r,v_r)=(114.14, 50.0)$

If the image is $100 \times 100$, this lies **out of bounds** → we **mask it out** (does not contribute to loss).

---

### 1.3.6 Sanity Check (No Translation)

If $\mathbf{t}=0$:

$$
P_r=(0.28284, 0, 2.0) \quad\Rightarrow\quad
u_r=64.14, \; v_r=50
$$

Now the pixel is **inside the image** → it would contribute to the loss.

---

### 1.3.7 Scale Ambiguity in Monocular VO

If we **halve depth and translation**:

* $P_t'=(0.1,-0.1,1.0)$
* $\mathbf{t}'=[0.5,0,0]^T$

Then

$$
P_r'=(0.64142, 0, 1.0)
\quad\Rightarrow\quad
u_r'=114.14, v_r'=50
$$

The projected pixel is **identical** → photometric loss cannot recover absolute scale, only **relative geometry**.

---

### 1.3.8 Training Loop Summary

1. Predict depth $D_t$ and pose $(R,\mathbf{t})$.
2. Back-project pixels → 3D points $P_t$.
3. Transform → reference frame $P_r$.
4. Project back to image plane → $(u_r,v_r)$.
5. Sample $I_r$ at these coordinates → reconstruct $\hat{I}_t$.
6. Compute photometric loss (L1 or L1+SSIM) over valid pixels.
7. Backprop through projection, sampling, and networks.

---
