##  Self-Supervised Losses (always used)

They drive both DepthNet and PoseNet when you train from monocular videos **without ground truth**.

---

### (a) **Photometric Reprojection Loss** (a.k.a. View Synthesis Loss)

**Definition:**
$
L_\text{photo} = \min_s \Big( \alpha \frac{1 - \text{SSIM}(I_t, I_s')}{2} + (1 - \alpha) | I_t - I_s' |_1 \Big)
$

* $I_s'$: Source image warped into target frame using predicted depth + pose.
* $\alpha = 0.85$ works best (from Monodepth2).
* Take **minimum reprojection** across multiple source frames (to ignore occlusions).


---



* You have a **target image** $I_t$ and a **reference image** $I_r$.
* Using predicted **depth** $D_t$ and **camera motion** ($R, \mathbf{t}$), you warp $I_r$ into the target frame, producing a **reconstructed image** $\hat{I}_t$.
* The **photometric loss** compares $I_t$ and $\hat{I}_t$: if they are visually similar, the loss is small — meaning your depth and motion predictions are consistent.

The most common version is **pixel-wise L1 loss**:

$$
\mathcal{L}_{\text{photo}} = \frac{1}{N} \sum_{i=1}^N \big| I_t(i) - \hat{I}_t(i) \big|
$$

where $N$ is the number of valid pixels.

---

### 1.3.1 Numerical Example (Pixel-wise Photometric Loss)

**Target image $I_t$:**

$$
I_t =
\begin{bmatrix}
0.1 & 0.2 & 0.3 \\
0.2 & 0.4 & 0.6 \\
0.3 & 0.5 & 0.7
\end{bmatrix}
$$

**Reconstructed image $\hat{I}_t$:**

$$
\hat{I}_t =
\begin{bmatrix}
0.0 & 0.1 & 0.4 \\
0.3 & 0.3 & 0.5 \\
0.4 & 0.4 & 0.8
\end{bmatrix}
$$

Pixel-wise absolute differences:

$$
|I_t - \hat{I}_t| =
\begin{bmatrix}
0.1 & 0.1 & 0.1 \\
0.1 & 0.1 & 0.1 \\
0.1 & 0.1 & 0.1
\end{bmatrix}
$$

Mean over all $N=9$ pixels:

$$
\mathcal{L}_{\text{photo}} = \frac{0.9}{9} = 0.1
$$

A lower value means the reconstruction is closer to the target.

---

### 1.3.2 Pose Representation (Rigid-Body Motion)

In unsupervised VO, the network predicts the **rigid-body motion** between two frames as an SE(3) transform:

$$
T_{t \rightarrow r} =
\begin{bmatrix}
R & \mathbf{t} \\
0 & 1
\end{bmatrix}
$$

* $R$: $3 \times 3$ rotation matrix, usually parameterized by a **quaternion**
* $\mathbf{t}$: $3 \times 1$ translation vector

---

**Example Transformation**

* **Quaternion:** $q = (w=0.9239, x=0, y=0, z=0.3827)$ → 45° rotation about Z-axis.
* **Translation:** $\mathbf{t} = [1, 0, 0]^T$

Rotation matrix from quaternion:

$$
R =
\begin{bmatrix}
\cos 45° & -\sin 45° & 0 \\
\sin 45° & \cos 45° & 0 \\
0 & 0 & 1
\end{bmatrix}
=
\begin{bmatrix}
0.707 & -0.707 & 0 \\
0.707 & 0.707 & 0 \\
0 & 0 & 1
\end{bmatrix}
$$

Full transform:

$$
T_{t \rightarrow r} =
\begin{bmatrix}
0.707 & -0.707 & 0 & 1 \\
0.707 & 0.707 & 0 & 0 \\
0 & 0 & 1 & 0 \\
0 & 0 & 0 & 1
\end{bmatrix}
$$

---

### 1.3.3 Back-Project Pixel to 3D (Target Frame)

Given intrinsics

$$
K =
\begin{bmatrix}
100 & 0 & 50 \\
0 & 100 & 50 \\
0 & 0 & 1
\end{bmatrix}
,\quad
K^{-1} =
\begin{bmatrix}
0.01 & 0 & -0.5 \\
0 & 0.01 & -0.5 \\
0 & 0 & 1
\end{bmatrix}
$$

Take target pixel $(u_t,v_t)=(60,40)$ with predicted depth $D_t=2.0$:

$$
K^{-1} \begin{bmatrix} 60 \\ 40 \\ 1 \end{bmatrix}
=
\begin{bmatrix}
0.1 \\ -0.1 \\ 1
\end{bmatrix}
\quad\Rightarrow\quad
P_t = D_t \cdot
\begin{bmatrix}
0.1 \\ -0.1 \\ 1
\end{bmatrix}
=
\begin{bmatrix}
0.2 \\ -0.2 \\ 2.0
\end{bmatrix}
$$

This is the **3D point in target frame**.

---

### 1.3.4 Transform Point to Reference Frame

Apply $P_r = R P_t + \mathbf{t}$:

1. Rotate:

$$
R P_t =
\begin{bmatrix}
0.28284 \\ 0 \\ 2.0
\end{bmatrix}
$$

2. Translate:

$$
P_r =
\begin{bmatrix}
1.28284 \\ 0 \\ 2.0
\end{bmatrix}
$$

---

### 1.3.5 Project Back to Reference Image

Project using intrinsics:

$$
u_r = 100 \cdot \frac{1.28284}{2.0} + 50 = 114.14,
\quad
v_r = 100 \cdot 0 + 50 = 50
$$

Resulting pixel in $I_r$: $(u_r,v_r)=(114.14, 50.0)$

If the image is $100 \times 100$, this lies **out of bounds** → we **mask it out** (does not contribute to loss).

---

### 1.3.6 Sanity Check (No Translation)

If $\mathbf{t}=0$:

$$
P_r=(0.28284, 0, 2.0) \quad\Rightarrow\quad
u_r=64.14, \; v_r=50
$$

Now the pixel is **inside the image** → it would contribute to the loss.

---

### 1.3.7 Scale Ambiguity in Monocular VO

If we **halve depth and translation**:

* $P_t'=(0.1,-0.1,1.0)$
* $\mathbf{t}'=[0.5,0,0]^T$

Then

$$
P_r'=(0.64142, 0, 1.0)
\quad\Rightarrow\quad
u_r'=114.14, v_r'=50
$$

The projected pixel is **identical** → photometric loss cannot recover absolute scale, only **relative geometry**.

---

### 1.3.8 Training Loop Summary

1. Predict depth $D_t$ and pose $(R,\mathbf{t})$.
2. Back-project pixels → 3D points $P_t$.
3. Transform → reference frame $P_r$.
4. Project back to image plane → $(u_r,v_r)$.
5. Sample $I_r$ at these coordinates → reconstruct $\hat{I}_t$.
6. Compute photometric loss (L1 or L1+SSIM) over valid pixels.
7. Backprop through projection, sampling, and networks.

---


### Photometric Loss: Mathematical Formulation

This section explains the mathematical equations used in the photometric loss calculation for self-supervised monocular depth estimation, following the approach of **Monodepth2** and similar works.

---

### 1. Dataset Structure (KITTI Odometry)

* **Images**: $ N = 4541 $
* **Poses**: $ N = 4541 $
* **Calibration**: Camera intrinsic matrix $ K $

Each pose $ T_{w_i} $ transforms from camera frame $ i $ to world frame $ w $.

---

### 2. Camera Intrinsic Matrix

$$
K =
\begin{bmatrix}
f_x & 0 & c_x \\
0 & f_y & c_y \\
0 & 0 & 1
\end{bmatrix}
$$

Example from KITTI:

$$
K =
\begin{bmatrix}
718.856 & 0 & 607.1928 \\
0 & 718.856 & 185.2157 \\
0 & 0 & 1
\end{bmatrix}
$$

Inverse:

$$
K^{-1} =
\begin{bmatrix}
1/f_x & 0 & -c_x/f_x \\
0 & 1/f_y & -c_y/f_y \\
0 & 0 & 1
\end{bmatrix}
$$

---

### 3. Camera Pose Representation

Each pose (a row of `poses/00.txt`) encodes:

$$
T_{w_i} =
\begin{bmatrix}
r_{11} & r_{12} & r_{13} & t_x \
r_{21} & r_{22} & r_{23} & t_y \
r_{31} & r_{32} & r_{33} & t_z \
0 & 0 & 0 & 1
\end{bmatrix}
$$

The rotation $ R $ is $ 3\times3 $, translation $ t $ is $ 3\times1 $.

**Relative transformation** between target $t$ and source $s$:

$$
T_{t\rightarrow s} = T_{w_s}^{-1} T_{w_t}
$$


$$
T_{t}^s = (T_{s}^{w} )^{-1}  T_{t}^w
$$

---





### 4. Image Warping (Projective Transformation)

Goal: warp source image $ I_s $ into the target viewpoint $ I_t $ using predicted depth $ D_t $.

Pipeline:
$$
(u, v) \rightarrow \text{3D point} \rightarrow \text{Transform} \rightarrow \text{Project} \rightarrow (u', v')
$$

#### Step 1: Pixel Coordinates

$$
p =
\begin{bmatrix} u \ v \ 1 \end{bmatrix}, \quad
P_{homo} \in \mathbb{R}^{3\times H\times W}
$$

#### Step 2: Back-projection

$$
X_{cam_t} = D_t(u,v) , K^{-1} p
$$

Expanded:
$$
\begin{aligned}
X &= D_t(u,v) \frac{u - c_x}{f_x} \\
Y &= D_t(u,v) \frac{v - c_y}{f_y} \\
Z &= D_t(u,v)
\end{aligned}
$$

#### Step 3: Transform to Source Frame

$$
X_{cam_s} = T_{t\rightarrow s} , X_{cam_t}
$$

Expanded:
$$
\begin{aligned}
X_s &= R_{11}X_t + R_{12}Y_t + R_{13}Z_t + t_x \\
Y_s &= R_{21}X_t + R_{22}Y_t + R_{23}Z_t + t_y \\
Z_s &= R_{31}X_t + R_{32}Y_t + R_{33}Z_t + t_z
\end{aligned}
$$

#### Step 4: Projection to Source Image Plane

$$
\begin{bmatrix} u' \ v' \ 1 \end{bmatrix}
= K \begin{bmatrix} X_s/Z_s \ Y_s/Z_s \ 1 \end{bmatrix}
$$

Which gives:
$$
u' = f_x \frac{X_s}{Z_s} + c_x, \quad
v' = f_y \frac{Y_s}{Z_s} + c_y
$$



#### Step 5: Sampling

$$
I_{warped}(u,v) = I_s(u', v')
$$

Complete form:
$$
I_{warped}(u,v) = I_s\left( \pi( K , T_{t\rightarrow s} , D_t(u,v) , K^{-1} [u,v,1]^T ) \right)
$$

Where projection operator
$$
\pi([X,Y,Z]^T) = [X/Z, Y/Z]^T
$$

---

### 5. Photometric Loss

#### SSIM (Structural Similarity)

$$
SSIM(I_t, I_w) =
\frac{ (2\mu_t \mu_w + C_1)(2\sigma_{tw} + C_2) }
{ (\mu_t^2 + \mu_w^2 + C_1)(\sigma_t^2 + \sigma_w^2 + C_2) }
$$

Constants:
( C_1 = 0.01^2 ), ( C_2 = 0.03^2 )

Loss:
$$
L_{SSIM} = \frac{1 - SSIM}{2}
$$

#### L1 Loss

$$
L_{L1} = |I_t - I_{warped}|
$$

#### Combined Photometric Loss

$$
L_{photo} = \alpha L_{SSIM} + (1 - \alpha) L_{L1}, \quad \alpha = 0.85
$$

Per-pixel:
$$
L(u,v) = 0.85 L_{SSIM}(u,v) + 0.15 L_{L1}(u,v)
$$

Total:
$$
L_{total} = \frac{1}{N_{valid}} \sum L(u,v)
$$

---

### 6. Pipeline Summary

**Input:** $ I_t, I_s, D_t, K, T_{w_i}, T_{w_j} $

Steps:

1. $ T_{t\rightarrow s} = T_{w_j}^{-1} T_{w_i} $
2. For each pixel:

   * $ X_{cam_t} = D_t K^{-1} [u,v,1]^T $
   * $ X_{cam_s} = T_{t\rightarrow s} [X_{cam_t}; 1] $
   * $ [u',v'] = \pi(K X_{cam_s}) $
   * $ I_{warped}(u,v) = I_s(u',v') $
3. Compute:

   * $ L = 0.85L_{SSIM} + 0.15L_{L1} $

---

### 7. Implementation Details

Shapes:

```
I_s:     [B,3,H,W]
I_t:     [B,3,H,W]
depth_t: [B,1,H,W]
K:       [B,3,3]
T_t2s:   [B,4,4]
```

Normalized coordinates for PyTorch `grid_sample`:
$$
u_{norm} = 2 \frac{u}{W-1} - 1, \quad v_{norm} = 2 \frac{v}{H-1} - 1
$$

---

### 8. Coordinate Frames

* **World**: fixed reference
* **Camera**: optical center origin, ( Z ) forward
* **Image**: origin at top-left corner

---

### 9. Assumptions

1. Static scene
2. Lambertian surfaces
3. Known camera intrinsics
4. Known or estimated poses

---

### 10. References

1. Godard et al., *Monodepth2*, ICCV 2019
2. KITTI Vision Benchmark Suite
3. Hartley & Zisserman, *Multiple View Geometry*

---

### Appendix: Matrix Dimensions

| Symbol                                   | Dimension | Description           |
| ---------------------------------------- | --------- | --------------------- |
| $ I_t $                                  | [B,3,H,W] | Target image          |
| $ I_s $                                  | [B,3,H,W] | Source image          |
| $ I_{warped} $                           | [B,3,H,W] | Warped source         |
| $ D_t $                                  | [B,1,H,W] | Depth map             |
| $ K, K^{-1} $                            | [B,3,3]   | Intrinsics / inverse  |
| $ T_{w_i}, T_{w_j}, T_{t\rightarrow s} $ | [B,4,4]   | Transforms            |
| $ p $                                    | [3,1]     | Pixel coordinates     |
| $ X_{cam} $                              | [4,1]     | 3D homogeneous coords |
| $ L_{photo} $                            | [B,1,H,W] | Photometric loss map  |

---