### (b) **Edge-Aware Depth Smoothness Loss**

Encourages locally smooth depth while preserving depth discontinuities along image edges.

**Equation:**
$
L_\text{smooth} = |\partial_x d_t^*| e^{-|\partial_x I_t|} + |\partial_y d_t^*| e^{-|\partial_y I_t|}
$

where $d_t^* = d_t / \bar{d_t}$ (normalized disparity).

 This loss prevents noisy disparity, and the edge weighting keeps depth edges aligned with color edges.

---

**Combined core loss:**
$
L_\text{unsup} = L_\text{photo} + \lambda_\text{smooth} L_\text{smooth}
$
Typical λₛₘₒₒₜₕ ≈ 0.001 – 0.01.

---


## 1.6 Smoothness Loss



### 1.6.1 What it is (and why)

Photometric+SSIM losses make the network match views, but monocular depth has many plausible solutions (esp. in textureless regions). **Smoothness loss** is a regularizer that encourages **piecewise-smooth disparity/depth** while **preserving edges** aligned with image gradients.

### 1.6.2 Common formulations

Let $I\in\mathbb{R}^{H\times W\times 3}$ be the target image, and $d$ the predicted **disparity** (often smoother than raw depth; $d=1/z$). Finite differences:
$\partial_x f_{i,j}=f_{i,j}-f_{i,j-1}$, $\partial_y f_{i,j}=f_{i,j}-f_{i-1,j}$.

### 1.6.3 First-order, edge-aware (most used)

$$
\mathcal{L}_{\text{sm}}^{(1)}=
\frac{1}{HW}\sum_{i,j}\Big(
\left|\partial_x d_{i,j}\right|\,e^{-\alpha\,\|\partial_x I_{i,j}\|}
+
\left|\partial_y d_{i,j}\right|\,e^{-\alpha\,\|\partial_y I_{i,j}\|}
\Big)
$$

* The exponential **down-weights** the penalty at strong image edges so you **don’t over-smooth boundaries**.
* Typical $\alpha\in[5,10]$. Use grayscale $I$ or per-channel gradient norm.

### 1.6.4 Second-order (curvature) variant

$$
\mathcal{L}_{\text{sm}}^{(2)}=
\frac{1}{HW}\sum_{i,j}\Big(
\left|\partial_{xx} d_{i,j}\right| e^{-\alpha \|\partial_x I_{i,j}\|}
+
\left|\partial_{yy} d_{i,j}\right| e^{-\alpha \|\partial_y I_{i,j}\|}
\Big)
$$

* Penalizes **changes of slope**; good for avoiding “staircasing”.

### 1.6.5 Robust penalty / Charbonnier

Replace $|x|$ with $\rho(x)=\sqrt{x^2+\epsilon^2}$ (e.g., $\epsilon=10^{-3}$) for stability.

### 1.6.6 Scale-invariant normalization

Disparity amplitude can drift. Common tricks:

* Use disparity $d$ instead of depth $z$.
* Or divide by mean disparity per image: $\tilde d = d / (\bar d + \varepsilon)$ before taking gradients.

### 1.6.7 Multi-scale

Compute $\mathcal{L}_{\text{sm}}$ at pyramid levels $s=0..S-1$ (coarsest $\to$ finest). Weight by $w_s$ (e.g., $w_s=1/2^s$) and sum.

### 1.6.8 How it plugs into monocular VO (with a ViT)

Even if your depth/pose networks are **ViT-based**, the smoothness term is unchanged—just compute it on the **full-resolution** disparity map (after your ViT’s upsampling head).

**Total loss** (typical self-supervised monocular pipeline):

$$
\mathcal{L} = 
\lambda_{\text{photo}} \, \mathcal{L}_{\text{photo}}
+ \lambda_{\text{ssim}} \, \mathcal{L}_{\text{ssim}}
+ \lambda_{\text{sm}} \, \mathcal{L}_{\text{sm}}
\,(+ \text{other terms: automask, occlusion, geometry})
$$

Reasonable starting weights (tune per dataset):

* $\lambda_{\text{photo}}=1.0$
* $\lambda_{\text{ssim}}=0.15$ (if photo is L1)
* $\lambda_{\text{sm}} \in [0.001, 0.1]$ (start small; increase if depth is noisy)

**ViT-specific tips**

* Upsample tokens to image space (conv+pixelshuffle or interpolation) **before** smoothness.
* If you see block boundaries, add a tiny second-order term or anti-blocking conv in the upsampling head.
* Detach image gradients (no backprop through $I$).


> Notes

* Provide `image` in **grayscale** or compute gradient norm channel-wise and average (as above).
* If you want Charbonnier: replace `.abs()` with `torch.sqrt(x*x + eps*eps)`.

### 1.6.9 Practical tuning & pitfalls

* **Start small** $\lambda_{\text{sm}}$: too large $\Rightarrow$ over-smoothed, “melted” geometry; too small $\Rightarrow$ noisy depth.
* Use **disparity** rather than depth; it naturally stabilizes scale.
* **Detach** image gradients (as shown) to prevent weird coupling.
* Consider **second-order** term if you see “staircase” artifacts.
* Compute on **multiple scales**; strongest impact at coarse scales.
* Dynamic objects/occlusions: combine with **auto-masking / per-pixel min reprojection** to avoid penalizing impossible warps.
* For ViT heads, ensure good **anti-aliasing upsampling**; otherwise smoothness fights token blocking.

### 1.6.10 Minimal recipe (drop-in)

1. Predict multi-scale disparities $d^{(s)}$ from your ViT depth head.
2. For each scale, compute $\mathcal{L}_{\text{sm}}^{(1)}$ with $\alpha=10$, normalize by mean disparity, weight by $w_s=1/2^s$.
3. Set $\lambda_{\text{sm}}=0.01$ as a starting point; tune against validation photometric error and scale drift.
4. Keep your usual $\mathcal{L}_{\text{photo}} + \mathcal{L}_{\text{SSIM}}$ and occlusion handling.



## 1.3 Edge-Aware Smoothness


**weight the smoothness loss by image gradients.**: Intuition: if the image has a strong edge (large intensity gradient), we expect a depth discontinuity there, so we should relax the smoothness penalty.

The edge-aware smoothness loss becomes:

$$
\mathcal{L}_{\text{edge-aware}} = 
\sum_{i,j} 
\Big(|\partial_x D_{i,j}| \cdot e^{-|\partial_x I_{i,j}|}\Big)
+
\Big(|\partial_y D_{i,j}| \cdot e^{-|\partial_y I_{i,j}|}\Big)
$$

where

* $D_{i,j}$: predicted depth/disparity
* $I_{i,j}$: input image (grayscale or per-channel average)
* $\partial_x, \partial_y$: gradients along x and y

####  Interpretation

* If $I$ has **low gradient** → weight ≈ 1 → enforce smoothness strongly
* If $I$ has **high gradient** (edge) → weight ≈ 0 → let depth change abruptly

This preserves object boundaries and prevents depth bleeding across edges.

---


### 1.3.1 Numerical Example

let’s treat $D$ as **Z-depth in meters** and compute the **full edge-aware smoothness loss** step-by-step on tiny 3×3 grids.
Depth (meters)

$$
D=\begin{bmatrix}
1.0 & 1.1 & 1.2\\
1.0 & 1.1 & 1.3\\
1.0 & 1.0 & 1.2
\end{bmatrix}
$$

Grayscale image

$$
I=\begin{bmatrix}
0.2 & 0.2 & 0.8\\
0.2 & 0.2 & 0.9\\
0.2 & 0.2 & 0.9
\end{bmatrix}
$$

We’ll use **forward differences** and the common formulation:

$$
\mathcal L_{\text{edge}}=\underbrace{\big\langle |\partial_x D|\;e^{-|\partial_x I|}\big\rangle}_{\text{x-term mean}}
\;+\;
\underbrace{\big\langle |\partial_y D|\;e^{-|\partial_y I|}\big\rangle}_{\text{y-term mean}}.
$$

(Angle brackets $\langle\cdot\rangle$ = mean over all valid entries.)

---

#### **Gradients**

Forward diffs (left–right for $x$, top–down for $y$):

$$
\partial_x D=
\begin{bmatrix}
-0.1 & -0.1\\
-0.1 & -0.2\\
0.0 & -0.2
\end{bmatrix},\quad
\partial_y D=
\begin{bmatrix}
0.0 & 0.0 & -0.1\\
0.0 & 0.1 & 0.1
\end{bmatrix}
$$

$$
\partial_x I=
\begin{bmatrix}
0.0 & -0.6\\
0.0 & -0.7\\
0.0 & -0.7
\end{bmatrix},\quad
\partial_y I=
\begin{bmatrix}
0.0 & 0.0 & -0.1\\
0.0 & 0.0 & 0.0
\end{bmatrix}
$$

We’ll use magnitudes $|\cdot|$ in the loss.

---

#### **Edge-aware weights**

Weights are $w=e^{-|\partial I|}$.

Useful constants (rounded):
$e^{-0.6}\approx \mathbf{0.548811}$, $e^{-0.7}\approx \mathbf{0.496585}$, $e^{-0.1}\approx \mathbf{0.904837}$, $e^{0}=1$.

**X-weights $w_x=e^{-|\partial_x I|}$:**

$$
w_x=
\begin{bmatrix}
1.000000 & 0.548812\\
1.000000 & 0.496585\\
1.000000 & 0.496585
\end{bmatrix}
$$

**Y-weights $w_y=e^{-|\partial_y I|}$:**

$$
w_y=
\begin{bmatrix}
1.000000 & 1.000000 & 0.904837\\
1.000000 & 1.000000 & 1.000000
\end{bmatrix}
$$

---

#### **Weighted gradients**

Take absolute value of the depth gradients, multiply by weights.

**X term** ($|\partial_x D|\cdot w_x$):

$$
\begin{bmatrix}
0.1\cdot1      & 0.1\cdot0.548812\\
0.1\cdot1      & 0.2\cdot0.496585\\
0.0\cdot1      & 0.2\cdot0.496585
\end{bmatrix}
=
\begin{bmatrix}
\mathbf{0.100000} & \mathbf{0.054881}\\
\mathbf{0.100000} & \mathbf{0.099317}\\
\mathbf{0.000000} & \mathbf{0.099317}
\end{bmatrix}
$$

**Y term** ($|\partial_y D|\cdot w_y$):

$$
\begin{bmatrix}
0.0\cdot1 & 0.0\cdot1 & 0.1\cdot0.904837\\
0.0\cdot1 & 0.1\cdot1 & 0.1\cdot1
\end{bmatrix}
=
\begin{bmatrix}
\mathbf{0.000000} & \mathbf{0.000000} & \mathbf{0.090484}\\
\mathbf{0.000000} & \mathbf{0.100000} & \mathbf{0.100000}
\end{bmatrix}
$$

---

#### **Means and final loss**

Each matrix above has 6 valid entries. Compute means:

* $\text{mean}_x = \frac{0.100000+0.054881+0.100000+0.099317+0.000000+0.099317}{6} = \mathbf{0.075585}$

* $\text{mean}_y = \frac{0.000000+0.000000+0.090484+0.000000+0.100000+0.100000}{6} = \mathbf{0.048415}$

**Edge-aware smoothness loss**

$$
\boxed{\mathcal L_{\text{edge}}=\text{mean}_x+\text{mean}_y
= \mathbf{0.075585}+\mathbf{0.048415}
= \mathbf{0.123999}\ (\approx 0.124)}
$$

---

**For comparison: naïve smoothness (no edge weights)**

$$
\langle|\partial_x D|\rangle + \langle|\partial_y D|\rangle
= \frac{0.1+0.1+0.1+0.2+0.0+0.2}{6} + \frac{0+0+0.1+0+0.1+0.1}{6}
= 0.116667+0.050000
= \mathbf{0.166667}
$$

So edge-aware weighting **reduced** the penalty from **0.1667 → 0.124**, because it down-weighted depth changes where the image has strong edges.


### Size of $D$, and $I$ For Real Size Image Input 
Let’s walk through this carefully for an image of **224 × 224** (like a ResNet-18 input) and see what the shapes look like for $D$, $I$, and their gradients.

---

#### Input Image $I$

If you feed a single RGB image to your network:

* $I$ shape (PyTorch): **\[B, C, H, W] = \[1, 3, 224, 224]**
* For smoothness loss, we usually convert to **grayscale or take per-channel mean** to get shape:
  **\[1, 1, 224, 224]**

---

#### Predicted Depth/Disparity Map $D$

Most monocular depth networks **output one value per pixel** (dense prediction):

* $D$ shape: **\[1, 1, 224, 224]**
* This means $D[i,j]$ is the predicted depth/disparity for pixel $(i,j)$.

> ⚠️ If you’re using a network like ResNet-18 as backbone, you often upsample the final feature map back to 224×224 so that $D$ has same resolution as $I$.

---

####  Computing Gradients

Gradients are local finite differences — they don’t change resolution much.

#### Horizontal gradient $∂x$:

* Compute $D[:,:, :, :-1] - D[:,:, :, 1:]$
* Resulting size: **\[1, 1, 224, 223]** (one fewer column)

#### Vertical gradient $∂y$:

* Compute $D[:,:, :-1, :] - D[:,:, 1:, :]$
* Resulting size: **\[1, 1, 223, 224]** (one fewer row)

Same for image $I$ — we compute $∂x I$ and $∂y I$ with same operators, so they match the shapes of depth gradients.

---

#### Element-wise Weighting

Once you have:

* $|\partial_x D|$  → shape \[1, 1, 224, 223]
* $|\partial_x I|$  → shape \[1, 1, 224, 223]

You compute:

$$
\text{weighted}_x = |\partial_x D| \cdot e^{-|\partial_x I|}
$$

Element-wise multiplication → stays **\[1, 1, 224, 223]**

Then you take mean over all elements.
Same for $y$-direction.

---


| Quantity        | Shape (for 224×224 input)      | Meaning                   |
| --------------- | ------------------------------ | ------------------------- |
| $I$ (grayscale) | \[B, 1, 224, 224]              | Input image intensity     |
| $D$             | \[B, 1, 224, 224]              | Predicted depth/disparity |
| $∂xD, ∂xI$      | \[B, 1, 224, 223]              | Horizontal differences    |
| $∂yD, ∂yI$      | \[B, 1, 223, 224]              | Vertical differences      |
| Weighted terms  | Same as corresponding gradient | Used in loss computation  |
| Final loss      | Scalar                         | Mean over all entries     |



#### Could We Use Better Edge Detection?

Yes — some papers do. Variants include:

* **Gradient magnitude (isotropic):**

  $$
  |\nabla I| = \sqrt{(\partial_x I)^2 + (\partial_y I)^2}
  $$

  This captures diagonal edges more naturally, instead of separating x and y.

* **Sobel / Scharr filters**
  Better gradient estimation (uses 3×3 kernels for smoothing + derivative).

* **Learned edge weights**
  Some works learn an "edge mask" jointly with depth to adaptively weight smoothness.

* **Perceptual / semantic edges**
  Use segmentation boundaries or learned feature maps instead of raw intensity.

But they all must remain **differentiable** — so binary detectors like Canny are generally avoided in training.



##  Convention Used in Most Papers (e.g. SfMLearner, Monodepth2)

| **Frame**                    | **Role**                                                          |
| ---------------------------- | ----------------------------------------------------------------- |
| $I_i$                        | **Target frame** → you predict depth $D_i$ for this frame         |
| $I_{i+1}$ (and/or $I_{i-1}$) | **Reference frame(s)** → you warp them into frame $i$’s viewpoint |

---

##  What Happens Step by Step

1. **Depth Prediction:**
   $D_i = \text{DepthNet}(I_i)$
   → per-pixel depth map **for frame $i$**.

2. **Pose Prediction:**
   $(R, \mathbf{t}) = \text{PoseNet}(I_i, I_{i+1})$
   → relative transform $T_{i \rightarrow i+1}$ (from frame $i$ to frame $i+1$).

3. **Back-Project:**
   Use $D_i$ and camera intrinsics $K$ to get 3D points $P_i$ in frame $i$.

4. **Transform:**
   Move points to frame $i+1$:
   $P_{i+1} = T_{i \rightarrow i+1} \cdot P_i$.

5. **Project:**
   Project $P_{i+1}$ to 2D using intrinsics $K$ → get pixel coords $(u_{i+1}, v_{i+1})$.

6. **Sample:**
   Bilinear sample $I_{i+1}$ at these coords → reconstructed image $\hat{I}_i$.

7. **Loss:**
   Compare $I_i$ (target) and $\hat{I}_i$:

   $$
   \mathcal{L}_{\text{photo}} = \frac{1}{N}\sum |I_i - \hat{I}_i|
   $$

---

##  Intuition

* **Target frame:** $I_i$ — you are trying to reproduce this image.
* **Reference frame:** $I_{i+1}$ — you are "borrowing" its pixels, warping them into $i$’s viewpoint.
* If reconstruction is good, $I_i \approx \hat{I}_i$.

---

##  You Can Also Swap

You can just as well make $I_{i+1}$ the target and $I_i$ the reference — as long as you're consistent.
But by convention:

* **Depth is always predicted for the target frame.**
* **Reference frames are the ones you warp into the target’s viewpoint.**

---

### TL;DR (Answer to Your Question)

**With two consecutive frames $I_i, I_{i+1}$:**

* **Target frame:** $I_i$ (depth $D_i$ is predicted for this one)
* **Reference frame:** $I_{i+1}$ (warped into frame $i$’s viewpoint to create $\hat{I}_i$)
* **Loss computed between:** $I_i$ and $\hat{I}_i$

---

Would you like me to draw a small diagram (camera frustums for $i$ and $i+1$, showing how a 3D point projects to each, and how we warp $I_{i+1}$ to reconstruct $I_i$)? It usually makes this concept stick immediately.
