# What is a cost volume?

For each target pixel $(u,v)$ and a set of hypothesized depths $\{d_i\}_{i=1}^D$, you measure how well that pixel matches across nearby frames when you **warp** those frames as if the pixel were at depth $d_i$. Stack those per-depth match costs into a 3D tensor $C \in \mathbb{R}^{D\times H\times W}$. A network regularizes $C$ and reads out depth (and sometimes confidence).


<figure>
  <img
    src="images/Diagram-showing-how-the-cost-volume-is-constructed.-1024x570.jpg"
    alt="cost volume" />
  <figcaption>cost volume, image courtesy learnopencv</figcaption>
</figure>


# How you build it (plane-sweep, differentiable)

1. **Features**
   Extract multi-scale features $F_t, F_s \in \mathbb{R}^{C\times H\times W}$ from target $I_t$ and source $I_s$ (ResNet/ConvNeXt/Lightweight encoder).

2. **Depth (usually inverse-depth) hypotheses**
   Choose $D$ planes: $z_i$ (or $1/z$) between near/far (log or inverse-depth spacing is best).

3. **Warp source features per plane**
   For each depth $z_i$ and pixel $p=(u,v)$:

* Back-project: $\mathbf{X} = z_i K^{-1}\tilde{p}$
* Transform: $\mathbf{X}_s = T_{t\rightarrow s}\mathbf{X}$
* Project: $p' \sim K\,\mathbf{X}_s$
* Sample: $F_s^{(i)}(p) = \text{bilinear\_sample}(F_s, p')$

4. **Compute matching cost** (several good options)

* **L1/L2 concatenation:** $C_i(u,v) = \| F_t(u,v) - F_s^{(i)}(u,v)\|_1$
* **Dot-product correlation:** $C_i(u,v) = -\langle F_t, F_s^{(i)}\rangle$
* **Groupwise correlation:** split channels into $G$ groups, correlate per group, then average (cheap & strong).
* **Variance/mean aggregation** across multiple sources $\{s\}$: per-plane mean/variance of warped features.

Stack over depths → **raw cost volume** $C$.

5. **Regularize (denoise, complete, enforce smooth structure)**

* **3D CNN encoder–decoder (hourglass)** over $(D,H,W)$ (classic in stereo/MVS).
* **2D CNN + guided aggregation/attention** (lighter memory).
* **Recurrent/transformer** updates with a **correlation pyramid** (RAFT/DROID-style).

6. **Depth readout**

* **Softmax along depth**: $P_i(u,v)=\text{softmax}(-C)_i$
* **Soft-argmin**: $\hat{z}(u,v)=\sum_i P_i(u,v)\,z_i$ (use inverse-depth to reduce bias).
* Output optional **confidence** (entropy of $P$ or learned head).

# Where pose comes from in self-supervised Depth+Pose

* **Option 1 (common):** a small PoseNet predicts $T_{t\rightarrow s}$. You **use that pose** to build the cost volume for depth, then train end-to-end with photometric + SSIM + smoothness losses using the predicted depth.
* **Option 2 (tighter geometry):** alternate/refine pose via a differentiable PnP/BA block using correspondences implied by the current depth/posterior $P$. (Heavier but more accurate.)

# Training losses that pair well

* **Photometric reconstruction:** $\alpha\frac{1-\text{SSIM}(I_t,\hat{I}_t)}{2} + (1-\alpha)\|I_t-\hat{I}_t\|_1$, where $\hat{I}_t$ is source warped with $\hat{z}$ and pose.
* **Edge-aware smoothness** on inverse depth.
* **Multi-view min-reprojection** to handle occlusions.
* **Pose/geodesic regularization** (optional).
* **Depth priors** if you have ToF/RGB-D: narrow-band volume around measured depth; penalize deviation within sensor confidence.

# Variants & design choices

* **Stereo only:** disparity cost volume is 1D shifts → very efficient (PSMNet/GA-Net style); treat disparity bins instead of depth.
* **Temporal monocular (your case for self-sup):** plane-sweep with predicted pose between adjacent frames; inverse-depth bins.
* **Multi-view (MVS):** aggregate costs from many neighbors with per-plane mean/variance; strong for wide baselines.
* **Correlation pyramids (RAFT-like):** store all-pairs correlations at multiple scales and iterate updates with a GRU—memory-savvy and very accurate.
* **Groupwise correlation:** big win on speed/accuracy trade-off; keep it as your default matcher.
* **3D vs 2D regularization:** 3D CNNs are powerful but memory-heavy; 2D with smart aggregation or recurrent updates often suffices.

# Practical settings that work

* **Depth range:** pick near/far from your camera (e.g., 0.3–30 m); **inverse-depth** spacing.
* **Pyramids:** at 1/8 or 1/4 scale use $D=32\!-\!64$ planes; refine at higher res in a **coarse-to-fine** scheme with a **narrow-band** (±3–6 planes) around the upsampled estimate.
* **Occlusions:** per-plane z-buffer (keep nearest), or **min-reprojection** across sources; add auto-mask for moving objects.
* **Numerics:** ensure intrinsics are scaled with feature maps; use `grid_sample(align_corners=True/False)` consistently; mask out-of-bounds warps; mixed precision + checkpointing.
* **Confidence:** use entropy of $P$ to down-weight photometric residuals; improves robustness.

# Minimal pseudo-PyTorch sketch (core ideas)

```python
# F_t, F_s: [B,C,H,W] features; depths: [D] inverse-depth or depth values
# K, T_ts: intrinsics (scaled to H,W), relative pose t->s
C = []  # cost volume planes
for z in depths:  # vectorize in practice
    # backproject target pixels at depth z, transform to source, project
    p_src = project(T_ts, backproject(grid(H,W), z, K), K)    # [B,2,H,W]
    F_sw = bilinear_sample(F_s, p_src)                        # [B,C,H,W]
    # groupwise correlation (G groups):
    cost = groupwise_corr(F_t, F_sw, G=8)                     # [B,1,H,W]
    C.append(cost)
C = torch.stack(C, dim=1)                                     # [B,D,H,W]

# 3D regularization
R = hourglass3d(C)                                            # [B,D,H,W]
P = torch.softmax(-R, dim=1)                                  # prob over depth
z_hat = (P * depths.view(1,-1,1,1)).sum(dim=1)                # [B,H,W]
```

# How this fits your plan (Depth+Pose self-sup)

1. Target & 2–3 neighbor frames → PoseNet for $T_{t\rightarrow s}$.
2. Build **inverse-depth** cost volume via plane sweep.
3. 3D (or 2D+recurrent) regularization → **probability volume** → depth.
4. Losses: photometric (min over sources) + SSIM + smoothness (+ pose reg).
5. Optional: small pose refinement block using current depth/confidence.

# If you have ToF/RGB-D later

Use the sensor depth as:

* **Narrow-band anchor:** build the volume only around $z_\text{ToF} \pm \delta$.
* **Depth completion:** fuse ToF depth into the 3D regularizer (concat a confidence map).
* **Hard constraints:** penalize deviation where ToF is confident; ignore where invalid.

Ref: [1](https://www.youtube.com/watch?v=lBFgNyz5JpU)

---