# Dataset Visual Odometry / SLAM Evaluation

1. [Download odometry data set (grayscale, 22 GB)](https://s3.eu-central-1.amazonaws.com/avg-kitti/data_odometry_gray.zip)
2. [Download odometry data set (color, 65 GB)](https://s3.eu-central-1.amazonaws.com/avg-kitti/data_odometry_color.zip)
3. [Download odometry data set (velodyne laser data, 80 GB)](https://s3.eu-central-1.amazonaws.com/avg-kitti/data_odometry_velodyne.zip)
4. [Download odometry data set (calibration files, 1 MB)](https://s3.eu-central-1.amazonaws.com/avg-kitti/data_odometry_calib.zip)
5. [Download odometry ground truth poses (4 MB)](https://s3.eu-central-1.amazonaws.com/avg-kitti/data_odometry_poses.zip)



## Sensor setup 
<img src="images/setup_top_view.png" />

<img src="images/passat_sensors_920.png" />




## Calibration Files and Projection Matrices

to get the calibration data run:
```
python kitti_calibration.py
```



- $P0$: Reference camera (left of stereo pair 1), extrinsics are identity.
- $P1$: Right camera of stereo pair 1, extrinsics include baseline offset.
- $P2$: Left camera of stereo pair 2, extrinsics depend on setup.
- $P3$: Right camera of stereo pair 2, extrinsics depend on setup.


---

Camera: $P0$:

```
Projection Matrix:
[[707.0912   0.     601.8873   0.    ]
 [  0.     707.0912 183.1104   0.    ]
 [  0.       0.       1.       0.    ]]
Intrinsic Matrix:
[[707.0912   0.     601.8873]
 [  0.     707.0912 183.1104]
 [  0.       0.       1.    ]]
Rotation Matrix:
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
Translation Vector:
[[0.]
 [0.]
 [0.]]
```
---

Camera: $P1$:
```
Projection Matrix:
[[ 707.0912    0.      601.8873 -379.8145]
 [   0.      707.0912  183.1104    0.    ]
 [   0.        0.        1.        0.    ]]
Intrinsic Matrix:
[[707.0912   0.     601.8873]
 [  0.     707.0912 183.1104]
 [  0.       0.       1.    ]]
Rotation Matrix:
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
Translation Vector:
[[ 5.37150653e-01]
 [-1.34802944e-17]
 [ 0.00000000e+00]]
```

From the above image the distance between two camera is `0.54` on $x$ axis and from decomposition we have: `5.37150653e-01`.

Refs: [1](https://www.cvlibs.net/datasets/kitti/setup.php)
[2](https://stackoverflow.com/questions/29407474/how-to-understand-the-kitti-camera-calibration-files), [3](https://github.com/yanii/kitti-pcl/blob/master/KITTI_README.TXT), [4](https://www.cvlibs.net/datasets/kitti/eval_odometry.php), [5](https://github.com/avisingh599/mono-vo/), [6](https://github.com/alishobeiri/Monocular-Video-Odometery), [7](https://avisingh599.github.io/vision/monocular-vo/)


## Ground Truth Poses
each row of the data has 12 columns, 12 come from flattening a `3x4` transformation matrix of the left:

```
r11 r12 r13 tx r21 r22 r23 ty r31 r32 r33 tz
```





## Display Ground Truth Poses in rerun 
just run: 

```
python kitti_gt_to_rerun.py
```


<img src="images/display_ground_truth_poses_rerun.png" />


## Display Ground Truth Poses in rerun 
just run: 

```
python kitti_gt_to_rerun.py
```


<img src="images/display_ground_truth_poses_rerun.png" />

## Stereo Vision
just run:
```
python kitti_stereo.py
```


## Reconstruct Sparse/Dense Model From Known Camera Poses with Colmap

Your data should have the following structure: 

```
├── database.db
├── dense
│   ├── refined
│   │   └── model
│   │       └── 0
│   └── sparse
│       └── model
│           └── 0
├── images
│   ├── 00000.png
│   ├── 00001.png
│   ├── 00002.png
│   └── 00003.png
└── sparse
    └── model
        └── 0
            ├── cameras.txt
            ├── images.txt
            └── points3D.txt
```

1. `cameras.txt`: the format is:

```
CAMERA_ID, MODEL, WIDTH, HEIGHT, PARAMS[]
```
so for KITTI dataset the camera model is `PINHOLE`, and it has four parameters which are the focal lengths (`fx`, `fy`) and principal point coordinates (`cx`, `cy`).

- `CAMERA_ID`: 1
- `MODEL`: PINHOLE
- `WIDTH`: 1226
- `HEIGHT`: 370
- `fx`: 707.0912
- `fy`: 707.0912
- `cx`: 601.8873
- `cy`: 183.1104

should be like this:

```
1 PINHOLE 1226 370 707.0912 707.0912 601.8873 183.1104
```

2. `images.txt`: the format is
```
IMAGE_ID, QW, QX, QY, QZ, TX, TY, TZ, CAMERA_ID, NAME
```

so you data should be like this, mind the extra line after each line:

```
1 1.0 0.0 0.0 0.0 0.031831570484910754 -0.2020180259287443 -0.05988511865826446 1 000000.png

2 0.9999990698095921 -0.000486454947446343 0.0008155417501438222 -0.0009790981505847082 -0.026717887515950233 -0.09385561937368328 -0.38812196090339146 1 000001.png

3 0.9999976159395401 -0.0011567120445530273 0.0013793515824379724 -0.0012359294859380324 -0.23100950491953082 -0.05900910756124116 -0.9698261247623092 1 000002.png

4 0.9999950283825452 -0.0017604272641239351 0.0022926784138869423 -0.0012600522730534293 0.17578254454768152 -0.014474209460539546 -1.9112790713853196 1 000003.png
```
and finally:

3. `points3D.txt`: This file should be empty.

You can run the following command to convert some colmap dataset into TXT to compare with your dataset:

```
colmap model_converter --input_path $DATASET_PATH/sparse/0 --output_path $DATASET_PATH/ --output_type TXT
```

KITTI format for ground truth poses (for instance, for the file `data/kitti/odometry/05/poses/05.txt`) is:

```
r11 r12 r13 tx r21 r22 r23 ty r31 r32 r33 tz
```
The colmap format for `images.txt` is: 

```
IMAGE_ID, QW, QX, QY, QZ, TX, TY, TZ, CAMERA_ID, NAME
```

Run the script [kitti_to_colmap.py](../scripts/kitti/kitti_to_colmap.py). It dumps the output into `images.txt` file. 


You can run the following script to add noise: [kitti_to_colmap_noise.py](../scripts/kitti/kitti_to_colmap_noise.py).


The inside of `~/colmap_projects/kitti_noisy` create a soft link pointing to KITTI images:
ln -s <path-to-kitti-odometry-image> images

in my case:

```
 ln -s /home/$USER/workspace/OpenCVProjects/data/kitti/odometry/05/image_0/ images
```

## 1) Self-supervised monocular VO (Depth + Pose)


* **DepthNet (single image → depth)**

  * **Encoder**: `ResNet18` or `MobileNetV2` (tiny & fast). If you want to play with transformers, you *can* try `swin_tiny` as encoder, but start with ResNet18 first.
  * **Decoder**: 4–5 upsampling stages with skip connections from the encoder; bilinear upsample + 3×3 convs; edge-aware smoothness.
* **PoseNet (2 or 3 frames → 6-DoF)**

  * Tiny CNN (e.g., 5 conv layers with stride) on **frame pairs/triplets concatenated along channels**; global average pool → FC(6) for relative pose (axis-angle + translation).

**Losses (no GT depth/pose):**

* Photometric reprojection (L1 + SSIM) between target and source, using depth + predicted pose + known intrinsics.
* Min-reprojection across multiple sources.
* Auto-mask static pixels; optionally per-pixel explainability mask.
* Edge-aware smoothness on inverse depth.

**Why this first?** It’s the classic “SfM-Learner / Monodepth2 family”—easy to fit, and you’ll learn view synthesis, warping, Jacobians, and all the VO basics.

## 2) Supervised depth (optional add-on)

If you want clean depth numbers without the VO loop, swap the photometric loss for **L1 + scale-invariant log loss** against GT depth on a dataset that has it (e.g., NYU-v2, KITTI depth). You can still keep PoseNet for pose-consistency regularization.

---


# Shapes, sizes, and VRAM knobs

* **Input res**: 128×416 or 192×640 (sweet spot).
* **Batch**: start with **2** (or **1** + gradient accumulation).
* **AMP**: `autocast` + `GradScaler` on.
* **Pose frames**: use **(t, t±1)** or a triplet (t−1, t, t+1); predict T_{t→s}.
* **Depth range**: predict **inverse depth** with sigmoid and scale to [d_min, d_max].

---

# Minimal architectures (proven + tiny)

### DepthNet (ResNet18 encoder + UNet-style decoder)

* Encoder: torchvision `resnet18` up to layer4.
* Decoder: for each scale, `upsample ×2 → concat skip → conv(3×3)×2`.
* Output: multi-scale disparity {1/8, 1/4, 1/2, 1}, supervise each scale.

This is essentially **Monodepth2-style** and fits easily on 4 GB at 192×640 with batch 2.

### PoseNet (very small CNN)

* Input: concat two RGB frames → 6 channels (or three frames → 9 channels).
* Conv(7×7, s=2) → Conv(5×5, s=2) → Conv(3×3, s=2) × 3 → GAP → FC(6).
* Last layer init near zero; multiply by 0.01 to keep poses small at start.

---

# Training recipe (self-sup monocular VO, KITTI)

1. **Preprocess**: resize to 192×640, keep fx, fy, cx, cy scaled accordingly.
2. **Batching**: sample snippets (length 3) → (t−1, t, t+1).
3. **Forward**:

   * Depth_t = DepthNet(I_t).
   * For each source s∈{t−1,t+1}: T_{t→s} = PoseNet(I_t, I_s).
   * Warp I_s→t using Depth_t, T_{t→s}, K (pinhole). Compute photometric loss with SSIM+L1; take **min** over sources per pixel.
4. **Regularize**: smoothness on Depth_t (edge-aware with image gradients).
5. **Masking**: auto-mask when photometric error of identity warp < reprojection error (static scenes).
6. **Optim**: AdamW, lr 1e-4 (Depth), 1e-4 (Pose); cosine decay; weight decay 1e-4.
7. **Tricks**: random brightness/contrast jitter, random flips (careful with flips + poses).
8. **Logging**: show sample depth maps, photometric error heatmaps, and train/val losses.

---

# Evaluation

* **Pose**: ATE/RPE (TUM style) on TUM or EuRoC; KITTI Odometry sequence metrics (t_rel, r_rel).
* **Depth** (if you evaluate on GT): AbsRel, SqRel, RMSE, RMSE_log, δ<1.25^n.
* **Ablations**: with/without auto-mask; with/without min reprojection; ResNet18 vs Swin-Tiny encoder.

---

# If you want a Swin-based DepthNet (optional)

You can swap the encoder:

* Use `swin_tiny_patch4_window7_224` as a **feature pyramid** by tapping outputs after each stage (patch merges simulate downsampling).
* Add simple lateral 1×1 convs to map Swin stage channels to decoder widths (e.g., 256→128→64→32).
* Keep the same UNet decoder.
  Expect **~+20–30% VRAM** vs ResNet18 at the same input size. Start with **128×416** and batch 1 if needed.

---

# Concrete baselines to run on 4 GB

### Baseline A (fastest to success)

* **DepthNet**: ResNet18 encoder + UNet decoder
* **PoseNet**: tiny 6-layer CNN
* **Data**: KITTI Odometry @ **192×640**
* **Batch**: 2 (or 1 + accum 8) with **AMP**
* **Expected**: training fits comfortably; you’ll see depth form in a few epochs

### Baseline B (transformer-curious)

* **DepthNet**: Swin-Tiny encoder + light decoder
* **PoseNet**: same tiny CNN
* **Data**: KITTI Odometry @ **128×416**
* **Batch**: 1 (accum 16), **AMP**, maybe gradient checkpointing in Swin
* **Expected**: slightly better edges; slower; tighter on VRAM

---

# Handy implementation tips

* **Warper**: implement differentiable pinhole projection with `grid_sample` (mind align_corners, padding mode).
* **Scale consistency**: monocular depth is scale-ambiguous—either use median-scaling for eval or add stereo pairs (if available) for scale.
* **Camera intrinsics per-sample**: store K in your dataset class and scale with resizing.
* **Stability**: start PoseNet outputs near zero; clamp inverse depth to [0.01, 10] (scene-dependent).
* **Speed**: compute SSIM on 1/2 scale to save memory, but backprop to full-res photometric term.

---




####  Single-Scale Output 

```python
def forward(self, x):
    disp = self.net(x)
    return {"disp_0": disp}
```

That’s just a **single-scale** version (full resolution) — meant to keep things simple for initial testing.


In **Monodepth2**, **SfMLearner**, and similar self-supervised VO methods,
we **predict disparity at multiple scales** (1/8, 1/4, 1/2, 1).

Those extra scales make the loss more stable and let the network learn coarse-to-fine depth structure.
So let’s extend your `ResNetUNet` to do that.

### Multi-Scale Output

Directly affects how **stable your training** will be 


| Scale             | Purpose                                             |
| ----------------- | --------------------------------------------------- |
| Coarse (1/8, 1/4) | Capture global structure; stabilize gradients early |
| Fine (1/2, 1)     | Recover local details; sharpen edges                |

During training, we compute **photometric + smoothness losses** at *each scale* and sum them, usually with decreasing weights:


$ L = \sum_s \lambda_s (L_\text{photo}^s + \beta L_\text{smooth}^s) $

e.g., $\lambda = [1.0, 0.5, 0.25, 0.125]$

---


```python
class ResNetUNet(nn.Module):
    def __init__(self, num_classes=1, encoder_weights=ResNet18_Weights.IMAGENET1K_V1,
                 center_mult=1.0, up_mode="deconv"):
        super().__init__()
        self.encoder = ResNet18Encoder(weights=encoder_weights)
        enc_chs = self.encoder.out_channels  # [64, 64, 128, 256, 512]
        c0, c2, c3, c4, c5 = enc_chs

        center_out = int(c5 * center_mult)
        self.center = ConvBlock(c5, center_out) if center_mult != 1.0 else nn.Identity()
        bottom_ch = center_out if center_mult != 1.0 else c5

        # Decoder blocks
        self.dec4 = DecoderBlock(bottom_ch, c4, c4, up_mode=up_mode)
        self.dec3 = DecoderBlock(c4, c3, c3, up_mode=up_mode)
        self.dec2 = DecoderBlock(c3, c2, c2, up_mode=up_mode)
        self.dec1 = DecoderBlock(c2, c0, c0, up_mode=up_mode)

        # Final full-res output
        self.final = nn.Sequential(
            nn.Upsample(scale_factor=2, mode="bilinear", align_corners=False),
            nn.Conv2d(c0, num_classes, kernel_size=1),
            nn.Sigmoid(),
        )

        # Extra prediction heads for multi-scale outputs
        self.disp3 = nn.Conv2d(c4, num_classes, kernel_size=3, padding=1)
        self.disp2 = nn.Conv2d(c3, num_classes, kernel_size=3, padding=1)
        self.disp1 = nn.Conv2d(c2, num_classes, kernel_size=3, padding=1)
        self.disp0 = nn.Conv2d(c0, num_classes, kernel_size=3, padding=1)

    def forward(self, x):
        x0, x2, x3, x4, x5 = self.encoder(x)
        x5 = self.center(x5)

        d4 = self.dec4(x5, x4)  # -> stride 16
        d3 = self.dec3(d4, x3)  # -> stride 8
        d2 = self.dec2(d3, x2)  # -> stride 4
        d1 = self.dec1(d2, x0)  # -> stride 2

        # Multi-scale disparities (upsampled to input size)
        disp_3 = torch.sigmoid(self.disp3(d4))
        disp_2 = torch.sigmoid(self.disp2(d3))
        disp_1 = torch.sigmoid(self.disp1(d2))
        disp_0 = torch.sigmoid(self.disp0(d1))

        disp_3 = F.interpolate(disp_3, size=x.shape[2:], mode="bilinear", align_corners=False)
        disp_2 = F.interpolate(disp_2, size=x.shape[2:], mode="bilinear", align_corners=False)
        disp_1 = F.interpolate(disp_1, size=x.shape[2:], mode="bilinear", align_corners=False)
        disp_0 = F.interpolate(disp_0, size=x.shape[2:], mode="bilinear", align_corners=False)

        return {
            "disp_0": disp_0,  # full res
            "disp_1": disp_1,  # 1/2
            "disp_2": disp_2,  # 1/4
            "disp_3": disp_3,  # 1/8
        }
```

In the above code the network behaves exactly like **Monodepth2’s DepthNet**:
It returns a dictionary of multi-scale disparities, all upsampled to the same size for simplicity.

---

###  How to use Multi-Scale Output In Training

You can now iterate over the dictionary:

```python
disp_outs = depth_net(I_t)
loss_total = 0
for scale, disp in disp_outs.items():
    depth = disp_to_depth(disp)
    # downscale I_t, I_s, K accordingly
    # compute min-reprojection + smoothness per scale
    loss_total += scale_weight * loss_scale
```

---

###  Why we upsample all scales to full resolution

All four predicted disparity maps (`disp_0 … disp_3`) are **upsampled to full input resolution** before returning them.


In **Monodepth2** and other self-supervised depth methods, each decoder scale (1/8, 1/4, 1/2, 1) predicts its own disparity map.
However, when computing the **photometric reconstruction loss**, it’s often **simpler** to have all disparities (and warped images) at the *same resolution* (usually the original input size).

That’s why we do:

```python
disp_i = F.interpolate(disp_i, size=x.shape[2:], mode="bilinear")
```

so that every predicted disparity map aligns with the original target frame resolution.
This makes it easy to:

* project pixels consistently using the same camera intrinsics, and
* visualize all disparities at the same size.

---

###  The “proper” alternative (multi-scale losses)

In the *original* Monodepth2, they **did not upsample** everything to full size.
Instead, for each scale $s$:

1. Downsample the input images $ I_t, I_s $ and intrinsics $ K $ by $ 2^s $.
2. Compute photometric + smoothness losses at that scale.

Then they sum losses across scales:

$
L_\text{total} = \sum_{s=0}^3 \lambda_s L_\text{photo}^s + \beta L_\text{smooth}^s
$

This is a bit more precise, but also more code to manage.

---

### Upsampling vs multi-scale loss — trade-offs


| Approach                                 | Pros                                                    | Cons                                                      |
| ---------------------------------------- | ------------------------------------------------------- | --------------------------------------------------------- |
| **Upsample all to full res (your code)** | Simple, consistent intrinsics, easy visualization       | Slightly redundant compute; small interpolation smoothing |
| **Keep native scale for each output**    | Exact geometric consistency; matches each encoder level | More complex (need per-scale K & downsampled images)      |

In practice, **both work** — many modern implementations upsample all scales to simplify training.

---

### When to keep as-is

For your **first implementation of self-supervised DepthNet/PoseNet**,
keep the current design (upsampling to input size). It’s clean, stable, and good enough for learning.

Later, if you want more geometrical precision, you can move to the multi-scale loss variant:

```python
# Example: compute photometric loss per scale
for i, (scale, disp) in enumerate(disp_outs.items()):
    down_factor = 2 ** i
    I_t_s = F.interpolate(I_t, scale_factor=1/down_factor, mode="area")
    I_src_s = [F.interpolate(I_s, scale_factor=1/down_factor, mode="area") for I_s in I_srcs]
    K_s = scale_intrinsics(K, 1/down_factor)
    loss_scale = compute_loss(I_t_s, I_src_s, disp, K_s)
    total_loss += weight[i] * loss_scale
```

---

### TL;DR Summary

| Setting                    | Description                              | Use when                                       |
| -------------------------- | ---------------------------------------- | ---------------------------------------------- |
|  **Upsample to full res** | Simplifies training and loss computation | Recommended for your current DepthNet          |
|  **Keep multi-scale**    | Physically precise, scale-consistent     | For advanced training later (Monodepth2-style) |

---

So yes — currently all outputs are upsampled so you can easily apply your loss once at the original resolution.
That’s totally valid, especially for your first self-supervised VO setup.


### Training PoseNet alongside DepthNet


In **self-supervised monocular visual odometry**, the **DepthNet and PoseNet are trained jointly, end-to-end** from raw videos.
Neither has ground-truth supervision; the photometric loss ties them together.


* **DepthNet** alone predicts only *relative* depth up to scale.
  It needs the camera motion to know *how to warp one frame to another* for photometric consistency.
* **PoseNet** provides that motion — it learns to estimate the **relative pose** (T_{t→s}) between frames.

Together they minimize the **photometric reconstruction loss**:

$
L_\text{photo} = \min_s |I_t - W(I_s, D_t, T_{t\rightarrow s}, K)|
$

where $W(\cdot)$ warps the source $I_s$ into the target view using the predicted **depth** and **pose**.



---

### Typical training loop

```python
for batch in loader:
    I_t = batch["t"]           # target frame
    I_srcs = batch["srcs"]     # e.g. [I_{t-1}, I_{t+1}]
    K = batch["K"]

    # --- forward both networks ---
    disp_outs = depth_net(I_t)
    depth = disp_to_depth(disp_outs["disp_0"])
    T_list = [se3_to_SE3(pose_net(torch.cat([I_t, I_s], 1))) for I_s in I_srcs]

    # --- photometric loss (min reprojection) ---
    loss = compute_min_reproj_loss(I_t, I_srcs, depth, T_list, K)

    # --- optional smoothness/regularization ---
    loss += λ_smooth * edge_aware_smoothness(disp_outs["disp_0"], I_t)

    loss.backward()
    optimizer.step()
```

You optimize both nets’ parameters with a shared optimizer (or separate LRs):

```python
params = list(depth_net.parameters()) + list(pose_net.parameters())
optimizer = torch.optim.AdamW(params, lr=1e-4, weight_decay=1e-4)
```

---

###  Common practice

| Stage         | Encoder        | Decoder | PoseNet | Train? | Notes                             |
| ------------- | -------------- | ------- | ------- | ------ | --------------------------------- |
| 1️⃣ Warm-up   | frozen         | ✅       | ✅       | yes    | learn decoder & PoseNet first     |
| 2️⃣ Fine-tune | unfrozen       | ✅       | ✅       | yes    | refine depth features jointly     |
| 3️⃣ Optional  | freeze PoseNet | ✅       | ❌       | no     | if you only want depth refinement |

---

###  Variations

* **Supervised depth:** train only DepthNet with GT depth → no PoseNet needed.
* **Stereo supervision:** PoseNet replaced by known baseline → simpler.
* **Multi-frame VO:** PoseNet takes 3 frames (t-1, t, t+1) to produce both poses jointly.

---

 **Summary**

| Network      | Input               | Output                 | Trained jointly? |
| ------------ | ------------------- | ---------------------- | ---------------- |
| **DepthNet** | 1 RGB frame         | per-pixel depth        | ✅                |
| **PoseNet**  | 2 (or 3) RGB frames | relative 6-DoF pose(s) | ✅                |

They cooperate through the photometric loss — that’s the essence of **self-supervised visual odometry / monocular depth learning**.


## City Scapes

Refs: [1](https://github.com/mcordts/cityscapesScripts), [2](https://github.com/mcordts/cityscapesScripts)
