# Dataset Visual Odometry / SLAM Evaluation

1. [Download odometry data set (grayscale, 22 GB)](https://s3.eu-central-1.amazonaws.com/avg-kitti/data_odometry_gray.zip)
2. [Download odometry data set (color, 65 GB)](https://s3.eu-central-1.amazonaws.com/avg-kitti/data_odometry_color.zip)
3. [Download odometry data set (velodyne laser data, 80 GB)](https://s3.eu-central-1.amazonaws.com/avg-kitti/data_odometry_velodyne.zip)
4. [Download odometry data set (calibration files, 1 MB)](https://s3.eu-central-1.amazonaws.com/avg-kitti/data_odometry_calib.zip)
5. [Download odometry ground truth poses (4 MB)](https://s3.eu-central-1.amazonaws.com/avg-kitti/data_odometry_poses.zip)



## Sensor setup 
<img src="images/setup_top_view.png" />

<img src="images/passat_sensors_920.png" />




## Calibration Files and Projection Matrices

to get the calibration data run:
```
python kitti_calibration.py
```



- $P0$: Reference camera (left of stereo pair 1), extrinsics are identity.
- $P1$: Right camera of stereo pair 1, extrinsics include baseline offset.
- $P2$: Left camera of stereo pair 2, extrinsics depend on setup.
- $P3$: Right camera of stereo pair 2, extrinsics depend on setup.


---

Camera: $P0$:

```
Projection Matrix:
[[707.0912   0.     601.8873   0.    ]
 [  0.     707.0912 183.1104   0.    ]
 [  0.       0.       1.       0.    ]]
Intrinsic Matrix:
[[707.0912   0.     601.8873]
 [  0.     707.0912 183.1104]
 [  0.       0.       1.    ]]
Rotation Matrix:
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
Translation Vector:
[[0.]
 [0.]
 [0.]]
```
---

Camera: $P1$:
```
Projection Matrix:
[[ 707.0912    0.      601.8873 -379.8145]
 [   0.      707.0912  183.1104    0.    ]
 [   0.        0.        1.        0.    ]]
Intrinsic Matrix:
[[707.0912   0.     601.8873]
 [  0.     707.0912 183.1104]
 [  0.       0.       1.    ]]
Rotation Matrix:
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
Translation Vector:
[[ 5.37150653e-01]
 [-1.34802944e-17]
 [ 0.00000000e+00]]
```

From the above image the distance between two camera is `0.54` on $x$ axis and from decomposition we have: `5.37150653e-01`.

Refs: [1](https://www.cvlibs.net/datasets/kitti/setup.php)
[2](https://stackoverflow.com/questions/29407474/how-to-understand-the-kitti-camera-calibration-files), [3](https://github.com/yanii/kitti-pcl/blob/master/KITTI_README.TXT), [4](https://www.cvlibs.net/datasets/kitti/eval_odometry.php), [5](https://github.com/avisingh599/mono-vo/), [6](https://github.com/alishobeiri/Monocular-Video-Odometery), [7](https://avisingh599.github.io/vision/monocular-vo/)


## Ground Truth Poses
each row of the data has 12 columns, 12 come from flattening a `3x4` transformation matrix of the left:

```
r11 r12 r13 tx r21 r22 r23 ty r31 r32 r33 tz
```





## Display Ground Truth Poses in rerun 
just run: 

```
python kitti_gt_to_rerun.py
```


<img src="images/display_ground_truth_poses_rerun.png" />


## Display Ground Truth Poses in rerun 
just run: 

```
python kitti_gt_to_rerun.py
```


<img src="images/display_ground_truth_poses_rerun.png" />

## Stereo Vision
just run:
```
python kitti_stereo.py
```


## Reconstruct Sparse/Dense Model From Known Camera Poses with Colmap

Your data should have the following structure: 

```
├── database.db
├── dense
│   ├── refined
│   │   └── model
│   │       └── 0
│   └── sparse
│       └── model
│           └── 0
├── images
│   ├── 00000.png
│   ├── 00001.png
│   ├── 00002.png
│   └── 00003.png
└── sparse
    └── model
        └── 0
            ├── cameras.txt
            ├── images.txt
            └── points3D.txt
```

1. `cameras.txt`: the format is:

```
CAMERA_ID, MODEL, WIDTH, HEIGHT, PARAMS[]
```
so for KITTI dataset the camera model is `PINHOLE`, and it has four parameters which are the focal lengths (`fx`, `fy`) and principal point coordinates (`cx`, `cy`).

- `CAMERA_ID`: 1
- `MODEL`: PINHOLE
- `WIDTH`: 1226
- `HEIGHT`: 370
- `fx`: 707.0912
- `fy`: 707.0912
- `cx`: 601.8873
- `cy`: 183.1104

should be like this:

```
1 PINHOLE 1226 370 707.0912 707.0912 601.8873 183.1104
```

2. `images.txt`: the format is
```
IMAGE_ID, QW, QX, QY, QZ, TX, TY, TZ, CAMERA_ID, NAME
```

so you data should be like this, mind the extra line after each line:

```
1 1.0 0.0 0.0 0.0 0.031831570484910754 -0.2020180259287443 -0.05988511865826446 1 000000.png

2 0.9999990698095921 -0.000486454947446343 0.0008155417501438222 -0.0009790981505847082 -0.026717887515950233 -0.09385561937368328 -0.38812196090339146 1 000001.png

3 0.9999976159395401 -0.0011567120445530273 0.0013793515824379724 -0.0012359294859380324 -0.23100950491953082 -0.05900910756124116 -0.9698261247623092 1 000002.png

4 0.9999950283825452 -0.0017604272641239351 0.0022926784138869423 -0.0012600522730534293 0.17578254454768152 -0.014474209460539546 -1.9112790713853196 1 000003.png
```
and finally:

3. `points3D.txt`: This file should be empty.

You can run the following command to convert some colmap dataset into TXT to compare with your dataset:

```
colmap model_converter --input_path $DATASET_PATH/sparse/0 --output_path $DATASET_PATH/ --output_type TXT
```

KITTI format for ground truth poses (for instance, for the file `data/kitti/odometry/05/poses/05.txt`) is:

```
r11 r12 r13 tx r21 r22 r23 ty r31 r32 r33 tz
```
The colmap format for `images.txt` is: 

```
IMAGE_ID, QW, QX, QY, QZ, TX, TY, TZ, CAMERA_ID, NAME
```

Run the script [kitti_to_colmap.py](../scripts/kitti/kitti_to_colmap.py). It dumps the output into `images.txt` file. 


You can run the following script to add noise: [kitti_to_colmap_noise.py](../scripts/kitti/kitti_to_colmap_noise.py).


The inside of `~/colmap_projects/kitti_noisy` create a soft link pointing to KITTI images:
ln -s <path-to-kitti-odometry-image> images

in my case:

```
 ln -s /home/$USER/workspace/OpenCVProjects/data/kitti/odometry/05/image_0/ images
```

## 1) Self-supervised monocular VO (Depth + Pose)


* **DepthNet (single image → depth)**

  * **Encoder**: `ResNet18` or `MobileNetV2` (tiny & fast). If you want to play with transformers, you *can* try `swin_tiny` as encoder, but start with ResNet18 first.
  * **Decoder**: 4–5 upsampling stages with skip connections from the encoder; bilinear upsample + 3×3 convs; edge-aware smoothness.
* **PoseNet (2 or 3 frames → 6-DoF)**

  * Tiny CNN (e.g., 5 conv layers with stride) on **frame pairs/triplets concatenated along channels**; global average pool → FC(6) for relative pose (axis-angle + translation).

**Losses (no GT depth/pose):**

* Photometric reprojection (L1 + SSIM) between target and source, using depth + predicted pose + known intrinsics.
* Min-reprojection across multiple sources.
* Auto-mask static pixels; optionally per-pixel explainability mask.
* Edge-aware smoothness on inverse depth.

**Why this first?** It’s the classic “SfM-Learner / Monodepth2 family”—easy to fit, and you’ll learn view synthesis, warping, Jacobians, and all the VO basics.

## 2) Supervised depth (optional add-on)

If you want clean depth numbers without the VO loop, swap the photometric loss for **L1 + scale-invariant log loss** against GT depth on a dataset that has it (e.g., NYU-v2, KITTI depth). You can still keep PoseNet for pose-consistency regularization.

---


# Shapes, sizes, and VRAM knobs

* **Input res**: 128×416 or 192×640 (sweet spot).
* **Batch**: start with **2** (or **1** + gradient accumulation).
* **AMP**: `autocast` + `GradScaler` on.
* **Pose frames**: use **(t, t±1)** or a triplet (t−1, t, t+1); predict T_{t→s}.
* **Depth range**: predict **inverse depth** with sigmoid and scale to [d_min, d_max].

---

# Minimal architectures (proven + tiny)

### DepthNet (ResNet18 encoder + UNet-style decoder)

* Encoder: torchvision `resnet18` up to layer4.
* Decoder: for each scale, `upsample ×2 → concat skip → conv(3×3)×2`.
* Output: multi-scale disparity {1/8, 1/4, 1/2, 1}, supervise each scale.

This is essentially **Monodepth2-style** and fits easily on 4 GB at 192×640 with batch 2.

### PoseNet (very small CNN)

* Input: concat two RGB frames → 6 channels (or three frames → 9 channels).
* Conv(7×7, s=2) → Conv(5×5, s=2) → Conv(3×3, s=2) × 3 → GAP → FC(6).
* Last layer init near zero; multiply by 0.01 to keep poses small at start.

---

# Training recipe (self-sup monocular VO, KITTI)

1. **Preprocess**: resize to 192×640, keep fx, fy, cx, cy scaled accordingly.
2. **Batching**: sample snippets (length 3) → (t−1, t, t+1).
3. **Forward**:

   * Depth_t = DepthNet(I_t).
   * For each source s∈{t−1,t+1}: T_{t→s} = PoseNet(I_t, I_s).
   * Warp I_s→t using Depth_t, T_{t→s}, K (pinhole). Compute photometric loss with SSIM+L1; take **min** over sources per pixel.
4. **Regularize**: smoothness on Depth_t (edge-aware with image gradients).
5. **Masking**: auto-mask when photometric error of identity warp < reprojection error (static scenes).
6. **Optim**: AdamW, lr 1e-4 (Depth), 1e-4 (Pose); cosine decay; weight decay 1e-4.
7. **Tricks**: random brightness/contrast jitter, random flips (careful with flips + poses).
8. **Logging**: show sample depth maps, photometric error heatmaps, and train/val losses.

---

# Evaluation

* **Pose**: ATE/RPE (TUM style) on TUM or EuRoC; KITTI Odometry sequence metrics (t_rel, r_rel).
* **Depth** (if you evaluate on GT): AbsRel, SqRel, RMSE, RMSE_log, δ<1.25^n.
* **Ablations**: with/without auto-mask; with/without min reprojection; ResNet18 vs Swin-Tiny encoder.

---

# If you want a Swin-based DepthNet (optional)

You can swap the encoder:

* Use `swin_tiny_patch4_window7_224` as a **feature pyramid** by tapping outputs after each stage (patch merges simulate downsampling).
* Add simple lateral 1×1 convs to map Swin stage channels to decoder widths (e.g., 256→128→64→32).
* Keep the same UNet decoder.
  Expect **~+20–30% VRAM** vs ResNet18 at the same input size. Start with **128×416** and batch 1 if needed.

---

# Concrete baselines to run on 4 GB

### Baseline A (fastest to success)

* **DepthNet**: ResNet18 encoder + UNet decoder
* **PoseNet**: tiny 6-layer CNN
* **Data**: KITTI Odometry @ **192×640**
* **Batch**: 2 (or 1 + accum 8) with **AMP**
* **Expected**: training fits comfortably; you’ll see depth form in a few epochs

### Baseline B (transformer-curious)

* **DepthNet**: Swin-Tiny encoder + light decoder
* **PoseNet**: same tiny CNN
* **Data**: KITTI Odometry @ **128×416**
* **Batch**: 1 (accum 16), **AMP**, maybe gradient checkpointing in Swin
* **Expected**: slightly better edges; slower; tighter on VRAM

---

# Handy implementation tips

* **Warper**: implement differentiable pinhole projection with `grid_sample` (mind align_corners, padding mode).
* **Scale consistency**: monocular depth is scale-ambiguous—either use median-scaling for eval or add stereo pairs (if available) for scale.
* **Camera intrinsics per-sample**: store K in your dataset class and scale with resizing.
* **Stability**: start PoseNet outputs near zero; clamp inverse depth to [0.01, 10] (scene-dependent).
* **Speed**: compute SSIM on 1/2 scale to save memory, but backprop to full-res photometric term.

---




## City Scapes

Refs: [1](https://github.com/mcordts/cityscapesScripts), [2](https://github.com/mcordts/cityscapesScripts)
