<a href="https://colab.research.google.com/github/Willyzw/monodepth2/blob/master/monodepth2_handson.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

Tracking the pose of a moving camera and simultanously inferring the **dense** structure of the environment is a long-standing problem sometimes denoted as **DenseSLAM**. Traditionally it is approached by two steps, namely a sparse set of feature points along with camera poses is firstly estimated, then followed by multi-view stereo(MVS) to construct the dense scene structure. While this traditional toolchain is well studied consisting of multiple elaborate hand-crafted stages, it lacks robustness in cases such as low texture, thin structure and dynamic objects. Besides that, modern applications like augmented reality or automated driving demand real-time dense scene perceiving for operations e.g. interaction between physical and virtual objects and obstacle avoidance.

With the rapid development and recent advances of deep learning, there has been remarkable progress in this field in recent years. **MonoDepth2**[1] is one of the most representative works. It consists of a depth and a pose network to estimate depth map and camera pose respectively. More specifically, the pose network takes a pair of consecutive images $I_{t-1}$ and $I_t$ and outputs the relative transform from $I_{t-1}$ to $I_t$, while the depth network can map a RGB image $I_{t}$ though an encoder-decoder network to its corresponding depth map. This process can be illustrated as the figure below (Figure 1 of SfMLearner [3])  
![](https://github.com/Willyzw/monodepth2/raw/master/assets/sfmlearner.png)

This notebook aims to convey the MonoDepth2's principles by showing an example. Firstly the required development environment will be set up. Then, a few example images from KITTI dataset[4] are used to illustrated the process of image warping, which is the core principle for the self-supervised learning. At the end, we apply the pre-trained network model on a short video clip of Cityscapes dataset to check how the model generalizes to a different dataset.


# Environment setup

In [1]:
!git clone https://github.com/Willyzw/monodepth2

Cloning into 'monodepth2'...
remote: Enumerating objects: 17, done.[K
remote: Counting objects: 100% (17/17), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 165 (delta 4), reused 7 (delta 1), pack-reused 148[K
Receiving objects: 100% (165/165), 13.30 MiB | 32.05 MiB/s, done.
Resolving deltas: 100% (75/75), done.


In [2]:
pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.8.0+cu111
[?25l  Downloading https://download.pytorch.org/whl/cu111/torch-1.8.0%2Bcu111-cp37-cp37m-linux_x86_64.whl (1982.2MB)
[K     |█████████████▌                  | 834.1MB 2.4MB/s eta 0:07:49tcmalloc: large alloc 1147494400 bytes == 0x559046912000 @  0x7f8069ad5615 0x55900caa606c 0x55900cb85eba 0x55900caa8e8d 0x55900cb9a99d 0x55900cb1cfe9 0x55900cb17b0e 0x55900caaa77a 0x55900cb1ce50 0x55900cb17b0e 0x55900caaa77a 0x55900cb1986a 0x55900cb9b7c6 0x55900cb18ee2 0x55900cb9b7c6 0x55900cb18ee2 0x55900cb9b7c6 0x55900cb18ee2 0x55900cb9b7c6 0x55900cc1d431 0x55900cb7e049 0x55900cae8c84 0x55900caa98e9 0x55900cb1dade 0x55900caaa69a 0x55900cb18a45 0x55900cb17e0d 0x55900caaa77a 0x55900cb18a45 0x55900caaa69a 0x55900cb18a45
[K     |█████████████████               | 1055.7MB 38.1MB/s eta 0:00:25tcmalloc: large alloc 1434370048 bytes == 0x55908af68000 @  0x7f8069ad5615 0x55900caa606c 0x55900cb85eba 0x55900caa8

# Network