# **Computer Vision - Tübingen (2021)**

These are my notes from Andreas Geiger's Computer Vision course on Youtube.

In [2]:
import torch
import numpy as np
from torch import nn
from torch.nn import functional as F

## **Lecture 2. Image Formation**

### **Photometric Image Formation**


Let $p \in \mathbb{R}^3$ be a 3D surface point and $v \in \mathbb{R}^3$ a viewing direction. Suppose that $r \in \mathbb{R}^3$ is the incoming light direction. Then the **rendering equation** describes how much of the light arriving at $p$ is reflected into viewing direction $v$ $$L_{out}(p,v ; \lambda) = L_{emit}(p,v ; \lambda) + \int_{\Omega} \verb|BRDF|(p,r,v ; \lambda) \cdot L_{in}(p, r ; \lambda) \cdot n^T r dr$$ where $\Omega$ is the unit hemisphere at normal $n$. Then $L_{emit}$ term is positive when the point also emits light. The bidirectional reflectance distribution function $\verb|BRDF|$ defines how light is reflected at an opaque surface.

Typical BRDFs have a *diffuse* component that scatters light uniformly in all directions and leads to shading (smooth variation of intensity w.r.t the surface normal) and a *specular* component that depends strongly on the outgoing direction.

Modelling one light bounce is insufficient for rendering complex scenes. Global illumination techniques also take indirect illumination into account.

## **Lecture 3. Structure from Motion**

### **Preliminaries**

SIFT constructs a scale space by iteratively filtering the image with a Gaussian where adjacent scales are subtracted, yielding a *difference of gaussians*. Interest points are detected as extrema in the resulting scale space. These are points where all neighboring points are higher (or lower for a minima) including neighbors at a different scale level.

SIFT rotates the descriptor to align with the dominant gradient orientation. Then gradient histograms are computed for local sub-regions of the descriptor and all the histograms are concatenated and normalized to form a 128 dimensional feature vector.

Feature correspondences are retrieved with efficient nearest neighbor search and ambiguous matches are filtered by computing the ratio of distance from the closest neighbor to the distance of the second closest. A large ratio $(> 0.8)$ indicates that the found match might not be correct.

### **Two-frame Structure-from-Motion**



The relation between camera pose and 3D structure from two image correspondences is described by *epipolar geometry*. 

The main crux of this is the epipolar equation $$x_2^T E x_1 = x_2^T [t]_\times R x_1 = 0$$ where $x_1,x_2$ are the observed coordinates in the two images after undoing the effect of the intrinsic camera matrix $K$ and $y \to Ry + t$ is the change of 3D coordinates from first camera center to the second. The matrix $E$ is an example of an *essential matrix*.



 The epipolar line $l_2 = Ex_1$ in the second image passes through the second epipole, and so we have $e_2^T E x_1 = 0$. This is true for any $x_1$, so we must have $e_2^T E_1 = 0$. Similarly

We can recover the essential matrix from $N$ image correspondences and we use the SVD to constrain the scale. It is important to whiten the image coordinates to have zero mean  and unit variance before performing this algorithm. The matrix recovered is then transformed appropriately to account for this. 

From $E$ we can recover the direction $\hat{t}$ of the translation vector $t$ since we have $t^T E = 0$. Essential matrices have 5 DoF.

If the camera calibration $K_i$ is unknown, we obtain the criterion $$x_2^T K_2^{-T} E K_1^{-1} x_1 = 0 = x_2^T F x_1$$ where $F = K_2^{-T} E K_1^{-1}$ is the fundamental matrix. However, the intrinsic parameters cannot be directly determined and we obtain only a perspective reconstruction. If there is additional information like vanishing points, constancy of K across time, zero skew, aspect ratio etc., this can be upgraded to a metric reconstruction.

Now that we have recovered point correspondences, we would like to recover the 3d coordinates. Write $x_i = P_i X$ so that $[x_i]_\times P_i X = 0$. Since $[x_i]_\times = \begin{bmatrix} 0 & -1 & y_i \\ 1 & 0 & -x_1 \\ -y_i & x_i & 0  \end{bmatrix}$. Thus, if $P_i  [p_i^1|p_i^2|p_i^3]^T$, we have $$[x_i]_\times P_i =  \begin{bmatrix} -p_i^2 + y_ip_i^3 \\ p_i^1  -x_1p_i^3 \\-y_ip_i^1 + x_ip_i^2  \end{bmatrix} $$ The third equation is redundant, so we have $$A_i X = \begin{bmatrix} -p_i^2 + y_ip_i^3 \\ p_i^1  -x_1p_i^3   \end{bmatrix} X = 0$$

While DLT works well, it is not invariant to perspective transformations. The gold standard is to minimize the re-projection error $$X^* = \arg\min_{X} \sum_{i=1}^{N} \lvert\lvert P_i(X) - x_i \rvert\rvert^2$$ using techniques like Levenberg-Marquardt.

### **Bundle Adjustment**

Let $\Pi = \{\pi_i\}$ denote the $N$ camera matrices, $\mathcal{X}_w = \{X_p\}$ be the set of 3D points in world coordinates. Let $\mathcal{X}_s = \{x_{i,p}\}$ be the images of the points in the corresponding camera. 

Bundle adjustment minimizes the re-projection error $$\Pi^*, \mathcal{X}_w^* = \arg\min_{\pi, \mathcal{X}_w} \sum_{i=1}^{N} \sum_{p=1}^{P} W_{f,p} \lvert\lvert x_{i,p} - \pi_i(X_p) \rvert\rvert^2$$ Here $W_{i,p} = 1$ exactly when point $p$ is observed in image $i$ and $\pi_i(X_p)$ is the de-homogenized point (we divide by the third coordinate). 

Modern Structure from motion approaches perform *incremental bundle adjustment* where a carefully selected two-view reconstruction is used for initialization and iteratively new images/cameras re added to the reconstruction.

The issues with bundle adjustment are:

1. It is a non-convex optimization problem and needs a good initialization to avoid getting trapped in bad local minima.
2. Initializing all 3D points and cameras jointly is difficult, and so incremental bundle adjustment iteratively adds new images or cameras.
3. Since there can be up to millions of features and thousands of cameras, large scale bundle adjustment is computationally demanding.

Luckily the problem is usually sparse in practice and efficient sparse implementations can be used in practice.

#### **Incremental Structure-from-Motion**



There are two stages in the COLMAP pipeline:

1. First robust 2d features are found and matched across images.
2. Incrementally cameras are added after starting with two views.

Let us look at the Image registration step in a bit more detail.Suppose that we have a new image with correspondences to the current set $X_i, x_i$. Given a new image, new correspondences are triangulated and then a local bundle adjustment is performed on only the locally connected images (global bundle adjustemnt is performed only once in a while for efficiency)

Given a reconstruction, we can obtain a dense reconstruction using epipolar geometry.

## **Lecture 4: Stereo Reconstruction**

### **Preliminaries**

*Disparity* is inverse depth, the relative displacement between pixels of two images of the same scene. We wish to recover disparity for each pixel from two images and obtain a dense 3D model from 2 images of a static scene. 

The pipeline for Multi-view Stereo is:

1. Calibrate cameras.
2. Rectify images given the calibration.
3. Compute the disparity map for the reference image.
4. Remove outliers using consistency/occlusion tests.
5. Obtain depth from disparity using camera calibration.
6. Construct a 3D model via techniques like volumetric fusing and meshing.

The pipeline for 3D reconstruction is:

1. Take input images and compute camera poses.
2. Use camera poses to compute dense correspondences.
3. Use the dense correspondences to form depth maps.
4. Use depth maps with a depth fusion method to obtain a 3D reconstruction.

If we have two images of a 3D point $x_1,x_2$, we have $x_2^T E x_1 = 0$. The correspondence point $x_2$ of a pixel $x_1$ lies on the epipolar line $Ex_1$ and the correspondence of a pixel $x_2$ in the second image lies on $E^T x_2$. This means we only need to search for the correspondence point on the epipolar line. 

If both cameras face in the same direction (no rotation) and the translation vector is parallel to the image planes, the epipoles will lie at infinity. The epipolar lines will be parallel to the translation vector between the two images and so if we adjust for this, we will only have to look for correspondences using a *horizontal line search*. This simplies the implementation of this method.

Even if the images are not in the required setup, we can warp them through a rotation that maps both the image planes to a common plane parallel to the baseline. This is called *rectification* and is possible even without knowing a 3D reconstruction of the images.

Let us first assume $K_1 = K_2 = R = I$ and $\mathbf{t} = (t,0,0)$. In this case the essential matrix is given by $$E = [\mathbf{t}]_\times R = \begin{bmatrix} 0 & 0  & 0 \\ 0 & 0 & -t \\ 0 & t  & 0\end{bmatrix}$$ This means the epipolar constraint becomes $x_2^T E x_1 = 0 = -t y_2 + ty_1$. Thus, $y_1 = y_2$.

Now we want to find a rectifying rotation $R_{\text{rect}} = [r_1, r_2, r_3]^T$, where $r_1 = \frac{t}{\lvert\lvert t \rvert\rvert}$, $r_2 = [0,0,1]^T \times r_1$ and $r_3 = r_1 \times r_2$. The reason for $r_2$'s slightly strange definition is so that the $z$-direction is preserved.

Then we have $R_{\text{rect}} \frac{t}{\lvert\lvert t\rvert\rvert} =R_{\text{rect}} r_1 = [1,0,0]^T$, just as we require. 

So the rectification algorithm is:

1. Estimate $E$ and decompose to $\mathbf{t}$ and $R. 
2. Construct $R_{\text{rect}}$.
3. Warp pixels in the first image as $x_1' = KR_{rect} K_1^{-1} x_1$, and
4. Warp pixels in the second image as $x_2 = K R R_{rect} K_2^{-1} x_2$,
   
where $K$ is a shared projection matrix that can be chosen arbitrarily (for example, $K = K_1$).

In practice the inverse of the above transformations is applied on destination pixels to query the pixel which is warped to it. This will not land exactly in an integer location and some sort of bilinear or cubic interpolation is used.

The relative horizontal displacement of points of different depths is called **disparity** and it is inversely proportional to depth. We want to recover depth from estimated disparity.

Let $z$ be the true depth of a point, $b$ the baseline (the distance between camera centers after rectification), $d$ the disparity, and $f$ the focal length, we have $$\frac{z-f}{b - d} = \frac{z}{b}$$ by a simple similarity relation. This means $z = \frac{fb}{d}$.

### **Block Matching**

To identify the disparity between two images we will need to determine if two image points correspond and even what it means for this to be so. This is complicated by the fact that a single pixel doesn't reveal local structure, and even a small region around a point of interest can look very different due to changes in perspective or illumination. 

When performing block matching we query a point in the input image and sweep the horizontal direction and compare patches. We only need to do this to the left of the corresponding point in the right image because disparity has to be positive.

Say we have $K \times K$ windows of pixels flattened to $w_L, w_R \in \mathbb{R}^{K^2}$. We want to compare these patches, and a common method is Zero Normalized Cross-Correlation (ZNCC) $$\verb|NCC|(x,y,d) = \frac{\left(w_L(x,y) - \overline{w_L}(x,y)\right)^T(w_r(x-d,y) - \overline{w_r(x-d,y)})}{\lvert\lvert w_L(x,y) - \overline{w_L}(x,y) \rvert\rvert \,\cdot\, \lvert\lvert w_R(x-d,y) - \overline{w_R}(x-d,y) \rvert\rvert}$$ Another common method is the sum of squared difference (SSD).

So the block matching algorithm performs the following steps:

1. Choose a disparity range $[0,D]$.
2. For all pixels $x$ compute the best disparity (winner takes all)
3. Do this for both images and apply left-right constituency check to remove outliers.

When looking at the results of the block matching algorithm we see artifacts around the edges of the objects. These are caused by half occlusions where the patch is occluded in one of the images.

Block matching assumes that all the pixels inside a window are displaced by the same disparity $d$. This is called the *fronto-parallel assumption* and is often invalid. Slanted surfaces deform perspectively when the viewpoint changes. Also, the window content changes differently at discontinuities of disparity.

So there is a tradeoff between small window sizes that lead to matching ambiguities and noise in the disparity map and large window sizes that lead to smoother results but loss of details and border bleeding.

We can detect for outliers and half-occlusions can be detected via a left-right consistency test. For every pixel in the left image we compute the disparity and move along that disparity at the right image to find the computed corresponding pixel. Then we compute the disparity in the left image for that pixel and see if we recover our original pixel.

### **Siamese Networks**

We would like to learn similarity metrics from data because hand crafted features do not take into consideration relevant geometric and radiometric invariance or occlusion patterns. Zbontar and Lecun (2016) showed that the computation of matching cost can be treated as an image classification problem.

The method assumes a large disparity dataset and trains a CNN patch-wise based on images with ground truth disparity maps. Once we have trained a patch classifier, we can calculate features for both images and correlate features with a using a dot product or MLP and take a global "winner takes all" approach. Then a global optimization algorithm is run that incorporates some smoothness assumptions about the problem.

In the original paper there were two proposed architectures for this Siamese Network architecture.

The *learned similarity* method starts with one CNN with ReLU nonlinearities in parallel on both the left and right input patches and the features from both halves are then passed into a fully connected network trained to output a similarity score. This is a potentially more expressive solution than the one below, but it is very slow due to the MLP at the end.

The *cosine similarity* method eliminates the fully connected network and applies a dot product to obtain the similarity score. This makes the algorithm much faster with no appreciable drop in performance.

The training set is composed of patch triplets $(w_L(x_L), w_R(x_R^{\verb|neg|}), w_R(x_R^{\verb|pos|})$ where $w_L(x_L)$ is a patch from the left image centered at $x_L$, $w_R(x_R)$ is a patch from the second image centered at $x_R$, and the negative example is a patch centered at $(x_L - d + o, y_L)$ where $o$ is an offset drawn uniformly on $[-N_h, -N_l] \cup [N_l, N_h]$. The positive examples are similarly patches centered at $(x_L - d + o_+, y_L)$, where $o_L$ is drawn uniformly from $[-P_h, P_h]$. 

Typically $P_h = 1$, $[N_l, N_h] = [3,6]$ so that negative examples are quite close to positive examples but still far enough away.

We use a Hinge Loss function $\ell = \max(0, m + s_- - s_+)$ where $s_+$ and $s_-$ are the scores for the positive and negative examples respectively. The loss is zero when the similarity of the positive example is greater than the similarity of the negative example by a margin at least $m$. This ensures that further separation of well-separated examples is not pursued and allows the model to focus on the hard cases. 

The margin $m$ is a tunable hyperparameter, but generally $m = 0.2$.

### **Spatial Regularization**

We would also like to incorporate global optimization in addition to locally picking a matching patch.

The underlying assumption in matching is that corresponding regions in both images should look similar and non-corresponding regions should look different. But this constraint can fail in some cases like:

* surfaces without texture,
* occlusions, and
* non-lambertian surfaces.

Such local ambiguities cannot be overcome by only using spatial features. So we need to use some form of global information.

In the real world, for example the Brown range image database, we see that depth varies slowly except at discontinuities which are sparse.

 We incorporate this constraint into the disparity estimation process by specifying a loopy markov random field (MRF) on a grid and solving for the whole disparity map $D$ at once. The MAP solution is then a minimum energy solution since we have $$p(D) = k \exp \left(- \sum_{i} \psi_{data}(d)i)  - \lambda \sum_{i \sim j} \psi_{smooth}(d_i, d_j)\right)$$ where $k$ is some constant. 

Here the terms mean:

 * $i \sim j$ means the pixels are neighboring on a 4-connected grid
 * $d$ is the unknown disparity hypothesis
 * The Unary terms are the matching costs $\psi_{data}(d)$
 * The pairwise terms represent the smoothness between pixels. For example, two option include $\psi_{smooth}(d,d') = \mathbb{1}_{d\neq d'}$, and a truncated $\ell_1$ function $\psi_\text{smooth}(d,d') = \min(\lvert d-d'\rvert, \tau)$.

We can solve this MRF approximation using the belief propogation algorithm.

We can extend this idea of using Markov random fields to model more global relationships, and not just pairwise relationships. For example, disparities and objects can be modelled jointly to give constraints that span larger distances.

### **End-to-End Learning**

These algorithms directly take images as input and output a full disparity map. This requires a lot of training data and could only work after emergence of datasets in this direction.

#### **DispNet**

Is the first end-to-end model for stereo. It used a U-Net like architecture with skip-connections to retain details. 



The loss function was a Multi-scale loss (disparity error in pixels) and the loss function was applied even for the downscaled versions of the ground truth at intermediate layers. *Curriculum Learning* was used. Here the network is first trained on easy examples before the difficulty of the datasets was increased.

Creating datasets of 3D imaginary with ground truth is difficult, and the Mayer et. al (2016) paper was pretrained on large synthetic datasets with cheap annotations and then fine tuned on a little real data.

#### **GC-Net**

A follow up work used a similar idea to DispNet but had improved performance.

The key idea was to calculate the disparity cost volume and *3D convolutions* on it. This is very memory intensive and so small batch sizes should be used.

It also converted the learned matching cost $c_\theta(d)$ to a disparity via the expectation $$d^* = \mathbb{E}[d] = \sum_{d=0}^{D} \verb|softmax|(-c_\theta(d)) d$$

In [9]:
c_of_d = torch.tensor([1.0,3.0,10.0,3.0,1.0])
d = torch.arange(*c_of_d.size())
torch.sum(d * F.softmax(c_of_d, dim=0))

tensor(2.)

In [12]:
c_of_d = torch.tensor([10.0,2.0,1.0,2.0,10.0])
d = torch.arange(*c_of_d.size())
torch.sum(d * F.softmax(c_of_d, dim=0))

tensor(2.0000)

#### **Stereo Mixed Density Networks (SMD-Nets)**



Because of the intrinsic smoothness properties of neural networks we see a bleeding at the edges (pixels). SMD-Nets predict sharper boundaries at higher resolution.

They do so by predicting a bimodel (Laplacian) *mixture distribution* which allows accurate capture of uncertainty close to depth discontinuities. Another innovation was an MLP head that regresses the distribution parameters from interpolated features. This enables  training and inference at arbitrary spacial resolution.

## **Lecture 5: Probabilistic Graphical Models**