# "The Role of Wide Baseline Stereo in the Deep Learning World"
> "Short history of wide baseline stereo in computer vision"
- toc: false
- image: images/doll_wbs_300.png
- branch: master
- badges: true
- comments: true
- hide: false
- search_exclude: true

## Rise of Wide Baseline Stereo

The wide baseline stereo (WBS) is a process of establishing correspondences between pixels and/or regions between
images depicting the same object or scene and estimation geometric relationship between the cameras, which produced that images.

![](00_intro_files/match_doll.png "Correspondences between two views found by wide baseline stereo algorithm. Photo and doll created by Olha Mishkina")


<!--- ![Wide baseline stereo model. "Baseline" is the distance between cameras. Image by Arne Nordmann (WikiMedia)](00_intro_files/Epipolar_geometry.svg) 
-->

One of the first succesful solutions for the WBS problem was proposed by Schmid and Mohr \cite{Schmid1995} in 1995.
It was later extended by Beardsley, Torr and Zisserman\cite{Beardsley96} by adding RANSAC robust geometry estimation and later refined by Pritchett and Zisserman \cite{Pritchett1998, Pritchett1998b} in 1998. The general pipeline remains mostly the same until now \cite{WBSTorr99, CsurkaReview2018}. The currently adopted version of the wide baseline stereo algorithm is shown below. 

<!--- 
![image.png](00_intro_files/att_00002.png)
-->


![](00_intro_files/matching-filtering.png "Commonly used wide baseline stereo pipeline")


The algorithm can be summarized as the following:

1. Compute interest points/regions in all images independently
2. For each interest point/region compute a descriptor of their neigborhood (local patch).
3. Establish tentative corresponces between interest points based on their descriptors.
4. Robustly estimate geometric relation between two images based on tentative correspondences with RANSAC.

The reason of steps 1 and 2 done on the both images separately is that in general wide baseline stereo is not limited to pairs of images, but rather to a collections of them. If all the steps are done pairwise, then the computational complexity is $O(n^2)$. The more steps done seperately - the more efficient algorithm is.


## Quick expansion 

This algorithm significantly changed computer vision landscape for next forteen years.

Soon after introducing the algorithm, there it become clear that its quality significantly depends on quality of each component, i.e. local feature detector, descriptor, and geometry estimation. Pleora of new detectors and descriptors were proposed, with the most cited computer vision paper ever SIFT local feature\cite{Lowe99}. 

It is worth noting, that SIFT became popular only after Mikolajczyk benchmark paper \cite{MikoDescEval2003, Mikolajczyk05}, showed it superiority to the rest of alternatives.

Robust geometry estimation was also a hot topic: a lot of improvements over vanilla RANSAC were proposed: LO-RANSAC\cite{LOransac2003}, DEGENSAC\cite{Degensac2005}, MLESAC\cite{MLESAC00} 

Success of wide baseline stereo with SIFT features led to aplication of its components to other computer vision tasks, which were reformulated through wide baseline stereo lens:

-   **Scalable image search**. Sivic and Zisserman in famous "Video Google" paper\cite{VideoGoogle2003} proposed to treat local features as "visual words" and use ideas from text processing for searching in image collections.  Later even more WBS elements were re-introduced to image search, most notable -- **spatial verification**\cite{Philbin07}: simplified RANSAC procedure to verify if visual word matches were spatially consistent.

![](00_intro_files/att_00004.png "Bag of words image search. Image credit: Filip Radenovic http://cmp.felk.cvut.cz/~radenfil/publications/Radenovic-CMPcolloq-2015.11.12.pdf")

- **Image classification** was performed by placing some classifier (SVM, random forest, etc) on top of some encoding of the SIFT-like descriptors, extracted sparsely\cite{Fergus03, CsurkaBoK2004} or densely\cite{Lazebnik06}. 

![](00_intro_files/att_00005.png "Bag of local features representation for classification from Fergus03")

- **Object detection** was formulated as relaxed wide baseline stereo problem\cite{Chum2007Exemplar} or as classification of SIFT-like features inside a sliding window \cite{HoG2005}

![](00_intro_files/att_00003.png "Exemplar-representation of the classes using local features, cite{Chum2007Exemplar}")

<!--- 
![HoG-based pedestrian detection algorithm](00_intro_files/att_00006.png)
![Histogram of gradient visualization](00_intro_files/att_00007.png)
-->

- **Semantic segmentation** was performed by classicication of local region descriptors, typically, SIFT and color features and postprocessing afterwards\cite{Superparsing2010}. 


Of course,wide  baseline stereo was also used for its direct applications: 

 - **3D reconstruction** was based on camera poses and 3D points, estimated with help of SIFT features \cite{PhotoTourism2006, RomeInDay2009, COLMAP2016}
 
![](00_intro_files/att_00008.png "SfM pipeline from COLMAP")
 
 - **SLAM(Simultaneous localization and mapping)** \cite{Se02, PTAM2007, Mur15} were based on fast version of local feature detectors and descriptors.
 <!--- 
![ORBSLAM pipeline](00_intro_files/att_00009.png)
-->
 
 - **Panorama stiching** \cite{Brown07} and, more generally, **feature-based image registration**\cite{DualBootstrap2003} were initalized with a geometry obtained by WBS and then further optimized 

## Deep Learning Invasion: retreal to the geometrical fortress


In 2012 deep learning-based AlexNet\cite{AlexNet2012} approach beat all the methods in image classification. Soon after, Razavian et.al\cite{Astounding2014} have shown that convolutional neural networks (CNNs) pre-trained on the Imagenet outperform more complex traditional solutions in image and scene classification, object detection and image search.
Deep learning solutions, be it pretrained or end-to-end learned networks quickly become the default solution for the most of computer vision problems.

![](00_intro_files/att_00010.png "CNN representation beats complex traditional pipelines. Reds are CNN-based and greens are the handcrafted. From Astounding2014")


However, there was still an area, where deep learned solutions failed, sometimes spectacularly: geometry-related tasks. Wide baseline stereo\cite{Melekhov2017relativePoseCnn}, visual localization\cite{PoseNet2015}}, SLAM are still areas, where the classical wide baseline stereo dominates\cite{sattler2019understanding, zhou2019learn}.

The full reasons why convolution pipelines are failing for geometrical tasks are yet to understand, but the current hypothesis are the following:

- CNN-based pose predictions predictions are roughly equivalent to retrieval of most similar image from the training set and outputing its pose.\cite{sattler2019understanding} This phenomenum is also observed in related area: single-view 3D reconstruction\cite{Tatarchenko2019}.
- Geometric and arithmetic operations are hard to represent via vanilla neural networks (i.e. matrix multiplication with non-linearity) and they may require specialized building blocks, resembling operations of algorithmic or geometric methods, e.g. spatial transformers\cite{STN2015} and arithmetic units\cite{NALU2018,NAU2020}. Even with special structure such networks require "careful initialization, restricting parameter space, and regularizing for sparsity"\cite{NAU2020}.
- Vanilla CNNs are not covariant to even simple geometric transformation like translation \cite{MakeCNNShiftInvariant2019}, scaling and especially rotation \cite{GroupEqCNN2016}. Unlike them, WBS baseline is grounded on scale-space theory \cite{lindeberg2013scale} and local patches are geometrically normalilzed before description. 
- Predictions of the CNNs can be altered by change in a small localized area \cite{AdvPatch2017} or even single pixel \cite{OnePixelAttack2019}, while the wide baseline stereo methods require the consensus of different independent regions. 

## Today: assimilation and merging

### Wide baseline stereo as a task: formulate differentiably and learn modules
Wide baseline stereo as a task is solved today typically by using learned components as a replacement of specific blocks in WBS algorithm\cite{jin2020image} ,e.g. local descriptor like HardNet\cite{HardNet2017}, detectors like KeyNet\cite{KeyNet2019}, joint detector-descriptor\cite{SuperPoint2017} matching and filtering like SuperGlue\cite{sarlin2019superglue}, etc. 
There are also attempts to formulate the whole downstream task pipeline like SLAM\cite{gradslam2020} in a differentiable way, combining advantages of structured and learning-based approaches.  

![](00_intro_files/att_00011.png "SuperGlue: separate matching module for handcrafter and learned features")

![](00_intro_files/gradslam.png "gradSLAM: differentiable formulation of SLAM pipeline")


### Wide baseline stereo as a idea: consensus of local independent predictions

On the other hand, as an algorithm, wide baseline stereo is summarized into two main ideas

1. Image should be represented as set of local parts, robust to occlusion, and not influencing each other.
2. Decision should be based on spatial consensus of local feature correspondences.


One of modern revisit of wide baseline stereo ideas is Capsule Networks\cite{CapsNet2011,CapsNet2017}. Unlike CNNs, they encode not only intensity of feature responce, but also its location and require a geometric agreement between object parts for outputing a confident prediction.

Similar ideas are now explored for ensuring adversarial robustness of CNNs\cite{li2020extreme}.

While wide baseline stereo is far from the mainstream now, it continues to play an important role in computer vision.

![](00_intro_files/capsules.png "Capsule networks: revisiting the WBS idea. Each feature response is accompanied with its pose. Poses should be in agreement, otherwise object would not be recognized. Image by Aurélien Géron https://www.oreilly.com/content/introducing-capsule-networks/")

# References

[<a id="cit-Schmid1995" href="#call-Schmid1995">Schmid1995</a>] Schmid Cordelia and Mohr Roger, ``_Matching by local invariants_'', , vol. , number , pp. ,  1995.  [online](https://hal.inria.fr/file/index/docid/74046/filename/RR-2644.pdf)

