# "The Role of Wide Baseline Stereo in the Deep Learning World"
> "Short history of wide baseline stereo in computer vision"
- toc: false
- image: images/doll_wbs_300.png
- branch: master
- badges: true
- comments: true
- hide: false
- search_exclude: true

## Rise of Wide Baseline Stereo

The wide baseline stereo (WBS) is a process of establishing correspondences between pixels and/or regions between
images depicting the same object or scene and estimation geometric relationship between the cameras, which produced that images.

![](00_intro_files/match_doll.png "Correspondences between two views found by wide baseline stereo algorithm. Photo and doll created by Olha Mishkina")


<!--- ![Wide baseline stereo model. "Baseline" is the distance between cameras. Image by Arne Nordmann (WikiMedia)](00_intro_files/Epipolar_geometry.svg) 
-->

One of the first succesful solutions for the WBS problem was proposed by Schmid and Mohr \cite{Schmid1995} in 1995.
It was later extended by Beardsley, Torr and Zisserman\cite{Beardsley96} by adding RANSAC robust geometry estimation and later refined by Pritchett and Zisserman \cite{Pritchett1998, Pritchett1998b} in 1998. The general pipeline remains mostly the same until now \cite{WBSTorr99, CsurkaReview2018}. The currently adopted version of the wide baseline stereo algorithm is shown below. 

<!--- 
![image.png](00_intro_files/att_00002.png)
-->


![](00_intro_files/matching-filtering.png "Commonly used wide baseline stereo pipeline")


The algorithm can be summarized as the following:

1. Compute interest points/regions in all images independently
2. For each interest point/region compute a descriptor of their neigborhood (local patch).
3. Establish tentative corresponces between interest points based on their descriptors.
4. Robustly estimate geometric relation between two images based on tentative correspondences with RANSAC.

The reason of steps 1 and 2 done on the both images separately is that in general wide baseline stereo is not limited to pairs of images, but rather to a collections of them. If all the steps are done pairwise, then the computational complexity is $O(n^2)$. The more steps done seperately - the more efficient algorithm is.


## Quick expansion 

This algorithm significantly changed computer vision landscape for next forteen years.

Soon after introducing the algorithm, there it become clear that its quality significantly depends on quality of each component, i.e. local feature detector, descriptor, and geometry estimation. Pleora of new detectors and descriptors were proposed, with the most cited computer vision paper ever SIFT local feature\cite{Lowe99}. 

It is worth noting, that SIFT became popular only after Mikolajczyk benchmark paper \cite{MikoDescEval2003, Mikolajczyk05}, showed it superiority to the rest of alternatives.

Robust geometry estimation was also a hot topic: a lot of improvements over vanilla RANSAC were proposed: LO-RANSAC\cite{LOransac2003}, DEGENSAC\cite{Degensac2005}, MLESAC\cite{MLESAC00} 

Success of wide baseline stereo with SIFT features led to aplication of its components to other computer vision tasks, which were reformulated through wide baseline stereo lens:

-   **Scalable image search**. Sivic and Zisserman in famous "Video Google" paper\cite{VideoGoogle2003} proposed to treat local features as "visual words" and use ideas from text processing for searching in image collections.  Later even more WBS elements were re-introduced to image search, most notable -- **spatial verification**\cite{Philbin07}: simplified RANSAC procedure to verify if visual word matches were spatially consistent.

![](00_intro_files/att_00004.png "Bag of words image search. Image credit: Filip Radenovic http://cmp.felk.cvut.cz/~radenfil/publications/Radenovic-CMPcolloq-2015.11.12.pdf")

- **Image classification** was performed by placing some classifier (SVM, random forest, etc) on top of some encoding of the SIFT-like descriptors, extracted sparsely\cite{Fergus03, CsurkaBoK2004} or densely\cite{Lazebnik06}. 

![](00_intro_files/att_00005.png "Bag of local features representation for classification from Fergus03")

- **Object detection** was formulated as relaxed wide baseline stereo problem\cite{Chum2007Exemplar} or as classification of SIFT-like features inside a sliding window \cite{HoG2005}

![](00_intro_files/att_00003.png "Exemplar-representation of the classes using local features, cite{Chum2007Exemplar}")

<!--- 
![HoG-based pedestrian detection algorithm](00_intro_files/att_00006.png)
![Histogram of gradient visualization](00_intro_files/att_00007.png)
-->

- **Semantic segmentation** was performed by classicication of local region descriptors, typically, SIFT and color features and postprocessing afterwards\cite{Superparsing2010}. 


Of course,wide  baseline stereo was also used for its direct applications: 

 - **3D reconstruction** was based on camera poses and 3D points, estimated with help of SIFT features \cite{PhotoTourism2006, RomeInDay2009, COLMAP2016}
 
![](00_intro_files/att_00008.png "SfM pipeline from COLMAP")
 
 - **SLAM(Simultaneous localization and mapping)** \cite{Se02, PTAM2007, Mur15} were based on fast version of local feature detectors and descriptors.
 <!--- 
![ORBSLAM pipeline](00_intro_files/att_00009.png)
-->
 
 - **Panorama stiching** \cite{Brown07} and, more generally, **feature-based image registration**\cite{DualBootstrap2003} were initalized with a geometry obtained by WBS and then further optimized 

## Deep Learning Invasion: retreal to the geometrical fortress


In 2012 deep learning-based AlexNet\cite{AlexNet2012} approach beat all the methods in image classification. Soon after, Razavian et.al\cite{Astounding2014} have shown that convolutional neural networks (CNNs) pre-trained on the Imagenet outperform more complex traditional solutions in image and scene classification, object detection and image search.
Deep learning solutions, be it pretrained or end-to-end learned networks quickly become the default solution for the most of computer vision problems.

![](00_intro_files/att_00010.png "CNN representation beats complex traditional pipelines. Reds are CNN-based and greens are the handcrafted. From Astounding2014")


However, there was still an area, where deep learned solutions failed, sometimes spectacularly: geometry-related tasks. Wide baseline stereo\cite{Melekhov2017relativePoseCnn}, visual localization\cite{PoseNet2015}}, SLAM are still areas, where the classical wide baseline stereo dominates\cite{sattler2019understanding, zhou2019learn}.

The full reasons why convolution pipelines are failing for geometrical tasks are yet to understand, but the current hypothesis are the following:

- CNN-based pose predictions predictions are roughly equivalent to retrieval of most similar image from the training set and outputing its pose.\cite{sattler2019understanding} This phenomenum is also observed in related area: single-view 3D reconstruction\cite{Tatarchenko2019}.
- Geometric and arithmetic operations are hard to represent via vanilla neural networks (i.e. matrix multiplication with non-linearity) and they may require specialized building blocks, resembling operations of algorithmic or geometric methods, e.g. spatial transformers\cite{STN2015} and arithmetic units\cite{NALU2018,NAU2020}. Even with special structure such networks require "careful initialization, restricting parameter space, and regularizing for sparsity"\cite{NAU2020}.
- Vanilla CNNs are not covariant to even simple geometric transformation like translation \cite{MakeCNNShiftInvariant2019}, scaling and especially rotation \cite{GroupEqCNN2016}. Unlike them, WBS baseline is grounded on scale-space theory \cite{lindeberg2013scale} and local patches are geometrically normalilzed before description. 
- Predictions of the CNNs can be altered by change in a small localized area \cite{AdvPatch2017} or even single pixel \cite{OnePixelAttack2019}, while the wide baseline stereo methods require the consensus of different independent regions. 

## Today: assimilation and merging

### Wide baseline stereo as a task: formulate differentiably and learn modules
Wide baseline stereo as a task is solved today typically by using learned components as a replacement of specific blocks in WBS algorithm\cite{jin2020image} ,e.g. local descriptor like HardNet\cite{HardNet2017}, detectors like KeyNet\cite{KeyNet2019}, joint detector-descriptor\cite{SuperPoint2017} matching and filtering like SuperGlue\cite{sarlin2019superglue}, etc. 
There are also attempts to formulate the whole downstream task pipeline like SLAM\cite{gradslam2020} in a differentiable way, combining advantages of structured and learning-based approaches.  

![](00_intro_files/att_00011.png "SuperGlue: separate matching module for handcrafter and learned features")

![](00_intro_files/gradslam.png "gradSLAM: differentiable formulation of SLAM pipeline")


### Wide baseline stereo as a idea: consensus of local independent predictions

On the other hand, as an algorithm, wide baseline stereo is summarized into two main ideas

1. Image should be represented as set of local parts, robust to occlusion, and not influencing each other.
2. Decision should be based on spatial consensus of local feature correspondences.


One of modern revisit of wide baseline stereo ideas is Capsule Networks\cite{CapsNet2011,CapsNet2017}. Unlike CNNs, they encode not only intensity of feature responce, but also its location and require a geometric agreement between object parts for outputing a confident prediction.

Similar ideas are now explored for ensuring adversarial robustness of CNNs\cite{li2020extreme}.

While wide baseline stereo is far from the mainstream now, it continues to play an important role in computer vision.

![](00_intro_files/capsules.png "Capsule networks: revisiting the WBS idea. Each feature response is accompanied with its pose. Poses should be in agreement, otherwise object would not be recognized. Image by Aurélien Géron https://www.oreilly.com/content/introducing-capsule-networks/")

# References

[<a id="cit-Pritchett1998" href="#call-Pritchett1998">Pritchett1998</a>] P. Pritchett and A. Zisserman, ``_Wide baseline stereo matching_'', ICCV,  1998.

[<a id="cit-Pritchett1998b" href="#call-Pritchett1998b">Pritchett1998b</a>] P. Pritchett and A. Zisserman, ``_"Matching and Reconstruction from Widely Separated Views"_'', 3D Structure from Multiple Images of Large-Scale Environments,  1998.

[<a id="cit-WBSTorr99" href="#call-WBSTorr99">WBSTorr99</a>] P. Torr and A. Zisserman, ``_Feature Based Methods for Structure and Motion Estimation_'', Workshop on Vision Algorithms,  1999.

[<a id="cit-CsurkaReview2018" href="#call-CsurkaReview2018">CsurkaReview2018</a>] {Csurka} Gabriela, {Dance} Christopher R. and {Humenberger} Martin, ``_From handcrafted to deep local features_'', arXiv e-prints, vol. , number , pp. ,  2018.

[<a id="cit-Lowe99" href="#call-Lowe99">Lowe99</a>] D. Lowe, ``_Object Recognition from Local Scale-Invariant Features_'', ICCV,  1999.

[<a id="cit-MikoDescEval2003" href="#call-MikoDescEval2003">MikoDescEval2003</a>] K. Mikolajczyk and C. Schmid, ``_A Performance Evaluation of Local Descriptors_'', CVPR, June 2003.

[<a id="cit-Mikolajczyk05" href="#call-Mikolajczyk05">Mikolajczyk05</a>] Mikolajczyk K., Tuytelaars T., Schmid C. <em>et al.</em>, ``_A Comparison of Affine Region Detectors_'', IJCV, vol. 65, number 1/2, pp. 43--72,  2005.

[<a id="cit-LOransac2003" href="#call-LOransac2003">LOransac2003</a>] O. Chum, J. Matas and J. Kittler, ``_Locally Optimized RANSAC_'', Pattern Recognition,  2003.

[<a id="cit-Degensac2005" href="#call-Degensac2005">Degensac2005</a>] O. Chum, T. Werner and J. Matas, ``_Two-View Geometry Estimation Unaffected by a Dominant Plane_'', CVPR,  2005.

[<a id="cit-MLESAC00" href="#call-MLESAC00">MLESAC00</a>] Torr P.H.S. and Zisserman A., ``_MLESAC: A New Robust Estimator with Application to Estimating Image Geometry_'', CVIU, vol. 78, number , pp. 138--156,  2000.

[<a id="cit-VideoGoogle2003" href="#call-VideoGoogle2003">VideoGoogle2003</a>] J. Sivic and A. Zisserman, ``_Video Google: A Text Retrieval Approach to Object Matching in Videos_'', ICCV,  2003.

[<a id="cit-Philbin07" href="#call-Philbin07">Philbin07</a>] J. Philbin, O. Chum, M. Isard <em>et al.</em>, ``_Object Retrieval with Large Vocabularies and Fast Spatial Matching_'', CVPR,  2007.

[<a id="cit-Fergus03" href="#call-Fergus03">Fergus03</a>] R. Fergus, P. Perona and A. Zisserman, ``_Object Class Recognition by Unsupervised Scale-Invariant Learning_'', CVPR,  2003.

[<a id="cit-CsurkaBoK2004" href="#call-CsurkaBoK2004">CsurkaBoK2004</a>] C.D. G. Csurka, J. Willamowski, L. Fan <em>et al.</em>, ``_Visual Categorization with Bags of Keypoints_'', ECCV,  2004.

[<a id="cit-Lazebnik06" href="#call-Lazebnik06">Lazebnik06</a>] S. Lazebnik, C. Schmid and J. Ponce, ``_Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories_'', CVPR,  2006.

[<a id="cit-Chum2007Exemplar" href="#call-Chum2007Exemplar">Chum2007Exemplar</a>] O. {Chum} and A. {Zisserman}, ``_An Exemplar Model for Learning Object Classes_'', CVPR,  2007.

[<a id="cit-HoG2005" href="#call-HoG2005">HoG2005</a>] N. {Dalal} and B. {Triggs}, ``_Histograms of oriented gradients for human detection_'', CVPR,  2005.

[<a id="cit-Superparsing2010" href="#call-Superparsing2010">Superparsing2010</a>] J. Tighe and S. Lazebnik, ``_SuperParsing: Scalable Nonparametric Image Parsing with Superpixels_'', ECCV,  2010.

[<a id="cit-PhotoTourism2006" href="#call-PhotoTourism2006">PhotoTourism2006</a>] Snavely Noah, Seitz Steven M. and Szeliski Richard, ``_Photo Tourism: Exploring Photo Collections in 3D_'', ToG, vol. 25, number 3, pp. 835–846,  2006.

[<a id="cit-RomeInDay2009" href="#call-RomeInDay2009">RomeInDay2009</a>] Agarwal Sameer, Furukawa Yasutaka, Snavely Noah <em>et al.</em>, ``_Building Rome in a day_'', Communications of the ACM, vol. 54, number , pp. 105--112,  2011.

[<a id="cit-COLMAP2016" href="#call-COLMAP2016">COLMAP2016</a>] J. Sch\"{o}nberger and J. Frahm, ``_Structure-From-Motion Revisited_'', CVPR,  2016.

[<a id="cit-Se02" href="#call-Se02">Se02</a>] Se S., G. D. and Little J., ``_Mobile Robot Localization and Mapping with Uncertainty Using Scale-Invariant Visual Landmarks_'', IJRR, vol. 22, number 8, pp. 735--758,  2002.

[<a id="cit-PTAM2007" href="#call-PTAM2007">PTAM2007</a>] G. {Klein} and D. {Murray}, ``_Parallel Tracking and Mapping for Small AR Workspaces_'', IEEE and ACM International Symposium on Mixed and Augmented Reality,  2007.

[<a id="cit-Mur15" href="#call-Mur15">Mur15</a>] Mur-Artal R., Montiel J. and Tard{\'o}s J., ``_ORB-Slam: A Versatile and Accurate Monocular Slam System_'', IEEE Transactions on Robotics, vol. 31, number 5, pp. 1147--1163,  2015.

[<a id="cit-Brown07" href="#call-Brown07">Brown07</a>] Brown M. and Lowe D., ``_Automatic Panoramic Image Stitching Using Invariant Features_'', IJCV, vol. 74, number , pp. 59--73,  2007.

[<a id="cit-DualBootstrap2003" href="#call-DualBootstrap2003">DualBootstrap2003</a>] V. C., Tsai} {Chia-Ling and {Roysam} B., ``_The dual-bootstrap iterative closest point algorithm with application to retinal image registration_'', IEEE Transactions on Medical Imaging, vol. 22, number 11, pp. 1379-1394,  2003.

[<a id="cit-AlexNet2012" href="#call-AlexNet2012">AlexNet2012</a>] Alex Krizhevsky, Ilya Sutskever and Geoffrey E., ``_ImageNet Classification with Deep Convolutional Neural Networks_'',  2012.

[<a id="cit-Astounding2014" href="#call-Astounding2014">Astounding2014</a>] A. S., H. {Azizpour}, J. {Sullivan} <em>et al.</em>, ``_CNN Features Off-the-Shelf: An Astounding Baseline for Recognition_'', CVPRW,  2014.

[<a id="cit-Melekhov2017relativePoseCnn" href="#call-Melekhov2017relativePoseCnn">Melekhov2017relativePoseCnn</a>] I. Melekhov, J. Ylioinas, J. Kannala <em>et al.</em>, ``_Relative Camera Pose Estimation Using Convolutional Neural Networks_'', ,  2017.  [online](https://arxiv.org/abs/1702.01381)

[<a id="cit-PoseNet2015" href="#call-PoseNet2015">PoseNet2015</a>] A. Kendall, M. Grimes and R. Cipolla, ``_PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization_'', ICCV,  2015.

[<a id="cit-sattler2019understanding" href="#call-sattler2019understanding">sattler2019understanding</a>] T. Sattler, Q. Zhou, M. Pollefeys <em>et al.</em>, ``_Understanding the limitations of cnn-based absolute camera pose regression_'', CVPR,  2019.

[<a id="cit-zhou2019learn" href="#call-zhou2019learn">zhou2019learn</a>] Q. Zhou, T. Sattler, M. Pollefeys <em>et al.</em>, ``_To Learn or Not to Learn: Visual Localization from Essential Matrices_'', ICRA,  2020.

[<a id="cit-Tatarchenko2019" href="#call-Tatarchenko2019">Tatarchenko2019</a>] M. Tatarchenko, S.R. Richter, R. Ranftl <em>et al.</em>, ``_What Do Single-View 3D Reconstruction Networks Learn?_'', CVPR,  2019.

[<a id="cit-STN2015" href="#call-STN2015">STN2015</a>] M. Jaderberg, K. Simonyan and A. Zisserman, ``_Spatial transformer networks_'', NeurIPS,  2015.

[<a id="cit-NALU2018" href="#call-NALU2018">NALU2018</a>] A. Trask, F. Hill, S.E. Reed <em>et al.</em>, ``_Neural arithmetic logic units_'', NeurIPS,  2018.

[<a id="cit-NAU2020" href="#call-NAU2020">NAU2020</a>] A. Madsen and A. Rosenberg, ``_Neural Arithmetic Units_'', ICLR,  2020.

[<a id="cit-MakeCNNShiftInvariant2019" href="#call-MakeCNNShiftInvariant2019">MakeCNNShiftInvariant2019</a>] R. Zhang, ``_Making convolutional networks shift-invariant again_'', ICML,  2019.

[<a id="cit-GroupEqCNN2016" href="#call-GroupEqCNN2016">GroupEqCNN2016</a>] T. Cohen and M. Welling, ``_Group equivariant convolutional networks_'', ICML,  2016.

[<a id="cit-lindeberg2013scale" href="#call-lindeberg2013scale">lindeberg2013scale</a>] Lindeberg Tony, ``_Scale-space theory in computer vision_'', , vol. 256, number , pp. ,  2013.

[<a id="cit-AdvPatch2017" href="#call-AdvPatch2017">AdvPatch2017</a>] T. Brown, D. Mane, A. Roy <em>et al.</em>, ``_Adversarial patch_'', NeurIPSW,  2017.

[<a id="cit-OnePixelAttack2019" href="#call-OnePixelAttack2019">OnePixelAttack2019</a>] Su Jiawei, Vargas Danilo Vasconcellos and Sakurai Kouichi, ``_One pixel attack for fooling deep neural networks_'', IEEE Transactions on Evolutionary Computation, vol. 23, number 5, pp. 828--841,  2019.

[<a id="cit-jin2020image" href="#call-jin2020image">jin2020image</a>] Jin Yuhe, Mishkin Dmytro, Mishchuk Anastasiia <em>et al.</em>, ``_Image Matching across Wide Baselines: From Paper to Practice_'', arXiv preprint arXiv:2003.01587, vol. , number , pp. ,  2020.

[<a id="cit-HardNet2017" href="#call-HardNet2017">HardNet2017</a>] A. Mishchuk, D. Mishkin, F. Radenovic <em>et al.</em>, ``_Working Hard to Know Your Neighbor's Margins: Local Descriptor Learning Loss_'', NeurIPS,  2017.

[<a id="cit-KeyNet2019" href="#call-KeyNet2019">KeyNet2019</a>] A. Barroso-Laguna, E. Riba, D. Ponsa <em>et al.</em>, ``_Key.Net: Keypoint Detection by Handcrafted and Learned CNN Filters_'', ICCV,  2019.

[<a id="cit-SuperPoint2017" href="#call-SuperPoint2017">SuperPoint2017</a>] Detone D., Malisiewicz T. and Rabinovich A., ``_Superpoint: Self-Supervised Interest Point Detection and Description_'', CVPRW Deep Learning for Visual SLAM, vol. , number , pp. ,  2018.

[<a id="cit-sarlin2019superglue" href="#call-sarlin2019superglue">sarlin2019superglue</a>] P. Sarlin, D. DeTone, T. Malisiewicz <em>et al.</em>, ``_SuperGlue: Learning Feature Matching with Graph Neural Networks_'', CVPR,  2020.

[<a id="cit-gradslam2020" href="#call-gradslam2020">gradslam2020</a>] J. Krishna Murthy, G. Iyer and L. Paull, ``_gradSLAM: Dense SLAM meets Automatic Differentiation _'', ICRA,  2020 .

[<a id="cit-CapsNet2011" href="#call-CapsNet2011">CapsNet2011</a>] G.E. Hinton, A. Krizhevsky and S.D. Wang, ``_Transforming auto-encoders_'', ICANN,  2011.

[<a id="cit-CapsNet2017" href="#call-CapsNet2017">CapsNet2017</a>] S. Sabour, N. Frosst and G.E. Hinton, ``_Dynamic routing between capsules_'', NeurIPS,  2017.

[<a id="cit-li2020extreme" href="#call-li2020extreme">li2020extreme</a>] Li Jianguo, Sun Mingjie and Zhang Changshui, ``_Extreme Values are Accurate and Robust in Deep Networks_'', , vol. , number , pp. ,  2020.  [online](https://openreview.net/forum?id=H1gHb1rFwr)

[<a id="cit-Schmid1995" href="#call-Schmid1995">Schmid1995</a>] Schmid Cordelia and Mohr Roger, ``_Matching by local invariants_'', , vol. , number , pp. ,  1995.  [online](https://hal.inria.fr/file/index/docid/74046/filename/RR-2644.pdf)

