## Rise of Wide Multiple Baseline Stereo

The *wide multiple baseline stereo (WxBS)* is a process of establishing a sufficient number of pixel or region correspondences from two or more images depicting the same scene to estimate the geometric relationship between cameras, which produced these images. Typically, WxBS relies on the scene rigidity -- the assumption that there is no motion in the scene except the motion of the camera itself. The stereo problem is called wide multiple baseline if the images are significantly different in more than one aspect: viewpoint, illumination, time of acquisition, and so on. Historically, people were focused on the simpler problem with a single baseline, which was geometrical, i.e., viewpoint difference between cameras, and the area was known as wide baseline stereo. Nowadays, the field is mature and research is focused on solving more challenging multi-baseline problems. 

WxBS is a building block of many popular computer vision applications, where spatial localization or 3D world understanding is required -- panorama stitching, 3D reconstruction, image retrieval, SLAM, etc. 

If the wide baseline stereo is a new concept for you, I recommend checking the [examplanation in simple terms](https://ducha-aiki.github.io/wide-baseline-stereo-blog/2021/01/09/wxbs-in-simple-terms.html).

![](00_intro_files/match_doll.png "Correspondences between two views found by wide baseline stereo algorithm. Photo and doll created by Olha Mishkina")





**Where does wide baseline stereo come from?** 

As often happens, a new problem arises from the old -- narrow or short baseline stereo. In the narrow baseline stereo, images are taken from nearby positions, often exactly at the same time. One could find correspondence for the point $(x,y)$ from the image $I_1$ in the image $I_2$ by simply searching in some small window around $(x,y)$\cite{Hannah1974ComputerMO, Moravec1980} or, assuming that camera pair is calibrated and the images are rectified -- by searching along the epipolar line\cite{Hartley2004}.



![](2020-03-27-intro_files/att_00003.png "Correspondence search in narrow baseline stereo, from Moravec 1980 PhD thesis.")

<!--- ![Wide baseline stereo model. "Baseline" is the distance between cameras. Image by Arne Nordmann (WikiMedia)](00_intro_files/Epipolar_geometry.svg) 
-->

One of the first, if not the first, approaches to the wide baseline stereo problem was proposed by Schmid and Mohr \cite{Schmid1995} in 1995. Given the difficulty of the wide multiple baseline stereo task at the moment, only a single --- geometrical -- baseline was considered, thus the name -- wide baseline stereo (WBS). The idea of Schmid and Mohr was to equip each keypoint with an invariant descriptor. This allowed establishing tentative correspondences between keypoints under viewpoint and illumination changes, as well as occlusions. One of the stepping stones was the corner detector by Harris and Stevens \cite{Harris88}, initially used for the application of tracking. It is worth a mention, that there were other good choices for the local feature detector at the time, starting with the Forstner \cite{forstner1987fast}, Moravec \cite{Moravec1980} and Beaudet feature detectors \cite{Hessian78}.


The Schmid and Mohr approach was later extended by Beardsley, Torr and Zisserman \cite{Beardsley96} by adding RANSAC \cite{RANSAC1981} robust geometry estimation and later refined by Pritchett and Zisserman \cite{Pritchett1998, Pritchett1998b} in 1998. The general pipeline remains mostly the same until now \cite{WBSTorr99, CsurkaReview2018, IMW2020}, which is shown in Figure below.

<!--- 
![image.png](00_intro_files/att_00002.png)
-->


![](00_intro_files/matching-filtering.png "Commonly used wide baseline stereo pipeline")


Let's write down the WxBS algorithm:

1. Compute interest points/regions in all images independently
2. For each interest point/region compute a descriptor of their neigborhood (local patch).
3. Establish tentative correspondences between interest points based on their descriptors.
4. Robustly estimate geometric relation between two images based on tentative correspondences with RANSAC.

The reasoning behind each step is described in [this separate post](https://ducha-aiki.github.io/wide-baseline-stereo-blog/2021/02/11/WxBS-step-by-step.html).


## Quick expansion 

This algorithm significantly changed computer vision landscape for next forteen years.

Soon after the introduction of the WBS algorithm, it became clear that its quality significantly depends on the quality of each component, i.e., local feature detector, descriptor, and geometry estimation. Local feature detectors were designed to be as invariant as possible, backed up by the scale-space theory, most notable developed by Lindenberg \cite{Lindeberg1993, Lindeberg1998, lindeberg2013scale}. A plethora of new detectors and descriptors were proposed in that time. We refer the interested reader to these two surveys: by Tuytelaars and Mikolajczyk \cite{Tuytelaars2008} (2008) and by Csurka \etal \cite{CsurkaReview2018} (2018). Among the proposed local features is one of the most cited computer vision papers ever -- SIFT local feature \cite{Lowe99, SIFT2004}. Besides the SIFT descriptor itself, 

Lowe's paper incorporated several important steps, proposed earlier with his co-authors, to the matching pipeline. Specifically, they are quadratic fitting of the feature responses for precise keypoint localization \cite{QuadInterp2002}, using the Best-Bin-First kd-tree \cite{aknn1997} as an approximate nearest neightbor search engine to speed-up the tentative correspondences generation, and using second-nearest neighbor (SNN) ratio to filter the tentative matches. 
It is worth noting that SIFT feature became popular only after Mikolajczyk benchmark paper \cite{MikoDescEval2003, Mikolajczyk05} that showed its superiority to the rest of alternatives. 
 
Robust geometry estimation was also a hot topic: a lot of improvements over vanilla RANSAC were proposed. For example, LO-RANSAC \cite{LOransac2003} proposed an additional local optimization step into RANSAC to significantly decrease the number of required steps. PROSAC \cite{PROSAC2005} takes into account the tentative correspondences matching score during sampling to speed up the procedure.  DEGENSAC \cite{Degensac2005} improved the quality of the geometry estimation in the presence of a dominant plane in the images, which is the typical case for urban images. We refer the interested reader to the survey by Choi \etal \cite{RANSACSurvey2009}. 


Success of wide baseline stereo with SIFT features led to aplication of its components to other computer vision tasks, which were reformulated through wide baseline stereo lens:

-   **Scalable image search**. Sivic and Zisserman in famous "Video Google" paper\cite{VideoGoogle2003} proposed to treat local features as "visual words" and use ideas from text processing for searching in image collections.  Later even more WBS elements were re-introduced to image search, most notable -- **spatial verification**\cite{Philbin07}: simplified RANSAC procedure to verify if visual word matches were spatially consistent.

![](00_intro_files/att_00004.png "Bag of words image search. Image credit: Filip Radenovic http://cmp.felk.cvut.cz/ radenfil/publications/Radenovic-CMPcolloq-2015.11.12.pdf")

- **Image classification** was performed by placing some classifier (SVM, random forest, etc) on top of some encoding of the SIFT-like descriptors, extracted sparsely\cite{Fergus03, CsurkaBoK2004} or densely\cite{Lazebnik06}. 

![](00_intro_files/att_00005.png "Bag of local features representation for classification from Fergus03")

- **Object detection** was formulated as relaxed wide baseline stereo problem\cite{Chum2007Exemplar} or as classification of SIFT-like features inside a sliding window \cite{HoG2005}

![](00_intro_files/att_00003.png "Exemplar-representation of the classes using local features, cite{Chum2007Exemplar}")

<!--- 
![HoG-based pedestrian detection algorithm](00_intro_files/att_00006.png)
![Histogram of gradient visualization](00_intro_files/att_00007.png)
-->

- **Semantic segmentation** was performed by classicication of local region descriptors, typically, SIFT and color features and postprocessing afterwards\cite{Superparsing2010}. 


Of course,wide  baseline stereo was also used for its direct applications: 

 - **3D reconstruction** was based on camera poses and 3D points, estimated with help of SIFT features \cite{PhotoTourism2006, RomeInDay2009, COLMAP2016}
 
![](00_intro_files/att_00008.png "SfM pipeline from COLMAP")
 
 - **SLAM(Simultaneous localization and mapping)** \cite{Se02, PTAM2007, Mur15} were based on fast version of local feature detectors and descriptors.
 <!--- 
![ORBSLAM pipeline](00_intro_files/att_00009.png)
-->
 
 - **Panorama stiching** \cite{Brown07} and, more generally, **feature-based image registration**\cite{DualBootstrap2003} were initalized with a geometry obtained by WBS and then further optimized 

## Deep Learning Invasion: retreal to the geometrical fortress


 In 2012 the deep learning-based AlexNet \cite{AlexNet2012} approach beat all methods in image classification at the ImageNet Large Scale Visual Recognition Challenge (ILSVRC).
 Soon after, Razavian et al.\cite{Astounding2014} have shown that convolutional neural networks (CNNs) pre-trained on the Imagenet outperform more complex traditional solutions in image and scene classification, object detection and image search, see Figure below. The performance gap between deep leaning and "classical" solutions was large and quickly increasing. In addition, deep learning pipelines, be it off-the-shelf pretrained, fine-tuned or the end-to-end learned networks, are simple from the engineering perspective. That is why the deep learning algorithms quickly become the default option for lots of computer vision problems. 

![](00_intro_files/att_00010.png "CNN representation beats complex traditional pipelines. Reds are CNN-based and greens are the handcrafted. From Astounding2014")

However, there was still a domain, where deep learned solutions failed, sometimes spectacularly: geometry-related tasks. Wide baseline stereo \cite{Melekhov2017relativePoseCnn}, visual localization \cite{PoseNet2015} and SLAM are still areas, where the classical wide baseline stereo dominates \cite{sattler2019understanding, zhou2019learn, pion2020benchmarking}. 

The full reasons why convolution neural network pipelines are struggling to perform tasks that are related to geometry, and how to fix that, are yet to be understood. The observations from the recent papers are following:

- CNN-based pose predictions predictions are roughly equivalent to the retrieval of the most similar image from the training set and outputing its pose \cite{sattler2019understanding}. This kind of behaviour is also observed in a related area: single-view 3D reconstruction performed by deep networks is essentially a retrieval of the most similar 3D model from the training set \cite{Tatarchenko2019}.
- Geometric and arithmetic operations are hard to represent via vanilla neural networks (i.e., matrix multiplication followed by non-linearity) and they may require specialized building blocks, approximating operations of algorithmic or geometric methods, e.g. spatial transformers \cite{STN2015} and arithmetic
  units \cite{NALU2018,NAU2020}. Even with such special-purpose components, the deep learning solutions require "careful initialization, restricting parameter space, and regularizing for sparsity" \cite{NAU2020}.
- Vanilla CNNs suffer from sensitivity to geometric transformations like scaling and rotation \cite{GroupEqCNN2016} or even translation \cite{MakeCNNShiftInvariant2019}. The sensitivity to translations might sound counter-intuitive, because the concolution operation by definition is translation-covariant. However, a typical CNN contains also zero-padding and downscaling operations, which break the covariance \cite{MakeCNNShiftInvariant2019, AbsPositionCNN2020}. Unlike them, classical local feature detectors are grounded on scale-space \cite{lindeberg2013scale} and image processing theories. Some of the classical methods deal with the issue by explicit geometric normalization of the patches before description.
- CNNs predictions can be altered by a change in a small localized area \cite{AdvPatch2017} or even a single pixel \cite{OnePixelAttack2019}, while the wide baseline stereo methods require the consensus of different independent regions.

## Today: assimilation and merging

### Wide baseline stereo as a task: formulate differentiably and learn modules

This leads us to the following question -- **is deep learning helping WxBS today?** The answer is yes. After the quick interest in the black-box-style models, the current trend is to design deep learning solutions for the wide baseline stereo in a modular fashion \cite{cv4action2019}, resembling the one in Figure below. Such modules are learned separately. For example, the HardNet \cite{HardNet2017} descriptor replaces SIFT local descriptor. The Hessian detector can be replaced by deep learned detectors like KeyNet \cite{KeyNet2019} or the joint
detector-descriptor \cite{SuperPoint2017, R2D22019, D2Net2019}. The matching and filtering are performed by the SuperGlue \cite{sarlin2019superglue} matching network, etc. There have been attempts to formulate the full pipeline solving problem like SLAM \cite{gradslam2020} in a differentiable way, combining the advantages of structured and learning-based approaches.


![](00_intro_files/att_00011.png "SuperGlue: separate matching module for handcrafter and learned features")

![](00_intro_files/gradslam.png "gradSLAM: differentiable formulation of SLAM pipeline")


### Wide baseline stereo as a idea: consensus of local independent predictions

On the other hand, as an algorithm, wide baseline stereo is summarized into two main ideas

1. Image should be represented as set of local parts, robust to occlusion, and not influencing each other.
2. Decision should be based on spatial consensus of local feature correspondences.


One of modern revisit of wide baseline stereo ideas is Capsule Networks\cite{CapsNet2011,CapsNet2017}.  Unlike vanilla CNNs, capsule networks encode not only the intensity of feature response, but also its location. Geometric agreement between "object parts" is a requirement for outputing a confident prediction.

Similar ideas are now explored for ensuring adversarial robustness of CNNs\cite{li2020extreme}.


Another way of using "consensus of local independent predictions" is used in [Cross-transformers](https://arxiv.org/abs/2007.11498) paper: spatial attention helps to select relevant feature for few-shot learning, see Figure below. 

While wide multiple baseline stereo is a mature field now and does not attract even nearly as much attention as before, it continues to play an important role in computer vision.

![](2020-03-27-intro_files/att_00000.png "Cross-transformers: spatial attention helps to select relevant feature for few-shot learning")


![](00_intro_files/capsules.png "Capsule networks: revisiting the WBS idea. Each feature response is accompanied with its pose. Poses should be in agreement, otherwise object would not be recognized. Image by Aurélien Géron https://www.oreilly.com/content/introducing-capsule-networks/")



# References

[<a id="cit-Hannah1974ComputerMO" href="#call-Hannah1974ComputerMO">Hannah1974ComputerMO</a>] M. J., ``_Computer matching of areas in stereo images._'',  1974.

[<a id="cit-Moravec1980" href="#call-Moravec1980">Moravec1980</a>] Hans Peter Moravec, ``_Obstacle Avoidance and Navigation in the Real World by a Seeing Robot Rover_'',  1980.

[<a id="cit-Hartley2004" href="#call-Hartley2004">Hartley2004</a>] R.~I. Hartley and A. Zisserman, ``_Multiple View Geometry in Computer Vision_'',  2004.

[<a id="cit-Schmid1995" href="#call-Schmid1995">Schmid1995</a>] Schmid Cordelia and Mohr Roger, ``_Matching by local invariants_'', , vol. , number , pp. ,  1995.  [online](https://hal.inria.fr/file/index/docid/74046/filename/RR-2644.pdf)

[<a id="cit-Harris88" href="#call-Harris88">Harris88</a>] C. Harris and M. Stephens, ``_A Combined Corner and Edge Detector_'', Fourth Alvey Vision Conference,  1988.

[<a id="cit-forstner1987fast" href="#call-forstner1987fast">forstner1987fast</a>] W. F{\"o}rstner and E. G{\"u}lch, ``_A fast operator for detection and precise location of distinct points, corners and centres of circular features_'', Proc. ISPRS intercommission conference on fast processing of photogrammetric data,  1987.

[<a id="cit-Hessian78" href="#call-Hessian78">Hessian78</a>] P.R. Beaudet, ``_Rotationally invariant image operators_'', Proceedings of the 4th International Joint Conference on Pattern Recognition,  1978.

[<a id="cit-Beardsley96" href="#call-Beardsley96">Beardsley96</a>] P. Beardsley, P. Torr and A. Zisserman, ``_3D model acquisition from extended image sequences_'', ECCV,  1996.

[<a id="cit-RANSAC1981" href="#call-RANSAC1981">RANSAC1981</a>] Fischler Martin A. and Bolles Robert C., ``_Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography_'', Commun. ACM, vol. 24, number 6, pp. 381--395, jun 1981.

[<a id="cit-Pritchett1998" href="#call-Pritchett1998">Pritchett1998</a>] P. Pritchett and A. Zisserman, ``_Wide baseline stereo matching_'', ICCV,  1998.

[<a id="cit-Pritchett1998b" href="#call-Pritchett1998b">Pritchett1998b</a>] P. Pritchett and A. Zisserman, ``_"Matching and Reconstruction from Widely Separated Views"_'', 3D Structure from Multiple Images of Large-Scale Environments,  1998.

[<a id="cit-WBSTorr99" href="#call-WBSTorr99">WBSTorr99</a>] P. Torr and A. Zisserman, ``_Feature Based Methods for Structure and Motion Estimation_'', Workshop on Vision Algorithms,  1999.

[<a id="cit-CsurkaReview2018" href="#call-CsurkaReview2018">CsurkaReview2018</a>] {Csurka} Gabriela, {Dance} Christopher R. and {Humenberger} Martin, ``_From handcrafted to deep local features_'', arXiv e-prints, vol. , number , pp. ,  2018.

[<a id="cit-IMW2020" href="#call-IMW2020">IMW2020</a>] Jin Yuhe, Mishkin Dmytro, Mishchuk Anastasiia <em>et al.</em>, ``_Image Matching across Wide Baselines: From Paper to Practice_'', arXiv preprint arXiv:2003.01587, vol. , number , pp. ,  2020.

[<a id="cit-Lindeberg1993" href="#call-Lindeberg1993">Lindeberg1993</a>] Lindeberg Tony, ``_Detecting Salient Blob-like Image Structures and Their Scales with a Scale-space Primal Sketch: A Method for Focus-of-attention_'', Int. J. Comput. Vision, vol. 11, number 3, pp. 283--318, December 1993.

[<a id="cit-Lindeberg1998" href="#call-Lindeberg1998">Lindeberg1998</a>] Lindeberg Tony, ``_Feature Detection with Automatic Scale Selection_'', Int. J. Comput. Vision, vol. 30, number 2, pp. 79--116, November 1998.

[<a id="cit-lindeberg2013scale" href="#call-lindeberg2013scale">lindeberg2013scale</a>] Lindeberg Tony, ``_Scale-space theory in computer vision_'', , vol. 256, number , pp. ,  2013.

[<a id="cit-Tuytelaars2008" href="#call-Tuytelaars2008">Tuytelaars2008</a>] Tuytelaars Tinne and Mikolajczyk Krystian, ``_Local Invariant Feature Detectors: A Survey_'', Found. Trends. Comput. Graph. Vis., vol. 3, number 3, pp. 177--280, July 2008.

[<a id="cit-Lowe99" href="#call-Lowe99">Lowe99</a>] D. Lowe, ``_Object Recognition from Local Scale-Invariant Features_'', ICCV,  1999.

[<a id="cit-SIFT2004" href="#call-SIFT2004">SIFT2004</a>] Lowe David G., ``_Distinctive Image Features from Scale-Invariant Keypoints_'', International Journal of Computer Vision (IJCV), vol. 60, number 2, pp. 91--110,  2004.

[<a id="cit-QuadInterp2002" href="#call-QuadInterp2002">QuadInterp2002</a>] M. Brown and D. Lowe, ``_Invariant Features from Interest Point Groups_'', BMVC,  2002.

[<a id="cit-aknn1997" href="#call-aknn1997">aknn1997</a>] J.S. Beis and D.G. Lowe, ``_Shape Indexing Using Approximate Nearest-Neighbour Search in High-Dimensional Spaces_'', CVPR,  1997.

[<a id="cit-MikoDescEval2003" href="#call-MikoDescEval2003">MikoDescEval2003</a>] K. Mikolajczyk and C. Schmid, ``_A Performance Evaluation of Local Descriptors_'', CVPR, June 2003.

[<a id="cit-Mikolajczyk05" href="#call-Mikolajczyk05">Mikolajczyk05</a>] Mikolajczyk K., Tuytelaars T., Schmid C. <em>et al.</em>, ``_A Comparison of Affine Region Detectors_'', IJCV, vol. 65, number 1/2, pp. 43--72,  2005.

[<a id="cit-LOransac2003" href="#call-LOransac2003">LOransac2003</a>] O. Chum, J. Matas and J. Kittler, ``_Locally Optimized RANSAC_'', Pattern Recognition,  2003.

[<a id="cit-PROSAC2005" href="#call-PROSAC2005">PROSAC2005</a>] O. Chum and J. Matas, ``_Matching with PROSAC -- Progressive Sample Consensus_'', Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01,  2005.

[<a id="cit-Degensac2005" href="#call-Degensac2005">Degensac2005</a>] O. Chum, T. Werner and J. Matas, ``_Two-View Geometry Estimation Unaffected by a Dominant Plane_'', CVPR,  2005.

[<a id="cit-RANSACSurvey2009" href="#call-RANSACSurvey2009">RANSACSurvey2009</a>] S. Choi, T. Kim and W. Yu, ``_Performance Evaluation of RANSAC Family._'', BMVC,  2009.

[<a id="cit-VideoGoogle2003" href="#call-VideoGoogle2003">VideoGoogle2003</a>] J. Sivic and A. Zisserman, ``_Video Google: A Text Retrieval Approach to Object Matching in Videos_'', ICCV,  2003.

[<a id="cit-Philbin07" href="#call-Philbin07">Philbin07</a>] J. Philbin, O. Chum, M. Isard <em>et al.</em>, ``_Object Retrieval with Large Vocabularies and Fast Spatial Matching_'', CVPR,  2007.

[<a id="cit-Fergus03" href="#call-Fergus03">Fergus03</a>] R. Fergus, P. Perona and A. Zisserman, ``_Object Class Recognition by Unsupervised Scale-Invariant Learning_'', CVPR,  2003.

[<a id="cit-CsurkaBoK2004" href="#call-CsurkaBoK2004">CsurkaBoK2004</a>] C.D. G. Csurka, J. Willamowski, L. Fan <em>et al.</em>, ``_Visual Categorization with Bags of Keypoints_'', ECCV,  2004.

[<a id="cit-Lazebnik06" href="#call-Lazebnik06">Lazebnik06</a>] S. Lazebnik, C. Schmid and J. Ponce, ``_Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories_'', CVPR,  2006.

[<a id="cit-Chum2007Exemplar" href="#call-Chum2007Exemplar">Chum2007Exemplar</a>] O. {Chum} and A. {Zisserman}, ``_An Exemplar Model for Learning Object Classes_'', CVPR,  2007.

[<a id="cit-HoG2005" href="#call-HoG2005">HoG2005</a>] N. {Dalal} and B. {Triggs}, ``_Histograms of oriented gradients for human detection_'', CVPR,  2005.

[<a id="cit-Superparsing2010" href="#call-Superparsing2010">Superparsing2010</a>] J. Tighe and S. Lazebnik, ``_SuperParsing: Scalable Nonparametric Image Parsing with Superpixels_'', ECCV,  2010.

[<a id="cit-PhotoTourism2006" href="#call-PhotoTourism2006">PhotoTourism2006</a>] Snavely Noah, Seitz Steven M. and Szeliski Richard, ``_Photo Tourism: Exploring Photo Collections in 3D_'', ToG, vol. 25, number 3, pp. 835–846,  2006.

[<a id="cit-RomeInDay2009" href="#call-RomeInDay2009">RomeInDay2009</a>] Agarwal Sameer, Furukawa Yasutaka, Snavely Noah <em>et al.</em>, ``_Building Rome in a day_'', Communications of the ACM, vol. 54, number , pp. 105--112,  2011.

[<a id="cit-COLMAP2016" href="#call-COLMAP2016">COLMAP2016</a>] J. Sch\"{o}nberger and J. Frahm, ``_Structure-From-Motion Revisited_'', CVPR,  2016.

[<a id="cit-Se02" href="#call-Se02">Se02</a>] Se S., G. D. and Little J., ``_Mobile Robot Localization and Mapping with Uncertainty Using Scale-Invariant Visual Landmarks_'', IJRR, vol. 22, number 8, pp. 735--758,  2002.

[<a id="cit-PTAM2007" href="#call-PTAM2007">PTAM2007</a>] G. {Klein} and D. {Murray}, ``_Parallel Tracking and Mapping for Small AR Workspaces_'', IEEE and ACM International Symposium on Mixed and Augmented Reality,  2007.

[<a id="cit-Mur15" href="#call-Mur15">Mur15</a>] Mur-Artal R., Montiel J. and Tard{\'o}s J., ``_ORB-Slam: A Versatile and Accurate Monocular Slam System_'', IEEE Transactions on Robotics, vol. 31, number 5, pp. 1147--1163,  2015.

[<a id="cit-Brown07" href="#call-Brown07">Brown07</a>] Brown M. and Lowe D., ``_Automatic Panoramic Image Stitching Using Invariant Features_'', IJCV, vol. 74, number , pp. 59--73,  2007.

[<a id="cit-DualBootstrap2003" href="#call-DualBootstrap2003">DualBootstrap2003</a>] V. C., Tsai} {Chia-Ling and {Roysam} B., ``_The dual-bootstrap iterative closest point algorithm with application to retinal image registration_'', IEEE Transactions on Medical Imaging, vol. 22, number 11, pp. 1379-1394,  2003.

[<a id="cit-AlexNet2012" href="#call-AlexNet2012">AlexNet2012</a>] Alex Krizhevsky, Ilya Sutskever and Geoffrey E., ``_ImageNet Classification with Deep Convolutional Neural Networks_'',  2012.

[<a id="cit-Astounding2014" href="#call-Astounding2014">Astounding2014</a>] A. S., H. {Azizpour}, J. {Sullivan} <em>et al.</em>, ``_CNN Features Off-the-Shelf: An Astounding Baseline for Recognition_'', CVPRW,  2014.

[<a id="cit-Melekhov2017relativePoseCnn" href="#call-Melekhov2017relativePoseCnn">Melekhov2017relativePoseCnn</a>] I. Melekhov, J. Ylioinas, J. Kannala <em>et al.</em>, ``_Relative Camera Pose Estimation Using Convolutional Neural Networks_'', ,  2017.  [online](https://arxiv.org/abs/1702.01381)

[<a id="cit-PoseNet2015" href="#call-PoseNet2015">PoseNet2015</a>] A. Kendall, M. Grimes and R. Cipolla, ``_PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization_'', ICCV,  2015.

[<a id="cit-sattler2019understanding" href="#call-sattler2019understanding">sattler2019understanding</a>] T. Sattler, Q. Zhou, M. Pollefeys <em>et al.</em>, ``_Understanding the limitations of cnn-based absolute camera pose regression_'', CVPR,  2019.

[<a id="cit-zhou2019learn" href="#call-zhou2019learn">zhou2019learn</a>] Q. Zhou, T. Sattler, M. Pollefeys <em>et al.</em>, ``_To Learn or Not to Learn: Visual Localization from Essential Matrices_'', ICRA,  2020.

[<a id="cit-pion2020benchmarking" href="#call-pion2020benchmarking">pion2020benchmarking</a>] !! _This reference was not found in biblio.bib _ !!

[<a id="cit-Tatarchenko2019" href="#call-Tatarchenko2019">Tatarchenko2019</a>] M. Tatarchenko, S.R. Richter, R. Ranftl <em>et al.</em>, ``_What Do Single-View 3D Reconstruction Networks Learn?_'', CVPR,  2019.

[<a id="cit-STN2015" href="#call-STN2015">STN2015</a>] M. Jaderberg, K. Simonyan and A. Zisserman, ``_Spatial transformer networks_'', NeurIPS,  2015.

[<a id="cit-NALU2018" href="#call-NALU2018">NALU2018</a>] A. Trask, F. Hill, S.E. Reed <em>et al.</em>, ``_Neural arithmetic logic units_'', NeurIPS,  2018.

[<a id="cit-NAU2020" href="#call-NAU2020">NAU2020</a>] A. Madsen and A. Rosenberg, ``_Neural Arithmetic Units_'', ICLR,  2020.

[<a id="cit-GroupEqCNN2016" href="#call-GroupEqCNN2016">GroupEqCNN2016</a>] T. Cohen and M. Welling, ``_Group equivariant convolutional networks_'', ICML,  2016.

[<a id="cit-MakeCNNShiftInvariant2019" href="#call-MakeCNNShiftInvariant2019">MakeCNNShiftInvariant2019</a>] R. Zhang, ``_Making convolutional networks shift-invariant again_'', ICML,  2019.

[<a id="cit-AbsPositionCNN2020" href="#call-AbsPositionCNN2020">AbsPositionCNN2020</a>] M. Amirul, S. Jia and N. D., ``_How Much Position Information Do Convolutional Neural Networks Encode?_'', ICLR,  2020.

[<a id="cit-AdvPatch2017" href="#call-AdvPatch2017">AdvPatch2017</a>] T. Brown, D. Mane, A. Roy <em>et al.</em>, ``_Adversarial patch_'', NeurIPSW,  2017.

[<a id="cit-OnePixelAttack2019" href="#call-OnePixelAttack2019">OnePixelAttack2019</a>] Su Jiawei, Vargas Danilo Vasconcellos and Sakurai Kouichi, ``_One pixel attack for fooling deep neural networks_'', IEEE Transactions on Evolutionary Computation, vol. 23, number 5, pp. 828--841,  2019.

[<a id="cit-cv4action2019" href="#call-cv4action2019">cv4action2019</a>] Zhou Brady, Kr{\"a}henb{\"u}hl Philipp and Koltun Vladlen, ``_Does computer vision matter for action?_'', Science Robotics, vol. 4, number 30, pp. ,  2019.

[<a id="cit-HardNet2017" href="#call-HardNet2017">HardNet2017</a>] A. Mishchuk, D. Mishkin, F. Radenovic <em>et al.</em>, ``_Working Hard to Know Your Neighbor's Margins: Local Descriptor Learning Loss_'', NeurIPS,  2017.

[<a id="cit-KeyNet2019" href="#call-KeyNet2019">KeyNet2019</a>] A. Barroso-Laguna, E. Riba, D. Ponsa <em>et al.</em>, ``_Key.Net: Keypoint Detection by Handcrafted and Learned CNN Filters_'', ICCV,  2019.

[<a id="cit-SuperPoint2017" href="#call-SuperPoint2017">SuperPoint2017</a>] Detone D., Malisiewicz T. and Rabinovich A., ``_Superpoint: Self-Supervised Interest Point Detection and Description_'', CVPRW Deep Learning for Visual SLAM, vol. , number , pp. ,  2018.

[<a id="cit-R2D22019" href="#call-R2D22019">R2D22019</a>] J. Revaud, ``_R2D2: Repeatable and Reliable Detector and Descriptor_'', NeurIPS,  2019.

[<a id="cit-D2Net2019" href="#call-D2Net2019">D2Net2019</a>] M. Dusmanu, I. Rocco, T. Pajdla <em>et al.</em>, ``_D2-Net: A Trainable CNN for Joint Detection and Description of Local Features_'', CVPR,  2019.

[<a id="cit-sarlin2019superglue" href="#call-sarlin2019superglue">sarlin2019superglue</a>] P. Sarlin, D. DeTone, T. Malisiewicz <em>et al.</em>, ``_SuperGlue: Learning Feature Matching with Graph Neural Networks_'', CVPR,  2020.

[<a id="cit-gradslam2020" href="#call-gradslam2020">gradslam2020</a>] J. Krishna Murthy, G. Iyer and L. Paull, ``_gradSLAM: Dense SLAM meets Automatic Differentiation _'', ICRA,  2020 .

[<a id="cit-CapsNet2011" href="#call-CapsNet2011">CapsNet2011</a>] G.E. Hinton, A. Krizhevsky and S.D. Wang, ``_Transforming auto-encoders_'', ICANN,  2011.

[<a id="cit-CapsNet2017" href="#call-CapsNet2017">CapsNet2017</a>] S. Sabour, N. Frosst and G.E. Hinton, ``_Dynamic routing between capsules_'', NeurIPS,  2017.

[<a id="cit-li2020extreme" href="#call-li2020extreme">li2020extreme</a>] Li Jianguo, Sun Mingjie and Zhang Changshui, ``_Extreme Values are Accurate and Robust in Deep Networks_'', , vol. , number , pp. ,  2020.  [online](https://openreview.net/forum?id=H1gHb1rFwr)

