# Explanation

Early SLAM approaches depended on LiDAR and were relatively naive; for example, they didn't take advantage of loop closures (when the robot returns to a position it has previously been at) to resolve errors in the robots trajectory and environment map.

ORB-SLAM instead provides an advanced SLAM approach that depends on only a single camera, which is far cheaper, effectively uses loop closures, and offers more robustness against failure.

They accomplish this by detecting ORB image features in a series of frames, using a keyframe filtering system to only save important frames from a video feed that contain sufficient novel information, and implement other algorithmic improvements.

With this design, ORB-SLAM based approaches (most recently ORB-SLAM-3) are at the SOTA performance for SLAM solutions. Notably, deep learning based approaches with low priors still haven't fully surpassed the computational SLAM approaches like ORB-SLAM with far more inductive bias.

# Notes

> Visual SLAM has the goal of estimating the camera trajectory while reconstructing the environment.

Visual SLAM is challenging as it requires efficient usage of a subset of observations and keyframes to prevent redundancy as complexity grows, a strong network of observations to produce accurate results, sufficient loop-closure abilities, handling occlusions, etc.

ORB-SLAM is a new monocular SLAM algorithm that:

- Uses a single set of ORB features for all tasks: tracking, mapping, re-localization, and loop closing.
- Operates in real-time in large environments using a co-visibility graph.
- Real-time loop closing based on pose graph optimization.
- Real-time camera relocalization which allows recovery from tracking failure.
- A new initialization procedure.
- A survival of the fittest map point and keyframe selection approach.

> To the best of our knowledge, this is the most complete and reliable solution to monocular SLAM, and for the benefit of the community we make the source code public.

### Related Work

**1. Place Recognition**

> [Place recognition approaches] based on appearance, that is image to image matching, scale better in large environments than map to map or image to map methods.

> With appearance based methods, bag of words techniques, are to the fore because of their high efficiency.

**2. Map Initialization**

> Monocular SLAM requires a procedure to create an initial map because depth cannot be recovered from a single image.

**3. Monocular SLAM**

> Monocular SLAM was initially solved by filtering, [where] every frame is processed by the filter to jointly estimate the map feature locations and the camera pose.

> It has the drawbacks of wasting computation in processing consecutive frames with little new information and the accumulation of linearization errors.

> On the other hand keyframe-based approaches, estimate the map using only selected frames (keyframes) allowing to perform more costly but accurate bundle adjustment optimizations.

> Keyframe-based techniques are more accurate than filtering for the same computational cost.

> [PTAM] was the first work to introduce the idea of splitting camera tracking and mapping in parallel threads, and demonstrated to be successful for real time augmented reality applications in small environments.

> In our system we take advantage of the excellent ideas of using a local map based on co-visibility, and building the pose graph from the
> co-visibility graph, but apply them in a totally redesigned frontend and back-end.

> Another difference is that, instead of using specific features for loop detection (SURF), we perform the place recognition on the same tracked and mapped features, obtaining robust frame-rate relocalization and loop detection.

> All visual SLAM works in the literature agree that running BA with all the points and all frames is not feasible.

> The most cost effective approach is to keep as much points as possible, while keeping only non-redundant keyframes.

> Our survival of the fittest strategy achieves unprecedented robustness in difficult scenarios by inserting keyframes as quickly as possible, and removing later the redundant ones, to avoid the extra cost.

### System Overview

**1. Feature Choice**

> One of the main design ideas in our system is that the same features used by the mapping and tracking are used for place recognition to perform frame-rate relocalization and loop detection.

They use the same features for all tasks which makes their algorithm far more computationally efficient.

> [ORB features] are extremely fast to compute and match, while they have good invariance to viewpoint.

They use ORB to extract features which has high performance.

**2. Three Threads: Tracking, Local Mapping, and Loop Closing**

![Screenshot 2024-11-07 at 2.14.14 PM.png](../../images/Screenshot_2024-11-07_at_2.14.14_PM.png)

> Our system, see an overview in Fig. 1, incorporates three threads that run in parallel: tracking, local mapping and loop closing.

> The tracking is in charge of localizing the camera with every frame and deciding when to insert a new keyframe.

> The local mapping processes new keyframes and performs local BA to achieve an optimal reconstruction in the surroundings of the camera pose.

> The local mapping is also in charge of culling redundant keyframes.

> The loop closing searches for loops with every new keyframe. If a loop is detected, we compute a similarity transformation that informs about the drift accumulated in the loop. Then both sides of the loop are aligned and duplicated points are fused.

**3. Map Points, Key Frames, and their Selection**

Each map point $p_i$ stores:

- Its 3D position within world coordinates
- It’s viewing direction
- A representative ORB descriptor for the point
- The max and min distances from which it can be observed

Each keyframe $K_i$ stores:

- The camera pose that transforms points from the world to camera coordinates
- The camera intrinsics like focal length and principle point
- All the ORB features extracted in the frame.

> Map points and keyframes are created with a generous policy, while a later very exigent culling mechanism is in charge of detecting redundant keyframes and wrongly matched or not trackable map points.

> This permits a flexible map expansion during exploration, which promotes robustness under hard conditions.

**4. Covisibility Graph and Essential Graph**

> Covisibility information between keyframes is very useful in several tasks of our system, and is represented as an undirected weighted graph.

> Each node is a keyframe and an edge between two keyframes exists if they share observations of the same map points.

They use this covisibility graph for loop closure.

**5. Bag of Words Place Recognition**

> The system has embedded a bags of words place recognition module, to perform loop detection and localization.

> Visual words are just a discretization of the descriptor space, which is known as the visual vocabulary. The vocabulary is created offline with the ORB descriptors extracted from a large set of images.

> If the images are general enough, the same vocabulary can be used for different environments getting a good performance.

### Automatic Map Initialization

> The goal of the map initialization is to compute the relative pose between two frames to triangulate an initial set of map points.

![Screenshot 2024-11-07 at 2.26.35 PM.png](../../images/Screenshot_2024-11-07_at_2.26.35_PM.png)

### Tracking

The tracking thread steps are performed at every camera frame.

**1. ORB Extraction**

They first extract FAST corners from each section of the grid. Then they compute ORB descriptors and orientations using the FAST corners.

**2. Initial Pose Estimation from Previous Frame**

If tracking in the last frame was successful, they use constant velocity to predict the camera pose and search for observed map points that should be visible.

**3. Initial Pose Estimation via Global Relocalization**

If they lose tracking, they convert the frame into a bag of words and query the recognition database for keyframes that they can use for relocalization.

**4. Track Local Map**

> Once we have an estimation of the camera pose and an initial set of feature matches, we can project the map into the frame and search more map point correspondences.

They only use a local map to prevent complexity in large environments.

> The camera pose is finally optimized with all the map points found in the frame

**5. New Keyframe Decision**

Then the system has to decide whether to insert the current frame as a keyframe.

> As there is a mechanism in the local mapping to cull redundant keyframes, we will try to insert keyframes as fast as possible, because that makes the tracking more robust to challenging camera movements, typically rotations.

They want to be generous with storing keyframes because it helps with fast camera movements that are otherwise hard to recover.

### Local Mapping

These are steps performed with every new keyframe.

**1. Keyframe Insertion**

They add the keyframe to the covisibility graph and store the bag of words representation of the keyframe.

**2. Recent Map Points Culling**

> Map points, in order to be retained in the map, must pass a restrictive test during the first three keyframes after creation.

**3. New Map Point Creation**

> New map points are created by triangulating ORB from connected keyframes in the covisibility graph.

**4. Local Bundle Adjustment**

> The local BA optimizes the currently processes keyframe, all the keyframes connected to it in the covisibility graph, and all the map points seen by those keyframes.

**5. Local Keyframe Culling**

> In order to maintain a compact reconstruction, the local mapping tries to detect redundant keyframes and delete them.

> We discard all the keyframes in Kc whose 90% of the map points have been seen in at least other three keyframes in the same or finer scale.

### Loop Closing

**1. Loop Candidates Detection**

> At first we compute the similarity between the bag of words vector of $K_i$ and all its neighbors in the covisibility graph.

They query the recognition database to find similar keyframes and delete frames whose score is lower than some threshold.

> To accept a loop candidate we must detect consecutively three
> loop candidates that are consistent.

**2. Compute the Similarity Transformation**

They compute a similarity transformation between the frames.

**3. Loop Fusion**

> The first step in the loop correction is to fuse duplicated map points and insert new edges in the covisibility graph.

**4. Essential Graph Optimization**

> To effectively close the loop, we perform a pose graph optimization over the Essential Graph that distributes the loop closing error along the graph.

### Experiments

> Our system runs in real time and processes the images exactly at the frame rate they were acquired.

> ORB-SLAM has three main threads, that run in parallel with other tasks from ROS and the operating system.

**1. System Performance in the NewCollege Dataset**

![Screenshot 2024-11-07 at 2.51.05 PM.png](../../images/Screenshot_2024-11-07_at_2.51.05_PM.png)

> The NewCollege dataset contains a 2.2km sequence from a robot traversing a campus and adjacent parks.

Since ORB-SLAM depends purely on visual data, they can construct the environment just from a video. Very cool.

> It contains several loops and fast rotations that makes the sequence quite challenging for monocular vision. To the best of our knowledge there is no other monocular system in the literature able to process this whole sequence.

Loop closures and fast turning make video data difficult to process.

**2. Localization Accuracy in the TUM RGB-D Benchmark**

> The TUM RGB-D benchmark is an excellent dataset to evaluate the accuracy of camera localization as it provides several sequences with accurate ground truth obtained with an external motion capture system.

This dataset is good for evaluating performance because it comes with location data.

> In terms of accuracy ORB-SLAM and PTAM are similar in open trajectories, while ORB-SLAM achieves higher accuracy when detecting large loops.

**3. Relocalization in the TUM RGB-D Benchmark**

> ORB-SLAM accurately relocalizes more than the double of frames than PTAM.

**4. Lifelong Experiment in the TUM RGB-D Benchmark**

> Previous relocalization experiments have shown that our system is able to localize in a map from very different viewpoints and robustly under moderate dynamic changes.

> This property in conjunction with our keyframe culling procedure allows to operate lifelong in the same environment under different viewpoints and some dynamic changes.

> While the lifelong operation in a static scenario should be a requirement of any SLAM system, more interesting is the case where dynamic changes occur.

**5. Large Scale and Large Loop Closing in the KITTI Dataset**

> 11 sequences from a car driven around a residential area with accurate ground truth from GPS and a Velodyne laser scanner.

> This is a very challenging dataset for monocular vision due to fast rotations, areas with lot of foliage, which make more difficult data association, and relatively high car speed.

![Screenshot 2024-11-07 at 2.56.51 PM.png](../../images/Screenshot_2024-11-07_at_2.56.51_PM.png)

### Conclusions and Discussion

**1. Conclusion**

> In this work we have presented a new monocular SLAM system with a detailed description of its building blocks and an exhaustive evaluation in public datasets.

> The accuracy of the system is typically below 1 cm.

> The main contribution of our work is to expand the versatility of PTAM to environments that are intractable for that system.

> To the best of our knowledge, no other system has demonstrated to work in as many different scenarios and with such accuracy. Therefore our system is currently the most reliable and complete solution for monocular SLAM.

> Finally we have also demonstrated that ORB features have enough recognition power to enable place recognition from severe viewpoint change.

**2. Sparse/Feature-based vs. Dense/Direct Methods**

This is a sparse/feature-based SLAM method. There are also dense methods that perform dense reconstruction of the environment and localizing the camera by optimizing over image pixel intensities.

> In contrast, feature-based methods are able to match features with a wide baseline, thanks to their good invariance to viewpoint and illumination changes.

> We consider that the future of monocular SLAM should incorporate the best of both approaches.

**3. Future Work**

> The accuracy of our system can still be improved incorporating points at infinity in the tracking.

> Another open way is to upgrade the sparse map of our system to a denser and more useful reconstruction.
