# Notes

Prior to this paper, there have already been many approaches to SLAM.

Early SLAM used probabilistic and filtering approaches, and alternating optimization of the map and camera poses.

> More recently, modern SLAM systems have leveraged least-squares
> optimization. A key element for accuracy has been full Bundle Adjustment (BA), which jointly optimizes the camera poses and the 3D map in a single optimization problem.

> One advantage of the optimization-based formulation is that a SLAM system can be easily modified to leverage different sensors.

ORB-SLAM3 supports monocular, stereo, RGB-D, and IMU sensors.

> Despite significant progress, current SLAM systems lack the robustness demanded for many real-world applications.

SLAM was still not ready. There are errors like lost feature tracks, divergence in optimization, and accumulation of drift.

They introduce the deep learning based DROID-SLAM in this paper.

> It has state-of-the-art performance, outperforming existing SLAM systems, classical or learning-based, on challenging benchmarks with very large margins.

DROID-SLAM has high accuracy, high robustness to failure, and strong generalization.

The “Differentiable Recurrent Optimization-Inspired Design” (DROID)

> is an end-to-end differentiable architecture that combines the strengths of both classical approaches and deep networks.

It uses recurrent iterative updates like RAFT. Unlike RAFT, DROID-SLAM iteratively updates camera poses and depth operating on any number of frames, rather than RAFT operating on optical flow in just two frames.

DROID-SLAM also uses a Dense Bundle Adjustment (DBA) layer.

### Related Work

**1. Visual SLAM**

Visual SLAM uses observations from monocular, stereo, or RGB-D images. Indirect approaches process the image into intermediate representations with points of interest and feature descriptors, and then match features between images.

They are optimized by minimizing re-projection error, the error from projecting a predicted feature from the 3D map approximation into the camera field of few based on the estimated camera pose and checking the actual difference from the expectation in the image.

Direct methods instead optimize over photometric error and skip image processing. This allows them to process more information about the image but leads to more difficult optimization problems.

DROID-SLAM takes an in between approach; it doesn’t use intermediate representations and passes images directly to the neural network like direct approaches, but uses the re-projection error optimization problem like indirect approaches. This way, it gets the rich features of direct and the faster optimization of indirect.

**2. Deep Learning**

Prior deep learning SLAM attempts have often attempted to implement specific features of the SLAM problem with deep learning.

There have been few end-to-end approaches and many have been incomplete.

DROID-SLAM optimizes the depth of each pixel with deep learning to make it a more flexible approach.

### Approach

> We take a video as input with two objectives: estimate the trajectory of the camera and build a 3D map of the environment.

The network operates on a collection of images $\{ I \}_{t=0}^N$ with each image have two state variables: a camera pose $G+t \in SE(3)$ and inverse depth $d_t \in R_+^{H \times W}$. Inverse depth is used for numerical stability for large depth values that approach 0.

They also use a frame graph $(\mathcal{V}, \mathcal{E})$ that’s updated with edges $(i, j) \in \mathcal{E}$ for any pair of images $I_i$ and $I_j$ that have overlapping fields of view.

**1. Feature Extraction and Correlation**

Features are extracted as images are added to the system.

There’s a feature extraction network that processes images with a feature network and a context network. The feature network builds correlation volumes and the context network is used by the network at each update.

For each edge of the frame graph, correlations between the two images are calculated by taking the dot products of the feature vectors.

The correlation volumes are then indexed for usage in later search.

**2. Update Operator**

![Screenshot 2024-11-07 at 5.56.17 PM.png](../images/Screenshot_2024-11-07_at_5.56.17_PM.png)

The core part of DROID-SLAM is an update operator that updates the camera poses and true depth map on every iteration, as well as updating the hidden state $h$.

Specifically, the GRU uses hidden state $h$ to compute a pose update $\Delta \xi^{(k)}$ and a depth update $\Delta d^{(k)}$. Then it computes the new post values and depth values as follows:

$$
\textrm{G}^{(k+1)} = \textrm{Exp}(\Delta\xi^{(k)}) \circ \textrm{G}^{(k)} \\
\textrm{d}^{(k+1)} = \Delta \textrm{d}^{(k)} + \textrm{d}^{(k)}
$$

This should eventually converge to a fixed point representing the true construction $\{ \textrm{G}^{(k)}\} \rightarrow \textrm{G}^*, \{ \textrm{d}^{(k)} \} \rightarrow \textrm{d}^*$.

At the start of each iteration, we use the current poses and depths to estimate correspondence with the current image. This gives a map of where the pixels in prior images are predicted to be.

Then, they use this correspondence field to index the correlation volumes by specifically searching in the correlation volumes in pixels predicted by the correspondence field.

This gives the network correlation correlation features and flow features that allows the network to learn to align visually similar image regions.

These correlation and flow features (which are represented across the image) pass through 2 convolutional layers and then enter the GRU, along with context features being added.

Instead of predicting depth or pose updates directly, the network instead predicts updates to dense flow fields with a revision flow field $r_{ij} \in \mathbb{R}^{H\times W \times 2}$ and associated confidence map $w_{ij} \in \mathbb{R}_+^{H \times W \times 2}$. This predicts where pixels in one image should move to based on estimation.

The dense bundle adjustment layer (DBA) then maps these flow revisions into pose and pixel-wise depth updates (basically a projection layer).

We want an updated pose $G’$ and depth $d’$ such that reprojected points match the revised correspondence $\textrm{p}_{ij}^*$. The DBA layer is part of the computation graph and back-propagation goes through the layer during draining.

**3. Training**

Training examples are made of 7-frame video sequences.

They use a pose loss and flow loss predicted from the ground truth depths and poses vs. the predicted depths and poses.

**4. SLAM System**

The SLAM system takes in video and performs localization and mapping. It has two threads: a **frontend** thread which takes new frames, extracts features, selects keyframes, and performs local bundle adjustment, and a **backend** which performs global bundle adjustment over the history of keyframes.

They initialize the network by collecting 12 frames and making a frame graph. Frames are only kept if there is sufficient optical flow between them.

The frontend operates directly on the incoming video frames by maintaining the keyframes and frame graph, updating pose and depth estimations, and removing redundant keyframes.

The backend performs full global bundle adjustment on all keyframes and updates the frame graph.

The system can also use Stereo and RGB-D, treating depth as a variable which can still have error. It can be trained on and the network learns to remove this error.

### Experiments

> We compare to both deep learning and established classical SLAM algorithms and put specific emphasis on cross-dataset generalization.

> Our network is trained entirely on monocular video from the synthetic TartanAir dataset. Training takes 1 week on 4 RTX-3090 GPUs.

Goes to show how much room for improvement there is if models still have room to improve by scaling up parameters and training them with more compute.

**1. TartanAir**

![Screenshot 2024-11-07 at 6.28.12 PM.png](../images/Screenshot_2024-11-07_at_6.28.12_PM.png)

> On most sequences, we outperform existing methods by an order-of-magnitude and achieve 8x lower average error than TartanVO and 20x lower than DeepV2D.

**2. EuRoC**

![Screenshot 2024-11-07 at 6.29.04 PM.png](../images/Screenshot_2024-11-07_at_6.29.04_PM.png)

**3. TUM-RGBD**

![Screenshot 2024-11-07 at 6.30.23 PM.png](../images/Screenshot_2024-11-07_at_6.30.23_PM.png)

> The RGBD dataset consists of indoor scenes captured with handheld camera. This is a notoriously difficult dataset for monocular methods due to rolling shutter artifacts, motion blur, and heavy rotation.

> It successfully tracks all 9 sequences while achieving 83% lower ATE than DeepFactors and which succeeds on all videos and 90% lower ATE than DeepV2D.

**4. ETH3D-SLAM**

![Screenshot 2024-11-07 at 6.30.35 PM.png](../images/Screenshot_2024-11-07_at_6.30.35_PM.png)

> Our system can run in real-time with 2 3090 GPUs. Tracking and local BA is run on the first GPU, while global BA and loop closure is run on the second.

### Conclusion

> We introduce DROID-SLAM, an end-to-end neural architecture for visual SLAM. DROID-SLAM is accurate, robust, and versatile and can be used on monocular, stereo, and RGB-D video. It outperforms prior work by large margins on challenging benchmarks.
