## Noah Snavely reprojection error


```cpp
SnavelyReprojectionError struct SnavelyReprojectionError {
  SnavelyReprojectionError(double observed_x, double observed_y)
      : observed_x(observed_x), observed_y(observed_y) {}
  template <typename T>
  bool operator()(const T* const camera,
                  const T* const point,
                  T* residuals) const {
    // camera[0,1,2] are the angle-axis rotation.
    T p[3];
    AngleAxisRotatePoint(camera, point, p);
    // camera[3,4,5] are the translation.
    p[0] += camera[3];
    p[1] += camera[4];
    p[2] += camera[5];
    // Compute the center of distortion. The sign change comes from
    // the camera model that Noah Snavely's Bundler assumes, whereby
    // the camera coordinate system has a negative z axis.
    const T xp = - p[0] / p[2];
    const T yp = - p[1] / p[2];
    // Apply second and fourth order radial distortion.
    const T& l1 = camera[7];
    const T& l2 = camera[8];
    const T r2 = xp*xp + yp*yp;
    const T distortion = T(1.0) + r2  * (l1 + l2  * r2);
    // Compute final projected point position.
    const T& focal = camera[6];
    const T predicted_x = focal * distortion * xp;
    const T predicted_y = focal * distortion * yp;
    // The error is the difference between the predicted and observed position.
    residuals[0] = predicted_x - T(observed_x);
    residuals[1] = predicted_y - T(observed_y);
    return true;
  }
  // Factory to hide the construction of the CostFunction object from
  // the client code.
  static ceres::CostFunction* Create(const double observed_x,
                                     const double observed_y) {
    return (new ceres::AutoDiffCostFunction<SnavelyReprojectionError, 2, 9, 3>(
                new SnavelyReprojectionError(observed_x, observed_y)));
  }
  double observed_x;
  double observed_y;
};
```

The `SnavelyReprojectionError` particularly used in Ceres Solver. This struct models the reprojection error in terms of the difference between observed and projected pixel coordinates.



1. **Input Observations (`observed_x`, `observed_y`)**:
   - These are the measured pixel coordinates in the image, provided as inputs during initialization of the struct.

2. **Camera Parameters (`camera`)**:
   - The `camera` parameter is a 9-dimensional array containing:
     - **[0-2]**: Rotation in angle-axis representation.
     - **[3-5]**: Translation vector components.
     - **[6]**: Focal length.
     - **[7-8]**: Radial distortion coefficients (second-order `l1` and fourth-order `l2`).

3. **Point Parameters (`point`)**:
   - The `point` parameter is a 3-dimensional array representing the 3D world point in space.

4. **Steps to Project 3D Points to Pixel Coordinates**:
   - **Rotation**: The 3D point is rotated using the angle-axis representation provided in `camera[0-2]`.
   - **Translation**: The rotated point is translated using the values in `camera[3-5]`.
   - **Perspective Division**: The camera assumes a pinhole model, so the 3D point is converted to normalized image coordinates `(xp, yp)`:
     $
     xp = -p[0] / p[2], \quad yp = -p[1] / p[2]
     $
   - **Radial Distortion**: The normalized coordinates are adjusted using a radial distortion model:
     $
     r^2 = xp^2 + yp^2
     $
     $
     distortion = 1 + r^2 \cdot (l1 + l2 \cdot r^2)
     $
     The distorted coordinates are:
     $
     xp' = distortion \cdot xp, \quad yp' = distortion \cdot yp
     $
   - **Scaling to Pixels**: The distorted coordinates are scaled by the focal length to get the projected pixel coordinates:
     $
     predicted_x = focal \cdot xp', \quad predicted_y = focal \cdot yp'
     $

5. **Residuals (Error)**:
   - The reprojection error is calculated as the difference between the projected pixel coordinates and the observed pixel coordinates:
     $
     residuals[0] = predicted_x - observed_x
     $
     $
     residuals[1] = predicted_y - observed_y
     $


### Principal Point in the Bundler Model
The `SnavelyReprojectionError` implementation does not explicitly include the principal point offsets $ c_x $ and $ c_y $ in the computation of pixel coordinates. 

According to the [Bundler manual](https://www.cs.cornell.edu/~snavely/bundler/bundler-v0.4-manual.html), the camera model assumes:
$
\text{Pixel coordinates} = \begin{bmatrix} f & 0 & c_x \\ 0 & f & c_y \\ 0 & 0 & 1 \end{bmatrix} \cdot \text{Normalized coordinates}
$
Where:
- $ f $: Focal length.
- $ c_x, c_y $: Principal point offsets in the image plane.

### Why $ c_x $ and $ c_y $ Are Missing in This Implementation
In your provided implementation:
1. The predicted pixel coordinates are computed as:
   $
   predicted_x = f \cdot \text{distorted } x
   $
   $
   predicted_y = f \cdot \text{distorted } y
   $
   There is no explicit addition of $ c_x $ and $ c_y $, which would look like:
   $
   predicted_x = f \cdot \text{distorted } x + c_x
   $
   $
   predicted_y = f \cdot \text{distorted } y + c_y
   $

2. This suggests that the implementation assumes the optical center ($ c_x, c_y $) is at the image center, i.e., $ c_x = c_y = 0 $. This is a simplification and might not be valid for all camera setups.



The formula for projecting a 3D point $\mathbf{X} = [X, Y, Z]^T$ into a camera with parameters $R_w^c$, $t_w^c$ ( pose of the **world frame expressed in the camera frame), $f$ (focal length), and radial distortion coefficients $k_1$ and $k_2$ is as follows:

1. **Transform the 3D point into the camera coordinate system**:
   $
   \mathbf{X}_c = R_w^c \mathbf{X} + t_w^c
   $
   Where:
   - $\mathbf{X}_c = [X_c, Y_c, Z_c]^T$ is the point in the camera coordinate system.
   - $R_w^c$ is the 3x3 rotation matrix.
   - $t_w^c = [t_x, t_y, t_z]^T$ is the translation vector.

2. **Project the point into normalized image coordinates**:
   $
   x = \frac{X_c}{Z_c}, \quad y = \frac{Y_c}{Z_c}
   $

3. **Apply radial distortion**:
   Compute the radial distance $r^2$:
   $
   r^2 = x^2 + y^2
   $
   Apply the radial distortion model:
   $
   x' = x \cdot \left(1 + k_1 r^2 + k_2 r^4\right)
   $
   $
   y' = y \cdot \left(1 + k_1 r^2 + k_2 r^4\right)
   $
   Where:
   - $k_1$ and $k_2$ are the first and second radial distortion coefficients.

In a standard incremental SfM pipeline that uses something like the Snavely/BAL (Bundle Adjustment in the Large) reprojection model, *all cameras* and *all 3D points* should typically be expressed in a common “world” coordinate system (often chosen to be the first camera).  Whenever you add a new camera (e.g., camera 3) and triangulate new points in that camera’s local frame, you should transform both the camera’s pose and the newly triangulated points into the common reference frame before adding them to the bundle adjuster.

Below is a conceptual explanation of **why** and **how** to do that.

---

## 1. The Snavely Reprojection Model Assumes a Common World Frame

In the snippet of `SnavelyReprojectionError` you showed, we see:
```cpp
AngleAxisRotatePoint(camera, point, p);  
p[0] += camera[3];
p[1] += camera[4];
p[2] += camera[5];
```
Here,
- `camera[0..2]` is an angle-axis rotation (which represents $R$).
- `camera[3..5]` is a translation vector (which represents $\mathbf{t}$).
- `camera[6]` is the focal length, and `camera[7..8]` are radial distortion coefficients.

The code implements:
$
\mathbf{p}' = R \,\mathbf{X} + \mathbf{t}
$
where:
- $\mathbf{X}$ is the 3D point (in *world* coordinates),
- $R$ and $\mathbf{t}$ transform that point *from world* into *the camera’s frame*,
- Then it projects to 2D (accounting for the Bundler sign convention on the z-axis and radial distortion).

Hence, the expectation is that `point` (the 3D point) lives in the *same consistent world coordinate system* that corresponds to each camera’s extrinsics $(R,\mathbf{t})$.  

If each newly triangulated set of 3D points is instead stored in the local coordinate system of whichever camera you used to triangulate them, then the above transform $R\mathbf{X} + \mathbf{t}$ will not be consistent, because $\mathbf{X}$ is no longer the same “world” $\mathbf{X}$.  Therefore you must transform everything into *one* frame—usually picking the first camera as the “world” frame.

---

## 2. Typical Incremental SfM Workflow

1. **Choose a reference** (often the first camera) to define the origin $(R=I, \mathbf{t}=\mathbf{0})$.  
2. **Add a second camera**:
   - You find $\mathbf{R}_{1\to2}$ and $\mathbf{t}_{1\to2}$ by `cv::recoverPose` (which gives the transformation from camera 1 to camera 2).  
   - You convert that to a pose $(R_{2}, t_{2})$ in the “world” frame, which is simply $\mathbf{R}_2 = \mathbf{R}_{1\to2}$ and $\mathbf{t}_2 = \mathbf{t}_{1\to2}$ if you treat camera 1 as the origin.  
   - Triangulate points in the reference frame (camera 1’s frame). In OpenCV, typically you use something like:
     ```cpp
     triangulatePoints(P1, P2, matched_points1, matched_points2, points4D);
     ```
     Here, `P1` could be `[I|0]`, and `P2` is `[R_{1->2} | t_{1->2}]`, so the resulting 3D points are effectively in camera 1’s coordinate system (depending on your usage of `triangulatePoints`).
   - Now you have 3D points (in world = camera 1’s frame) plus camera 2’s extrinsics in that frame.

3. **Add a third camera**:
   - You match keypoints between camera 2 and camera 3, and do another `recoverPose` to get $\mathbf{R}_{2\to3}$ and $\mathbf{t}_{2\to3}$.  
   - **But** in order to maintain *everything* in camera 1’s world frame, you must convert this camera-3 pose to the same global reference frame.  
     $
       \mathbf{R}_{1\to3} = \mathbf{R}_{1\to2} \,\mathbf{R}_{2\to3}, 
       \quad
       \mathbf{t}_{1\to3} = \mathbf{R}_{1\to2}\,\mathbf{t}_{2\to3} + \mathbf{t}_{1\to2}.
     $
   - Once you have $\mathbf{R}_{1\to3}$ and $\mathbf{t}_{1\to3}$, those are the extrinsic parameters for camera 3 *in the same world frame*.  
   - If you triangulate new points with camera 2 and camera 3, those new points will initially end up in camera-2’s frame (depending on how you call `triangulatePoints`).  So you must transform them from camera 2’s frame into camera 1’s frame:
     $
       \mathbf{X}_{w} \;=\; \mathbf{R}_{1\to2}\, \mathbf{X}_{2} \;+\; \mathbf{t}_{1\to2}.
     $
     (Or whichever formula matches how your P-matrices are set up in OpenCV.  The key is to bring $\mathbf{X}$ into the same “world” coordinate system.)

4. **Add all cameras and all points** (now in a common frame) to the bundle adjuster.  Each camera’s extrinsics are parameterized by an angle-axis and translation that map *world $\to$ that camera*, and each 3D point $\mathbf{X}_w$ is stored in the same world coordinate system.  
5. **Refine** all parameters via bundle adjustment.

---

## 3. Do You *Have* to Transform Points and Poses Every Time?

- **Yes, if** you want to keep a single global frame (which is by far the most common approach), you must express any newly triangulated points and newly computed camera extrinsics in that same frame.  
- You do *not* necessarily have to re-triangulate *old* points if they are already consistent.  You only have to transform:
  1. The newly found camera pose into the global frame,  
  2. The newly triangulated 3D points into the global frame,  
  3. Then add them to the optimization problem.

Once everything is in one consistent coordinate system, the standard Snavely reprojection model works directly.

---

## 4. Summary

- Pick camera 1 as *world* with $\mathbf{R}=I$, $\mathbf{t}=\mathbf{0}$.  
- For each new camera $k$:
  1. Compute $\mathbf{R}_{k}$ and $\mathbf{t}_{k}$ **in the world frame** (using the known pose from a previous camera that is already in the world frame).  
  2. Triangulate new points (they might initially come out in camera $k$’s frame or camera j’s frame).  Convert them to *world* coordinates.  
  3. Add $\mathbf{R}_{k}, \mathbf{t}_{k}$ and the 3D points in *world* frame to the bundle adjuster.

This ensures that each residual evaluation inside `SnavelyReprojectionError` does exactly:
$
\mathbf{p}_{\text{cam}} 
= R_{k} \,\mathbf{X}_{w} \;+\; \mathbf{t}_{k}
$
for the 3D point $\mathbf{X}_w$ (in world coordinates) and camera $k$’s extrinsics $\{R_{k}, \mathbf{t}_{k}\}$.  No confusion about local frames is necessary once everything is transformed consistently.

**Short Answer**  
Yes, if you have 3D points expressed in camera 2’s frame (after triangulating between camera 2 and camera 3), and you want those points in the **world** frame (which you’ve designated as camera 1’s frame), you need to apply the **inverse** of the transform that takes points from camera 1 to camera 2. In other words, if your `recoverPose` gave you $R_{1\to2}$ and $\mathbf{t}_{1\to2}$ (so that $\mathbf{X}_2 = R_{1\to2}\,\mathbf{X}_1 + \mathbf{t}_{1\to2}$), then to go from camera 2 coordinates back to camera 1 coordinates (“world”), you must invert that:
$
\mathbf{X}_1 
= 
R_{1\to2}^{\,\top}
\bigl(\mathbf{X}_2 - \mathbf{t}_{1\to2}\bigr).
$
Equivalently,
$
R_{2\to1} = R_{1\to2}^{\,\top}, 
\quad
\mathbf{t}_{2\to1} = -\,R_{1\to2}^{\,\top} \,\mathbf{t}_{1\to2}.
$

---

## Detailed Explanation

1. **What `recoverPose` Returns**

   When you call  
   ```cpp
   recoverPose(E, points1, points2, K, R, t);
   ```  
   with $`points1` = camera 1, `points2` = camera 2$,
   - It returns $R$ and $t$ such that a 3D point $\mathbf{X}_1$ in camera 1’s coordinates is transformed to camera 2’s coordinates by:  
     $
       \mathbf{X}_2 \;=\; R\,\mathbf{X}_1 \;+\; t.
     $
   - We can denote that as $\,^{2}T_{1} = (R_{1\to2},\, t_{1\to2})$.

2. **Camera 1 as World**

   In many SfM pipelines, we choose camera 1’s coordinate system to be the “world” frame. Then camera 2’s extrinsic parameters $(R_2,\, t_2)$ are exactly $(R_{1\to2},\, t_{1\to2})$, meaning:
   $
     \mathbf{X}_2 
     \;=\;
     R_2\,\mathbf{X}_\mathrm{world} + t_2
     \;\;(\text{but here $\mathbf{X}_\mathrm{world}\equiv \mathbf{X}_1$}).
   $
   No inversion is needed to define camera 2’s pose with respect to the world.

3. **Triangulating Between Camera 2 and Camera 3**

   - Suppose you then add camera 3. You match features between camera 2 and camera 3 and do `recoverPose` again to get $R_{2\to3}$, $\mathbf{t}_{2\to3}$. You might use OpenCV’s `triangulatePoints` with projection matrices that place camera 2 at the origin of that triangulation process.
   - The resulting 3D points from that call will come out in **camera 2’s** local coordinate system (depending on how you formed your projection matrices `P2`, `P3`).

4. **Convert Those Points to the World Frame (camera 1)**

   Now you have some new 3D points $\mathbf{X}_2$ in camera 2’s frame, but you want everything in the same world (camera 1) coordinate system. The question is: “How do I go from camera 2 coords to camera 1 coords?”

   - If we know 
     $
       \mathbf{X}_2 
       \;=\;
       R_{1\to2}\,\mathbf{X}_1 
       + 
       \mathbf{t}_{1\to2},
     $
     then we invert it:
     $
       \mathbf{X}_1 
       \;=\;
       R_{1\to2}^{\,\top}
       \bigl(\mathbf{X}_2 - \mathbf{t}_{1\to2}\bigr).
     $
   - Or equivalently, define 
     $
       R_{2\to1} 
       = 
       R_{1\to2}^{\,\top}
       , 
       \quad
       \mathbf{t}_{2\to1}
       =
       -\,R_{1\to2}^{\,\top} \,\mathbf{t}_{1\to2},
     $
     so 
     $
       \mathbf{X}_1 
       =
       R_{2\to1}\,\mathbf{X}_2
       +
       \mathbf{t}_{2\to1}.
     $

5. **Bundle Adjustment**

   Once you have the newly triangulated points in the world (camera 1) coordinates, you can add them (and camera 3’s pose in world coords) into your bundle adjuster. During bundle adjustment:
   - Each 3D point is stored in world coords,
   - Each camera has extrinsics that transform a world point $\mathbf{X}_\mathrm{w}$ into that camera’s local frame,
   $
     \mathbf{X}_\mathrm{cam} = R_\mathrm{cam}\,\mathbf{X}_\mathrm{w} + t_\mathrm{cam}.
   $

Thus, **yes**—if your triangulation code yields points in camera 2’s reference frame, and you want them in camera 1’s (world) frame, you use the **inverse** of the “camera 1 to camera 2” transform.