# Problem: Absolute camera orientation given set of relative camera pairs

Useful video: https://www.youtube.com/watch?v=MyrVDUnaqUs

Given an optical system of $N$ cameras with extrinsics $\{g_1, ..., g_N | g_i \in SE(3)\}$, and a set of relative camera positions $\{g_{ij} | g_{ij}\in SE(3)\}$ that map between coordinate frames of randomly selected pairs of cameras $(i, j)$, we search for the absolute extrinsic parameters $\{g_1, ..., g_N\}$ that are consistent with the relative camera motions.

*Extrinsic parameters define the absolute position and orientation of a camera in a global or world coordinate system. They tell you where the camera is located (translation) and how it is pointed (rotation) in relation to a common reference point in that world space*

This optimization process aims to adjust the estimated camera positions and orientations so that they align as closely as possible with their true positions. It minimizes the differences between observed relative positions of camera pairs and those calculated from these adjusted positions. The goal is to refine the estimates iteratively until they accurately reflect the real-world setup of the cameras.

More formally: $$ g_1, ..., g_N = {\arg \min}_{g_1, ..., g_N} \sum_{g_{ij}} d(g_{ij}, g_i^{-1} g_j), $$, where $d(g_i, g_j)$ is a suitable metric that compares the extrinsics of cameras $g_i$ and $g_j$.

Visually, the problem can be described as follows. The picture below depicts the situation at the beginning of our optimization. The ground truth cameras are plotted in purple while the randomly initialized estimated cameras are plotted in orange:

![problem start](https://github.com/facebookresearch/pytorch3d/blob/main/docs/tutorials/data/bundle_adjustment_initialization.png?raw=1)

**Our optimization seeks to align the estimated (orange) cameras with the ground truth (purple) cameras, by minimizing the discrepancies between pairs of relative cameras.**

![problem finish](https://github.com/facebookresearch/pytorch3d/blob/main/docs/tutorials/data/bundle_adjustment_final.png?raw=1)

In practice, the camera extrinsics $g_{ij}$ and $g_i$ are represented using objects from the SfMPerspectiveCameras class initialized with the corresponding rotation and translation matrices R_absolute and T_absolute that define the extrinsic parameters $g = (R, T); R \in SO(3); T \in \mathbb{R}^3$.
In order to ensure that R_absolute is a valid rotation matrix, we represent it using an exponential map (implemented with so3_exp_map) of the axis-angle representation of the rotation log_R_absolute.

Note that the solution to this problem could only be recovered up to an unknown global rigid transformation $g_{glob} \in SE(3)$. Thus, for simplicity, we assume knowledge of the absolute extrinsics of the first camera $g_0$. We set $g_0$ as a trivial camera $g_0 = (I, \vec{0})$.

Visualization of bundle adjustments: https://drive.google.com/file/d/1jxER6Gqjw3-dcNx7s7PG8DXK6xTquAXk/view?usp=sharing


## Terminology

### Special Orthogonal Group

- SO(n): The Special Orthogonal group SO(n) is the group of n×n rotation matrices with determinant 1. These matrices represent rotations in n-dimensional space and are orthogonal, meaning their inverse is their transpose.
- SO(3): Specifically, SO(3) refers to the group of 3x3 rotation matrices that describe all possible rotations in 3-dimensional space. In computer vision and robotics, SO(3) is crucial for representing the orientation of objects or cameras in three dimensions.

### Special Euclidian Group
- SE(n): The Special Euclidean group SE(n) refers to the set of all transformations that can be described as a rotation followed by a translation. These transformations are represented using matrices that combine rotational and translational components. For n-dimensional space, SE(n) transformations are typically represented by (n+1)×(n+1) matrices.
- SE(3): Specifically, SE(3) involves transformations in 3-dimensional space. An SE(3) transformation matrix is a 4x4 matrix.

Here, R is a 3x3 matrix from SO(3) representing the rotation, and T is a 3-element column vector representing the translation. The last row is usually a fixed row [0,0,0,1], ensuring the matrix is homogeneous and suitable for operations in projective space.

# Installation and Imports

In [None]:
!pip install torch

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

In [None]:
!pip install torchvision



In [None]:
import os
import sys
import torch
need_pytorch3d=False
try:
    import pytorch3d
except ModuleNotFoundError:
    need_pytorch3d=True
if need_pytorch3d:
    if torch.__version__.startswith("2.2.") and sys.platform.startswith("linux"):
        # We try to install PyTorch3D via a released wheel.
        pyt_version_str=torch.__version__.split("+")[0].replace(".", "")
        version_str="".join([
            f"py3{sys.version_info.minor}_cu",
            torch.version.cuda.replace(".",""),
            f"_pyt{pyt_version_str}"
        ])
        !pip install fvcore iopath
        !pip install --no-index --no-cache-dir pytorch3d -f https://dl.fbaipublicfiles.com/pytorch3d/packaging/wheels/{version_str}/download.html
    else:
        # We try to install PyTorch3D from source.
        !pip install 'git+https://github.com/facebookresearch/pytorch3d.git@stable'

Collecting git+https://github.com/facebookresearch/pytorch3d.git@stable
  Cloning https://github.com/facebookresearch/pytorch3d.git (to revision stable) to /tmp/pip-req-build-ax06169l
  Running command git clone --filter=blob:none --quiet https://github.com/facebookresearch/pytorch3d.git /tmp/pip-req-build-ax06169l
  Running command git checkout -q 75ebeeaea0908c5527e7b1e305fbc7681382db47
  Resolved https://github.com/facebookresearch/pytorch3d.git to commit 75ebeeaea0908c5527e7b1e305fbc7681382db47
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting iopath (from pytorch3d==0.7.8)
  Downloading iopath-0.1.10.tar.gz (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting portalocker (from iopath->pytorch3d==0.7.8)
  Downloading portalocker-3.1.1-py3-none-any.whl.metadata (8.6 kB)
Downloading portalocker-3.1.1-py3-none-any.whl (19 kB)
Bu

In [None]:
import torch
from pytorch3d.transforms.so3 import (
    so3_exp_map,
    so3_relative_angle,
)
from pytorch3d.renderer.cameras import (
    SfMPerspectiveCameras,
)

# add path for demo utils
import sys
import os
sys.path.append(os.path.abspath(''))

# set for reproducibility
torch.manual_seed(42)
if torch.cuda.is_available():
    device = torch.device("cuda:0")
else:
    device = torch.device("cpu")
    print("WARNING: CPU only, this will be slow!")

In [None]:
!wget https://raw.githubusercontent.com/facebookresearch/pytorch3d/main/docs/tutorials/utils/camera_visualization.py
from camera_visualization import plot_camera_scene

!mkdir data
!wget -P data https://raw.githubusercontent.com/facebookresearch/pytorch3d/main/docs/tutorials/data/camera_graph.pth

--2025-03-27 15:57:43--  https://raw.githubusercontent.com/facebookresearch/pytorch3d/main/docs/tutorials/utils/camera_visualization.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2037 (2.0K) [text/plain]
Saving to: ‘camera_visualization.py’


2025-03-27 15:57:43 (24.1 MB/s) - ‘camera_visualization.py’ saved [2037/2037]

--2025-03-27 15:57:43--  https://raw.githubusercontent.com/facebookresearch/pytorch3d/main/docs/tutorials/data/camera_graph.pth
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16896 (16K) [application/octet-st

# Camera setup and ground thruts

In [None]:
camera_graph_file = './data/camera_graph.pth'
(R_absolute_gt, T_absolute_gt), (R_relative, T_relative), relative_edges = torch.load(camera_graph_file)

- R stands for Rotation matrix. It is part of the camera's extrinsic parameters and describes how the camera is oriented in space. This is typically a 3x3 matrix.
- T stands for Translation vector. It also belongs to the camera's extrinsic parameters and tells you how the camera is positioned in space relative to some reference point. This is usually a 3-dimensional vector.

- Ground Truth (Absolute) Coordinates: These are the actual (real-world) positions and orientations of the cameras. These are known and fixed, used as a reference to assess the accuracy of estimated positions.

- Relative Coordinates: These describe the position and orientation of one camera relative to another. For example, if you know the position of Camera A and you have the relative position of Camera B to Camera A, you can calculate the position of Camera B.

relative_edges: A list of pairs of indices indicating between which cameras the relative positions are known.

In [None]:
cameras_relative = SfMPerspectiveCameras(
    R = R_relative.to(device),
    T = T_relative.to(device),
    device = device,
)

In [None]:
cameras_absolute_gt = SfMPerspectiveCameras(
    R = R_absolute_gt.to(device),
    T = T_absolute_gt.to(device),
    device = device,
)

The SfMPerspectiveCameras class in PyTorch3D is used to model cameras within a 3D environment by setting their positions and orientations through rotation matrices and translation vectors. This class is particularly useful in computer vision and 3D reconstruction tasks, such as Structure from Motion (SfM), where the goal is to reconstruct a scene's 3D structure from multiple 2D images. It allows for the simulation of camera behavior under different poses, and it's crucial for algorithms that need to estimate or optimize camera positions to align virtual camera views with actual observed data.

In [None]:
# the number of absolute camera positions
N = R_absolute_gt.shape[0]

# Optimization Functions

We now define two functions crucial for the optimization.

calc_camera_distance compares a pair of cameras. This function is important as it defines the loss that we are minimizing. The method utilizes the so3_relative_angle function from the SO3 API.

get_relative_camera computes the parameters of a relative camera that maps between a pair of absolute cameras. Here we utilize the compose and inverse class methods from the PyTorch3D Transforms API.

In [None]:
def calc_camera_distance(cam_1, cam_2):
    """
    Calculates the divergence of a batch of pairs of cameras cam_1, cam_2.
    The distance is composed of the cosine of the relative angle between
    the rotation components of the camera extrinsics and the l2 distance
    between the translation vectors.
    """
    # rotation distance
    R_distance = (1.-so3_relative_angle(cam_1.R, cam_2.R, cos_angle=True)).mean()
    # translation distance
    T_distance = ((cam_1.T - cam_2.T)**2).sum(1).mean()
    # the final distance is the sum
    return R_distance + T_distance

The function calc_camera_distance(cam_1, cam_2) measures the difference between two cameras in terms of their orientation and position.

The rotation distance evaluates how much the orientation of one camera differs from another. This is done by computing the cosine of the angle between their rotation matrices using the so3_relative_angle function, which outputs the cosine of the angle between these rotations. The computation 1 - cos(theta) translates this cosine value into a scale from 0 to 2, where 0 indicates that the cameras have identical orientations, and 2 suggests that they are oppositely directed (180 degrees apart).

The translation distance is the straight-line distance between the positions of the two cameras. It's calculated as the Euclidean (L2) distance between their translation vectors, which involves summing the squares of the differences between corresponding components of these vectors and then taking the square root.

After calculating these two distances, the function sums them to provide a single metric that quantifies the total difference in camera poses. This combined measure is particularly useful in scenarios like camera calibration or multi-camera system alignments where you need to match the orientation and position of one camera to another.

This combination of rotation and translation measurements—one based on angles and the other on distance—doesn't automatically balance these different units. In practice, you might need to apply weights or scaling factors

In [None]:
def get_relative_camera(cams, edges):
    """
    For each pair of indices (i,j) in "edges" generate a camera
    that maps from the coordinates of the camera cams[i] to
    the coordinates of the camera cams[j]
    """

    # first generate the world-to-view Transform3d objects of each
    # camera pair (i, j) according to the edges argument
    trans_i, trans_j = [
        SfMPerspectiveCameras(
            R = cams.R[edges[:, i]],
            T = cams.T[edges[:, i]],
            device = device,
        ).get_world_to_view_transform()
         for i in (0, 1)
    ]

    # compose the relative transformation as g_i^{-1} g_j
    trans_rel = trans_i.inverse().compose(trans_j)

    # generate a camera from the relative transform
    matrix_rel = trans_rel.get_matrix()
    cams_relative = SfMPerspectiveCameras(
                        R = matrix_rel[:, :3, :3],
                        T = matrix_rel[:, 3, :3],
                        device = device,
                    )
    return cams_relative

The get_relative_camera(cams, edges) function is designed to compute the relative transformation between pairs of cameras based on given indices. It essentially tells you how to adjust from the perspective of one camera to match another within the same system.

1. Extract Camera Transformations: For each camera pair specified by edges, the function first retrieves the transformation of each camera. This transformation describes how each camera views the world—essentially, how to translate and rotate world coordinates to fit the camera’s own coordinate system. This involves both a rotation (which way the camera is looking) and a translation (where the camera is positioned in space).

2. Invert and Compose Transformations: For each pair, the function calculates what adjustments are needed to move from the first camera's viewpoint to the second's. This is done by first inverting the transformation of the first camera (to undo its perspective) and then applying the transformation of the second camera (to adopt its perspective). The inversion essentially resets the viewpoint to a neutral position, and applying the second transformation shifts this neutral viewpoint to that of the second camera.

3. Create New Camera Models: From the combined transformations, the function then constructs a new camera model for each pair. These new models do not correspond to physical cameras but rather represent the relative orientations and positions between pairs of cameras as computed from their transformations.

The syntax for i in (0, 1) in the list comprehension is a Pythonic way to loop twice: first to process the transformation of the first camera in each pair (i=0) and then the second camera (i=1). This helps efficiently generate the transformations for both cameras in each pair using a concise code structure.

In summary, the function does not convert relative coordinates to absolute coordinates. Instead, it provides a way to understand and model how each camera in a specified pair is positioned relative to the other. This relative modeling is fundamental in multi-camera setups where understanding the spatial relationships and orientations between cameras directly impacts the accuracy and effectiveness of the overall system's operation.

# Optimization

Finally, we start the optimization of the absolute cameras.

We use SGD with momentum and optimize over log_R_absolute and T_absolute.

As mentioned earlier, log_R_absolute is the axis angle representation of the rotation part of our absolute cameras. We can obtain the 3x3 rotation matrix R_absolute that corresponds to log_R_absolute with:

R_absolute = so3_exp_map(log_R_absolute)


In [None]:
# initialize the absolute log-rotations/translations with random entries
log_R_absolute_init = torch.randn(N, 3, dtype=torch.float32, device=device)
T_absolute_init = torch.randn(N, 3, dtype=torch.float32, device=device)

# furthermore, we know that the first camera is a trivial one
#    (see the description above)
log_R_absolute_init[0, :] = 0.
T_absolute_init[0, :] = 0.

 These variables are initialized to store the initial guesses for the rotations and translations of a set of cameras in a scene. log_R_absolute_init holds the logarithmic representation of rotations, and T_absolute_init holds the translations. Each camera in the system is represented by three values in these tensors:
 - Rotation (log_R_absolute_init): The rotation of each camera is stored in a compact, logarithmic form (specifically, the axis-angle representation). This form expresses a rotation in 3D space as a vector along the rotation axis, with a magnitude equal to the angle of rotation in radians.
 - Translation (T_absolute_init): This simply stores the x, y, and z coordinates of each camera's position in space.
The use of torch.randn function initializes these values with random entries drawn from a normal distribution. This randomness serves as an initial guess for the positions and orientations of the cameras, which will be refined through an optimization process.

The first camera is often set as a reference or anchor in the scene, hence its rotation and translation are initialized to zero

Using the logarithmic form for rotations, specifically the axis-angle representation, simplifies gradient calculations and allows for linear operations like interpolation and extrapolation. This approach avoids the constraints of orthogonality and unit norm required by rotation matrices and quaternions, making the optimization process more straightforward.

In [None]:
# instantiate a copy of the initialization of log_R / T
log_R_absolute = log_R_absolute_init.clone().detach()
log_R_absolute.requires_grad = True
T_absolute = T_absolute_init.clone().detach()
T_absolute.requires_grad = True

**Cloning**: The .clone() method creates a copy of the original tensor (log_R_absolute_init and T_absolute_init). This is important because you often want to keep the initial values unchanged for reference or reuse them later without affecting them during the optimization process. Cloning ensures that the original tensors remain intact.

**Detaching:** The .detach() method is used to detach the cloned tensors from the current computation graph. In PyTorch, tensors that are part of a computation graph record operations performed on them to compute gradients. By detaching the clones, you prevent the original computation history from being affected by the operations that will be performed on these new tensors. Essentially, this makes the new tensors independent of the original ones in terms of gradient computation.

**requires_grad = True**: This setting is crucial for enabling automatic differentiation on these tensors. By setting requires_grad to True, PyTorch knows that it needs to compute gradients for these tensors when performing backpropagation. This is necessary because, in an optimization loop, you want to adjust these values (rotations and translations) to minimize the loss function, and gradients are required for the optimization algorithm (like SGD) to update the parameters.

In [None]:
# the mask the specifies which cameras are going to be optimized
#     (since we know the first camera is already correct,
#      we only optimize over the 2nd-to-last cameras)
camera_mask = torch.ones(N, 1, dtype=torch.float32, device=device)
camera_mask[0] = 0.

The concept of a "mask" in this context is used to control which elements in a dataset or tensor are affected by certain operations, typically during computation processes like optimization. In your script, a mask is utilized to specify which cameras in a multi-camera system should undergo optimization.

Creating the Mask: The mask is created as a tensor of ones (torch.ones(N, 1, dtype=torch.float32, device=device)), which initially suggests that all cameras are candidates for optimization. The size of the mask (N, 1) matches the number of cameras, and each entry in the mask corresponds to a camera.

Setting the First Camera to Zero: By setting the first entry of the mask to zero (camera_mask[0] = 0), the script explicitly excludes the first camera from the optimization process. The value 0 indicates that any operations controlled by the mask should not affect this camera.

First Camera as Reference: As mentioned previously, the first camera is often set to have zero rotation and zero translation, aligning it perfectly with the coordinate system's origin and orientation. This camera serves as a fixed reference or baseline for the system. Since its position and orientation are already defined as correct, there is no need to adjust or optimize its parameters.

Optimization of Other Cameras: The remaining cameras (from the second to the last) are marked by the mask with a value of 1, indicating they are active for optimization. These cameras' parameters will be adjusted during the optimization process to align their observed data (like positions and rotations relative to the scene or other cameras) with the model or expected outcomes.

In [None]:
# init the optimizer
optimizer = torch.optim.SGD([log_R_absolute, T_absolute], lr=.1, momentum=0.9)

This initializes an optimizer using PyTorch's Stochastic Gradient Descent (SGD) method, which is designed to update the parameters (in this case, log_R_absolute and T_absolute) to minimize a loss function over iterations. The lr=.1 specifies the learning rate, which controls how much the parameters change in response to the calculated gradient during each update, and momentum=0.9 helps accelerate the optimizer in the right direction, thus improving the convergence. This setup does not define the criteria of optimality or loss function itself; it simply sets up the mechanism for updating parameters once the loss is computed during the optimization loop.


In [None]:
# run the optimization
n_iter = 2000  # fix the number of iterations
for it in range(n_iter):
    # re-init the optimizer gradients
    optimizer.zero_grad()

    # compute the absolute camera rotations as
    # an exponential map of the logarithms (=axis-angles)
    # of the absolute rotations
    R_absolute = so3_exp_map(log_R_absolute * camera_mask)

    # get the current absolute cameras
    cameras_absolute = SfMPerspectiveCameras(
        R = R_absolute,
        T = T_absolute * camera_mask,
        device = device,
    )

    # compute the relative cameras as a composition of the absolute cameras
    cameras_relative_composed = \
        get_relative_camera(cameras_absolute, relative_edges)

    # compare the composed cameras with the ground truth relative cameras
    # camera_distance corresponds to $d$ from the description
    camera_distance = \
        calc_camera_distance(cameras_relative_composed, cameras_relative)

    # our loss function is the camera_distance
    camera_distance.backward()

    # apply the gradients
    optimizer.step()

    # plot and print status message
    if it % 200==0 or it==n_iter-1:
        status = 'iteration=%3d; camera_distance=%1.3e' % (it, camera_distance)
        plot_camera_scene(cameras_absolute, cameras_absolute_gt, status)

print('Optimization finished.')

NameError: name 'optimizer' is not defined

The function so3_exp_map in PyTorch is used to convert rotations from their logarithmic form (specifically the axis-angle representation) to the corresponding rotation matrices



```
cameras_absolute = SfMPerspectiveCameras(
    R = R_absolute,
    T = T_absolute * camera_mask,
    device = device,
)
```
Initializing a set of cameras with their absolute rotations (R_absolute) and translations (T_absolute). The camera_mask applied to T_absolute ensures that the translation of the first camera remains zero (as it's set as the reference camera and does not need optimization). These cameras_absolute now represent the current estimate of where each camera is positioned and how it is oriented in your scene, based on the optimizer's current state.




```
cameras_relative_composed = get_relative_camera(cameras_absolute, relative_edges)
```

This line calculates the relative transformations between pairs of cameras specified in relative_edges. The function get_relative_camera uses the absolute transformations (positions and orientations) of these cameras (provided in cameras_absolute) to compute how one camera is positioned relative to another.

For each pair of cameras (i, j) specified in relative_edges, this function:

- Takes the absolute transformation of camera i, calculates its inverse (essentially setting it as a new reference point).

- Applies the absolute transformation of camera j to this reference point, resulting in the transformation that describes how to move from camera i to camera j in space.
