Skip to content

Pixel-Perfect Structure-from-Motion with Featuremetric Refinement (ICCV 2021, Best Student Paper Award)


Notifications You must be signed in to change notification settings


Repository files navigation

Pixel-Perfect Structure-from-Motion

Best student paper award @ ICCV 2021

We introduce a framework that improves the accuracy of Structure-from-Motion (SfM) and visual localization by refining keypoints, camera poses, and 3D points using the direct alignment of deep features. It is presented in our paper:

Here we provide pixsfm, a Python package that can be readily used with COLMAP and our toolbox hloc. This makes it easy to refine an existing COLMAP model or reconstruct a new dataset with state-of-the-art image matching. Our framework also improves visual localization in challenging conditions.

The refinement is composed of 2 steps:

  1. Keypoint adjustment: before SfM, jointly refine all 2D keypoints that are matched together.
  2. Bundle adjustment: after SfM, refine 3D points and camera poses.

In each step, we optimize the consistency of dense deep features over multiple views by minimizing a featuremetric cost. These features are extracted beforehand from the images using a pre-trained CNN.

With pixsfm, you can:

  • reconstruct and refine a scene using hloc, from scratch or with given camera poses
  • localize and refine new query images using hloc
  • run the keypoint or bundle adjustments on a COLMAP database or 3D model
  • evaluate the refinement with new dense or sparse features on the ETH3D dataset

Our implementation scales to large scenes by carefully managing the memory and leveraging parallelism and SIMD vectorization when possible.


pixsfm requires Python >=3.6, GCC >=6.1, and COLMAP 3.8 installed from source. The core optimization is implemented in C++ with Ceres >= 2.1 but we provide Python bindings with high granularity. The code is written for UNIX and has not been tested on Windows. The remaining dependencies are listed in requirements.txt and include PyTorch >=1.7 and pycolmap + pyceres built from source:

# install COLMAP following, tag 3.8
sudo apt-get install libhdf5-dev
git clone --recursive
cd pixel-perfect-sfm
pip install -r requirements.txt

To use other local features besides SIFT via COLMAP, we also require hloc:

git clone --recursive
cd Hierarchical-Localization/
pip install -e .

Finally build and install the pixsfm package:

pip install -e .  # install pixsfm in develop mode

We highly recommend to use pixsfm with a working GPU for the dense feature extraction. All other steps can only run on the CPU. Having issues with compilation errors or runtime crashes? Want to use the codebase as a C++ library? Check our FAQ.


The Jupyter notebook demo.ipynb demonstrates a minimal usage example. It shows how to run Structure-from-Motion and the refinement, how to align and compare different 3D models, and how to localize and refine additional query images.

Visualizing mapping and localization results in the demo.


End-to-end SfM with hloc

Given keypoints and matches computed with hloc and stored in HDF5 files, we can run Pixel-Perfect SfM from a Python script:

from pixsfm.refine_hloc import PixSfM
refiner = PixSfM()
model, debug_outputs = refiner.reconstruction(
# model is a pycolmap.Reconstruction 3D model

or from the command line:

python -m pixsfm.refine_hloc reconstructor \
    --sfm_dir path_to_working_directory \
    --image_dir path_to_image_dir \
    --pairs_path path_to_list_of_image_pairs \
    --features_path path_to_keypoints.h5 \
    --matches_path path_to_matches.h5

Note that:

  • The final refined 3D model is written to path_to_working_directory in either case.
  • Dense features are automatically extracted (on GPU when available) using a pre-trained CNN, S2DNet by default.
  • The result debug_outputs contains the dense features and optimization statistics.


We have fine-grained control over all hyperparameters via OmegaConf configurations, which have sensible default values defined in PixSfM.default_conf. See Detailed configuration for a description of the main configuration entries and their defaults.

[Click to see some examples]

For example, dense features are stored in memory by default. If we reconstruct a large scene or have limited RAM, we should instead write them to a cache file that is loaded on-demand. With the Python API, we can pass a configuration update:

refiner = PixSfM(conf={"dense_features": {"use_cache": True}})

or equivalently with the command line using a dotlist:

python -m pixsfm.refine_hloc reconstructor [...] dense_features.use_cache=true

We also provide ready-to-use configuration templates in pixsfm/configs/ covering the main use cases. For example, pixsfm/configs/low_memory.yaml reduces the memory consumption to scale to large scene and can be used as follow:

refiner = PixSfM(conf="low_memory")
# or
python -m pixsfm.refine_hloc reconstructor [...] --config low_memory

Triangulation from known camera poses

[Click to expand]

If camera poses are available, we can simply triangulate a 3D point cloud from an existing reference COLMAP model with:

model, _ = refiner.triangulation(..., path_to_reference_model, ...)


python -m pixsfm.refine_hloc triangulator [...] \
    --reference_sfm_model path_to_reference_model

By default, camera poses and intrinsics are optimized by the bundle adjustment. To keep them fixed, we can simply overwrite the corresponding options as:

conf = {"BA": {"optimizer": {
    "refine_focal_length": False,
    "refine_extra_params": False,  # distortion parameters
    "refine_extrinsics": False,    # camera poses
refiner = PixSfM(conf=conf)

or equivalently

python -m pixsfm.refine_hloc triangulator [...] \
  'BA.optimizer={refine_focal_length: false, refine_extra_params: false, refine_extrinsics: false}'

Keypoint adjustment

The first step of the refinement is the keypoint adjustment (KA). It refines the keypoints from tentative matches only, before SfM. Here we show how to run this step separately.

[Click to expand]

To refine keypoints stored in an hloc HDF5 feature file:

from pixsfm.refine_hloc import PixSfM
refiner = PixSfM()
keypoints, _, _ = refiner.refine_keypoints(

To refine keypoints stored in a COLMAP database:

from pixsfm.refine_colmap import PixSfM
refiner = PixSfM()
keypoints, _, _ = refiner.refine_keypoints_from_db(
    path_to_output_database,  # pass path_to_input_database for in-place refinement

In either case, there is an equivalent command line interface.

Bundle adjustment

The second contribution of the refinement is the bundle adjustment (BA). Here we show how to run it separately to refine an existing COLMAP 3D model.

[Click to expand]

To refine a 3D model stored on file:

from pixsfm.refine_colmap import PixSfM
refiner = PixSfM()
model, _, _, = refiner.refine_reconstruction(

Using the command line interface:

python -m pixsfm.refine_colmap bundle_adjuster \
    --input_path path_to_input_model \
    --output_path path_to_output_model \
    --image_dir path_to_image_dir

Visual localization

When estimating the camera pose of a single image, we can also run the keypoint and bundle adjustments before and after PnP+RANSAC. This requires reference features attached to each observation of the reference model. They can be computed in several ways.

[Click to learn how to localize a single image]
  1. To recompute the references from scratch, pass the path to the reference images:
from pixsfm.localization import QueryLocalizer
localizer = QueryLocalizer(
    reference_model,  # pycolmap.Reconstruction 3D model
    dense_features=cache_path,  # optional: cache to file for later reuse
pose_dict = localizer.localize(
    pnp_points2D      # keypoints with valid 3D correspondence (N, 2)
    pnp_point3D_ids,  # IDs of corresponding 3D points in the reconstruction
    query_camera,     # pycolmap.Camera
if pose_dict["success"]:
    # quaternion and translation of the query, from world to camera
    qvec, tvec = pose_dict["qvec"], pose_dict["tvec"]

The default localization configuration can be accessed with QueryLocalizer.default_conf.

  1. Alternatively, if dense reference features have already been computed during the pixel-perfect SfM, it is more efficient to reuse them:
refiner = PixSfM()
model, outputs = refiner.reconstruction(...)
features = outputs["feature_manager"]
# or load the features manually
features = pixsfm.extract.load_features_from_cache(
localizer = QueryLocalizer(
    reference_model,  # pycolmap.Reconstruction 3D model

We can also batch-localize multiple queries equivalently to hloc.localize_sfm:

    dense_features,  # FeatureManager or path to cache file
    reference_model,  # pycolmap.Reconstruction 3D model
    config=config,  # optional dict

Example: mapping and localization

We now show how to run the featuremetric pipeline on the Aachen Day-Night v1.1 dataset. First, download the dataset by following the instructions described here. Then run python examples/, which will perform mapping and localization with SuperPoint+SuperGlue. As the scene is large, with over 7k images, we cache the dense feature patches and therefore require about 350GB of free disk space. Expect the sparse feature matching to take a few hours on a recent GPU. We also show in examples/ how to start from an existing COLMAP database.


We can evaluate the accuracy of the pixel-perfect SfM and of camera pose estimation on the ETH3D dataset. Refer to the paper for more details.

First, we download the dataset with python -m, by default to ./datasets/ETH3D/.

3D triangulation

[Click to expand]

We first need to install the ETH3D multi-view evaluation tool:

sudo apt install libpcl-dev  # linux only
git clone
cd multi-view-evaluation && mkdir build && cd build
cmake .. && make -j

We can then evaluate the accuracy of the sparse 3D point cloud triangulated with Pixel-Perfect SfM, for example on the courtyard scene with SuperPoint keypoints:

python -m pixsfm.eval.eth3d.triangulation \
    --scenes courtyard \
    --methods superpoint \
    --tag pixsfm
  • omit --scenes and --methods to run all scenes with all feature detectors.
  • the results are written to ./outputs/ETH3D/ by default
  • use --tag some_run_name to distinguish different runs
  • add --config norefine to turn off any refinement or use the dotlist KA.apply=false BA.apply=false
  • add --config photometric to run the photometric BA (no KA)

To aggregate the results and compare different runs, for example with and without refinement, we run:

python -m pixsfm.eval.eth3d.plot_triangulation \
    --scenes courtyard \
    --methods superpoint \
    --tags pixsfm raw

Running on all scenes and all detectors should yield the following results (±1%):

----scene---- -keypoints- -tag-- -accuracy @ X cm- completeness @ X cm
                                  1.0   2.0   5.0   1.0   2.0   5.0 
indoor        sift        raw    75.95 85.50 92.88  0.21  0.88  3.65
                          pixsfm 83.16 89.94 94.94  0.25  0.96  3.77
              superpoint  raw    78.96 87.77 94.55  0.64  2.36  9.39
                          pixsfm 89.93 94.09 97.04  0.76  2.62  9.85
              r2d2        raw    67.91 80.25 90.45  0.55  2.12  8.85
                          pixsfm 81.09 87.78 93.41  0.67  2.32  9.04
outdoor       sift        raw    57.70 72.90 86.41  0.06  0.34  2.46
                          pixsfm 68.10 80.57 91.59  0.08  0.42  2.75
              superpoint  raw    53.63 68.93 83.27  0.11  0.64  4.43
                          pixsfm 71.83 82.65 92.06  0.18  0.89  5.40
              r2d2        raw    49.33 66.21 83.37  0.11  0.55  3.62
                          pixsfm 67.94 81.02 91.68  0.16  0.71  3.99

The results of this evaluation can be different from the numbers reported in the paper. The trends are however similar and the conclusions of the paper still hold. This difference is due to improvements of the pixsfm code and to changes in the SuperPoint implementation: we initially used the setup of PatchFlow and later switched to hloc, which is strictly better and easier to install.

Camera pose estimation

[Click to expand]

Similarly, we evaluate the accuracy of camera pose estimation given sparse 3D models triangulated from other views:

python -m pixsfm.eval.eth3d.localization --tag pixsfm

Again, we can also run on a subset of scenes or keypoint detectors. To aggregate the results and compare different runs, for example with and without KA and BA, we run:

python -m pixsfm.eval.eth3d.plot_localization --tags pixsfm raw

We should then obtain the following table and plot (±2%):

-keypoints- -tag-- -AUC @ X cm (%)--
                    0.1    1    10  
sift        raw    16.92 55.39 81.15
            pixsfm 23.08 60.47 84.01
superpoint  raw    15.38 63.41 87.24
            pixsfm 41.54 73.86 89.66
r2d2        raw     6.15 51.70 83.46
            pixsfm 23.85 62.41 86.89

SIFT (black), SuperPoint (red), R2D2 (green)

Results for the 0.1cm threshold can vary across setups and therefore differ from the numbers reported in the paper. This might be due to changes in the PyTorch and COLMAP dependencies. We are investigating this but any help is welcome!

Advanced usage

Detailed configuration

Here we explain the main configuration entries for mapping and localization along with their default values:

[Click to expand]
dense_features:  # refinement features
  model:  # the CNN that extracts the features
    name: s2dnet  # the name of one of the models defined in pixsfm/features/models/
    num_layers: 1  # the number of output layers (model-specific parameters)
  device: auto  # cpu, cuda, or auto-determined based on CUDA availability
  max_edge: 1600  # downscale the image such the largest dimension has this value
  resize: LANCZOS  # interpolation algorithm for the image resizing
  pyr_scales: [1.0]   # concat features extracted at multiple scales
  fast_image_load: false  # approximate resizing for large images
  l2_normalize: true  # whether to normalize the features so they have unit norm
  sparse: true  # whether to store sparse patches of features instead of the full feature maps
  patch_size: 8  # the size of the feature patches if sparse
  dtype: half  # the data type of features when stored, half float or double
  use_cache: false  # whether to cache the features on file or keep them in memory
  overwrite_cache: false  # whether to overwrite the cache file if it already exists
  cache_format: chunked
  nodes: [[0.0, 0.0]]  # grid over which to compute the cost, by default a single point
  mode: BICUBIC  # the interpolation algorithm
  l2_normalize: true
  ncc_normalize: false  # only works if len(nodes)>1, mostly for photometric
mapping:  # pixsfm.refine_colmap.PixSfM
  dense_features: ${..dense_features}
  KA:  # keypoint adjustment
    apply: true  # whether to apply or instead skip
    strategy: featuremetric  # regular, or alternatively topological_reference (much faster)
    interpolation: ${...interpolation}  # we can use a different interpolation for KA
    level_indices: null  # we can optimize a subset of levels, by default all
    split_in_subproblems: true  # parallelize the optimization
    max_kps_per_problem: 50  # parallelization, a lower value saves memory, conservative if -1
    optimizer:  # optimization problem and solving
        name: cauchy  # name of the loss function, among {huber, soft_l1, ...}
        params: [0.25]  # loss-specific parameters
        function_tolerance: 0.0
        gradient_tolerance: 0.0
        parameter_tolerance: 1.0e-05
        minimizer_progress_to_stdout: false  # print a progress bar
        max_num_iterations: 100  # maximum number of optimization iterations
        max_linear_solver_iterations: 200
        max_num_consecutive_invalid_steps: 10
        max_consecutive_nonmonotonic_steps: 10
        use_inner_iterations: false
        use_nonmonotonic_steps: false
        num_threads: 1
      root_regularize_weight: -1  # prevent drift by adding edges to the root node, disabled if -1
      print_summary: false  # whether to print a detailed summary after completion
      bound: 4.0  # constraint on the distance (in pixels) w.r.t. the initial values
      num_threads: -1  # number of threads if parallelize in subproblems
  BA:  # bundle adjustment
    apply: true  # whether to apply or instead skip
    strategy: feature_reference  # regular, or alternatively {costmaps, patch_warp}
    interpolation: ${...interpolation}  # we can use a different interpolation for BA
    level_indices: null  # we can optimize a subset of levels, by default all
    max_tracks_per_problem: 10  # parallelization of references/costmaps, a lower value saves memory
    num_threads: -1
      loss:  # same config as KA.optimizer.loss
      solver:  # same config as KA.optimizer.solver
      print_summary: false
      refine_focal_length: true  # whether to optimize the focal length
      refine_principal_point: false  # whether to optimize the principal points
      refine_extra_params: true  # whether to optimize distortion parameters
      refine_extrinsics: true  # whether to optimize the camera poses
    references:  # if strategy==feature_reference
      loss:  # what to minimize to compute the robust mean
        name: cauchy
        params: [0.25]
      iters: 100  # number of iterations to compute the robust mean
      num_threads: -1
    repeats: 1
localization:  # pixsfm.localization.main.QueryLocalizer
  dense_features: ${..dense_features}
  target_reference: nearest  # how to select references, in {nearest, robust_mean, all_observations}
  overwrite_features_sparse: null  # overwrite dense_features.sparse in query localization only
  references:  # how to compute references
    loss:  # what to minimize to compute the robust mean, same as BA.references.loss
    iters: 100
    keep_observations: true  # required for target_reference in {nearest, all_observations}
    num_threads: -1
  max_tracks_per_problem: 50  # parallelization of references, a lower value saves memory
  unique_inliers: min_error  # how we select unique matches for each 3D point
  QKA:  # query keypoint adjustment
    apply: true  # whether to apply or instead skip
    interpolation: ${...interpolation}
    level_indices: null
    feature_inlier_thresh: -1  # discard points with high feature error, disabled if -1
    stack_correspondences: False # Stack references for equal keypoints
      loss:  # same config as KA.optimizer.loss
        name: trivial  # L2, no robust loss function
        params: []
      solver:  # same config as KA.optimizer.solver
      print_summary: false
      bound: 4.0  # constraint on the distance (in pixels) w.r.t. the initial values
    estimation:  # pycolmap.absolute_pose_estimation
        max_error: 12  # inlier threshold in pixel reprojection error
        estimate_focal_length: false  # if the focal length is unknown
    refinement:  # refinement in pycolmap.absolute_pose_estimation
    	refine_focal_length: false
    	refine_extra_params: false
  QBA:  # query bundle adjuster
    apply: true  # whether to apply or instead skip
    interpolation: ${...interpolation}
    level_indices: null
      loss:  # same config as KA.optimizer.loss
      solver:  # same config as KA.optimizer.solver
      print_summary: false
      refine_focal_length: false
      refine_principal_point: false
      refine_extra_params: false

Note that the config supports variable interpolation through omegaconf.

Large-scale refinement

When dealing with large scenes or with a large number of images, memory is often a bottleneck. The configuration low_memory shows how to decrease the memory consumption by trading-off accuracy and speed.

[Click to expand]

The main improvements are:

  • dense_features
    • store as sparse patches: sparse=true
    • reduce the size of the patches: patch_size=8 (or smaller)
    • store in a cache file: use_cache=true
  • KA
    • chunk the optimization, loading only a subset of features at once: split_in_subproblems=true
    • optimize at most around 50 keypoints per chunk: max_kps_per_problem=50
  • BA
    • use the costmap approximation: strategy=costmaps (described in Section C of the paper)

When runtime is a limitation, one can also reduce the runtime of KA by optimizing only costs with respect to the topological center of each track with KA.strategy=topological_reference.

Keypoints with large noise

[Click to expand]

Some keypoint detectors with low output resolution, like D2-Net, predict keypoints that are localized inaccurately. In this case, the refinement is highly beneficial but the default parameters are not optimal. It is necessary to increase the patch size and use multiple feature layers. An example configuration is given in pixsfm_eth3d_d2net to evaluate D2-Net on ETH3D.

Extending pixsfm

Still having questions about pixsfm? Anything in the doc is unclear? Are you unsure whether it fits your use case? Please let us know by opening an issue!


We welcome external contributions, especially to improve the following points:

  • make pixsfm work on Windows
  • train and integrate dense features that are more compact with fewer dimensions
  • build a conda package for pixsfm and pycolmap to not require installing COLMAP from source
  • add examples on how to build featuremetric problems with pyceres

BibTex citation

Please consider citing our work if you use any code from this repo or ideas presented in the paper:

  author    = {Philipp Lindenberger and
               Paul-Edouard Sarlin and
               Viktor Larsson and
               Marc Pollefeys},
  title     = {{Pixel-Perfect Structure-from-Motion with Featuremetric Refinement}},
  booktitle = {ICCV},
  year      = {2021},