Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


LandscapeAR: Large Scale Outdoor Augmented Reality by Matching Photographs with Terrain Models Using Learned Descriptors

This repository contains official implementation of the ECCV 2020 LandscapeAR paper. If you use this code in a scientific work, please, cite it:

Brejcha, J., Lukáč, M., Hold-Geoffroy, Y., Wang, O., Čadík, M.:
LandscapeAR: Large Scale Outdoor Augmented Reality by Matching Photographs with
Terrain Models Using Learned Descriptors,
In: 16th European Conference on Computer Vision (ECCV), 2020.

The code for rendering photographs aligned with a textured Digital Elevation Model is coupled with the UIST 2018 Immersive Trip Reports paper. If you also use the rendering part of the project, please, cite it:

BREJCHA Jan, LUKÁČ Michal, CHEN Zhili, DIVERDI Stephen a ČADÍK Martin.
Immersive Trip Reports. In: Proceedings of the 31st ACM User Interface Software
and Technology Symposium. Berlín: Association for Computing Machinery,
2018, s. 389-401. ISBN 978-1-4503-5948-1.

Useful links


This work was supported by project no. LTAIZ19004 Deep-Learning Approach to Topographical Image Analysis; by the Ministry of Education, Youth and Sports of the Czech Republic within the activity INTER-EXCELENCE (LT), subactivity INTER-ACTION (LTA), ID: SMSM2019LTAIZ.


This project consists of several parts, which are aimed at:

  • Structure from Motion with Terrain Reference for fine grained localization of a batch of query photographs to the terrain model. This method was mainly used to create the training datasets. It implements custom matching stage, with the ability to use D2Net (for keypoint extraction and description), or our approach (SIFT for keypoint extraction and our trained network for description). It allows to reconstruct scenes with and also without the terrain reference. Our custom matches are imported to COLMAP, which then performs the 3D scene reconstruction. This module is easily extensible by novel keypoint extractors and descriptors.
  • Training cross-domain descriptor based on Convolutional Neural Network (CNN). The code allows to generate training datasets from the reconstructed scenes and train our descriptor for matching real photographs with rendered terrain (Digital Elevation Models - DEM) textured with satellite imagery.
  • Cross domain camera pose estimation. We leverage state-of-the-art descriptors, such as HardNet, D2Net, and our trained descriptor in order to match query photograph with the rendered DEM to obtain camera pose with respect to the virtual globe. For this, EPnP + RANSAC or P4Pf + RANSAC are used, both combined with Bundle Adjustment to refine the camera pose.


This project depends on Python3 and COLMAP for Structure-from-Motion (SfM) reconstruction. Further dependencies will be installed automatically. It has been tested on GNU/Linux (Fedora 31), and MacOSX 10.15. For following steps, we assume you are working with a UNIX shell.

Clone the repository including all the submodules. The dependencies, like HardNet or D2Net are included as submodules.

git clone --recurse-submodules

Enter LandscapeAR project, create Python3 virtual environment, activate it, and install the python dependencies. Optionally, you can skip creating the virtual environment, if you find your Python3 environment compatible with our requirements (can be found at LandscapeAR/python/requirements.txt).

cd landscapeAR
python3 -m venv lsar_venv
source lsar_venv/bin/activate
pip3 install -r python/requirements.txt

Download the pretrained models - HardNet, D2Net, NCNet, and our trained cross-domain model. We packaged them together so that they can be obtained easily. The downloaded models will be unpacked into the LandscapeAR/python/pretrained directory.

cd scripts

Finally, build the p4pf python module to be able to run P4Pf minimal solver. For more details, see the paper and the original code. Furthermore, the script will also patch the NCNet submodule so that it is compatible with PyTorch 1.3. The patch can be inspected at LandscapeAR/python/patches/ncnet_conv4d_padding_mode.patch.

cd ../python

In the last step, please add the following paths on your PYTHONPATH, preferably by adding following lines to your shell init script (~/.bashrc for bash, ~/.zshrc for zsh, etc.):

export PYTHONPATH="<project path>/LandscapeAR/python/thirdparty/ncnet:$PYTHONPATH"
export PYTHONPATH="<project path>/LandscapeAR/python/thirdparty/d2net:$PYTHONPATH"
export PYTHONPATH="<project path>/LandscapeAR/python/thirdparty/affnet:$PYTHONPATH"
export PYTHONPATH="<project path>/LandscapeAR/python:$PYTHONPATH"

Restart your shell afterwards, or run:

source ~/.bashrc (or your custom shell init script)

Optional dependencies

If you want to use also the Saddle keypoint detector for keypoint detection, you may optionally install it from and add the python module on your PYTHONPATH.

Installing the DEM renderer

In order to have complete visual localization pipeline, we need to be able to render the DEM and visualize the registered photographs. The DEM renderer itr is a separate project but it has been added as a submodule located at LandscapeAR/itr. LandscapeAR project expects the itr to be located at the system PATH. To install it, please follow the instructions at LandscapeAR/itr/, or use a build script, which will automatically build and install the itr together with all dependencies:

cd LandscapeAR/itr
./scripts/ build install Release 8

Feel free to change the number of processes (here 8) based on your PC configuration. If the process goes well, the itr will be installed into the install directory. Please, add this install directory to your PATH.

DEM and satellite data for rendering

The easiest way to render DEM model with textures is to get a Mapbox account with a free quota, and update your API key to the example earth file: LandscapeAR/itr-renderer/ For your convenience, we also prepared a single GeoTiff containing the DEM of the whole alps with 1 arcsecond (~30m) resolution, which we used in all our experiments. You can download it here. This DEM originally comes from, thanks!


Let's take a look how to Download Flickr imagery from area of interest, how to render grid of panorama images for reconstruction*, how to reconstruct a scene using Structure from Motion with Terrain Reference, how to generate a training dataset, how to train our descriptor, and finally, how to estimate and visualize camera pose of a single query image.

Downloading Flickr imagery from area of interest

In order to obtain training data, we downloaded imagery from several locations across the European Alps. If you want to download more images to create even larger dataset, you can use our tool as follows:

cd LandscapeAR/python/flickr
python3 <download_dir> <lat> <lon> <radius in km, at most 30>

The downloaded images will be stored into <download_dir>/images directory. The downloaded metadata will be stored into the <download_dir> and can be later queried with

Rendering a grid of panoramic images from DEM

In order to be able to use the Structure from Motion with Terrain Reference in the next step, we need to render a grid of synthetic panoramic images from the DEM, you can render them as follows:

itr --render-grid <center lat> <center lon> <grid width in meters> <grid height in meters> <offset x in meters> <offset y in meters> <image resolution in px> <output dir> --render-grid-mode perspective --render-grid-num-views 6 <path to .earth file>

For offscreen rendering on a headless machine, add --egl <GPU num> flag. Usual parameters are grid width and height = 10000 m, offset x and y is the distance between consecutive grid points, we use 1000 m, and image resolution 1024 px.

Structure from Motion with Terrain Reference

We use this technique to build training datasets from the downloaded photographs and terrain rendered in the previous step. This technique allows precise alignment (registration, pose estimation) of multiple photographs of of the same scene to the rendered terrain model. The input are the rendered images and photographs, for which we know the area where they were captured (the same area as where we rendered the images), but we don't know exact camera position and orientation of the photographs.

We start by creating a simple directory structure needed for the reconstruction:

<reconstruction root>

We place the downloaded photographs into the directory photo, and all rendered images to the directory render_uniform. We copy the file scene_info.txt from the directory with rendered images render_uniform to the <reconstruction root>. Then we create a list of all images, which will be used for the reconstruction:

cd <reconstruction root>
find . -name "*.png" >> image_list.txt
find . -name "*.jpg" >> image_list.txt

Each line of the list contains relative path from the to a single image. You can also create more image lists to divide the photgraphs into individual reconstruction batches. For this, use a suffix, such as image_list_0.txt, where 0 will be the suffix. The suffix needs to be a natural number. Splitting huge amount of photographs is a good idea no expedite the reconstruction process. We usually use batches around 1000 photographs in a single batch.

Once the directory structure is created, we can run the reconstruction process:

cd LandscapeAR/scripts
./ <reconstruction root> <suffix (if empty, use "")> <num threads> <gpu index> <'MATCHER OPTIONS'>

In order to use CUDA for reconstruction (strongly recommended), please, set the gpu index for zero or greater. If you don't want to use CUDA, use negative number. The MATCHER OPTIONS need to be encapsulated with apostrophes, as this is a single argument for the shell script. If you want to be sure to recompute all features and matches (if you run this script multiple times), you can add '--recompute-features --recompute-matches'. Also, you can add --d2net to force using D2Net for keypoint extraction and description. If you ommit --d2net, our trained network will be used.

This script will first run the feature extraction and a custom matching stage. After this, the matches will be imported to a COLMAP database and COLMAP will be used to create the reconstruction.

The reconstructed scenes can be easily exported for future rendering using itr.

./ <reconstruction root>

This will create a directory export inside <reconstruction root> with all the reconstructed scenes. Each reconstructed scene can be visualized with itr using:

cd <reconstruction root>/export
itr --directory <suffix> <path to an earth file>

or, they can be also rendered into a dataset using:

cd <reconstruction root>/export
itr --directory <suffix> --render-photoviews 1024 --output <output dir> <path to an earth file>

Generating metadata for training

Once the dataset has been reconstructed using SfM with Terrain Reference and rendered using --render-photoviews, we may now generate the metadata for training. Basically, all image pairs with some reasonable overlap are found and the corresponding patches are stored. It can be done by calling following tool:

python3 python/ <rendered dataset> real real-synth <output patches metadata>

Here <rendered dataset> is the <output dir> of the previous, rendering stage.


As the dataset generation is covered, let us now describe the training procedure.

Training data organization

We store all training data in following hierarchy:

<all datasets root>
    <dataset 1>
    <dataset 2>
    <dataset n>

We store all datasets in the single directory <all datasets root>. As we see, we can have as many datasets (separately reconstructed scenes) as we want. Each dataset directory contains the <rendered_dataset> (rendered by the itr tool in previous step), and the <patches_dataset> contains the patches metadata (corresponding patches definition generated by the python/ in the previous step). Finally, the dataset directories, which are used for training are listed in all_training_datasets.txt (name can be different), and all validation datasets are listed in all_validation_datasets.txt. Put each dataset directory on a separate line. Each training (validation) dataset needs to contain the <train.txt> (<val.txt>) file, which contains all training (validation) pairs, which are the names of the files contained in <patches_dataset>. The advantage of this hierarchy is that a singe scene can be divided to training and validation parts (which should be disjoint).

Training procedure

The training can be run easily:

python3 python/ -l <log_and_model_out_dir> --name <name_of_the_run> --num_epochs 50 --num_workers 1 --batch_size 300 --positive_threshold 1 --distance_type 3D --cuda --learning_rate 0.00001 --margin 0.2 --color_jitter --maxres 3000 --hardnet --hardnet_filter_dist --hardnet_filter_dist_values 50 1000000 --hardnet_orig --normalize_output --adaptive_mining --adaptive_mining_step 0.0 --adaptive_mining_coeff 0.23 --no_symmetric --uniform_negatives --architecture MultimodalPatchNet5lShared2l <all datasets root> <all datasets root>/all_training_datasets.txt <all datasets root>/all_validation_datasets.txt <path to temporary cache (ideally SSD)>

With these parameters we trained the CNN presented in our paper. For additional auxiliary loss (as presented in our paper), please add --los_aux. The --adaptive_mining allows to gradually increase the training difficulty by step defined in --adaptive_mining_step. The hardness starts at value defined with --adaptive_mining_coeff. With the current setting, we simply fixed the adaptive mining value to 0.23, since it worked the best for. However, it seems that it is heavily dependent on margin (--margin), and when --margin 0.1 is used, then the adaptive mining coeff could be set to 0.35 by --adaptive_mining_coeff 0.35. Higher values cause that the training collapses into a singular point, as described in our paper.

Pose estimation

We have now covered almost the complete pipeline, the only missing piece is the direct camera pose estimation of the query image with respect to the rendered terrain model. For this, we use:

python3 python/ --working-directory <working_directory> --matching-dir <matching_dir> -no --best-buddy-refine --maxres 3000 <query_image.jpg> Ours-aux --voting-cnt 3 --cuda -l <log_and_model_out_dir> [--gps latitude longitude] --earth-file <>

This tool will run the complete pose estimation algorithm as described in our paper. The <working_directory> will be used to store the rendered imagery, and the <matching_dir> will be a subdirectory of the <working_dir> to store the results of matching. If the corresponding rendered panorama does not exist on the <working_directory>, the itr tool will be called to render it (so it must be installed). Furthermore, if the photograph does not contain GPS in its EXIF, user needs to specify the position by adding --gps <lat> <lon> on the command line.

Since we wanted to be able also to experiment with new networks and architectures, we developed a simpler tool, which matches a single photo with a single rendered image. If you want to try it out, you can do as follows:

python3 python/ <query photograph> <rendered image.png> <network name (e.g, Ours)> matches -no -c --maxres 3000 -l python/pretrained/landscapeAR --sift-keypoints

This runs pose estimation by matching the photo to the render and then uses the rendered depth map and camera pose to estimate the pose of the photograph. We assume, that the rendered image has been rendered by the itr tool, as described in previous sections. The and allows to use different keypoint detectors -- you can experiment with them by adding different flags, such as --dense-uniform-keypoints (to extract) keypoints on a regular uniform grid, etc. For more, see the help of both tools. By default (if not specified on the command line), uses dense uniform keypoint detection, whereas uses SIFT keypoints.


We cannot provide dataset images in exact same form as we used for training and testing due to licensing issues. However, we provide at least images which have public copyright (Creative Commons), complete structure-from-motion reconstructions, and also instructions for downloading and rendering the remaining data, so that you can re-download, and re-render the data on your own.

Satellite Imagery for Rendering

For the Alps, Yosemite, and Nepal, we used textures from the Mapbox service. To use them, you just need to create your account and then update the file with your API key inside the itr repository. For Huascaran, we downloaded satellite imagery from ESA. You don't need to render the Huascaran dataset, since for this one we already provide the rendered imagery.

For Alps, we used DEM with resolution of 1 arc-second, freely available on You can download all the tiles you need from there, or you can download the merged tiles into one. Similar data are available also for the Nepal, but with lower resolution, 3 arc-seconds. Again, you can download the merged DEM: G45, and H45. For Yosemites, we used 1 arc-second data from USGS, and you can download merged DEM tiles here: Sierra. Please, update your file with the downloaded DEMs using the GDAL driver. More details, how to do it can be found in OSGEarth manual.


First, download and unpack the GeoPose3K dataset. The GeoPose3K dataset contains images, camera positions, field-of-view, and camera rotation, for details see the README file inside the dataset. For each image in the dataset, first generate the rotation matrix xml file as follows:

cd LandscapeAR
for i in $(ls $GP3K_path); do
    python3 python/ $(cat $GP3K_path/$i/info.txt | head -n 2 | tail -n 1) > $GP3K_path/$i/rotationC2G.xml
    echo $i

Now you can render the dataset, for each image you call the rendering as follows:

itr --single-photoparam <lat> <lon> <FOV [degrees]> rotationC2G.xml photo.jpg --render-photoviews 1024 --output <output_dir> --scene-center 46.86077 9.89454 <path/to/>

If you want to generate training images from GeoPose3K, please, also add --render-random flag and for each image generate at least 10 random renders looking at the same view from different viewpoints. You can do something like:

itr --single-photoparam <lat> <lon> <FOV [degrees]> rotationC2G.xml photo.jpg --render-photoviews 1024 --output <output_dir> --scene-center 46.86077 9.89454 --render-random 20 200 10 <path/to/>

The train/val/test splits are the same as for previous publication (Semantic Orientation), and can be downloaded here:

SfM based datasets

All remaining datasets are SfM-based, which means, that a Structure-from-Motion technique has been used to create a 3D model of the scene. Matterhorn, Yosemite, and Nepal datasets were screated by standard SfM and the reconstructed scene was then aligned with the terrain model using GPS and Iterative Closest Points. All remaining datasets were created using the SfM with terrain reference, as described in our ECCV 2020 paper (see the introduction). Each dataset contains publicly available photographs under Creative Commons license, 3D reconstructed scene in COLMAP and NVM formats, and list with links to download all the photographs from the original source (please see the license on Flickr). If you want to train your network, it is a good idea to download all photographs available (see the file photo_urls.txt contained in each dataset).

If you want to render synthetic images corresponding to the photographs, run following for each subdirectory:

cd <dataset_dir>/export
itr --directory <subdirectory> --render-photoviews 1024 --output rendered_dataset --scene-center <lat> <lon> <path/to/>

Scene center is the latitude and longitude included in the name of the dataset.


The dataset Alps100K has been collected from publicly available source for non-commercial research and academic purposes. The rights for each file are listed in the photo/licenses.txt file, along with the link to the original source.

Train datasets

Test datasets


Official repository for the LandscapeAR ECCV 2020 code.







No releases published


No packages published