LandscapeAR: Large Scale Outdoor Augmented Reality by Matching Photographs with Terrain Models Using Learned Descriptors
This repository contains official implementation of the ECCV 2020 LandscapeAR paper. If you use this code in a scientific work, please, cite it:
Brejcha, J., Lukáč, M., Hold-Geoffroy, Y., Wang, O., Čadík, M.: LandscapeAR: Large Scale Outdoor Augmented Reality by Matching Photographs with Terrain Models Using Learned Descriptors, In: 16th European Conference on Computer Vision (ECCV), 2020.
The code for rendering photographs aligned with a textured Digital Elevation Model is coupled with the UIST 2018 Immersive Trip Reports paper. If you also use the rendering part of the project, please, cite it:
BREJCHA Jan, LUKÁČ Michal, CHEN Zhili, DIVERDI Stephen a ČADÍK Martin. Immersive Trip Reports. In: Proceedings of the 31st ACM User Interface Software and Technology Symposium. Berlín: Association for Computing Machinery, 2018, s. 389-401. ISBN 978-1-4503-5948-1.
- LandscapeAR Project webpage
- ImmersiveTripReports Project webpage
- Paper (pdf)
- Supplementary material (pdf)
This work was supported by project no. LTAIZ19004 Deep-Learning Approach to Topographical Image Analysis; by the Ministry of Education, Youth and Sports of the Czech Republic within the activity INTER-EXCELENCE (LT), subactivity INTER-ACTION (LTA), ID: SMSM2019LTAIZ.
This project consists of several parts, which are aimed at:
- Structure from Motion with Terrain Reference for fine grained localization of a batch of query photographs to the terrain model. This method was mainly used to create the training datasets. It implements custom matching stage, with the ability to use D2Net (for keypoint extraction and description), or our approach (SIFT for keypoint extraction and our trained network for description). It allows to reconstruct scenes with and also without the terrain reference. Our custom matches are imported to COLMAP, which then performs the 3D scene reconstruction. This module is easily extensible by novel keypoint extractors and descriptors.
- Training cross-domain descriptor based on Convolutional Neural Network (CNN). The code allows to generate training datasets from the reconstructed scenes and train our descriptor for matching real photographs with rendered terrain (Digital Elevation Models - DEM) textured with satellite imagery.
- Cross domain camera pose estimation. We leverage state-of-the-art descriptors, such as HardNet, D2Net, and our trained descriptor in order to match query photograph with the rendered DEM to obtain camera pose with respect to the virtual globe. For this, EPnP + RANSAC or P4Pf + RANSAC are used, both combined with Bundle Adjustment to refine the camera pose.
This project depends on Python3 and COLMAP for Structure-from-Motion (SfM) reconstruction. Further dependencies will be installed automatically. It has been tested on GNU/Linux (Fedora 31), and MacOSX 10.15. For following steps, we assume you are working with a UNIX shell.
Clone the repository including all the submodules. The dependencies, like HardNet or D2Net are included as submodules.
git clone --recurse-submodules https://github.com/brejchajan/LandscapeAR
Enter LandscapeAR project, create Python3 virtual environment, activate it,
and install the python dependencies. Optionally, you can skip creating the
virtual environment, if you find your Python3 environment compatible with our
requirements (can be found at
cd landscapeAR python3 -m venv lsar_venv source lsar_venv/bin/activate pip3 install -r python/requirements.txt
Download the pretrained models - HardNet, D2Net, NCNet, and our trained
cross-domain model. We packaged them together so that they can be obtained
easily. The downloaded models will be unpacked into the
cd scripts ./download_data.sh
Finally, build the p4pf python module to be able to run P4Pf minimal solver. For more details, see the paper and the original code. Furthermore, the script will also patch the NCNet submodule
so that it is compatible with PyTorch 1.3. The patch can be inspected at
cd ../python ./build.sh
In the last step, please add the following paths on your PYTHONPATH, preferably by adding following lines to your shell init script (~/.bashrc for bash, ~/.zshrc for zsh, etc.):
export PYTHONPATH="<project path>/LandscapeAR/python/thirdparty/ncnet:$PYTHONPATH" export PYTHONPATH="<project path>/LandscapeAR/python/thirdparty/d2net:$PYTHONPATH" export PYTHONPATH="<project path>/LandscapeAR/python/thirdparty/affnet:$PYTHONPATH" export PYTHONPATH="<project path>/LandscapeAR/python:$PYTHONPATH"
Restart your shell afterwards, or run:
unset PYTHONPATH source ~/.bashrc (or your custom shell init script)
If you want to use also the Saddle keypoint detector for keypoint detection, you may optionally install it from https://github.com/brejchajan/saddle_detector.git and add the python module on your PYTHONPATH.
Installing the DEM renderer
In order to have complete visual localization pipeline, we need to be able to
render the DEM and visualize the registered photographs.
The DEM renderer
itr is a separate project but it has been added
as a submodule located at
LandscapeAR project expects the
itr to be located at the system
PATH. To install it, please follow the instructions at
LandscapeAR/itr/README.md, or use a build script, which will
automatically build and install the
itr together with all
cd LandscapeAR/itr ./scripts/build-fedora.sh build install Release 8
Feel free to change the number of processes (here 8) based on your PC
configuration. If the process goes well, the
itr will be installed
install directory. Please, add this
install directory to your PATH.
DEM and satellite data for rendering
The easiest way to render DEM model with textures is to get a Mapbox account
with a free quota https://www.mapbox.com/, and
update your API key to the example
LandscapeAR/itr-renderer/example.earth. For your convenience, we also prepared a single GeoTiff
containing the DEM of the whole alps with 1 arcsecond (~30m) resolution, which
we used in all our experiments. You can
download it here.
This DEM originally comes from viewfinderpanoramas.org, thanks!
Let's take a look how to Download Flickr imagery from area of interest, how to render grid of panorama images for reconstruction*, how to reconstruct a scene using Structure from Motion with Terrain Reference, how to generate a training dataset, how to train our descriptor, and finally, how to estimate and visualize camera pose of a single query image.
Downloading Flickr imagery from area of interest
In order to obtain training data, we downloaded imagery from several locations across the European Alps. If you want to download more images to create even larger dataset, you can use our tool as follows:
cd LandscapeAR/python/flickr python3 flickrAlbumDownload.py <download_dir> <lat> <lon> <radius in km, at most 30>
The downloaded images will be stored into <download_dir>/images directory.
The downloaded metadata will be stored into the <download_dir> and can be later
Rendering a grid of panoramic images from DEM
In order to be able to use the Structure from Motion with Terrain Reference in the next step, we need to render a grid of synthetic panoramic images from the DEM, you can render them as follows:
itr --render-grid <center lat> <center lon> <grid width in meters> <grid height in meters> <offset x in meters> <offset y in meters> <image resolution in px> <output dir> --render-grid-mode perspective --render-grid-num-views 6 <path to .earth file>
For offscreen rendering on a headless machine, add
--egl <GPU num> flag.
Usual parameters are grid width and height = 10000 m, offset x and y is the
distance between consecutive grid points, we use 1000 m, and image resolution
Structure from Motion with Terrain Reference
We use this technique to build training datasets from the downloaded photographs and terrain rendered in the previous step. This technique allows precise alignment (registration, pose estimation) of multiple photographs of of the same scene to the rendered terrain model. The input are the rendered images and photographs, for which we know the area where they were captured (the same area as where we rendered the images), but we don't know exact camera position and orientation of the photographs.
We start by creating a simple directory structure needed for the reconstruction:
<reconstruction root> <photo> <render_uniform> image_list.txt scene_info.txt
We place the downloaded photographs into the directory
photo, and all rendered
images to the directory
render_uniform. We copy the file
the directory with rendered images
render_uniform to the
<reconstruction root>. Then we create a list of all images,
which will be used for the reconstruction:
cd <reconstruction root> find . -name "*.png" >> image_list.txt find . -name "*.jpg" >> image_list.txt
Each line of the list contains relative path from the
to a single image. You can also create more image lists to divide the
photgraphs into individual reconstruction batches. For this, use a suffix, such as
0 will be the suffix. The suffix needs to be a natural number. Splitting huge amount of photographs is a good idea no expedite the reconstruction process.
We usually use batches around 1000 photographs in a single batch.
Once the directory structure is created, we can run the reconstruction process:
cd LandscapeAR/scripts ./process_colmap_crossdomain.sh <reconstruction root> <suffix (if empty, use "")> <num threads> <gpu index> <'MATCHER OPTIONS'>
In order to use CUDA for reconstruction (strongly recommended), please, set the gpu index for zero or greater. If you don't want to use CUDA, use negative number. The MATCHER OPTIONS need to be encapsulated with apostrophes, as this is a single argument for the shell script. If you want to be sure to recompute all features and matches (if you run this script multiple times), you can add '--recompute-features --recompute-matches'. Also, you can add --d2net to force using D2Net for keypoint extraction and description. If you ommit --d2net, our trained network will be used.
This script will first run the feature extraction and a custom matching stage. After this, the matches will be imported to a COLMAP database and COLMAP will be used to create the reconstruction.
The reconstructed scenes can be easily exported for future rendering
./exportCrossdomainScenes.sh <reconstruction root>
This will create a directory
with all the reconstructed scenes. Each reconstructed scene can be
cd <reconstruction root>/export itr --directory <suffix> <path to an earth file>
or, they can be also rendered into a dataset using:
cd <reconstruction root>/export itr --directory <suffix> --render-photoviews 1024 --output <output dir> <path to an earth file>
Generating metadata for training
Once the dataset has been reconstructed using SfM with Terrain
Reference and rendered using
--render-photoviews, we may now
generate the metadata for training. Basically, all image pairs
with some reasonable overlap are found and the corresponding
patches are stored. It can be done by calling following tool:
python3 python/genPatchesDataset.py <rendered dataset> real real-synth <output patches metadata>
<rendered dataset> is the
<output dir> of the previous, rendering stage.
As the dataset generation is covered, let us now describe the training procedure.
Training data organization
We store all training data in following hierarchy:
<all datasets root> <dataset 1> <rendered_dataset> <patches_dataset> <train.txt> <val.txt> <dataset 2> <rendered_dataset> <patches_dataset> <train.txt> ... <dataset n> <rendered_dataset> <patches_dataset> <train.txt> all_training_datasets.txt all_validation_datasets.txt
We store all datasets in the single directory
<all datasets root>. As we see, we can have as many datasets (separately reconstructed scenes) as we want. Each dataset directory contains the
<rendered_dataset> (rendered by the
in previous step), and the
<patches_dataset> contains the
patches metadata (corresponding patches definition generated
python/getPatchesDataset.py in the previous step).
Finally, the dataset directories, which are used for training
are listed in
all_training_datasets.txt (name can be
different), and all validation datasets are listed in
all_validation_datasets.txt. Put each dataset directory on a separate line. Each training (validation) dataset needs to contain the
<val.txt>) file, which contains all training (validation) pairs, which are the names of the files contained in
<patches_dataset>. The advantage of this hierarchy is that a singe scene can be divided to training and
validation parts (which should be disjoint).
The training can be run easily:
python3 python/trainPatchesDescriptors.py -l <log_and_model_out_dir> --name <name_of_the_run> --num_epochs 50 --num_workers 1 --batch_size 300 --positive_threshold 1 --distance_type 3D --cuda --learning_rate 0.00001 --margin 0.2 --color_jitter --maxres 3000 --hardnet --hardnet_filter_dist --hardnet_filter_dist_values 50 1000000 --hardnet_orig --normalize_output --adaptive_mining --adaptive_mining_step 0.0 --adaptive_mining_coeff 0.23 --no_symmetric --uniform_negatives --architecture MultimodalPatchNet5lShared2l <all datasets root> <all datasets root>/all_training_datasets.txt <all datasets root>/all_validation_datasets.txt <path to temporary cache (ideally SSD)>
With these parameters we trained the CNN presented in our paper. For additional auxiliary loss (as presented in our paper), please add
--adaptive_mining allows to gradually increase the training difficulty by step defined in
--adaptive_mining_step. The hardness starts at value defined with
--adaptive_mining_coeff. With the current setting, we simply fixed the adaptive mining value to 0.23, since it worked the best for. However, it seems that it is heavily dependent on margin (
--margin), and when
--margin 0.1 is used, then the adaptive mining coeff could be set to 0.35 by
--adaptive_mining_coeff 0.35. Higher values cause that the training collapses into a singular point, as described in our paper.
We have now covered almost the complete pipeline, the only missing piece is the direct camera pose estimation of the query image with respect to the rendered terrain model. For this, we use:
python3 python/findPose.py --working-directory <working_directory> --matching-dir <matching_dir> -no --best-buddy-refine --maxres 3000 <query_image.jpg> Ours-aux --voting-cnt 3 --cuda -l <log_and_model_out_dir> [--gps latitude longitude] --earth-file <path_to_your_earth_file.earth>
This tool will run the complete pose estimation algorithm as described in our
<working_directory> will be used to store the rendered imagery,
<matching_dir> will be a subdirectory of the
<working_dir> to store
the results of matching. If the corresponding rendered panorama does not exist
itr tool will be called to render it
(so it must be installed). Furthermore, if the photograph does not contain
GPS in its EXIF, user needs to specify the position by adding
--gps <lat> <lon>
on the command line.
Since we wanted to be able also to experiment with new networks and architectures, we developed a simpler tool, which matches a single photo with a single rendered image. If you want to try it out, you can do as follows:
python3 python/analyzeMatches.py <query photograph> <rendered image.png> <network name (e.g, Ours)> matches -no -c --maxres 3000 -l python/pretrained/landscapeAR --sift-keypoints
This runs pose estimation by matching the photo to the render and then uses
the rendered depth map and camera pose to estimate the pose of the photograph.
We assume, that the rendered image has been rendered by the
itr tool, as
described in previous sections. The
allows to use different keypoint detectors -- you can experiment with them
by adding different flags, such as
--dense-uniform-keypoints (to extract)
keypoints on a regular uniform grid, etc. For more, see the help of both tools.
By default (if not specified on the command line),
dense uniform keypoint detection, whereas
findPose.py uses SIFT keypoints.
We cannot provide dataset images in exact same form as we used for training and testing due to licensing issues. However, we provide at least images which have public copyright (Creative Commons), complete structure-from-motion reconstructions, and also instructions for downloading and rendering the remaining data, so that you can re-download, and re-render the data on your own.
Satellite Imagery for Rendering
For the Alps, Yosemite, and Nepal, we used textures from the Mapbox service. To use them, you just need to create your account and then update the example.earth file with your API key inside the
itr repository. For Huascaran,
we downloaded satellite imagery from ESA. You don't need to render the Huascaran
dataset, since for this one we already provide the rendered imagery.
For Alps, we used DEM with resolution of 1 arc-second, freely available on viewfinderpanoramas.org. You can download all the tiles you need from there, or you can download the merged tiles into one. Similar data are available also for the Nepal, but with lower resolution, 3 arc-seconds. Again, you can download the merged DEM: G45, and H45. For Yosemites, we used 1 arc-second data from USGS, and you can download merged DEM tiles here: Sierra. Please, update your example.earth file with the downloaded DEMs using the GDAL driver. More details, how to do it can be found in OSGEarth manual.
First, download and unpack the GeoPose3K dataset. The GeoPose3K dataset contains images, camera positions, field-of-view, and camera rotation, for details see the README file inside the dataset. For each image in the dataset, first generate the rotation matrix xml file as follows:
cd LandscapeAR GP3K_path=path/to/geoPose3K for i in $(ls $GP3K_path); do python3 python/gp3K_atm.py $(cat $GP3K_path/$i/info.txt | head -n 2 | tail -n 1) > $GP3K_path/$i/rotationC2G.xml echo $i done
Now you can render the dataset, for each image you call the rendering as follows:
itr --single-photoparam <lat> <lon> <FOV [degrees]> rotationC2G.xml photo.jpg --render-photoviews 1024 --output <output_dir> --scene-center 46.86077 9.89454 <path/to/example.earth>
If you want to generate training images from GeoPose3K, please, also add
--render-random flag and for each image generate at least 10 random renders looking at the
same view from different viewpoints. You can do something like:
itr --single-photoparam <lat> <lon> <FOV [degrees]> rotationC2G.xml photo.jpg --render-photoviews 1024 --output <output_dir> --scene-center 46.86077 9.89454 --render-random 20 200 10 <path/to/example.earth>
The train/val/test splits are the same as for previous publication (Semantic Orientation), and can be downloaded here: www.fit.vutbr.cz/~ibrejcha/landscapear/splits/geoPose3K.tar.gz.
SfM based datasets
All remaining datasets are SfM-based, which means, that a Structure-from-Motion
technique has been used to create a 3D model of the scene. Matterhorn, Yosemite,
and Nepal datasets were screated by standard SfM and the reconstructed scene
was then aligned with the terrain model using GPS and Iterative Closest Points.
All remaining datasets were created using the SfM with terrain reference, as
described in our ECCV 2020 paper (see the introduction). Each dataset contains
publicly available photographs under Creative Commons license, 3D reconstructed
scene in COLMAP and NVM formats, and list with links to download all the photographs
from the original source (please see the license on Flickr).
If you want to train your network, it is a good idea
to download all photographs available (see the file
in each dataset).
If you want to render synthetic images corresponding to the photographs, run following for each subdirectory:
cd <dataset_dir>/export itr --directory <subdirectory> --render-photoviews 1024 --output rendered_dataset --scene-center <lat> <lon> <path/to/example.earth>
Scene center is the latitude and longitude included in the name of the dataset.
The dataset Alps100K has been collected from publicly available source flickr.com for non-commercial research and academic purposes.
The rights for each file are listed in the
photo/licenses.txt file, along with the link to the original source.
- Alps Eiger, 9.4 GB
- Alps Grande Casse, 4.5 GB
- Alps Gran Paradiso, 1.3 GB
- Alps Chamonix, 5.1 GB
- Alps Matterhorn, 1.0 GB
- Alps Ortler, 4.6 GB
- Alps Wildspitze, 3.0 GB