This is the PyTorch implementation of our paper:
Multimodal Attention Networks for Low-Level Vision-and-Language Navigation (paper)
Federico Landi, Lorenzo Baraldi, Marcella Cornia, Massimiliano Corsini, Rita Cucchiara
Computer Vision and Image Understanding (CVIU), 2021
- Installation
- Training and Testing
- Visualizing Navigation Episodes
- Reproducibility Note
- Citing
- License
- Acknowledgments
Our repository is based on the Matterport3D simulator, which was originally proposed with the Room-to-Room dataset.
As a first step, clone the repository and create the environment with conda:
git clone --recursive https://github.com/aimagelab/perceive-transform-and-act
cd perceive-transform-and-act
conda env create -f environment.yml
source activate pta
If you didn't clone with the --recursive
flag, then you'll need to manually clone the submodules from the top-level directory:
git submodule update --init --recursive
Please follow the instructions on the Matterport3DSimulator repository to install the simulator via Docker.
A C++ compiler with C++11 support is required. Matterport3D Simulator has several dependencies:
- Ubuntu >= 14.04
- Nvidia-driver with CUDA installed
- C++ compiler with C++11 support
- CMake >= 3.10
- OpenCV >= 2.4 including 3.x
- OpenGL
- GLM
- Numpy
Optional dependences (depending on the cmake rendering options):
If all of the dependecies are installed, you can build the simulator from source by tiping:
mkdir build
cd build
cmake -DOSMESA_RENDERING=ON -DPYTHON_EXECUTABLE:FILEPATH=`path/to/your/python/bin` ..
make
Download the precomputed ResNet-152 (imagenet) features, and place the corresponding .tsv file into the img_features
folder.
To train PTA from scratch, move to the root directory and run:
python tasks/R2R/main.py --name train_from_scratch \
--plateau_sched \
--lr 1e-4 \
--max_episode_len 30
We also provide weights obtained with the training described in the paper. If you wish to reproduce the results in our paper, run:
python tasks/R2R/main.py --name test_ll \
--max_episode_len 30 \
--eval_only \
--pretrained \
--load_from low_level
Our agent can also perform high-level Vision-and-Language Navigation. To reproduce the results otained with the high-level setup, run:
python tasks/R2R/main.py --name test_hl \
--high_level \
--max_episode_len 10 \
--eval_only \
--pretrained \
--load_from high_level
To make our qualitative results easier to visualize, we provide some .gif files that display some of the navigation episodes reported in our paper. We also show meaningful metrics to evaluate our results.
Our experiments were made using an Nvidia 1080Ti GPU, CUDA 10.0, and python 3.6.8. Using different hardware setups or software versions may affect results.
If you find our code useful for your research, please cite our paper:
@article{landi2021multimodal,
title={Multimodal attention networks for low-level vision-and-language navigation},
author={Landi, Federico and Baraldi, Lorenzo and Cornia, Marcella and Corsini, Massimiliano and Cucchiara, Rita},
journal={Computer Vision and Image Understanding},
year={2021},
publisher={Elsevier}
}
PTA is MIT licensed. See the LICENSE file for details.
The trained models are considered data derived from the correspondent scene datasets.
This work has been supported by "Fondazione di Modena" and by the national project "IDEHA: Innovation for Data Elaboration in Heritage Areas" (PON ARS01_00421), cofunded by the Italian Ministry of University and Research.