Perceive, Transform, and Act

This is the PyTorch implementation of our paper:

Multimodal Attention Networks for Low-Level Vision-and-Language Navigation (paper)
Federico Landi, Lorenzo Baraldi, Marcella Cornia, Massimiliano Corsini, Rita Cucchiara
Computer Vision and Image Understanding (CVIU), 2021

Installation

Our repository is based on the Matterport3D simulator, which was originally proposed with the Room-to-Room dataset.

As a first step, clone the repository and create the environment with conda:

git clone --recursive https://github.com/aimagelab/perceive-transform-and-act
cd perceive-transform-and-act

conda env create -f environment.yml
source activate pta

If you didn't clone with the --recursive flag, then you'll need to manually clone the submodules from the top-level directory:

git submodule update --init --recursive

Building with Docker

Please follow the instructions on the Matterport3DSimulator repository to install the simulator via Docker.

Bulding without Docker

A C++ compiler with C++11 support is required. Matterport3D Simulator has several dependencies:

Ubuntu >= 14.04
Nvidia-driver with CUDA installed
C++ compiler with C++11 support
CMake >= 3.10
OpenCV >= 2.4 including 3.x
OpenGL
GLM
Numpy

Optional dependences (depending on the cmake rendering options):

OSMesa for OSMesa backend support
epoxy for EGL backend support

If all of the dependecies are installed, you can build the simulator from source by tiping:

mkdir build
cd build
cmake -DOSMESA_RENDERING=ON -DPYTHON_EXECUTABLE:FILEPATH=`path/to/your/python/bin` ..
make

Precomputed ResNet Image Features

Download the precomputed ResNet-152 (imagenet) features, and place the corresponding .tsv file into the img_features folder.

Training and Testing

To train PTA from scratch, move to the root directory and run:

python tasks/R2R/main.py --name train_from_scratch \
                         --plateau_sched \
                         --lr 1e-4 \
                         --max_episode_len 30

We also provide weights obtained with the training described in the paper. If you wish to reproduce the results in our paper, run:

python tasks/R2R/main.py --name test_ll \
                         --max_episode_len 30 \
                         --eval_only \
                         --pretrained \
                         --load_from low_level

Our agent can also perform high-level Vision-and-Language Navigation. To reproduce the results otained with the high-level setup, run:

python tasks/R2R/main.py --name test_hl \
                         --high_level \
                         --max_episode_len 10 \
                         --eval_only \
                         --pretrained \
                         --load_from high_level

Visualizing Navigation Episodes

To make our qualitative results easier to visualize, we provide some .gif files that display some of the navigation episodes reported in our paper. We also show meaningful metrics to evaluate our results.

Low-level VLN in R2R

Low-level VLN in R4R

Reproducibility Note

Our experiments were made using an Nvidia 1080Ti GPU, CUDA 10.0, and python 3.6.8. Using different hardware setups or software versions may affect results.

Citing

If you find our code useful for your research, please cite our paper:

Bibtex:

@article{landi2021multimodal,
  title={Multimodal attention networks for low-level vision-and-language navigation},
  author={Landi, Federico and Baraldi, Lorenzo and Cornia, Marcella and Corsini, Massimiliano and Cucchiara, Rita},
  journal={Computer Vision and Image Understanding},
  year={2021},
  publisher={Elsevier}
}

License

PTA is MIT licensed. See the LICENSE file for details.

The trained models are considered data derived from the correspondent scene datasets.

Acknowledgments

This work has been supported by "Fondazione di Modena" and by the national project "IDEHA: Innovation for Data Elaboration in Heritage Areas" (PON ARS01_00421), cofunded by the Italian Ministry of University and Research.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
cmake/Modules		cmake/Modules
connectivity		connectivity
img_features		img_features
include		include
pybind11 @ baf6934		pybind11 @ baf6934
scripts		scripts
speaksee @ 5b3eef6		speaksee @ 5b3eef6
src		src
tasks/R2R		tasks/R2R
teaser		teaser
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
Dockerfile		Dockerfile
Doxyfile		Doxyfile
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Perceive, Transform, and Act

Table of contents

Installation

Building with Docker

Bulding without Docker

Precomputed ResNet Image Features

Training and Testing

Visualizing Navigation Episodes

Low-level VLN in R2R

Low-level VLN in R4R

Reproducibility Note

Citing

Bibtex:

License

Acknowledgments

About

Releases

Packages

Languages

License

aimagelab/perceive-transform-and-act

Folders and files

Latest commit

History

Repository files navigation

Perceive, Transform, and Act

Table of contents

Installation

Building with Docker

Bulding without Docker

Precomputed ResNet Image Features

Training and Testing

Visualizing Navigation Episodes

Low-level VLN in R2R

Low-level VLN in R4R

Reproducibility Note

Citing

Bibtex:

License

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages