VOLTA: Visiolinguistic Transformer Architectures

This is the implementation of the framework described in the paper:

Emanuele Bugliarello, Ryan Cotterell, Naoaki Okazaki and Desmond Elliott. Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs. Transactions of the Association for Computational Linguistics 2021; 9 978–994.

We provide the code for reproducing our results, preprocessed data and pretrained models.

Repository Setup (IGLUE)

1. Create a fresh virtual environment, and install all dependencies.

python -m venv /path/to/iglue/virtual/environment
source /path/to/iglue/virtual/environment/bin/activate
pip install -r requirements.txt

2. Install apex. If you use a cluster, you may want to first run commands like the following:

module load cuda/10.1.105
module load gcc/8.3.0-cuda

and then:

cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
cd ..

3. Setup the refer submodule for Referring Expression Comprehension:

cd tools/refer; make

4. Install this codebase as a package in this environment.

python setup.py develop

Data

Check out data/README.md for links to preprocessed data and data preparation steps.

Models

Check out MODELS.md for links to pretrained models and how to define new ones in VOLTA.

Model configuration files are stored in config/.

Training and Evaluation

We provide sample scripts to train (i.e. pretrain or fine-tune) and evaluate models in examples/. These include ViLBERT, LXMERT and VL-BERT as detailed in the original papers, as well as ViLBERT, LXMERT, VL-BERT, VisualBERT and UNITER as specified in our controlled study.

Task configuration files are stored in config_tasks/.

License

This work is licensed under the MIT license. See LICENSE for details. Third-party software and data sets are subject to their respective licenses.
If you find our code/data/models or ideas useful in your research, please consider citing the paper:

@article{bugliarello-etal-2021-multimodal,
    author = {Bugliarello, Emanuele and Cotterell, Ryan and Okazaki, Naoaki and Elliott, Desmond},
    title = "{Multimodal Pretraining Unmasked: {A} Meta-Analysis and a Unified Framework of Vision-and-Language {BERT}s}",
    journal = {Transactions of the Association for Computational Linguistics},
    volume = {9},
    pages = {978-994},
    year = {2021},
    month = {09},
    issn = {2307-387X},
    doi = {10.1162/tacl_a_00408},
    url = {https://doi.org/10.1162/tacl\_a\_00408},
}

Acknowledgement

Our codebase heavily relies on these excellent repositories:

vilbert-multi-task
vilbert_beta
lxmert
VL-BERT
visualbert
UNITER
pytorch-transformers
bottom-up-attention

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

VOLTA: Visiolinguistic Transformer Architectures

Repository Setup (IGLUE)

Data

Models

Training and Evaluation

License

Acknowledgement

Files

README.md

Latest commit

History

README.md

File metadata and controls

VOLTA: Visiolinguistic Transformer Architectures

Repository Setup (IGLUE)

Data

Models

Training and Evaluation

License

Acknowledgement