X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers (EMNLP 2020)
- Authors: Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, and Ani Kembhavi
- Paper
- Blog
- Demo
- Slideslive Presentation
Recent multi-modal transformers have achieved tate of the art performance on a variety of multimodal discriminative tasks like visual question answering and generative tasks like image captioning. This begs an interesting question: Can these models go the other way and generate images from pieces of text? Our analysis of a popular representative from this model family - LXMERT - finds that it is unable to generate rich and semantically meaningful imagery with its current training setup. We introduce X-LXMERT, an extension to LXMERT with training refinements. X-LXMERT's image generation capabilities rival state of the art generative models while its question answering and captioning abilities remains comparable to LXMERT.
Try out AI2 Computer Vision Explorer Demo!
- Python packages
conda create -n xlxmert python=3.7
conda activate xlxmert
cd x-lxmert
pip install -r ./requirements.txt
-
Mask-RCNN-benchmark (for feature extraction)
- Please follow the original installation guide.
-
Faiss (for K-means clustering)
- Please follow the original installation guide.
# Store images, features, and annotations
./datasets
COCO/
images/
featuers/
VG/
images/
features/
GQA/
images/
features/
nlvr2/
images/
features/
data/ <= Store text annotations (*.json) for each split
lxmert/
vqa/
gqa/
nlvr2/
# Run feature extraction and k-means clustering
./feature_extraction
# Train image generator
./image_generator
snap/ <= Store image generator checkpoints
scripts/ <= Bash scripts for training image generator
# Train X-LXMERT
./x-lxmert
src/
lxrt/ <= X-LXMERT model class implementation (inherits huggingface transformers' LXMERT class)
pretrain/ <= X-LXMERT Pretraining
tasks/ <= Fine-tuning on downstream tasks (VQA, GQA, NLVR2, Image generation)
snap/ <= Store X-LXMERT checkpoints
scripts/ <= Bash scripts for pretraining, fine-tuning, and image generation
Please checkout ./feature_extraction for download pre-extracted features and more details.
cd ./feature_extraction
# For Pretraining / VQA
python coco_extract_grid_feature.py --split train
python coco_extract_grid_feature.py --split valid
python coco_extract_grid_feature.py --split test
# For Pretraining
python VG_extract_grid_feature.py
# For GQA
python GQA_extract_grid_feature.py
# For NLVR2
python nlvr2_extract_grid_feature.py --split train
python nlvr2_extract_grid_feature.py --split valid
python nlvr2_extract_grid_feature.py --split test
# K-Means clustering
python run_kmeans.py --src mscoco_train --tgt mscoco_train mscoco valid vg
cd ./x-lxmert/
bash scripts/pretrain.bash
wget -O x-lxmert/snap/pretrained/x_lxmert/Epoch20_LXRT.pth https://ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com/x-lxmert/Epoch20_LXRT.pth
cd ./x-lxmert/
bash scripts/finetune_vqa.bash
bash scripts/test_vqa.bash
cd ./x-lxmert/
bash scripts/finetune_gqa.bash
bash scripts/test_gqa.bash
cd ./x-lxmert/
bash scripts/finetune_nlvr2.bash
bash scripts/test_nlvr2.bash
cd ./image_generator/
bash scripts/train_generator.bash
wget -O image_generator/snap/pretrained/G_60.pth https://ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com/image_generator/G_60.pth
cd ./x-lxmert/
bash scripts/sample_image.bash
@inproceedings{Cho2020XLXMERT,
title={X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers},
author={Cho, Jaemin and Lu, Jiasen and Schwenk, Dustin and Hajishirzi, Hannaneh and Kembhavi, Aniruddha},
booktitle={EMNLP},
year={2020}
}