Skip to content
/ mPLUG Public

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections. (EMNLP 2022)

Notifications You must be signed in to change notification settings

X-PLUG/mPLUG

Repository files navigation

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections. (EMNLP 2022)

https://arxiv.org/abs/2205.12005

Introduction

We presents mPLUG, a new vision-language foundation model for both cross-modal understanding and generation. Most existing pre-trained models suffer from inefficiency and linguistic signal overwhelmed by long visual sequences in cross modal alignment. To address both problems, mPLUG introduces an effective and efficient vision-language architecture with novel cross modal skip-connections. mPLUG achieves state-of-the-art results on a wide range of vision language downstream tasks, including image captioning, image-text retrieval, visual grounding and visual question answering.

News

  • 2023.5.08: Moved from AliceMind repo for further update.
  • 2022.8.28: Released mPLUG downstream tasks!

Pre-trained models and datasets

  • Pre-trained models

For VQA and image captioning tasks, we do an additional continue pre-training on 4M image-text pairs based mplug.en.large to get mplug.en.large.v2.

Model Visual Backbone Text Enc Layers Fusion Layers Text Dec Layers #params Download
mplug.en.base vit-b-16 6 6 12 350M mplug.en.base
mplug.en.large vit-l-14 6 6 12 600M mplug.en.large
mplug.en.large.v2 vit-l-14 6 6 12 600M mplug.en.large.v2
mplug.en.huge vit-l-14 24 6 12 1.1B comming soon
  • Pre-train Datasets
COCO VG SBU CC3M CC13M
image 113K 100K 860K 3M 10M
text 567K 769K 860K 3M 10M

Results

  • Image-text
Task VQA Image Captioning Retrieval Referring Expression   Comprehension Visual Entailment
Dataset VQA v2 COCO MSCOCO Flickr30K RefCOCO RefCOCO+ RefCOCOg SNLI-VE NLVR2
Split test-dev/test-std Karpathy test (CE/CIDEr) 5k test (TR/IR) 1k test (TR/IR) val/test-a/test-b val/test-a/test-b val-u/test-u val/test dev/test-P
Metric Acc. CIDEr R@1 R@1 Acc. Acc. Acc.
mPLUGBase 79.79/79.98 137.5/150.4 -/- -/- -/- -/- -/- -/- -/-
mPLUGLarge 81.27/81.26 141.0/155.1 82.8/65.8 97.6/88.4 92.40/94.51/88.42 86.02/90.17 / 78.17 85.88/86.42 89.45/89.29 84.58/84.95
mPLUGHuge 82.27/82.41 142.3/158.7 -/- -/- -/-/- -/-/- -/- -/- -/-/-
  • Video-text
Task Video Retrieval Video QA Video Captioning
Dataset MSRVTT MSRVTT-QA MSVD-QA VATEX
Split test test test test(CE)
Metric R@1 Acc. Acc. CIDEr
mPLUG 38.1 21.1 37.2 42.0

Requirements

  • PyTorch version >= 1.11.0

  • Install other libraries via

pip install -r requirements.txt

Pre-training

Comming soon.

Fine-tuning

Download json files of downstream tasks

Visual Question Answering

  1. Download VQA v2 dataset and Visual Genome dataset from the original websites VQA 2.0.
  2. Download and extract the provided dataset json files.
  3. In configs/vqa_mplug_base.yaml, set the paths for the json files and the image paths.
  4. Finetune the pre-trained mplug_base or large model using 8 A100 GPUs:
sh scripts/vqa_mplug_base.sh
sh scripts/vqa_mplug_large.sh
  1. Evaluate the result using the official evaluation server.

Image Captioning

  1. Download COCO Caption dataset from the original websites.
  2. Download and extract the provided dataset json files.
  3. Download language evalution tool(language_evalution).
  4. In configs/caption_mplug_base.yaml, set the paths for the json files and the image paths.
  5. Finetune the pre-trained mplug_base or large model using 8 A100 GPUs:
sh scripts/caption_mplug_base.sh
sh scripts/caption_mplug_large.sh

Image-text Retrieval

  1. Download MSCOCO or Flickr30k datasets from the original websites.
  2. Download and extract the provided dataset json files.
  3. In configs/retrieval_flickr30k_mplug_large.yaml or configs/retrieval_coco_mplug_large.yaml, set the paths for the json files and the image path.
  4. Finetune the pre-trained checkpoint using 8 A100 GPUs:
sh scripts/retrieval_flickr30k_mplug_large.sh
sh scripts/retrieval_coco_mplug_large.sh

Visual Grounding

  1. Download RefCOCO datasets from the original websites.
  2. Download and extract the provided dataset json files.
  3. In configs/grounding_mplug_large.yaml, set the paths for the json files and the image path. Data preparation can follow TransVG
  4. Finetune the pre-trained checkpoint using 8 A100 GPUs:
 sh scripts/grounding_mplug_base.sh 

Zero-shot Video-text Retrieval

  1. Download MSRVTT datasets from the original websites.
  2. In configs/retrieval_msrvtt_mplug_large.yaml, set the paths for the json files and the video paths.
  3. To perform zero-shot evaluation, run:
sh scripts/retrieval_msrvtt_mplug_large.sh

Zero-shot Video Question Answering

  1. Download MSRVTT-QA datasets from the original websites.
  2. In configs/videoqa_msrvtt_mplug_base.yaml, set the paths for the json files and the video paths.
  3. To perform zero-shot evaluation, run:
sh scripts/videoqa_msrvtt_mplug_base.sh

Zero-shot Video Captioning

  1. Download VATEX datasets from the original websites.
  2. In configs/videocap_vatex_mplug_large.yaml, set the paths for the json files and the video paths.
  3. To perform zero-shot evaluation, run:
sh scripts/videocap_vatex_mplug_large.sh

Citation

If you use our work, please cite:

@article{li2022mplug,
  title={mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections},
  author={Li, Chenliang and Xu, Haiyang and Tian, Junfeng and Wang, Wei and Yan, Ming and Bi, Bin and Ye, Jiabo and Chen, Hehong and Xu, Guohai and Cao, Zheng and others},
  journal={arXiv preprint arXiv:2205.12005},
  year={2022}
}

Acknowledgement

The implementation of mPLUG relies on resources from ALBEF, BLIP, and timm. We thank the original authors for their open-sourcing.