Project Page  |  Paper  |  Demo |  Replicate Demo |  CheckpointsÂ
UnIVAL is a 0.25B-parameter unified model that is multitask pretrained on image and video-text data and target image, video and audio-text downstream tasks.
Check out our demo on Huggingface Spaces: Spaces
General
means the pretrained model before finetuning.
To easily play with our model we also provide several notebooks: VG.ipynb
, VQA.ipynb
, Captioning.ipynb
, Video_Captioning.ipynb
, and Audio_Captioning.ipynb
- [2023.8.12]: we provide the scripts to train UnIVAL for audio/video-text tasks.
- [2023.7.31]: we provide here more details to reproduce the results with UnIVAL on Visual Grounding used in our Rewarded soups work.
- [2023.7.31]: Released of UnIVAL code and model weights! We will release the scripts to train and evaluate audio/video tasks later.
- Quantitative Results
- Installation
- Datasets and Checkpoints
- Training and Inference
- Zero-shot Evaluation
- Parameter Efficient Finetuning (PEFT): Training only the linear layer
- Multimodal Model Merging/Weight Interpolation
- Qualitative results
- Citation
- Acknowledgment
Here are some results on several multimodal tasks.
Task | Visual Grounding | Image Captioning | VQA | Visual Entailment | VideoQA | Video Captioning | Audio Captioning | |||
---|---|---|---|---|---|---|---|---|---|---|
Dataset | RefCOCO | RefCOCO+ | RefCOCOg | COCO | VQA v2 | SNLI-VE | MSRVTT-QA | MSRVTT | AudioCaps | |
Split | val/test-a/test-b | val/test-a/test-b | val-u/test-u | Karpathy test | test-dev/test-std | val/test | test | test | test | |
Metric | Acc. | CIDEr | Acc. | Acc. | Acc. | CIDEr | CIDEr | |||
UnIVAL | 89.1 / 91.5 / 85.2 | 82.2 / 86.9 / 75.3 | 84.7 / 85.2 | 137.0 | 77.0 / 77.1 | 78.2 / 78.6 | 43.5 | 60.5 | 71.3 |
- python 3.7.4
- pytorch 1.13+
- torchvision 0.14.1+
- JAVA 1.8 (for COCO evaluation)
We recommend to first install pytorch before other libraries:
git clone https://github.com/mshukor/UnIVAL.git
pip install -r requirements.txt
Download the following model for captioning evaluation:
python -c "from pycocoevalcap.spice.spice import Spice; tmp = Spice()"
See datasets.md and checkpoints.md.
The scripts to launch pretraining, finetuning and evaluation can be found in run_scripts/
folder. Below we provide more details. The data are stored in .tsv
files with different format depending on the training task.
To restore training you need to provide the last checkpoint checkpoint_last.pt
to --restore-file
, and pass --reset-dataloader --reset-meters --reset-optimizer
as argument.
We use slurm to launch the training/evaluation.
In some datasets, the images are encoded to base64 strings. To do this transformation you can use the following code:
from PIL import Image
from io import BytesIO
import base64
img = Image.open(file_name) # path to file
img_buffer = BytesIO()
img.save(img_buffer, format=img.format)
byte_data = img_buffer.getvalue()
base64_str = base64.b64encode(byte_data) # bytes
base64_str = base64_str.decode("utf-8") # str
1. Prepare the Dataset
The format for pretraining tsv files are as follows:
-
Each line contains uniq-id, image/video path, caption, question, answer, ground-truth objects (objects appearing in the caption or question), dataset name (source of the data) and task type (caption, qa or visual gronunding). Prepared for the pretraining tasks of visual grounding, grounded captioning, image-text matching, image captioning and visual question answering. In addition, the folder
negative_sample
contains three filesall_captions.txt
,object.txt
andtype2ans.json
. The data in these files are used as negative samples for the image/video-text matching task.
2. Pretraining
There is 3 scripts to train UnIVAL. unival_s1.sh
for stage 1 training initialized from BART weights, unival_s2.sh
for stage 2 training, initialized from the weights after stage 1, and unival_s2_hs.sh
for high-resolution training during 1 epoch, initialized from the weights of stage 2. For example to launch for stage 1:
cd run_scripts/pretraining bash unival_s1.sh
1. Prepare the Dataset & Checkpoints
Each image corresponds to only 1 caption in caption_stage1_train.tsv
and corresponds to multiple captions in other TSV files (about 5 captions per image). Each line of the dataset represents a caption sample with the following format. The information of uniq-id, image-id, caption, predicted object labels (taken from VinVL, not used), image base64 string are separated by tabs.
162365 12455 the sun sets over the trees beyond some docks. sky&&water&&dock&&pole /9j/4AAQSkZJ....UCP/2Q==
2. Finetuning
To finetune for image captioning:
cd run_scripts/caption sh unival_caption_stage_1.sh > unival_caption_stage_1.out
3. Inference
You can use the following code for inference, after setting the right weights path:
cd run_scripts/caption/eval ; sh eval_caption.sh # inference & evaluate
1. Prepare the Dataset & Checkpoints
Following common practice, VG-QA samples are also included in the training data. To adapt to the seq2seq paradigm of OFA, we transform original VQA training questions with multiple golden answers into multiple training samples. For the original VQA validation set, we keep around 10k samples for our validation and utilize the other samples for training. Each line of the dataset represents a VQA sample with the following format. The information of question-id, image-id, question, answer (with confidence), predicted object labels (taken from VinVL, slightly brings around +0.1 accuracy improvement), image base64 string are separated by tabs.
79459 79459 is this person wearing shorts? 0.6|!+no house&&short&&...&&sky /9j/4AAQS...tigZ/9k=
2. Shuffle the Training Data
(Optional, but achieves better finetuning accuracy): If the disk storage is sufficient, we recommend to prepare the shuffled training data for each epoch in advance.
cd dataset/vqa_data ln vqa_train.tsv vqa_train_1.tsv for idx in `seq 1 9`;do shuf vqa_train_${idx}.tsv > vqa_train_$[${idx}+1].tsv;done # each file is used for an epoch
3. Finetuning
If you have shuffled the training data in the previous step, please correctly specify the training data path following the guide in the script comments.
cd run_scripts/vqa bash unival_vqa.sh
4. Inference
We use beam-search during inference.
cd run_scripts/vqa/eval bash evaluate_vqa.sh # specify 'val' or 'test' in the script
1. Prepare the Dataset & Checkpoints
We use RefCOCO (split by UNC), RefCOCO+ (split by UNC) and RefCOCOg (split by UMD) datasets. See RefCOCO and Refer for more details. Note that in the original dataset, each region-coord (or bounding box) may corresponds to multiple descriptive texts. We split these texts into multiple samples so that the region-coord in each sample corresponds to only one text. Each line of the processed dataset represents a sample with the following format. The information of uniq-id, image-id, text, region-coord (separated by commas), image base64 string are separated by tabs.
79_1 237367 A woman in a white blouse holding a glass of wine. 230.79,121.75,423.66,463.06 9j/4AAQ...1pAz/9k=
2. Finetuning
cd run_scripts/refcoco sh unival_refcoco.sh > train_refcoco.out & # finetune for refcoco sh unival_refcocoplus.sh > train_refcocoplus.out & # finetune for refcoco+ sh unival_refcocog.sh > train_refcocog.out & # finetune for refcocog
3. Inference
Run the following commands for the evaluation.
cd run_scripts/refcoco/eval ; sh eva_refcoco.sh # eva_refcocog.sh, eva_refcocoplus.sh
1. Prepare the Dataset & Checkpoints
Each line of the processed dataset represents a sample with the following format. The information of uniq-id, image-id, image base64 string, hypothesis, caption (or text premise), label are separated by tabs.
252244149.jpg#1r1n 252244149 /9j/4AAQ...MD/2Q== a man in pink and gold is chewing on a wooden toothpick. a man in pink is chewing a toothpick on the subway. neutral
2. Finetuning
Contrary to previous work (e.g. OFA) we do not use the text premise for this task.
cd run_scripts/snli_ve nohup sh unival_snli_ve.sh > train_snli_ve.out & # finetune for snli_ve
3. Inference
Run the following command to obtain the results.
cd run_scripts/snli_ve/eval ; sh eval_snli_ve.sh # specify 'dev' or 'test' in the script
1. Prepare the Dataset & Checkpoints
The dataset zipfile coco_image_gen.zip
contains coco_vqgan_train.tsv
, coco_vqgan_dev.tsv
and coco_vqgan_full_test.tsv
. Each line of the dataset represents a sample with the following format. The information of uniq-id, image-code (produced by vqgan, a list of integers separated by single-whitespaces), lowercased caption are separated by tabs.
1 6674 4336 4532 5334 3251 5461 3615 2469 ...4965 4190 1846 the people are posing for a group photo.
The checkpoint zipfile image_gen_large_best.zip
contains image_gen_large_best.pt
, vqgan/last.ckpt
, vqgan/model.yaml
and clip/Vit-B-16.pt
.
2. Finetuning
We divide the finetuning process of image generating into two stages. In stage 1, we finetune OFA with cross-entropy loss. In stage 2, we select the last checkpoint of stage 1 and train with CLIP Score optimization. During the validation, the generated image will be dumped into _GEN_IMAGE_PATH_
.
cd run_scripts/image_gen nohup sh unival_image_gen_stage_1.sh # stage 1, train with cross-entropy loss nohup sh unival_image_gen_stage_2.sh # stage 2, load the last ckpt of stage1 and train with CLIP Score optimization
4. Inference
Run the command below to generate your images.
cd run_scripts/image_gen/eval ; sh eval_image_gen.sh # inference & evaluate (FID, IS and CLIP Score)
Here we provide the scripts for zero-shot evaluation on image-text tasks. You need to specify the path to pretrained model in each of these scripts:
- Image Caption on Nocaps:
caption/eval/eval_nocaps.sh
- VQA on VizWiz:
vqa/eval/eval_vizwiz.sh
- VQA on Nocaps:
vqa/eval/eval_okvqa.sh
Following eP-ALM, we experiment with efficient finetuning by training only the linear connection between the modality spcific-encoders and the language model, while keeping all other parameters frozen:
- Image Caption on COCO:
caption/onlylinear/unival_caption_stage_s2_onlylinear.sh
- Video Caption on MSRVTT:
caption/onlylinear/unival_video_caption_stage_s2_onlylinear.sh
- Audio Caption on Audiocaps:
caption/onlylinear/unival_audio_caption_stage_s2_onlylinear.sh
- VQA on VQAv2:
vqa/onlylinear/unival_vqa_s2_onlylinear.sh
- Video QA on MSRVTT:
vqa/onlylinear/unival_video_vqa_s2_onlylinear.sh
To finetune the stage-1 pretrained model, you can use the scripts with s1
.
In this section we provide the details to reproduce the experiments for weight interpolation and different weight averaging experiments. The objective is to leverage the synergy between models finetuned on different multimodal tasks.
To average several models, you can use preprocess/average_save_models.py
. There is two options, either you average many models with uniform interpolation coefficient, or you interpolate between 2 models with interpolation coefficient from 0 to 1. However, you can also customise this script as you like.
Once you saved the interpolated weights, you can use the following scripts to evaluate the model:
## image-text tasks
sh caption/eval/eval_caption_avg.sh
sh refcoco/eval/eval_refcocoplus_avg.sh
sh snli_ve/eval/eval_snli_ve_avg.sh
sh vqa/eval/eval_vqa_avg.sh
## video-text tasks
sh vqa/eval/video/eval_video_qa_avg.sh
sh caption/eval/video/eval_msrvtt_video_caption_avg.sh
For Ratatouille finetuning, each one of the auxiliary models (e.g. models finetuned for captioning, vqa, visual grounding and visual entailment) are re-finetuned on the target task. At the end all obtained models are uniformly averaged.
The scripts to launch the finetuning and evaluation are in averaging/ratatouille/
.
You need also to use the weight averaging script in preprocess/average_save_models.py
.
For Fusing finetuning, first the auxiliary models are averaged, then finetuned on the target task.
The scripts to launch the finetuning and evaluation are in averaging/fusing/
.
Below we provide qualitative results for some tasks.
If you find the work helpful, you can cite it using the following citation:
@article{shukor2023unified,
title={Unified Model for Image, Video, Audio and Language Tasks},
author={Shukor, Mustafa and Dancette, Corentin and Rame, Alexandre and Cord, Matthieu},
journal={arXiv preprint arXiv:2307.16184},
year={2023}
}
This code is based mainly on the following repos:
We thank the authors for releasing their code.