Official code for the paper "Z-LaVI: Zero-Shot Language Solver Fueled by Visual Imagination".
We run our experiments using Python 3.9.12. You can install the required packages using:
pip install -r requirements.txt
/data
stores the processed data of all datasets, including AG News, ARC, CoarseWSD-20, QASC, SciQ, Situation, ViComTe./model
stores the code to run the models./output
save the models' predictions./scripts
some useful scripts.
First, you need to get the performance of all language models and clip. We have three categories of language models: prompt-based, latent-embedding-based, NLI-based and we evaluate on 8 datasets.
- Models:
- Prompt-based: gpt-neo-1.3B, gpt-neo-2.7B, gpt-j-6B, opt-30b
- Latent-based: simcse, sbert
- NLI-based: roberta, bart
- Datasets: ag_news, situation, arc_easy, arc_challenge, qasc, sciq, coarse_wsd, vicomte.
Navigate the /model
directory and then:
/model/prompt.py
is code for prompt-based models, to get the predictions, run:
python prompt.py dataset model
replace the dataset
and model
with the exact name.
/model/latent.py
is code for latent-embedding-based models, to get the predictions, run:
python latent.py dataset model
replace the dataset
and model
with the exact name.
/model/nli.py
is code for NLI-based models, to get the predictions, run:
python nli.py dataset model
replace the dataset
and model
with the exact name.
To run the clip model, you first need to download the image embeddings here. After unzip it in the /Z-LaVI
folder, then you can run:
python clip_zs.py dataset imagine_type
We provide three imagine_type
: synthesis
, recall
and combine
which means generated images, web images and combine both types of images.
You can evaluate single model's performance using /model/evaluate.py
, run:
python evaluate.py dataset model
replace the dataset
and model
with the exact name. You will see the printed output in the terminal in the following format:
Dataset: dataset
Model: model
Performance:
Metric: number
You can ensemble the clip with any language models using model/ensemble.py
:
python ensemble.py dataset clip_{imagine_type} language_model weight
You will see the performance of clip, language model and their ensembled performance.
scripts/bing-vis-search.py
is the code to download images from BING. You need to first have a paid Azure account and replace the subscription key in line 26. The prepare your queries in a txt file seperated by \n
. Then run:
python bing-vis-search.py -f your_queries.txt --threads 200 --limit 300 -o output_dir
--limit
is the number of images you want to download, but some images are not downloadable, so may set this number higher than what you want.
scripts/dalle_generation
is the code to use DALLE-mini to generate images. You need to first download the required packages follow the official repo. Then you can use the main function in the code to generate images:
prompt = "A photo of apple."
number_of_images = 8
images = main(prompt, number_of_images)
for i, image in enumerate(images):
image.save("apple_{}.jpg".format(i))
We released all the images generated by DALLE-mini, you can download them from here (21.43GB).
Please cite our work if you find it useful:
@inproceedings{yang-etal-2022-z,
title = "{Z}-{L}a{VI}: Zero-Shot Language Solver Fueled by Visual Imagination",
author = "Yang, Yue and
Yao, Wenlin and
Zhang, Hongming and
Wang, Xiaoyang and
Yu, Dong and
Chen, Jianshu",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.emnlp-main.78",
pages = "1186--1203",
abstract = "Large-scale pretrained language models have made significant advances in solving downstream language understanding tasks. However, they generally suffer from reporting bias, the phenomenon describing the lack of explicit commonsense knowledge in written text, e.g., {''}an orange is orange{''}. To overcome this limitation, we develop a novel approach, Z-LaVI, to endow language models with visual imagination capabilities. Specifically, we leverage two complementary types of {''}imaginations{''}: (i) recalling existing images through retrieval and (ii) synthesizing nonexistent images via text-to-image generation. Jointly exploiting the language inputs and the imagination, a pretrained vision-language model (e.g., CLIP) eventually composes a zero-shot solution to the original language tasks. Notably, fueling language models with imagination can effectively leverage visual knowledge to solve plain language tasks. In consequence, Z-LaVI consistently improves the zero-shot performance of existing language models across a diverse set of language tasks.",
}