Zhi Gao, Yuntao Du, Xintong Zhang, Xiaojian Ma, Wenjuan Han, Song-Chun Zhu, Qing Li
[ Project Page | Arxiv Paper ]
This is the official code for CLOVA - a Closed-Loop Visual Assistant that can update used tools via a closed-loop framework with inference, reflection, and learning phases.
conda env create -f environment.yml
conda activate clova
Our method requires one A100 GPU.
- The config file for LLMs and tasks:
configs/LLM_config.yaml
- The config file for tools:
configs/all_updated_model_config.yaml
We use LLaMA2-7B for inference and reflection, and you can also try other LLMs, such as GPT-3.5-turbo.
We use GPT-3.5-turbo for the LIST tool and collecting images from the Internet. You can revise your api key in the config file configs/LLM_config.yaml
.
Download the GQA dataset from GQA Link, and revise the path in configs/LLM_config.yaml
. The GQA dataset is used for the compositional VQA task.
Download the NLVRv2 dataset from NLVRv2 Link, and revise the path in configs/LLM_config.yaml
. The NLVRv2 dataset is used for the multiple-image reasoning task.
Download the LIVS dataset from LIVS Link, and revise the path in configs/all_updated_model_config.yaml
for the LOC and SEG tools. The LVIS dataset is used as an open-vocabulary dataset to update the two tools.
We provide four demos for the VQA, mutliple image reasoning, image editing, and knowledge tagging tasks.
For the VQA task
python -m torch.distributed.launch --nproc_per_node=1 gqa_demo.py
For the multiple image reasoning task
python -m torch.distributed.launch --nproc_per_node=1 nlvr_demo.py
For the image editing task
python -m torch.distributed.launch --nproc_per_node=1 imgedit_demo.py
For the knowledge tagging task
python -m torch.distributed.launch --nproc_per_node=1 knowtag_demo.py
- If the tool is learnable (e.g., a neural network), you should add a separate python file in the
tools
folder, including a class of this tool withexecute
andupdate
functions. Following is an exmaple of the LOC tool.
class LocInterpreter():
step_name = 'LOC'
def __init__(self):
...
def parse(self,prog_step):
...
def execute(self, prog_step):
...
def update(self, query):
...
- If the tool is not learnable, you should add a class in
tools/unupdated_functions.py', including
parse' and `execute' functions. Following is an exmaple of the COUNT tool.
class CountInterpreter():
step_name = 'COUNT'
def __init__(self):
...
def parse(self,prog_step):
...
def execute(self,prog_step,inspect=False):
...
- Add tool instruction to the
prompt_engineering.py
file, and several in-context examples to correspondingexperience_pool.py
file in theprompts
folders.
It can update used tools based on human feedback.
zhi.gao@pku.edu.cn, gaozhi_2017@126.com
We thank the following codebases.
- VISPROG The architecture of our code is based on VISPROG (CVPR2023 best paper). Don't forget to check this great open-source work if you don't know it before!
- BLIP Our prompt tuning scheme is evaluated and developed firstly on the BLIP code.
- Textual Inversion The prompt tuning scheme for Stable Diffusion is based on Textual Inversion.
code: Apache
If you find this code useful in your research, please consider citing:
@inproceedings{gao2024clova,
title={CLOVA: A closed-loop visual assistant with tool usage and update},
author={Gao, Zhi and Du, Yuntao and Zhang, Xintong and Ma, Xiaojian and Han, Wenjuan and Zhu, Song-Chun and Li, Qing},
booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)},
year={2024}
}