CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update (CVPR 2024)

Zhi Gao, Yuntao Du, Xintong Zhang, Xiaojian Ma, Wenjuan Han, Song-Chun Zhu, Qing Li

This is the official code for CLOVA - a Closed-Loop Visual Assistant that can update used tools via a closed-loop framework with inference, reflection, and learning phases.

Framework

Install Dependencies

conda env create -f environment.yml
conda activate clova

Config

Our method requires one A100 GPU.

The config file for LLMs and tasks: configs/LLM_config.yaml
The config file for tools: configs/all_updated_model_config.yaml

We use LLaMA2-7B for inference and reflection, and you can also try other LLMs, such as GPT-3.5-turbo. We use GPT-3.5-turbo for the LIST tool and collecting images from the Internet. You can revise your api key in the config file configs/LLM_config.yaml.

Dataset

GQA

Download the GQA dataset from GQA Link, and revise the path in configs/LLM_config.yaml. The GQA dataset is used for the compositional VQA task.

NLVRv2

Download the NLVRv2 dataset from NLVRv2 Link, and revise the path in configs/LLM_config.yaml. The NLVRv2 dataset is used for the multiple-image reasoning task.

LIVS

Download the LIVS dataset from LIVS Link, and revise the path in configs/all_updated_model_config.yaml for the LOC and SEG tools. The LVIS dataset is used as an open-vocabulary dataset to update the two tools.

Running the Demo

We provide four demos for the VQA, mutliple image reasoning, image editing, and knowledge tagging tasks.

For the VQA task

python -m torch.distributed.launch --nproc_per_node=1 gqa_demo.py

For the multiple image reasoning task

python -m torch.distributed.launch --nproc_per_node=1 nlvr_demo.py

For the image editing task

python -m torch.distributed.launch --nproc_per_node=1 imgedit_demo.py

For the knowledge tagging task

python -m torch.distributed.launch --nproc_per_node=1 knowtag_demo.py

Adding New Tools

If the tool is learnable (e.g., a neural network), you should add a separate python file in the tools folder, including a class of this tool with execute and update functions. Following is an exmaple of the LOC tool.

class LocInterpreter():
    step_name = 'LOC'
    def __init__(self):
       ...
    def parse(self,prog_step):
       ...       
    def execute(self, prog_step):
       ...
    def update(self, query):
       ...

If the tool is not learnable, you should add a class in tools/unupdated_functions.py', including parse' and `execute' functions. Following is an exmaple of the COUNT tool.

class CountInterpreter():
    step_name = 'COUNT'
    def __init__(self):
       ...

    def parse(self,prog_step):
       ...

    def execute(self,prog_step,inspect=False):
       ...

Add tool instruction to the prompt_engineering.py file, and several in-context examples to corresponding experience_pool.py file in the prompts folders.

What CLOVA Can Do

It can update used tools based on human feedback.

Contact

zhi.gao@pku.edu.cn, gaozhi_2017@126.com

Acknowledgement

We thank the following codebases.

VISPROG The architecture of our code is based on VISPROG (CVPR2023 best paper). Don't forget to check this great open-source work if you don't know it before!
BLIP Our prompt tuning scheme is evaluated and developed firstly on the BLIP code.
Textual Inversion The prompt tuning scheme for Stable Diffusion is based on Textual Inversion.

License

code: Apache

Citation

If you find this code useful in your research, please consider citing:

@inproceedings{gao2024clova,
  title={CLOVA: A closed-loop visual assistant with tool usage and update},
  author={Gao, Zhi and Du, Yuntao and Zhang, Xintong and Ma, Xiaojian and Han, Wenjuan and Zhu, Song-Chun and Li, Qing},
  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
Datasets		Datasets
configs		configs
engine		engine
figure		figure
framework		framework
image_editing_data		image_editing_data
knowtag_data		knowtag_data
knowtag_fl_score		knowtag_fl_score
llama		llama
prompts		prompts
tools		tools
transformers_self		transformers_self
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
gqa_demo.py		gqa_demo.py
imgedit_demo.py		imgedit_demo.py
knowtag_demo.py		knowtag_demo.py
nlvr_demo.py		nlvr_demo.py

License

clova-tool/CLOVA-tool

Folders and files

Latest commit

History

Repository files navigation

CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update (CVPR 2024)

Framework

Install Dependencies

Config

Dataset

GQA

NLVRv2

LIVS

Running the Demo

Adding New Tools

What CLOVA Can Do

Contact

Acknowledgement

License

Citation

About

Resources

License

Stars

Watchers

Forks

Languages