LVIS-INSTRUCT4V

Introduction

We introduce a fine-grained visual instruction dataset, LVIS-INSTRUCT4V, which contains 220K visually aligned and context-aware instructions produced by prompting the powerful GPT-4V with images from LVIS. Please refer to the arxiv paper for more details.

Usage

Please follow LLaVA to set up the code.

LVIS-INSTRUCT4V is available at LVIS-INSTRUCT4V. To achieve better results on the QA benchmarks, we follow LLaVA 1.5 to mix LVIS-INSTRUCT4V with academic task related data (see the Table 1 & 7 in LLaVA 1.5 paper), which can be found at LVIS-INSTRUCT4V-Nodetail-mix619k, LVIS-INSTRUCT4V-mix730k, and LVIS-Instruct4V-LLaVA-Instruct-mix880k.

Model Zoo

Version	Data	Size	Schedule	Checkpoint	VQAv2	GQA	VizWiz	SQA	T-VQA	POPE	MME	MM-Bench	MM-Bench-CN	SEED	LLaVA-Bench-Wild	MM-Vet
LLaVA-1.5	LVIS-Instruct4V-Nodetail-mix619k	7B	full_ft-1e	LVIS-Instruct4V-Nodetail-mix619k-7b	79.2	62.6	52.5	68.4	57.6	84.0	1472.9	67.1	60.0	60.8	70.4	34.6
LLaVA-1.5	LVIS-Instruct4V-mix730k	7B	full_ft-1e	LVIS-Instruct4V-mix730k-7b	79.4	62.6	52.6	68.9	58.4	85.1	1495.3	66.6	59.6	60.5	67.1	33.3
LLaVA-1.5	LVIS-Instruct4V-LLaVA-Instruct-mix880k	7B	full_ft-1e	LVIS-Instruct4V-LLaVA-Instruct-mix880k-7b	79.6	62.6	51.8	68.3	58.7	86.0	1528.2	66.2	60.4	60.6	67.0	31.5
LLaVA-1.5¹	LVIS-Instruct4V-Nodetail-mix619k	13B	full_ft-1e	LVIS-Instruct4V-Nodetail-mix619k-13b	80.1	63.8	51.4	69.0	62.1	85.3	1572.0	67.8	61.0	62.5	76.7	40.2
LLaVA-1.5	LVIS-Instruct4V-LLaVA-Instruct-mix880k	13B	full_ft-1e	LVIS-Instruct4V-LLaVA-Instruct-mix880k-13b	80.7	63.6	57.2	70.6	62.5	86.0	1574.9	68.0	61.1	61.6	71.3	37.4

Reference

If you find our work useful for your research or applications, please cite using this BibTeX:

@article{wang2023instruct4v,
  title={To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning},
  author={Wang, Junke and Meng, Lingchen and Weng, Zejia and He, Bo and Wu, Zuxuan and Jiang, Yu-Gang},
  journal={arXiv preprint arXiv:2311.07574},
  year={2023}
}

Acknowledgement

We thank the authors of LLaVA for their contribution to the open-source community.

We find TextQA is sensitive to the beam number, and for 13B models, we use beam = 3 on TextQA. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

Repository files navigation

LVIS-INSTRUCT4V

Introduction

Usage

Model Zoo

Reference

Acknowledgement

About

Releases

Packages

License

X2FD/LVIS-INSTRUCT4V

Folders and files

Latest commit

History

LICENSE

LICENSE

README.md

README.md

Repository files navigation

LVIS-INSTRUCT4V

Introduction

Usage

Model Zoo

Reference

Acknowledgement

Footnotes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages