GitHub - agneet42/revision: [ECCV 2024] "REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models"

REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models

📄 Abstract

Text-to-Image (T2I) and multimodal large language models (MLLMs) have been adopted in solutions for several computer vision and multimodal learning tasks. However, it has been found that such vision-language models lack the ability to correctly reason over spatial relationships. To tackle this shortcoming, we develop the REVISION framework which improves spatial fidelity in vision-language models. REVISION is a 3D rendering based pipeline that generates spatially accurate synthetic images, given a textual prompt. REVISION is an extendable framework, which currently supports 100+ 3D assets, 11 spatial relationships, all with diverse camera perspectives and backgrounds. Leveraging images from REVISION as additional guidance in a training-free manner consistently improves the spatial consistency of T2I models across all spatial relationships, achieving competitive performance on the VISOR and T2I-CompBench benchmarks. We also design RevQA, a question-answering benchmark to evaluate the spatial reasoning abilities of MLLMs, and find that state-of-the-art models are not robust to complex spatial reasoning under adversarial settings. Our results and findings indicate that utilizing rendering-based frameworks is an effective approach for developing spatially-aware generative models.

🖼️ Data

Please refer to the REVISION organization on Hugging Face 🤗.

📊 Sample Scripts

inference.py presents a simple script that can be modified based on the input prompt and a REVISION image.

📜 Citing

@misc{chatterjee2024revisionrenderingtoolsenable,
      title={REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models}, 
      author={Agneet Chatterjee and Yiran Luo and Tejas Gokhale and Yezhou Yang and Chitta Baral},
      year={2024},
      eprint={2408.02231},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.02231}, 
}

🙏 Acknowledgments

The authors acknowledge resources and support from the Research Computing facilities at Arizona State University. This work was supported by NSF RI grants #1750082 and #2132724. The views and opinions of the authors expressed herein do not necessarily state or reflect those of the funding agencies and employers.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models

📄 Abstract

📚 Contents

🖼️ Data

📊 Sample Scripts

📜 Citing

🙏 Acknowledgments

About

Releases

Packages

Contributors 2

Languages

License

agneet42/revision

Folders and files

Latest commit

History

Repository files navigation

REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models

📄 Abstract

📚 Contents

🖼️ Data

📊 Sample Scripts

📜 Citing

🙏 Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages