Skip to content

Visual Grounding project for the Deep Learning exam. University of Trento, 2023

Notifications You must be signed in to change notification settings

andreacoppari/visual-grounding-DL23

Repository files navigation

Grounding-GINA-DL23

Visual Grounding pipeline for the Deep Learning project. University of Trento, 2023

Task

Visual grounding is the process of associating linguistic information with visual content, such as images or video. It is an important task in natural language processing and computer vision, as it enables machines to understand and interpret the world in a more human-like way.

Introduction

The task at hand is to develop a model capable of performing visual grounding, which involves associating linguistic descriptions with visual content such as images or video. Visual grounding is a crucial task in the fields of natural language processing and computer vision, as it allows machines to better understand and interpret the world around them. This project aims to explore visual grounding by building and training a deep learning model using a dataset of image-caption pairs from RefCOCOg.

This project aims to explore visual grounding by developing a model that can accurately associate natural language descriptions with corresponding images. The model will be trained using a dataset of image-caption pairs, and evaluated on its ability to generate accurate and relevant captions for new images.

Through this project, we hope to contribute to the advancement of visual grounding research and inspire further exploration of this exciting field. Our code and results will be shared publicly on this Git repository, so that others can reproduce our work and build upon it.

Experiments

ABCD EFG HI

References

@inproceedings{li2022blip,
      title={BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation}, 
      author={Junnan Li and Dongxu Li and Caiming Xiong and Steven Hoi},
      year={2022},
      booktitle={ICML},
}
@misc{li2023blip2,
      title={BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models}, 
      author={Junnan Li and Dongxu Li and Silvio Savarese and Steven Hoi},
      year={2023},
      eprint={2301.12597},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
@misc{li2022lavis,
      title={LAVIS: A Library for Language-Vision Intelligence}, 
      author={Dongxu Li and Junnan Li and Hung Le and Guangsen Wang and Silvio Savarese and Steven C. H. Hoi},
      year={2022},
      eprint={2209.09019},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
@misc{radford2021learning,
      title={Learning Transferable Visual Models From Natural Language Supervision}, 
      author={Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
      year={2021},
      eprint={2103.00020},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

To cite us

Please cite us if you use this work in your research:

@misc{CoppariTedoldi2023,
    title   = {VisualGrounding:...},
    author  = {Andrea Coppari, Riccardo Tedoldi},
    year    = {2023},
    url  = {https://github.com/andreacoppari/visual-grounding-DL23}
}

About

Visual Grounding project for the Deep Learning exam. University of Trento, 2023

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published