Grounding-GINA-DL23

Visual Grounding pipeline for the Deep Learning project. University of Trento, 2023

Task

Visual grounding is the process of associating linguistic information with visual content, such as images or video. It is an important task in natural language processing and computer vision, as it enables machines to understand and interpret the world in a more human-like way.

Introduction

The task at hand is to develop a model capable of performing visual grounding, which involves associating linguistic descriptions with visual content such as images or video. Visual grounding is a crucial task in the fields of natural language processing and computer vision, as it allows machines to better understand and interpret the world around them. This project aims to explore visual grounding by building and training a deep learning model using a dataset of image-caption pairs from RefCOCOg.

This project aims to explore visual grounding by developing a model that can accurately associate natural language descriptions with corresponding images. The model will be trained using a dataset of image-caption pairs, and evaluated on its ability to generate accurate and relevant captions for new images.

Through this project, we hope to contribute to the advancement of visual grounding research and inspire further exploration of this exciting field. Our code and results will be shared publicly on this Git repository, so that others can reproduce our work and build upon it.

Experiments

ABCD EFG HI

References

@inproceedings{li2022blip,
      title={BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation}, 
      author={Junnan Li and Dongxu Li and Caiming Xiong and Steven Hoi},
      year={2022},
      booktitle={ICML},
}

@misc{li2023blip2,
      title={BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models}, 
      author={Junnan Li and Dongxu Li and Silvio Savarese and Steven Hoi},
      year={2023},
      eprint={2301.12597},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@misc{li2022lavis,
      title={LAVIS: A Library for Language-Vision Intelligence}, 
      author={Dongxu Li and Junnan Li and Hung Le and Guangsen Wang and Silvio Savarese and Steven C. H. Hoi},
      year={2022},
      eprint={2209.09019},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@misc{radford2021learning,
      title={Learning Transferable Visual Models From Natural Language Supervision}, 
      author={Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
      year={2021},
      eprint={2103.00020},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

To cite us

Please cite us if you use this work in your research:

@misc{CoppariTedoldi2023,
    title   = {VisualGrounding:...},
    author  = {Andrea Coppari, Riccardo Tedoldi},
    year    = {2023},
    url  = {https://github.com/andreacoppari/visual-grounding-DL23}
}

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
BOXBERT		BOXBERT
BOXCLIP		BOXCLIP
BOXDINO		BOXDINO
C2ILP		C2ILP
GRAPHBOXREGRESSOR		GRAPHBOXREGRESSOR
PreProcessing		PreProcessing
RCLIP		RCLIP
SENTENCEDENOISING		SENTENCEDENOISING
baseline		baseline
clip		clip
extractCOCO		extractCOCO
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Grounding-GINA-DL23

Task

Introduction

Experiments

References

To cite us

About

Releases

Packages

Contributors 2

Languages

andreacoppari/visual-grounding-DL23

Folders and files

Latest commit

History

Repository files navigation

Grounding-GINA-DL23

Task

Introduction

Experiments

References

To cite us

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages