Promoting Coherence and Diversity in Image Captioning
This repository includes the reference code for conventional diverse image captioning models and CLIP-CVAE.
- Python 3.8
- Pytorch 1.9
- transformers 4.12
To run the code, annotations and images for the COCO dataset are needed. Please download the zip files including the images (train2014.zip, val2014.zip), the zip file containing the annotations (annotations_trainval2014.zip) and extract them. These paths will be set as arguments later.
This repository refers to github and huggingface. Thanks for the released code.