Skip to content

Latest commit

 

History

History
80 lines (61 loc) · 2.88 KB

README.md

File metadata and controls

80 lines (61 loc) · 2.88 KB

Exploring Discrete Diffusion Models for Image Captioning.

Training prerequisites

You can use docker. Also, you can create environment and install dependencies:

conda env create -f environment.yml

or

bash install_req.sh

or

pip install -r requirements.txt

COCO training

Download train_captions.

Download training images and validation images and unzip (We use Karpathy et el. split).

Download oscar_split_ViT-B_32_train_512.pkl in ./data/coco/

Microsoft COCO

│MSCOCO_Caption/
├──annotations/
│  ├── captions_train2014.json
│  ├── captions_val2014.json
├──train2014/
│  ├── COCO_train2014_000000000009.jpg
│  ├── ......
├──val2014/ 
│  ├── COCO_val2014_000000000042.jpg
│  ├── ......

Prepare evaluation

Change the work directory and set up the code of evaluation :

cd ./captioneval/coco_caption
bash ./get_stanford_models.sh

Run

MKL_THREADING_LAYER=GPU  python -m torch.distributed.launch --nproc_per_node 8  train.py  --out_dir /results_diff --tag caption_diff_vitb16

If you want train the model with trainable clip, you can use the command:

MKL_THREADING_LAYER=GPU  python -m torch.distributed.launch --nproc_per_node 8  train_tclip.py  --out_dir /results_diff --tag caption_diff_vitb16

Please noting that we detach the gradients of [CLS] tokens during the training process of clip model. Because We observe that when the image encoder (clip) is trainable, the gradient backward of [CLS] tokens will damage the training of image encoder (clip).

Citation

If you use this code for your research, please cite:

@article{zhu2022exploring,
  title={Exploring Discrete Diffusion Models for Image Captioning},
  author={Zhu, Zixin and Wei, Yixuan and Wang, Jianfeng and Gan, Zhe and Zhang, Zheng and Wang, Le and Hua, Gang and Wang, Lijuan and Liu, Zicheng and Hu, Han},
  journal={arXiv preprint arXiv:2211.11694},
  year={2022}
}

Acknowledgments

This repository is heavily based on CLIP, CLIP_prefix_caption and Hugging-faces repositories. For training we used the data of COCO dataset.