This is a final project in EE5438(Applied Deep Learning)
. The image caption model is based on CLIP and GPT-2 model. There are mainly 4 modules, First, the image encoder module is employed to extract image features and return its embedding. Then, the mapping module is used to map the image embedding to GPT-2 embedding. Next, the text decoder module is responsible for process embedding into caption. Finally, the caption generation module concatenates all pieces together and generates caption based on image. The structure of the model is shown in the picture below:
The flickr30k dataset is downloaded from kaggle, to get the original dataert, see:
https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset
Install requirements:
pip install -r requirements.txt
To run the main program (only predict):
python main.py
To train your own model:
python train.py
Note that the project is trained on the Flickr30k dataset, and the raw data is downloaded from Kaggle. To preprocess the raw data to pkl, run:
python utils.py
To use our program, see baojudezeze