EE5438 Project

Image caption generation based on deep learning

This is a final project in EE5438(Applied Deep Learning). The image caption model is based on CLIP and GPT-2 model. There are mainly 4 modules, First, the image encoder module is employed to extract image features and return its embedding. Then, the mapping module is used to map the image embedding to GPT-2 embedding. Next, the text decoder module is responsible for process embedding into caption. Finally, the caption generation module concatenates all pieces together and generates caption based on image. The structure of the model is shown in the picture below:

Dataset

The flickr30k dataset is downloaded from kaggle, to get the original dataert, see:

https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset

Usage

Install requirements:

pip install -r requirements.txt

To run the main program (only predict):

python main.py

To train your own model:

python train.py

Note that the project is trained on the Flickr30k dataset, and the raw data is downloaded from Kaggle. To preprocess the raw data to pkl, run:

python utils.py

Example results

To use our program, see baojudezeze

References:

HuggingFace
CLIP
GPT-2
CLIP_prefix_caption(git)
clip-gpt-captioning(git)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

readme.md

EE5438 Project

Image caption generation based on deep learning

Dataset

Usage

Example results

References:

Files

readme.md

Latest commit

History

readme.md

File metadata and controls

EE5438 Project

Image caption generation based on deep learning

Dataset

Usage

Example results

References: