Skip to content

Latest commit

 

History

History
59 lines (38 loc) · 1.88 KB

readme.md

File metadata and controls

59 lines (38 loc) · 1.88 KB

EE5438 Project

Image caption generation based on deep learning

This is a final project in EE5438(Applied Deep Learning). The image caption model is based on CLIP and GPT-2 model. There are mainly 4 modules, First, the image encoder module is employed to extract image features and return its embedding. Then, the mapping module is used to map the image embedding to GPT-2 embedding. Next, the text decoder module is responsible for process embedding into caption. Finally, the caption generation module concatenates all pieces together and generates caption based on image. The structure of the model is shown in the picture below:

Example

Dataset

The flickr30k dataset is downloaded from kaggle, to get the original dataert, see:

https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset

Usage

Install requirements:

pip install -r requirements.txt

To run the main program (only predict):

python main.py

To train your own model:

python train.py

Note that the project is trained on the Flickr30k dataset, and the raw data is downloaded from Kaggle. To preprocess the raw data to pkl, run:

python utils.py

Example results

Example1 Example2 Example3

To use our program, see baojudezeze

References: