Skip to content

baojudezeze/Image-Caption-Generation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EE5438 Project

Image caption generation based on deep learning

This is a final project in EE5438(Applied Deep Learning). The image caption model is based on CLIP and GPT-2 model. There are mainly 4 modules, First, the image encoder module is employed to extract image features and return its embedding. Then, the mapping module is used to map the image embedding to GPT-2 embedding. Next, the text decoder module is responsible for process embedding into caption. Finally, the caption generation module concatenates all pieces together and generates caption based on image. The structure of the model is shown in the picture below:

Example

Dataset

The flickr30k dataset is downloaded from kaggle, to get the original dataert, see:

https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset

Usage

Install requirements:

pip install -r requirements.txt

To run the main program (only predict):

python main.py

To train your own model:

python train.py

Note that the project is trained on the Flickr30k dataset, and the raw data is downloaded from Kaggle. To preprocess the raw data to pkl, run:

python utils.py

Example results

Example1 Example2 Example3

To use our program, see baojudezeze

References:

Releases

No releases published

Packages

No packages published

Languages