VQA_Transformer

This repo details the implementation of a Transformer based model for Visual Question Answering. Given an input image and a question about that image, it aims to answer the question appropriately.

Architecture

The input questions are passed through an embedding layer and a Transformer encoder model as shown below.

The output of shape (batch_size,ques_seq_length,d_model) is average pooled along the temporal dimension. The decoder takes the attention vector output by the transformer and the input image (which has been passed through VGG16 except the top layer) and passes it through a Bahdanau attention mechanism followed by GRU layers to give an output sequence representing the answer of the question.

The decoder is based on the Bahdanau Attention based seq2seq model decoder utilised for many text generation tasks as shown below:

The resulting model is one that achieves an accuracy comparable to the latest implementations while being more lightweight.

Usage

To train the model:

$ python3 main.py

The last sections of main.py contain the code for inference and can either be selectively run or utilised in another python file.

Working

Input Image:

Input Question and Output Answer:

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
LICENSE		LICENSE
README.md		README.md
decoder.py		decoder.py
encoder_transf_model.py		encoder_transf_model.py
main.py		main.py
test.ipynb		test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VQA_Transformer

Architecture

Usage

Working

About

Releases

Packages

Languages

License

axe76/VQA_Transformer

Folders and files

Latest commit

History

Repository files navigation

VQA_Transformer

Architecture

Usage

Working

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages