Skip to content

axe76/VQA_Transformer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VQA_Transformer

This repo details the implementation of a Transformer based model for Visual Question Answering. Given an input image and a question about that image, it aims to answer the question appropriately.

Architecture

The input questions are passed through an embedding layer and a Transformer encoder model as shown below.
transformer

The output of shape (batch_size,ques_seq_length,d_model) is average pooled along the temporal dimension. The decoder takes the attention vector output by the transformer and the input image (which has been passed through VGG16 except the top layer) and passes it through a Bahdanau attention mechanism followed by GRU layers to give an output sequence representing the answer of the question.

The decoder is based on the Bahdanau Attention based seq2seq model decoder utilised for many text generation tasks as shown below:
attention_mechanism

The resulting model is one that achieves an accuracy comparable to the latest implementations while being more lightweight.

Usage

To train the model:

$ python3 main.py

The last sections of main.py contain the code for inference and can either be selectively run or utilised in another python file.

Working

Input Image:
COCO_train2014_000000027511

Input Question and Output Answer:
Capture

About

Lightweight Transformer based model for VQA

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages