Skip to content
[Reimplementation Antol et al 2015] Keras-based LSTM/CNN models for Visual Question Answering
Python Shell
Branch: master
Clone or download
Latest commit 99be95d Jun 13, 2017
Type Name Latest commit message Commit time
Failed to load latest commit information.
data add preprocessed Oct 21, 2015
experiments Add submission generation file Nov 30, 2015
features Add vgg_coco_idMap.txt Nov 3, 2015
models Pre-trained model release Nov 10, 2015
results Add results readme, update main readme Oct 30, 2015
scripts Update Feb 7, 2016
.gitignore Add gitignore Oct 30, 2015
LICENSE.txt Add license, update readme Nov 1, 2015 Create Jun 12, 2017

Deep Learning for Visual Question Answering

Click here to go to the accompanying blog post.

This project uses Keras to train a variety of Feedforward and Recurrent Neural Networks for the task of Visual Question Answering. It is designed to work with the VQA dataset.

Models Implemented:

BOW+CNN Model LSTM + CNN Model
alt text alt text


  1. Keras 0.20
  2. spaCy 0.94
  3. scikit-learn 0.16
  4. progressbar
  5. Nvidia CUDA 7.5 (optional, for GPU acceleration)
  6. Caffe (Optional)

Tested with Python 2.7 on Ubuntu 14.04 and Centos 7.1.


  1. Keras needs the latest Theano, which in turn needs Numpy/Scipy.
  2. spaCy is currently used only for converting questions to a vector (or a sequence of vectors), this dependency can be easily be removed if you want to.
  3. spaCy uses Goldberg and Levy's word vectors by default, but I found the performance to be much superior with Stanford's Glove word vectors.
  4. VQA Tools is not needed.
  5. Caffe (Optional) - For using the VQA with your own images.

Installation Guide

This project has a large number of dependecies, and I am yet to make a comprehensive installation guide. In the meanwhile, you can use the following guide made by @gajumaru4444:

  1. Prepare for VQA in Ubuntu 14.04 x64 Part 1
  2. Prepare for VQA in Ubuntu 14.04 x64 Part 2

If you intend to use my pre-trained models, you would also need to replace spaCy's default word vectors with the GloVe word vectors from Stanford. You can find more details here on how to do this.

Using Pre-trained models

Take a look at scripts/ An LSTM-based pre-trained model has been released. It currently works only on the images of the MS COCO dataset (need to be downloaded separately), since I have pre-computed the VGG features for them. I do intend to add a pipeline for computing features for other images.

Caution: Use the pre-trained model with 300D Common Crawl Glove Word Embeddings. Do not the the default spaCy embeddings (Goldberg and Levy 2014). If you try to use these pre-trained models with any embeddings except Glove, your results would be garbage. You can find more deatails here on how to do this.

Using your own images

Now you can use your own images with the scripts/ script. Use it like :

python --caffe /path/to/caffe

For now, a Caffe installation is required. However, I'm working on a Keras based VGG Net which should be up soon. Download the VGG Caffe model weights from here and place it in the scripts folder.

The Numbers

Performance on the validation set and the test-dev set of the VQA Challenge:

Model val test-dev
LSTM-Language only 44.17% TODO
LSTM+CNN 51.63% 53.34%

Note: For validation set, the model was trained on the training set, while it was trained on both training and validation set for the test-dev set results.

There is a lot of scope for hyperparameter tuning here. Experiments were done for 100 epochs.

Training Time on various hardware:

Model GTX 760 Intel Core i7
BOW+CNN 140 seconds/epoch 900 seconds/epoch
LSTM+CNN 200 seconds/epoch 1900 seconds/epoch

The above numbers are valid when using a batch size of 128, and training on 215K examples in every epoch.

Get Started

Have a look at the script in the scripts folder. Also, have a look at the readme present in each of the folders.


All kind of feedback (code style, bugs, comments etc.) is welcome. Please open an issue on this repo instead of mailing me, since it helps me keep track of things better.



You can’t perform that action at this time.