Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?


Failed to load latest commit information.
Latest commit message
Commit time


Join the chat at

This is an experimental Torch implementation of the VIS + LSTM visual question answering model from the paper Exploring Models and Data for Image Question Answering by Mengye Ren, Ryan Kiros & Richard Zemel.

Model architecture



Download the MSCOCO train+val images and VQA data using sh data/ Extract all the downloaded zip files inside the data folder.



If you had them downloaded already, copy over the train2014 and val2014 image folders and VQA JSON files to the data folder.

Download the VGG-19 Caffe model and prototxt using sh models/

Known issues

  • To avoid memory issues with LuaJIT, install Torch with Lua 5.1 (TORCH_LUA_VERSION=LUA51 ./ More instructions here.
  • If working with plain Lua, luaffifb may be needed for loadcaffe, unless using pre-extracted fc7 features.


Extract image features

th extract_fc7.lua -split train
th extract_fc7.lua -split val


  • batch_size: Batch size. Default is 10.
  • split: train/val. Default is train.
  • gpuid: 0-indexed id of GPU to use. Default is -1 = CPU.
  • proto_file: Path to the deploy.prototxt file for the VGG Caffe model. Default is models/VGG_ILSVRC_19_layers_deploy.prototxt.
  • model_file: Path to the .caffemodel file for the VGG Caffe model. Default is models/VGG_ILSVRC_19_layers.caffemodel.
  • data_dir: Data directory. Default is data.
  • feat_layer: Layer to extract features from. Default is fc7.
  • input_image_dir: Image directory. Default is data.


th train.lua


  • rnn_size: Size of LSTM internal state. Default is 512.
  • num_layers: Number of layers in LSTM
  • embedding_size: Size of word embeddings. Default is 512.
  • learning_rate: Learning rate. Default is 4e-4.
  • learning_rate_decay: Learning rate decay factor. Default is 0.95.
  • learning_rate_decay_after: In number of epochs, when to start decaying the learning rate. Default is 15.
  • alpha: Alpha for adam. Default is 0.8
  • beta: Beta used for adam. Default is 0.999.
  • epsilon: Denominator term for smoothing. Default is 1e-8.
  • batch_size: Batch size. Default is 64.
  • max_epochs: Number of full passes through the training data. Default is 15.
  • dropout: Dropout for regularization. Probability of dropping input. Default is 0.5.
  • init_from: Initialize network parameters from checkpoint at this path.
  • save_every: No. of iterations after which to checkpoint. Default is 1000.
  • train_fc7_file: Path to fc7 features of training set. Default is data/train_fc7.t7.
  • fc7_image_id_file: Path to fc7 image ids of training set. Default is data/train_fc7_image_id.t7.
  • val_fc7_file: Path to fc7 features of validation set. Default is data/val_fc7.t7.
  • val_fc7_image_id_file: Path to fc7 image ids of validation set. Default is data/val_fc7_image_id.t7.
  • data_dir: Data directory. Default is data.
  • checkpoint_dir: Checkpoint directory. Default is checkpoints.
  • savefile: Filename to save checkpoint to. Default is vqa.
  • gpuid: 0-indexed id of GPU to use. Default is -1 = CPU.


th predict.lua -checkpoint_file checkpoints/vqa_epoch23.26_0.4610.t7 -input_image_path data/train2014/COCO_train2014_000000405541.jpg -question 'What is the cat on?'


  • checkpoint_file: Path to model checkpoint to initialize network parameters from
  • input_image_path: Path to input image
  • question: Question string

Sample predictions

Randomly sampled image-question pairs from the VQA test set, and answers predicted by the VIS+LSTM model.

Q: What animals are those? A: Sheep

Q: What color is the frisbee that's upside down? A: Red

Q: What is flying in the sky? A: Kite

Q: What color is court? A: Blue

Q: What is in the standing person's hands? A: Bat

Q: Are they riding horses both the same color? A: No

Q: What shape is the plate? A: Round

Q: Is the man wearing socks? A: Yes

Q: What is over the woman's left shoulder? A: Fork

Q: Where are the pink flowers? A: On wall

Implementation Details

  • Last hidden layer image features from VGG-19
  • Zero-padded question sequences for batched implementation
  • Training questions are filtered for top_n answers, top_n = 1000 by default (~87% coverage)

Pretrained model and data files

To reproduce results shown on this page or try your own image-question pairs, download the following and run predict.lua with the appropriate paths.





No releases published


No packages published