Skip to content
master
Go to file
Code

README.md

neural-vqa

Join the chat at https://gitter.im/abhshkdz/neural-vqa

This is an experimental Torch implementation of the VIS + LSTM visual question answering model from the paper Exploring Models and Data for Image Question Answering by Mengye Ren, Ryan Kiros & Richard Zemel.

Model architecture

Setup

Requirements:

Download the MSCOCO train+val images and VQA data using sh data/download_data.sh. Extract all the downloaded zip files inside the data folder.

unzip Annotations_Train_mscoco.zip
unzip Questions_Train_mscoco.zip
unzip train2014.zip

unzip Annotations_Val_mscoco.zip
unzip Questions_Val_mscoco.zip
unzip val2014.zip

If you had them downloaded already, copy over the train2014 and val2014 image folders and VQA JSON files to the data folder.

Download the VGG-19 Caffe model and prototxt using sh models/download_models.sh.

Known issues

  • To avoid memory issues with LuaJIT, install Torch with Lua 5.1 (TORCH_LUA_VERSION=LUA51 ./install.sh). More instructions here.
  • If working with plain Lua, luaffifb may be needed for loadcaffe, unless using pre-extracted fc7 features.

Usage

Extract image features

th extract_fc7.lua -split train
th extract_fc7.lua -split val

Options

  • batch_size: Batch size. Default is 10.
  • split: train/val. Default is train.
  • gpuid: 0-indexed id of GPU to use. Default is -1 = CPU.
  • proto_file: Path to the deploy.prototxt file for the VGG Caffe model. Default is models/VGG_ILSVRC_19_layers_deploy.prototxt.
  • model_file: Path to the .caffemodel file for the VGG Caffe model. Default is models/VGG_ILSVRC_19_layers.caffemodel.
  • data_dir: Data directory. Default is data.
  • feat_layer: Layer to extract features from. Default is fc7.
  • input_image_dir: Image directory. Default is data.

Training

th train.lua

Options

  • rnn_size: Size of LSTM internal state. Default is 512.
  • num_layers: Number of layers in LSTM
  • embedding_size: Size of word embeddings. Default is 512.
  • learning_rate: Learning rate. Default is 4e-4.
  • learning_rate_decay: Learning rate decay factor. Default is 0.95.
  • learning_rate_decay_after: In number of epochs, when to start decaying the learning rate. Default is 15.
  • alpha: Alpha for adam. Default is 0.8
  • beta: Beta used for adam. Default is 0.999.
  • epsilon: Denominator term for smoothing. Default is 1e-8.
  • batch_size: Batch size. Default is 64.
  • max_epochs: Number of full passes through the training data. Default is 15.
  • dropout: Dropout for regularization. Probability of dropping input. Default is 0.5.
  • init_from: Initialize network parameters from checkpoint at this path.
  • save_every: No. of iterations after which to checkpoint. Default is 1000.
  • train_fc7_file: Path to fc7 features of training set. Default is data/train_fc7.t7.
  • fc7_image_id_file: Path to fc7 image ids of training set. Default is data/train_fc7_image_id.t7.
  • val_fc7_file: Path to fc7 features of validation set. Default is data/val_fc7.t7.
  • val_fc7_image_id_file: Path to fc7 image ids of validation set. Default is data/val_fc7_image_id.t7.
  • data_dir: Data directory. Default is data.
  • checkpoint_dir: Checkpoint directory. Default is checkpoints.
  • savefile: Filename to save checkpoint to. Default is vqa.
  • gpuid: 0-indexed id of GPU to use. Default is -1 = CPU.

Testing

th predict.lua -checkpoint_file checkpoints/vqa_epoch23.26_0.4610.t7 -input_image_path data/train2014/COCO_train2014_000000405541.jpg -question 'What is the cat on?'

Options

  • checkpoint_file: Path to model checkpoint to initialize network parameters from
  • input_image_path: Path to input image
  • question: Question string

Sample predictions

Randomly sampled image-question pairs from the VQA test set, and answers predicted by the VIS+LSTM model.

Q: What animals are those? A: Sheep

Q: What color is the frisbee that's upside down? A: Red

Q: What is flying in the sky? A: Kite

Q: What color is court? A: Blue

Q: What is in the standing person's hands? A: Bat

Q: Are they riding horses both the same color? A: No

Q: What shape is the plate? A: Round

Q: Is the man wearing socks? A: Yes

Q: What is over the woman's left shoulder? A: Fork

Q: Where are the pink flowers? A: On wall

Implementation Details

  • Last hidden layer image features from VGG-19
  • Zero-padded question sequences for batched implementation
  • Training questions are filtered for top_n answers, top_n = 1000 by default (~87% coverage)

Pretrained model and data files

To reproduce results shown on this page or try your own image-question pairs, download the following and run predict.lua with the appropriate paths.

References

License

MIT

Releases

No releases published

Packages

No packages published
You can’t perform that action at this time.