This model can select the correct one-word answer when asked a natural-language question about a picture.




It works by encoding the question into a vector, encoding the image into a vector, concatenating the two, and training on top a logistic regression over some vocabulary of potential answers.

In [1]:
import keras

Using TensorFlow backend.


In [7]:
from keras.layers import Dense, Input
from keras.layers import Embedding, LSTM
from keras.layers import Conv2D, MaxPooling2D, Flatten
from keras.models import Model, Sequential

We'll first create an Image model:

In [8]:
vision_model = Sequential()

In [10]:
vision_model.add(Conv2D(64,(3,3), padding = 'same', activation = 'relu', input_shape = (224,224,3)))
vision_model.add(Conv2D(64, (3, 3), activation='relu'))
vision_model.add(MaxPooling2D((2, 2)))
vision_model.add(Conv2D(128, (3, 3), activation='relu', padding='same'))
vision_model.add(Conv2D(128, (3, 3), activation='relu'))
vision_model.add(MaxPooling2D((2, 2)))
vision_model.add(Conv2D(256, (3, 3), activation='relu', padding='same'))
vision_model.add(Conv2D(256, (3, 3), activation='relu'))
vision_model.add(Conv2D(256, (3, 3), activation='relu'))
vision_model.add(MaxPooling2D((2, 2)))
vision_model.add(Flatten())


Now, let's get a tensor with the output of our vision model.

In [12]:
image_input = Input(shape=(224, 224, 3))
encoded_image = vision_model(image_input)

Next, lets define a language model to encode the question into a vector. Each question would be atmost 100 word long, and we'll index words as integers betweeen 1 and 9999.

In [13]:
question_input = Input(shape=(100,), dtype='int32')
embedded_question = Embedding(input_dim=10000, output_dim=256, input_length=100)(question_input)
encoded_question = LSTM(256)(embedded_question)

Now, we'll concatenate the question vector and the image vector.

In [14]:
merged = keras.layers.concatenate([encoded_question, encoded_image])

In [15]:
output = Dense(1000, activation='softmax')(merged)

This creates a logistic regression layer over 1000 words on top.


In [17]:
vqa_model = Model(inputs=[image_input, question_input], outputs=output)

Finally, we'll train the model with the actual data.

## Video Question - Answering model 

Now that we have trained our image QA model, we can quickly turn it into a video QA model. With appropriate training, you will be able to show it a short video (e.g. 100-frame human action) and ask a natural language question about the video (e.g. "what sport is the boy playing?" -> "football").

In [18]:
from keras.layers import TimeDistributed

In [19]:
video_input = Input((100,224,224,3))

Now we'll use our previously trained vision model to encode the video.

In [20]:
encoded_frame_sequence = TimeDistributed(vision_model)(video_input)

Outputs a sequence of vectors.

In [21]:
encoded_video = LSTM(256)(encoded_frame_sequence)

Outputs a vector.

Model-level representation of the Question encoder:

In [22]:
question_encoder = Model(inputs=question_input, outputs=encoded_question)

Let's use it to encode the question:

In [24]:
video_question_input = Input(shape=(100,), dtype='int32')
encoded_video_question = question_encoder(video_question_input)

Finally, our video-question answering model:

In [25]:
merged = keras.layers.concatenate([encoded_video, encoded_video_question])
output = Dense(1000, activation='softmax')(merged)
video_qa_model = Model(inputs=[video_input, video_question_input], outputs=output)
