This jupyter notebook only deals with the pre-processing of the data locally, which is then uploaded to our Google Colab notebooks for training. 

## Part 1a: Sentence embedding

When loading our csv file we can see that our questions and answers are arranged in a difficult way. The best would be to have one column with the question, another one with the image it refers to and a third one with the answer.

In [61]:
##First lets take a sample look at our CSV file

# Load the Pandas libraries with alias 'pd' 
import pandas as pd 
# Read data from file 'filename.csv' 
# (in the same directory that your python process is based)
# Control delimiters, rows, column names with read_csv (see later) 
data = pd.read_csv("./data/raw_data/DAQUAR_train_raw.csv",header=None)
# Preview the first 5 lines of the loaded data 
for i in range(10):
    print(data[0][i])

data.head()

what is on the right side of the black telephone and on the left side of the red chair in the image3 ?
desk
what is in front of the white door on the left side of the desk in the image3 ?
telephone
what is on the desk in the image3 ?
book  scissor  papers  tape_dispenser
what is the largest brown objects in this image3 ?
carton
what color is the chair in front of the white wall in the image3 ?
red


Unnamed: 0,0
0,what is on the right side of the black telepho...
1,desk
2,what is in front of the white door on the left...
3,telephone
4,what is on the desk in the image3 ?


The questions are every even row and the answers are every odd row. So we go through each row, check if even or odd and rewrite correctly in a csv

In [73]:
import os 
import csv

def prepare_data(in_directory,out_directory, mode):
    #find the source and destination directory, mode is either train or test
    file_name_in=os.path.join(in_directory,'DAQUAR_{}_raw.csv'.format(str(mode)))
    file_name_out=os.path.join(out_directory,'DAQUAR_{}_processed.csv'.format(str(mode)))
    
    #open the files and read
    with open(file_name_in, 'r') as f, open(file_name_out, 'w', newline='') as f_out:
        reader = csv.reader(f)
        
        fieldnames=['question','image','answer']
        writer = csv.DictWriter(f_out, fieldnames=fieldnames)
        
        writer.writeheader()
        
        #question row, then answer row
        row_skip=2
        dico={'question':None,
              'image':None,
              'answer':None}

        for index, row in enumerate(reader):
            
            #even number = question
            if index % row_skip ==0:
                #split the question at the 'image' key word
                question_image_list=row[0].split('image')

                dico['question']=[question_image_list[0]]
                
                #remove the question-mark and rewrite 'image' -> useful for integrating visual features later
                dico['image']='image'+question_image_list[1].replace(' ?','')
            
            else:
                dico['answer']=row
                
                #write row in the csv
                writer.writerow({'question': dico['question'], 'image':dico['image'], 'answer': dico['answer']})

                dico={'question':None,
                 'image':None,
                'answer':None}

In [75]:
prepare_data(in_directory='./data/raw_data', out_directory='./data/processed_data', mode='train')
prepare_data(in_directory='./data/raw_data', out_directory='./data/processed_data', mode='test')

In [76]:
import pandas as pd 
data = pd.read_csv("./data/processed_data/DAQUAR_train_processed.csv",header=None)
data.head()

Unnamed: 0,0,1,2
0,question,image,answer
1,['﻿what is on the right side of the black tele...,image3,['desk']
2,['what is in front of the white door on the le...,image3,['telephone']
3,['what is on the desk in the '],image3,['book scissor papers tape_dispenser']
4,['what is the largest brown objects in this '],image3,['carton']


## Part 1b: Create GloVe word embedding array

Only used if you don't want to create a custom embedding for your words. In autoencoder1 we use a custom 300d embeddings for words, which is found while training the autoencoder. In autoencoder2 we use a pre-trained embedding, trained on a very very large amount of text. Because we don't want to load the whole GloVe file on google colab, we first process it locally so that it only incorporates the words of our training set.

In [2]:
import pandas as pd

## read the csv files with the data

train_dir='./data/processed_data/DAQUAR_train_processed.csv'
test_dir='./data/processed_data/DAQUAR_test_processed.csv'

data_train=pd.read_csv(train_dir)
data_test=pd.read_csv(test_dir)

In [4]:
# Make sure that our version of tensorflow is not 2.0
import tensorflow as tf
print(tf.__version__)
from keras.preprocessing.text import Tokenizer

#create an instance of class Tokenizer
MAX_WORDS = 3000
tokenizer = Tokenizer(num_words = MAX_WORDS, split=' ')

#has to fit on questions and answers of the train dataset only
tokenizer.fit_on_texts(data_train['question'])
tokenizer.fit_on_texts(data_train['answer'])

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


1.14.0


Using TensorFlow backend.


In [13]:
import numpy as np

## Because GloVe has about 400K words in its vocabulary, the size of the file is very heavy. 
## Thus we first load GloVe locally to extract the words that are present in our training vocabulary
## and save it as a numpy file. Then we load it on Google Colab. We go from a 1GB file to a 3.6MB file

def create_embedding_matrix(tokenizer,directory,embed_dims):
    embeddings_index = {}
    #first download the glove data for 300 dimensions here : https://nlp.stanford.edu/projects/glove/
    #the file is glove.6B.zip and inside you have glove.6B.300d.txt
    with open(directory,encoding='utf8') as f:
        #processing the text
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
    
    #get the list of words that we have in the training set
    word_index=tokenizer.word_index.items()
    #number of dimensions for the glove embedding (depends on which file you donwloaded, here 300)
    EMBEDDING_DIM=embed_dims
    
    #create an empty matrix that we are going to fill
    embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
    for word, i in word_index:
        #get the glove embedding for the word
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector
            
    return embedding_matrix

In [15]:
directory='./data/embedding/glove.6B.300d.txt'
embed_dims=300

embedding_matrix=create_embedding_matrix(tokenizer, directory,embed_dims)
#save it as an npy file so that we can load it in Google Colab
np.save('./data/embedding/glove_300d_embedding.npy', embedding_matrix)

## Part 2: Visual Features

We get the visual features obtained from one of the last layers of VGG19. And attach each of these features to the correct question/answer as a tuple in a deque, which is like a list but more powerful. The deque file will then be loaded into memory on our Colab notebook in order to train our question-answering model.

In [67]:
## REDO THE SAME STEPS AS PREVIOUSLY

import pandas as pd

## read the csv files with the data

train_dir='./data/processed_data/DAQUAR_train_processed.csv'
test_dir='./data/processed_data/DAQUAR_test_processed.csv'

data_train=pd.read_csv(train_dir)
data_test=pd.read_csv(test_dir)

In [29]:
# Make sure that our version of tensorflow is not 2.0
import tensorflow as tf
print(tf.__version__)
from keras.preprocessing.text import Tokenizer

#create an instance of class Tokenizer
MAX_WORDS = 3000
tokenizer = Tokenizer(num_words = MAX_WORDS, split=' ')

#has to fit on questions and answers of the train dataset only

tokenizer.fit_on_texts(data_train['question'])
tokenizer.fit_on_texts(data_train['answer'])

1.14.0


In [30]:
from keras.preprocessing.text import text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences 
import numpy as np


#the tokenizer must already be fit on your training text
def tokenization(tokenizer, length_of_sequence, dataset, multiple_answer=True):
    MAX_LEN=length_of_sequence

    seqs_question = tokenizer.texts_to_sequences(dataset['question'])
    seqs_answer = tokenizer.texts_to_sequences(dataset['answer'])

    #when you pad as 'post' you will remove the words at the end of the sentence if too long, and not at the start
    #of the sentence.
    pad_seqs_question = pad_sequences(seqs_question,MAX_LEN,truncating='post')
    pad_seqs_answer = pad_sequences(seqs_answer,MAX_LEN,truncating='post')

    #we choose if we want to keep one or multiple answers
    if multiple_answer is False:
        pad_seqs_answer_one_answer = pad_seqs_answer[:,[MAX_LEN-1]]
        return pad_seqs_question, dataset['image'], pad_seqs_answer_one_answer

    else:
        return pad_seqs_question, dataset['image'], pad_seqs_answer

In [31]:
#we set the max number of words in each questin to 25, if less than 25 there will be padding (zeros)
MAX_LEN=25

### TRAINING SET ####
train_questions,train_images,train_answers = tokenization(tokenizer, MAX_LEN, data_train, multiple_answer=False)
test_questions,test_images,test_answers = tokenization(tokenizer, MAX_LEN, data_test, multiple_answer=False)

This is the json file that has all the visual features we need for each image. It's a dictionary and you can call an image by using 'imageX'.


ex: feat['image3']

In [91]:
import json

#load the visual features into memory
with open('./data/img_features.json', 'r') as f:
    feat = json.load(f)

In [92]:
from collections import deque

#Because the image name are sometimes wrongly written we use a try/except and count the number of errors we get
def fill_deque_with_data(visual_features,questions,images,answers,a_deque):
    
    #create error count
    error=0
    #append the indices of the errors
    index_error_images=[]

    for i in range(len(questions)):
        image_name=images[i]
        try:
            #append tuple to the deque list
            a_deque.append((questions[i],visual_features[image_name],answers[i]))
        except Exception as e:
            #print the exception that caused the visual_features to not work
            print(e)
            error+=1
            index_error_images.append(i)

    return error, index_error_images

In [99]:
import pickle

### TRAINING SET ####
train_deque=deque()
error, index = fill_deque_with_data(visual_features=feat,
                                    questions=train_questions,
                                    images=train_images,
                                    answers=train_answers,
                                    a_deque=train_deque)

#save as a text file
pickleFile = open("./data/processed_data/questions-visual_features-train.txt", 'wb')
pickle.dump(train_deque, pickleFile)
pickleFile.close()

'image10 behind the door frame in fornt of the cabinet in the '
'image912 close to the wall in the '
'image116 close to the shelf in the '
'image135 that is on the counter in the '
'image139 on the counter in the '
'image95 behind the clothes in the '
'image114 on the table in the '
'image929 in the '
'image1007 in the '
'image1008 in the '
'image1008 in the '
'image1035 in the '
'image1043 in the '


In [100]:
### TEST SET ####
test_deque=deque()
error, index = fill_deque_with_data(visual_features=feat,
                                    questions=test_questions,
                                    images=test_images,
                                    answers=test_answers,
                                    a_deque=test_deque)

#save as a text file 
pickleFile = open("./data/processed_data/questions-visual_features-test.txt", 'wb')
pickle.dump(test_deque, pickleFile)
pickleFile.close()

'image1206 on the floor in the '
'image1285 on the floor in the '
'image1170 which contains some book in the '
'image1400 in the mirror reflection in the '
'image155 made of in the '
'image168 in the '
'image1011 in the '
'image1407 in the '
