## Question Embedding extraction through VQA without images

In this notebook we're going to use a Recursive Neural Network for generating Question embeddings based on the question format and it's answer.

The goal of this is to extract array representation for the questions where similar questions are close.

### Load the data

In [1]:
import findspark
findspark.init(findspark.find())

ModuleNotFoundError: No module named 'findspark'

In [None]:
import tensorflow as tf
tf.test.is_gpu_available(
    cuda_only=False, min_cuda_compute_capability=None
)

In [2]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import *
from nltk.tokenize import word_tokenize
from collections import Counter
from keras.preprocessing.sequence import *
from keras.models import *
from keras.layers import *
from keras.utils import plot_model
from keras.callbacks import ModelCheckpoint
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

sns.set(style="ticks")

spark = SparkSession \
    .builder \
    .appName("QuestionRephrasing-AutoEncoder") \
    .config("spark.executor.memory", "14G")\
    .config("spark.driver.memory", "14G")\
    .config("spark.driver.maxResultSize", "14G")\
    .getOrCreate()

spark.sparkContext.setCheckpointDir('data/checkpoints')
questions = spark.read.parquet("data/processed/union/*")
questions.printSchema()
questions.count()

Using TensorFlow backend.


root
 |-- question: string (nullable = true)
 |-- answer: string (nullable = true)
 |-- image_id: string (nullable = true)
 |-- tokenized_question: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- question_len: double (nullable = true)
 |-- question_word_len: double (nullable = true)
 |-- first_word: string (nullable = true)



1895874

### Question filtering based on answer frequencies

we're dealing with a dataset that has a wide variety of answers. So for this we'll consider only the answers that appear often in order to reduce the classes predicted.

In [34]:
answers_counts = questions.select('answer')\
    .toPandas()['answer']\
    .value_counts()
answers_counts = pd.DataFrame(list(zip(answers_counts.keys(), answers_counts.values)))
most_common_answers = answers_counts[answers_counts[1] > 600][0]
f"We're going to target only {most_common_answers.count()} answers"

"We're going to target only 332 answers"

In [4]:
most_common_answers_df = spark.createDataFrame([[x] for x in most_common_answers.tolist()], ["answer"])

In [5]:
print(f"Original question size: {questions.count()}, after filtering {questions.join(most_common_answers_df, 'answer').count()}")
questions = questions.join(most_common_answers_df, 'answer')

Original question size: 1895874, after filtering 1038866


In [6]:
max_word_len = int(questions.agg({"question_word_len": "max"}).collect()[0]["max(question_word_len)"])

f"Mmaximum number of word in a question is {max_word_len}."

'Mmaximum number of word in a question is 28.'

### Build the vocabulary

In [7]:
# Tokens vocabulary and mappers
tokens = questions.select('tokenized_question')\
    .rdd\
    .flatMap(lambda x: x['tokenized_question'])\
    .collect()

word_mapping = {}
word_mapping_reversed = {}
word_counter = Counter(tokens)
for idx, value in enumerate(word_counter):
    word_mapping[value] = idx
    word_mapping_reversed[idx] = value
    
f"Word mapping example for 'is': {word_mapping['is']}."

"Word mapping example for 'is': 15."

### Extract the word representation of input using the mapping above

In [8]:
extract_word_embeddings = F.udf(lambda tokenized_question: [word_mapping[word] + 1 for word in tokenized_question], ArrayType(IntegerType()))

questions = questions.withColumn('question_word_embeddings', extract_word_embeddings(F.col('tokenized_question')))
questions.head(1)

[Row(answer='plane', question="what's in the sky?", image_id='21926', tokenized_question=['what', "'s", 'in', 'the', 'sky', '?'], question_len=18.0, question_word_len=6.0, first_word='what', question_word_embeddings=[1, 2, 3, 4, 5, 6])]

In [9]:
word_embeddings = questions.select('question_word_embeddings')\
    .rdd\
    .map(lambda x: x['question_word_embeddings'])\
    .collect()
word_embeddings = pad_sequences(word_embeddings, maxlen=max_word_len, dtype='int32', padding='post', truncating='pre', value=0.0)
word_embeddings[:1]

array([[1, 2, 3, 4, 5, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0]], dtype=int32)

### Transform the answers targets to one-hot representation

In [10]:
answer_tokens = questions.select('answer')\
    .rdd\
    .map(lambda x: x['answer'])\
    .collect()

nr_answers = len(set(answer_tokens))

label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(answer_tokens)
onehot_encoder = OneHotEncoder(sparse=True)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)

y = onehot_encoder.fit_transform(integer_encoded)
y[:1]

<1x332 sparse matrix of type '<class 'numpy.float64'>'
	with 1 stored elements in Compressed Sparse Row format>

### Model for extracting question embeddings

**Input:** question embedded (each token represents a number according to the above mapping) as sequences and padded

**Output:** answers as one-hot

For extracting the question embeddings, we'll train the model and then we'll drop the last layer.

In [30]:
from keras.layers import Dropout


model = Sequential()
model.add(Embedding(len(word_mapping) + 1, 50, input_length=max_word_len, mask_zero=True))
model.add(Dropout(0.1))
model.add(Dense(25))
model.add(Bidirectional(LSTM(250, activation='relu', dropout=0.1, recurrent_dropout=0.1), input_shape=(max_word_len, 100)))
model.add(Dropout(0.1))
model.add(Dense(25))
model.add(Dense(nr_answers, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model.summary()

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 28, 50)            840850    
_________________________________________________________________
dropout_13 (Dropout)         (None, 28, 50)            0         
_________________________________________________________________
dense_19 (Dense)             (None, 28, 25)            1275      
_________________________________________________________________
bidirectional_7 (Bidirection (None, 500)               552000    
_________________________________________________________________
dropout_14 (Dropout)         (None, 500)               0         
_________________________________________________________________
dense_20 (Dense)             (None, 25)                12525     
_________________________________________________________________
dense_21 (Dense)             (None, 332)              

### Compute answers class weights

Since there are lots of answers the dataset is very imbalanced. Thus for optimizing the training we're providing class weights so the model can adjust based on how many times the answer appeared.

In [31]:
from sklearn.utils import class_weight

classes = list(set(answer_tokens))

class_weights = class_weight.compute_class_weight('balanced', classes, answer_tokens)

In [32]:
weights = list(zip(label_encoder.fit_transform(classes), class_weights))
weights.sort(key=lambda x: x[0])

### Model training and checkpointing

In [33]:
filepath="model-checkpoints/answer-embeddings/answer-embeddings-model-{epoch:02d}-{val_accuracy:.2f}.hdf5"

checkpoint = ModelCheckpoint(filepath, monitor='val_accuracy', verbose=1, mode='max')
callbacks_list = [checkpoint]
model.fit(word_embeddings, [y],
                epochs=10,
                batch_size=5000,
                shuffle=True,
                callbacks=callbacks_list,
                validation_split=0.3)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 727206 samples, validate on 311660 samples
Epoch 1/10
 75000/727206 [==>...........................] - ETA: 12:50 - loss: 5.7342 - accuracy: 0.1030

KeyboardInterrupt: 