# Question auto-encoder evaluation

In this notebook we're going to evaluate the question auto-encoder results.

The first part until the model loading is the same as the `auto_encoder_training` notebook since we need to perform the same operations and pre-processing needed for predicting with the model.

### Imports

In [1]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import *
from nltk.tokenize import word_tokenize
from collections import Counter
from keras.preprocessing.sequence import *
from keras.models import *
from keras.layers import *
from keras.utils import plot_model
from keras.callbacks import ModelCheckpoint
from sklearn.manifold import TSNE

import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import pickle

from pyspark.sql.window import Window

sns.set(style="ticks")

spark = SparkSession \
    .builder \
    .appName("QuestionRephrasing-AutoEncoder") \
    .config("spark.executor.memory", "5G")\
    .config("spark.driver.memory", "10G")\
    .config("spark.driver.maxResultSize", "5G")\
    .getOrCreate()

w = Window().orderBy(F.lit('A'))

spark.sparkContext.setCheckpointDir('data/checkpoints')
questions = spark.read.parquet("data/processed/union/*").withColumn("columnindex", F.row_number().over(w))
questions.printSchema()

Using TensorFlow backend.


root
 |-- question: string (nullable = true)
 |-- answer: string (nullable = true)
 |-- image_id: string (nullable = true)
 |-- tokenized_question: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- question_len: double (nullable = true)
 |-- question_word_len: double (nullable = true)
 |-- first_word: string (nullable = true)
 |-- columnindex: integer (nullable = true)



In [2]:
max_word_len = int(questions.agg({"question_word_len": "max"}).collect()[0]["max(question_word_len)"])

f"Maximum word length is {max_word_len}."

'Maximum word length is 28.'

## Vocabulary build

We need to extract a numerical representation for *tokens*.

In [3]:
# Tokens vocabulary and mappers
tokens = questions.select('tokenized_question')\
    .rdd\
    .flatMap(lambda x: x['tokenized_question'])\
    .collect()

word_mapping = {}
word_mapping_reversed = {}
word_counter = Counter(tokens)
for idx, value in enumerate(word_counter):
    word_mapping[value] = idx
    word_mapping_reversed[idx] = value
    
f"Word mapping example for 'is': {word_mapping['is']}."

"Word mapping example for 'is': 1."

### Input pre-processing

Now let's pre-process the input to have the corresponding **mappings** for *words*.

In [4]:
extract_word_embeddings = F.udf(lambda tokenized_question: [[word_mapping[word] + 1] for word in tokenized_question], ArrayType(ArrayType(IntegerType())))

questions = questions.withColumn('question_word_embeddings', extract_word_embeddings(F.col('tokenized_question')))
questions.head(1)

[Row(question='what is this photo taken looking through?', answer='net', image_id='458752', tokenized_question=['what', 'is', 'this', 'photo', 'taken', 'looking', 'through', '?'], question_len=41.0, question_word_len=8.0, first_word='what', columnindex=1, question_word_embeddings=[[1], [2], [3], [4], [5], [6], [7], [8]])]

In [5]:
word_embeddings = questions.select('question_word_embeddings')\
    .rdd\
    .map(lambda x: x['question_word_embeddings'])\
    .collect()
word_embeddings = pad_sequences(word_embeddings, maxlen=max_word_len, dtype='int32', padding='post', truncating='pre', value=0.0)
word_embeddings[:1]

array([[[1],
        [2],
        [3],
        [4],
        [5],
        [6],
        [7],
        [8],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0]]], dtype=int32)

### Load the model

Load the model by dropping it's last 3 layers in order to obtain the array representation of the input computed by the neural network.

In [6]:
encoding_dim = 100

model = Sequential()
model.add(LSTM(encoding_dim, activation='relu', input_shape=(max_word_len, 1)))
model.add(RepeatVector(max_word_len))
model.add(LSTM(max_word_len, activation='relu', return_sequences=True))
model.add(TimeDistributed(Dense(1)))
model.compile(optimizer='adam', loss='mse', metrics=['mae', 'accuracy'])
model.summary()

model.load_weights("model-checkpoints/autoencoder-words/autoencoder-model-10-0.01.hdf5")

model.pop()
model.pop()
model.pop()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 100)               40800     
_________________________________________________________________
repeat_vector_1 (RepeatVecto (None, 28, 100)           0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 28, 28)            14448     
_________________________________________________________________
time_distributed_1 (TimeDist (None, 28, 1)             29        
Total params: 55,277
Trainable params: 55,277
Non-trainable params: 0
_________________________________________________________________


### Extract the question embeddings.

Compute the embeddings and dump them as pickle (so we can play with them without running the above cells). 

In [7]:
question_embeddings = model.predict(word_embeddings, verbose=1)
pickle.dump(question_embeddings, open("model-checkpoints/autoencoder-words/question_embeddings.pickle", "wb" ) )



### Original dataframe mapping

Let's map now the question embeddings to the original dataframe so we can create some nice plots below.

In [8]:
question_embeddings = pickle.load(open("model-checkpoints/autoencoder-words/question_embeddings.pickle", "rb" ) )
question_embeddings.shape, questions.count()

((1895874, 100), 1895874)

In [9]:
question_embeddings_arrays = [(embedding.tolist(), ) for embedding in question_embeddings]
question_embeddings_df = spark.createDataFrame(question_embeddings_arrays, ["question_embeddings"])\
    .withColumn("columnindex", F.row_number().over(w))
question_embeddings_df.head(10)

[Row(question_embeddings=[0.0004767561622429639, 0.02725340984761715, 0.015538804233074188, 0.0, 1.1279046248091618e-06, 0.1348886489868164, 0.45405861735343933, 0.4031505584716797, 0.0, 0.3173833191394806, 0.8812439441680908, 0.26736029982566833, 0.00025665387511253357, 0.14521507918834686, 0.05515861511230469, 0.00047478172928094864, 0.0011164809111505747, 0.5989505648612976, 0.6091572642326355, 0.002047825139015913, 0.28944143652915955, 0.08717868477106094, 0.006879337597638369, 0.007671233732253313, 0.0, 0.0, 0.40932971239089966, 2.3797378540039062, 0.00675981817767024, 0.0, 0.0, 0.009526976384222507, 0.032301124185323715, 0.09007634967565536, 0.0021635936573147774, 0.0, 0.0, 0.0017508039018139243, 0.08948977291584015, 0.44869595766067505, 0.14767955243587494, 0.006362385582178831, 0.00017617909179534763, 0.49680453538894653, 0.007712991908192635, 4.136678218841553, 0.17391712963581085, 0.5104081630706787, 2.8478558306233026e-05, 0.040004707872867584, 0.006689118687063456, 0.935730

In [10]:
questions_with_embeddings = questions.join(question_embeddings_df, "columnindex")
questions_with_embeddings.write.mode('overwrite').parquet("data/processed/auto-encoder-questions-with-embeddings")

In [11]:
questions_with_embeddings = spark.read.parquet("data/processed/auto-encoder-questions-with-embeddings/*")
questions_with_embeddings.head(5)

[Row(columnindex=46, question='what color is the sky?', answer='gray', image_id='393230', tokenized_question=['what', 'color', 'is', 'the', 'sky', '?'], question_len=22.0, question_word_len=6.0, first_word='what', question_word_embeddings=[[1], [12], [2], [13], [35], [8]], question_embeddings=[0.0, 0.0, 1.8078589840431203e-12, 0.0, 0.0, 0.10583753138780594, 86.96768951416016, 130.4357147216797, 0.0, 56.089805603027344, 212.8779296875, 0.0, 1.3465042880123972e-17, 6.0832203184470786e-21, 333.5337219238281, 0.0, 0.03435012325644493, 117.69542694091797, 559.9365234375, 0.4577213525772095, 0.672739565372467, 69.66572570800781, 2.197160939502078e-33, 0.0, 1.3345893989935576e-07, 0.0, 140.88681030273438, 15.757715225219727, 0.0, 0.0, 0.1006716638803482, 0.0, 1.3957998853796444e-25, 3.1913153009099915e-08, 0.0, 0.0, 0.0, 0.0, 0.0, 288.5118713378906, 0.0, 0.530575692653656, 0.0, 3.806386882784214e-36, 0.0, 1668.273193359375, 0.0, 410.58343505859375, 0.0, 8.028081831624007e-34, 0.0, 413.3502807

## Apply dimensionality reduction through TSNE and plot the question embeddings. 

Extract a sample of data from the initial dataset.

In [24]:
limit = 5000
embeddings = questions_with_embeddings.select("question_embeddings")\
    .orderBy(F.rand())\
    .limit(limit)\
    .rdd\
    .map(lambda x: x["question_embeddings"])\
    .collect()

len(embeddings)

5000

### 2D TSNE representation

In [25]:
tsne = TSNE(n_components=2, perplexity=10)
X_embedded = tsne.fit_transform(embeddings)
pickle.dump(X_embedded, open("model-checkpoints/autoencoder-words/question_embeddings_tsne.pickle", "wb" ) )
tsne.kl_divergence_

0.5567834377288818

In [27]:
question_embeddings_tsne_df = spark.createDataFrame([x.tolist() for x in X_embedded], ["X_tsne", "Y_tsne"])
question_embeddings_tsne_df = question_embeddings_tsne_df.withColumn("columnindex", F.row_number().over(w))

questions_with_tsne_embeddings = questions_with_embeddings\
    .join(question_embeddings_tsne_df, questions_with_embeddings.columnindex == question_embeddings_tsne_df.columnindex)\
    .select("question", "X_tsne", "Y_tsne")\
    .toPandas()
len(questions_with_tsne_embeddings)

5000

In [31]:
# Use column names of df for the different parameters x, y, color, ...
fig = px.scatter(questions_with_tsne_embeddings, x="X_tsne", y="Y_tsne",
                 hover_name="question",
                 title="Question embeddings", 
                 range_color=[0, 1],
                 opacity=0.3
                )

fig.show()

### 3D TSNE representation

In [32]:
tsne = TSNE(n_components=3, perplexity=10)
X_embedded = tsne.fit_transform(embeddings)
pickle.dump(X_embedded, open("model-checkpoints/autoencoder-words/question_embeddings_tsne_3d.pickle", "wb" ) )
tsne.kl_divergence_

0.4217800796031952

In [33]:
question_embeddings_tsne_df = spark.createDataFrame([x.tolist() for x in X_embedded], ["X_tsne", "Y_tsne", "Z_tsne"])
question_embeddings_tsne_df = question_embeddings_tsne_df.withColumn("columnindex", F.row_number().over(w))

questions_with_tsne_embeddings = questions_with_embeddings\
    .join(question_embeddings_tsne_df, questions_with_embeddings.columnindex == question_embeddings_tsne_df.columnindex)\
    .select("question", "X_tsne", "Y_tsne", "Z_tsne")\
    .toPandas()
questions_with_tsne_embeddings[:10]

Unnamed: 0,question,X_tsne,Y_tsne,Z_tsne
0,where are the trucks?,30.490835,2.520457,10.517181
1,how many sheep can you see?,-10.369836,10.554349,-12.972257
2,what is the color of the zipper?,-9.842864,34.16851,6.200254
3,how many men are playing baseball?,-11.898112,-17.116465,17.832705
4,what is the white structure behind the ramp?,-15.733245,15.438624,18.645353
5,are there waves?,-0.912281,-4.252706,13.751265
6,what sport are they playing?,-16.420198,-0.070695,9.371421
7,is it summer?,5.750901,-35.686661,1.345763
8,what animals are this?,-8.854768,-14.996067,7.397661
9,is the sun shining?,7.756484,15.37678,-14.52162


In [34]:
# Use column names of df for the different parameters x, y, color, ...
fig = px.scatter_3d(questions_with_tsne_embeddings, x="X_tsne", y="Y_tsne", z="Z_tsne",
                 hover_name="question",
                 title="3D Question embeddings", opacity=0.3
                )

fig.show()