# Question auto-encoder 

In this notebook we'll try to train a LSTM auto-encoder by feeding the questions and targeting the same input.

This way we'll try to extract some representation of the questions that will permit us to extract then the questions that have the same structure through clusterization.

### How will the LSTM auto-encoder be trained?
* by feeding the question **tokens** and targeting the input

### Imports

In [1]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import *
from nltk.tokenize import word_tokenize
from collections import Counter
from keras.preprocessing.sequence import *
from keras.models import *
from keras.layers import *
from keras.utils import plot_model
from keras.callbacks import ModelCheckpoint

import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style="ticks")

spark = SparkSession \
    .builder \
    .appName("QuestionRephrasing-AutoEncoder") \
    .config("spark.executor.memory", "5G")\
    .config("spark.driver.memory", "10G")\
    .config("spark.driver.maxResultSize", "5G")\
    .getOrCreate()

spark.sparkContext.setCheckpointDir('data/checkpoints')
questions = spark.read.parquet("data/processed/union/*")
questions.printSchema()

Using TensorFlow backend.


root
 |-- question: string (nullable = true)
 |-- answer: string (nullable = true)
 |-- image_id: string (nullable = true)
 |-- tokenized_question: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- question_len: double (nullable = true)
 |-- question_word_len: double (nullable = true)
 |-- first_word: string (nullable = true)



In [2]:
# Let's extract now the maximum token length of every question
# We'll need those later for sequence padding
max_word_len = int(questions.agg({"question_word_len": "max"}).collect()[0]["max(question_word_len)"])

f"Maximum question word length is {max_word_len}."

'Maximum characters length is 238 and maximum word length is 28.'

## Vocabulary build

We need to extract a numerical representation of *words*.

In [4]:
# Tokens vocabulary and mappers
tokens = questions.select('tokenized_question')\
    .rdd\
    .flatMap(lambda x: x['tokenized_question'])\
    .collect()

word_mapping = {}
word_mapping_reversed = {}
word_counter = Counter(tokens)
for idx, value in enumerate(word_counter):
    word_mapping[value] = idx
    word_mapping_reversed[idx] = value
    
f"Word mapping example for 'is': {word_mapping['is']}."

"Word mapping example for 'is': 1."

### Input pre-processing

Now let's pre-process the input to have the corresponding **mappings** for *words*.

In [6]:
extract_word_embeddings = F.udf(lambda tokenized_question: [[word_mapping[word] + 1] for word in tokenized_question], ArrayType(ArrayType(IntegerType())))

questions = questions.withColumn('question_word_embeddings', extract_word_embeddings(F.col('tokenized_question')))
questions.head(1)

[Row(question='what is this photo taken looking through?', answer='net', image_id='458752', tokenized_question=['what', 'is', 'this', 'photo', 'taken', 'looking', 'through', '?'], question_len=41.0, question_word_len=8.0, first_word='what', question_char_embeddings=[[1], [2], [3], [4], [5], [6], [7], [5], [4], [2], [6], [7], [5], [8], [2], [9], [4], [9], [5], [4], [3], [10], [11], [12], [5], [13], [9], [9], [10], [6], [12], [14], [5], [4], [2], [15], [9], [16], [14], [2], [17]], question_word_embeddings=[[1], [2], [3], [4], [5], [6], [7], [8]])]

In [8]:
word_embeddings = questions.select('question_word_embeddings')\
    .rdd\
    .map(lambda x: x['question_word_embeddings'])\
    .collect()
word_embeddings = pad_sequences(word_embeddings, maxlen=max_word_len, dtype='int32', padding='post', truncating='pre', value=0.0)
word_embeddings[:1]

array([[[1],
        [2],
        [3],
        [4],
        [5],
        [6],
        [7],
        [8],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0]]], dtype=int32)

### Create the model

In [37]:
encoding_dim = 50

model = Sequential()
model.add(LSTM(encoding_dim, activation='relu', input_shape=(max_word_len, 1), dropout=0.25, recurrent_dropout=0.25))
model.add(Dropout(0.3))
model.add(Dense(100))
model.add(RepeatVector(max_word_len))
model.add(Dropout(0.3))
model.add(LSTM(max_word_len, activation='relu', return_sequences=True, dropout=0.25, recurrent_dropout=0.25))
model.add(TimeDistributed(Dense(1)))
model.compile(optimizer=opt, loss='mse', metrics=['mae', 'accuracy'])

model.summary()

Model: "sequential_16"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_31 (LSTM)               (None, 50)                10400     
_________________________________________________________________
dropout_24 (Dropout)         (None, 50)                0         
_________________________________________________________________
dense_31 (Dense)             (None, 100)               5100      
_________________________________________________________________
repeat_vector_16 (RepeatVect (None, 28, 100)           0         
_________________________________________________________________
dropout_25 (Dropout)         (None, 28, 100)           0         
_________________________________________________________________
lstm_32 (LSTM)               (None, 28, 28)            14448     
_________________________________________________________________
time_distributed_16 (TimeDis (None, 28, 1)           

### Train the model

In [38]:
filepath="model-checkpoints/autoencoder-words/autoencoder-model-{epoch:02d}-{val_accuracy:.2f}.hdf5"

checkpoint = ModelCheckpoint(filepath, monitor='val_accuracy', verbose=1, mode='max')
callbacks_list = [checkpoint]

model.fit(word_embeddings, word_embeddings,
                epochs=10,
                batch_size=1000,
                shuffle=True,
                callbacks=callbacks_list,
                validation_split=0.3)

Train on 1327111 samples, validate on 568763 samples
Epoch 1/10

Epoch 00001: saving model to model-checkpoints/autoencoder-words/autoencoder-model-01-0.25.hdf5
Epoch 2/10


  'TensorFlow optimizers do not '



Epoch 00002: saving model to model-checkpoints/autoencoder-words/autoencoder-model-02-0.26.hdf5
Epoch 3/10

Epoch 00003: saving model to model-checkpoints/autoencoder-words/autoencoder-model-03-0.24.hdf5
Epoch 4/10

Epoch 00004: saving model to model-checkpoints/autoencoder-words/autoencoder-model-04-0.09.hdf5
Epoch 5/10

Epoch 00005: saving model to model-checkpoints/autoencoder-words/autoencoder-model-05-0.06.hdf5
Epoch 6/10

Epoch 00006: saving model to model-checkpoints/autoencoder-words/autoencoder-model-06-0.03.hdf5
Epoch 7/10

Epoch 00007: saving model to model-checkpoints/autoencoder-words/autoencoder-model-07-0.03.hdf5
Epoch 8/10

Epoch 00008: saving model to model-checkpoints/autoencoder-words/autoencoder-model-08-0.03.hdf5
Epoch 9/10

Epoch 00009: saving model to model-checkpoints/autoencoder-words/autoencoder-model-09-0.04.hdf5
Epoch 10/10

Epoch 00010: saving model to model-checkpoints/autoencoder-words/autoencoder-model-10-0.02.hdf5


<keras.callbacks.callbacks.History at 0x1a1966f90>