## Data preparation and preprocessing

### Short introduction

We are the Conloquor team, which means dialogue in latin. We will be developing a chatbot for our project this semester.

Members:

- Béres Bálint
- Drexler Konrád
- Drexler Kristóf

### Data source

We found a dataset on [reddit](https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/)
which includes all the reddit comments categorized by month. A user uploaded the entire dataset to google's
bigquery platform, here's the [reddit](https://www.reddit.com/r/bigquery/comments/3cej2b/17_billion_reddit_comments_loaded_on_bigquery/)
thread.

### Downloading the data

We chose to use comments from may, 2015 for our project. For this we ran the following SQL query on the bigquery platform.

```SQL
select *
from `fh-bigquery.reddit_comments.2015_05`
where subreddit like 'science'
    or subreddit like 'politics'
    or subreddit like 'gaming'
    or subreddit like 'worldnews'
    or subreddit like 'CasualConversation'
    or subreddit like 'sports'
```

At first we downloaded all the comments made that month, but the resulting file was 5GB compressed.
Therefore, we limited the source subreddits to **r/science**, **r/politics**, **r/gaming**,
**r/worldnews**, **r/CasualConversation** and **r/sports**. This query still yielded 1.45 million
comments to work with, but was a manageable size. We exported the resulting table to a json file;
 `data_2015_05.json`. This `.json` is available on my [google drive](https://drive.google.com/file/d/13n1ET0mppD6i-DjqyJIFjAiMQp6V7v6q/view?usp=sharing).
 In the future, the project will automatically download the data.

## Formatting the data for preprocessing

The initial json file still had a lot of unnecessary columns and unusable rows. Using further SQL queries we
trimmed and transformed the data to fit our needs. At the end of the process we were left with just short of
480000 message-response pairs. This was done by by filtering out messages longer than 200 characters, [deleted] messages
and hyperlink only messages to name a few.

In [None]:
# pandasql is not in the google colab repertoir by default, it needs to be installed manually
!pip install pandasql

In [None]:
# import statements
import pandas as pd
import pandasql as ps

In [None]:
# Create dataframe from json file
raw_data_df = pd.read_json(r'data_2015_05.json', orient='records', lines=True)

In [None]:
# Show top ten rows
raw_data_df.head(10)

In [None]:
# Filter raw data:
# select only rows which have a length less than 200, and the comment wasn't [deleted]
sql_query = " select body" \
            "       , name" \
            "       , link_id" \
            "       , parent_id" \
            "       , score" \
            " from raw_data_df" \
            " where length(body) < 200 and body <> '[deleted]'"
# Can only be saved as sdf since this is how pandas works.
sdf = ps.sqldf(sql_query)

In [None]:
# Delete the original Dataframe to save memory
del raw_data_df

In [None]:
# List of regular expressions to further filter the bodies of the comments;

# Remove all links from the comments
sdf.replace(r'https?://(www.)?[-a-zA-Z0-9@:%.+~#=]{1,256}.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9(_)@:%+.~#?&//=]*)','',regex=True, inplace = True)

# Remove all user links/subreddit links
sdf.replace(r'(/u/)?(r/)?(^)?(\\)?','',regex=True, inplace = True)

# Replace '&gt;' and '&lt' with '<' and '>' respectively
sdf.replace(r'(&gt;)','>',regex=True, inplace = True)
sdf.replace(r'(&lt)','<',regex=True, inplace = True)

# Replace '&amp' with an ampersand
sdf.replace(r'(&amp;)','&',regex=True, inplace = True)

In [None]:
# Rename body column to response
response_df = sdf.rename(columns={'body': 'response'})

In [None]:
# Show top ten rows
response_df.head(10)

In [None]:
# Create query-response pairs
# Join the two tables to make a single one
# Concatenate '<eos>' to the end, and '<sos>' to the start of the response and store each of them, in a different column
sql_query = " select inp.body" \
            "       , resp.response || ' <eos>'" \
            "       , '<sos> ' || resp.response" \
            " from response_df resp" \
            " left join sdf inp" \
            " on resp.parent_id = inp.name" \
            " where inp.body is not null and inp.body <> '' and resp.response <> ''"
# Can only be saved as sdf since this is how pandas works.
sdf = ps.sqldf(sql_query)

In [None]:
# Rename body to input, second column to output and third column to output_input
xy_df = sdf.rename(columns={'body': 'input', "resp.response || ' <eos>'": 'output', "'<sos> ' || resp.response": 'output_input'})

In [None]:
# Delete sdf to free up memory
del sdf

In [None]:
# show top ten rows of the new dataframe
xy_df.head(10)

In [None]:
# Export to a json file this is so we don't have to run all previous cells again
xy_df.to_json('xy_data_2015_05.json', orient='records', lines=True)

In [None]:
# Delete all to free memory
del response_df
del xy_df
del sql_query

## Preprocessing

Now that we have a dataset of usable message-response pairs, lets preprocess the data. The tokenizer encodes words into numbers,
a seperate tokenizer is used for the the message data and the output data. Next, we padded the messages to have a fixed size for all of our messages.
The `.json` file created in the previous section can be downloaded from this [google drive](https://drive.google.com/file/d/1J65cyCx6Zp1AgzTCGrB1Oqye9nkokmot/view?usp=sharing) link.

In [None]:
# import statements
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
# Load data from the saved json file
xy_data_df = pd.read_json(r'xy_data_2015_05.json', orient='records', lines=True)

In [None]:
# for testing purposes we reduced the dataframe to the first 10000 comments
xy_data_df = xy_data_df[0:10000]  

In [None]:
# Check a random row from the dataframe
print(xy_data_df['input'][172])
print(xy_data_df['output'][172])
print(xy_data_df['output_input'][172])

In [None]:
# set max number of words recognized by the model
MAX_NUM_WORDS = 5000

In [None]:
# Text from the input column is tokenized

input_tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)

# Tokenizer is fitted
input_tokenizer.fit_on_texts(xy_data_df['input'])

# Sequences are generated from the text
input_integer_seq = input_tokenizer.texts_to_sequences(xy_data_df['input'])

# { word: index} dictionary of the input_tokenizer
word2idx_inputs = input_tokenizer.word_index
print('Total unique words in the input: %s' % len(word2idx_inputs))

# Word count and max input sentence length are stored
max_input_len = max(len(sen) for sen in input_integer_seq)
print("Length of longest sentence in input: %g" % max_input_len)

In [None]:
# Text from the output and output_input columns are tokenized
# the regex given is the default filter minus the '<' and '>' symbols,
# as these have been handled using SQL in the previous section
output_tokenizer = Tokenizer(num_words=MAX_NUM_WORDS, filters='!"#$%&()*+,-./:;=?@[\\]^_`{|}~\t\n)')

# Tokenizer is fitted
output_tokenizer.fit_on_texts(pd.concat([xy_data_df['output'], xy_data_df['output_input']]))

# Sequences are generated from the text
output_integer_seq = output_tokenizer.texts_to_sequences(xy_data_df['output'])
output_input_integer_seq = output_tokenizer.texts_to_sequences(xy_data_df['output_input'])

# { word: index} dictionary of the output_tokenizer
word2idx_outputs = output_tokenizer.word_index
print('Total unique words in the output: %s' % len(word2idx_outputs))

# Word count and max output sentence length are stored
num_words_output = len(word2idx_outputs) + 1
max_out_len = max(len(sen) for sen in output_integer_seq)
print("Length of longest sentence in the output: %g" % max_out_len)

In [None]:
# input_integer_seq is padded which will be fed into the encoder
# max_input_len stores the maximum output sentence length
encoder_input_sequences = pad_sequences(input_integer_seq, maxlen=max_input_len)
print("encoder_input_sequences.shape:", encoder_input_sequences.shape)
print("encoder_input_sequences[172]:", encoder_input_sequences[172])

In [None]:
# Example word indices from input_tokenizer
print(word2idx_inputs["ill"])
print(word2idx_inputs["skins"])

In [None]:
# output_input_integer_seq is padded which will be fed into the decoder
# max_out_len stores the maximum output sentence length
decoder_input_sequences = pad_sequences(output_input_integer_seq, maxlen=max_out_len, padding='post')
print("decoder_input_sequences.shape:", decoder_input_sequences.shape)
print("decoder_input_sequences[172]:", decoder_input_sequences[172])

In [None]:
# Example word indices from output_tokenizer
print(word2idx_outputs["<eos>"])
print(word2idx_outputs["not"])
print(word2idx_outputs["correctly"])
# print(word2idx_outputs["invisibility"])

The following two cells visualize the progress made up until this point

In [None]:
subset_dict = {str(value): 0 for key, value in input_tokenizer.word_index.items()}
input_sequences = []

# The input and response sentences are tokenized 
# and the token occurrences are counted in subset_dict
for line in xy_data_df.iterrows():

    # Input tokenization
    token_list = input_tokenizer.texts_to_sequences([line[1][0]])[0]

    for token in token_list:
        subset_dict[str(token)] += 1

    # print('input')
    # print(token_list)
    # print(tokenizer.sequences_to_texts([token_list]))
    # print()

    # Response tokenization
    token_list = input_tokenizer.texts_to_sequences([line[1][1]])[0]

    # print('response')
    # print(token_list)
    # print(tokenizer.sequences_to_texts([token_list]))
    # print()

    for token in token_list:
        subset_dict[str(token)] += 1

In [None]:
import matplotlib.pyplot as plt

# A sequence from 0 to 29 is created
list_c = [i for i in range(30)]

# The string value of the 30 most used tokens are retrieved
example_seq = input_tokenizer.sequences_to_texts([list_c])[0]
print(example_seq)

# Turns the example_seq string into a list of words
x = example_seq.split()

# The 30 most popular words are plotted based on their occurrence
plt.bar(x, list(subset_dict.values())[:len(x)], align = 'center')
plt.show()

### Word embeddings

This is where our work for the second milestone starts. We relied heavily on [this](https://stackabuse.com/python-for-nlp-neural-machine-translation-with-seq2seq-in-keras/) guide on stackabuse. Although we made everal modifications to get it to work with our dataset.

The following section embeds the words recognized by the model in a vector with 100 dimensions.

In [None]:
# import statements
from numpy import array
from numpy import asarray
from numpy import zeros

In [None]:
# we used the pretrained vector embedding model GloVe
!wget http://nlp.stanford.edu/data/glove.6B.zip glove.6B.zip

In [None]:
# unzip the downloaded file
!unzip glove.6B.zip

In [None]:
#set embedding size
EMBEDDING_SIZE = 100

In [None]:
# the embedding dictionary is a dictionary with the key being a word,
# and the value being the corresponding 100d vector
embeddings_dictionary = dict()

# open the file containing the 100d vectors
glove_file = open(r'glove.6B.100d.txt', encoding="utf8")

# iterate over the lines in the file
for line in glove_file:
    records = line.split()  # split along whitespaces
    word = records[0]       # the word itself is the first element of the list
    # the vector representation is the rest of the elements
    vector_dimensions = asarray(records[1:], dtype='float32')  
    embeddings_dictionary[word] = vector_dimensions  # insert word: vector representation into dictionary
glove_file.close()  # close GloVe file

In [None]:
# create the embedding matrix

# limit the number of words understood by the model to MAX_NUM_WORDS
num_words = min(MAX_NUM_WORDS, len(word2idx_inputs) + 1)
# create embedding matrix filled with zeroes
embedding_matrix = zeros((num_words, EMBEDDING_SIZE))

# iterate over the first MAX_NUM_WORDS collected by the tokenizer
for word, index in list(word2idx_inputs.items())[:num_words-1]:
  # get embedding vector corresponding to the given word
  embedding_vector = embeddings_dictionary.get(word)
  # if embedding vector exists, the insert into relevant column of the mbedding matrix
  # null vector by default
  if embedding_vector is not None:
      embedding_matrix[index] = embedding_vector

In [None]:
# some testing
index = 4997

In [None]:
# print last word
print(list(word2idx_inputs.items())[index])

In [None]:
# print embedding of word from the embedding dictionary
print(embeddings_dictionary[list(word2idx_inputs.items())[index][0]])

In [None]:
# print embedding of word from the embedding matrix
print(embedding_matrix[list(word2idx_inputs.items())[index][1]])

### Model structure
This section build the model and trains it based on the data compiled in the previous sections.

In [None]:
# import statements
import numpy as np
from tensorflow.keras.layers import Embedding

In [None]:
# create embedding layer from embedding matrix
embedding_layer = Embedding(num_words, EMBEDDING_SIZE, weights=[embedding_matrix], input_length=max_input_len)

In [None]:
# get input sentences
input_sentences = xy_data_df['input']

# create null hypermatrix with dimensions:
# number of input sentences
# maximum word length of input sentences
# number of words
decoder_targets_one_hot = np.zeros((
        len(input_sentences),
        max_out_len,
        num_words
    ),
    dtype='float32'
)

In [None]:
# check shape
decoder_targets_one_hot.shape

In [None]:
# pad output sequences to the same length,
# namely to the maximum length of the output sequences
decoder_output_sequences = pad_sequences(output_integer_seq, maxlen=max_out_len, padding='post')

In [None]:
# fill the previously create null hypermatrix with one hot columns in the following fashion:
# Insert value 1 into every r-th row of every c-th column of every m-th matrix where;
# m is the index of the sentence in decoder_output_sequences: 1-st sentence -> m = 0, n-th sentence -> m = n-1
# c is the place of the word in the sentence: 1-st word in sentence -> c = 0, n-th word in sentence -> c = n-1
# r is the value given to the word by th output tokenizer: '<eos>' -> r = 1, 'not' -> r = 15

for m, sequence in enumerate(decoder_output_sequences):
    for c, r in enumerate(sequence):
        decoder_targets_one_hot[m, c, r] = 1

In [None]:
# set number of LSTM nodes
LSTM_NODES = 256

In [None]:
# import statements
from tensorflow.keras.layers import LSTM, Input, Dense
from tensorflow.keras import Model

In [None]:
encoder_inputs_placeholder = Input(shape=(max_input_len,))
x = embedding_layer(encoder_inputs_placeholder)
encoder = LSTM(LSTM_NODES, return_state=True)

encoder_outputs, h, c = encoder(x)
encoder_states = [h, c]

In [None]:
decoder_inputs_placeholder = Input(shape=(max_out_len,))

decoder_embedding = Embedding(num_words, LSTM_NODES)
decoder_inputs_x = decoder_embedding(decoder_inputs_placeholder)

decoder_lstm = LSTM(LSTM_NODES, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs_x, initial_state=encoder_states)

In [None]:
decoder_dense = Dense(num_words, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

In [None]:
model = Model([encoder_inputs_placeholder,
  decoder_inputs_placeholder], decoder_outputs)
model.compile(
    optimizer='rmsprop',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

In [None]:
from keras.utils import plot_model
plot_model(model, to_file='model_plot4a.png', show_shapes=True, show_layer_names=True)

In [None]:
BATCH_SIZE = 64
EPOCHS = 20

In [None]:
print(encoder_input_sequences.shape)
print(decoder_input_sequences.shape)
print(decoder_targets_one_hot.shape)

In [None]:
r = model.fit(
    [encoder_input_sequences, decoder_input_sequences],
    decoder_targets_one_hot,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    validation_split=0.2,
)

### Encoder model structure

In [None]:
encoder_model = Model(encoder_inputs_placeholder, encoder_states)

In [None]:
decoder_state_input_h = Input(shape=(LSTM_NODES,))
decoder_state_input_c = Input(shape=(LSTM_NODES,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

In [None]:
decoder_inputs_single = Input(shape=(1,))
decoder_inputs_single_x = decoder_embedding(decoder_inputs_single)

In [None]:
decoder_outputs, h, c = decoder_lstm(decoder_inputs_single_x, initial_state=decoder_states_inputs)

In [None]:
decoder_states = [h, c]
decoder_outputs = decoder_dense(decoder_outputs)

In [None]:
decoder_model = Model(
    [decoder_inputs_single] + decoder_states_inputs,
    [decoder_outputs] + decoder_states
)

In [None]:
from keras.utils import plot_model
plot_model(decoder_model, to_file='model_plot_dec.png', show_shapes=True, show_layer_names=True)

In [None]:
idx2word_input = {v:k for k, v in word2idx_inputs.items()}
idx2word_target = {v:k for k, v in word2idx_outputs.items()}

In [None]:
def translate_sentence(input_seq):
    states_value = encoder_model.predict(input_seq)
    target_seq = np.zeros((1, 1))
    target_seq[0, 0] = word2idx_outputs['<sos>']
    eos = word2idx_outputs['<eos>']
    output_sentence = []

    for _ in range(max_out_len):
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)
        idx = np.argmax(output_tokens[0, 0, :])

        if eos == idx:
            break

        word = ''

        if idx > 0:
            word = idx2word_target[idx]
            output_sentence.append(word)

        target_seq[0, 0] = idx
        states_value = [h, c]

    return ' '.join(output_sentence)

### Evaluation
Evaluate a few input sentences

In [None]:
i = np.random.choice(len(input_sentences))
input_seq = encoder_input_sequences[i:i+1]
translation = translate_sentence(input_seq)
print('-')
print('Input:', input_sentences[i])
print('Response:', translation)

### Save model
This section saves and download the model. The saved model was too large for github, it is available as a [google drive](https://drive.google.com/file/d/1mpYifGZ_TLrer6ZgRtNKO25xSuMOPi1u/view?usp=sharing) link.

In [None]:
# save model
model.save('model_1')

In [None]:
# zip saved model
!zip -r /content/model_1.zip /content/model_1

In [None]:
# download saved model
from google.colab import files
files.download('model_1.zip')

### Conclusion

This model still has a way to go, but with a single training cycle the results were positive. The future goal is to create a better bot via training and trying to make a better model in general. We expect to arrive at a somewhat sensible model (as sensible as a model trained on reddit comments can be).
The desired result would be a model which can react to inputs properly (in a way that makes sense in the given context). 

We will try to mitigate some of the problems we found in the answer the bot gave, such as words repeating and the bot giving an irrelevant answer. This can be achieved by using more data, having more training cycles, filtering the training data better (this is a hard thing to do, since there are so many comments that this needs to be automated, and creating an algorithm which can filter good data from bad data is hard) and trying different models.

The end result won't be quantifiable by a computer, we need to label good and bad outcomes.
