<a href="https://colab.research.google.com/github/benschlup/csck504assemblyfactory/blob/main/CSCK507_Team_A_WikiQA_Chatbot_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
### **CSCK507 Natural Language Processing, March-May 2022: End-of-Module Assignment**
# **Generative Chatbot**
---
#### Team A
Muhammad Ali (Student ID )  
Benjamin Schlup (Student ID 200050007)  
Chinedu Abonyi (Student ID )  
Victor Armenta-Valdes (Student ID )

---
# **Solution 1: LSTM without Attention Layer**
---

Dataset being used: https://www.microsoft.com/en-us/download/details.aspx?id=52419  
Paper on dataset: https://aclanthology.org/D15-1237/  
Solution inspired by https://medium.com/swlh/how-to-design-seq2seq-chatbot-using-keras-framework-ae86d950e91d  

Additional interesting materials to review, and potentially reference:
Khin, N.N., Soe, K.M., 2020. Question Answering based University Chatbot using Sequence to Sequence Model, in: .. doi:10.1109/o-cocosda50338.2020.9295021



---
Backlog:
* Check if lemmatizing on question side improves performance
* Check if word embedding (e.g. using Word2Vec or GloVe) improves performance (beware of out-of-vocab)
---

## 1. Configuration

In [1]:
# The dataset includes invalid answers (labelled 0) and some questions 
# even have no valid answer at all: Switches allow test runs excluding invalid
# answers.
# Note that the assignment says that answers must be provided by the chatbot: 
# there is no mention that answers must be correct!
train_with_invalid_answers = False
validate_with_invalid_answers = False
test_questions_without_valid_answers = False

# The dataset contains questions with multiple valid answers
train_with_duplicate_questions = True
validate_with_duplicate_questions = True
test_with_duplicate_questions = True

---

In [2]:
# Imports
import codecs
import io
import os
import re
import urllib.request
import yaml
import random
import zipfile

import numpy as np
import pandas as pd

#from gensim.models import Word2Vec

from tensorflow.keras.activations import softmax
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

from keras_preprocessing.text import Tokenizer

from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

In [3]:
# Make sure the GPU is visible to our runtime
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

In [4]:
# Check what GPU we have in place
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Tue May 10 22:49:53 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   44C    P0    30W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [5]:
# Download data: If link does not work any longer, access file manually from here: https://www.microsoft.com/en-us/download/details.aspx?id=52419
urllib.request.urlretrieve("https://download.microsoft.com/download/E/5/F/E5FCFCEE-7005-4814-853D-DAA7C66507E0/WikiQACorpus.zip", "WikiQACorpus.zip")

('WikiQACorpus.zip', <http.client.HTTPMessage at 0x7f5080f1ae90>)

In [6]:
# Extract files
with zipfile.ZipFile('WikiQACorpus.zip', 'r') as zipfile:
   zipfile.extractall()

In [7]:
# Import questions and answers: training, validation and test datasets
train_df = pd.read_csv( f'./WikiQACorpus/WikiQA-train.tsv', sep='\t', encoding='ISO-8859-1')
val_df = pd.read_csv( f'./WikiQACorpus/WikiQA-dev.tsv', sep='\t', encoding='ISO-8859-1')
test_df = pd.read_csv( f'./WikiQACorpus/WikiQA-test.tsv', sep='\t', encoding='ISO-8859-1')       

In [8]:
# Quality checks and exploratory data analysis removed: dataset has proven clean
# Print gross volumes:
print(f'Gross training dataset size: {len(train_df)}')
print(f'Gross validation dataset size: {len(val_df)}')
print(f'Gross test dataset size: {len(test_df)}')

Gross training dataset size: 20347
Gross validation dataset size: 2733
Gross test dataset size: 6116


In [9]:
# Remove q/a pairs depending on configuration of the notebook
if not train_with_invalid_answers:
    train_df = train_df[train_df['Label'] == 1]
if not validate_with_invalid_answers:
    val_df = val_df[val_df['Label'] == 1]
if not test_questions_without_valid_answers:
    test_df = test_df[test_df['Label'] == 1]

In [10]:
# Remove duplicate questions in case configured to do so
if not train_with_duplicate_questions:
    train_df.drop_duplicates(subset=['Question'], inplace=True)
if not validate_with_duplicate_questions:
    validate_df.drop_duplicates(subset=['Question'], inplace=True)
if not test_with_duplicate_questions:
    test_df.drop_duplicates(subset=['Question'], inplace=True)

In [11]:
# Print net volumes
print(f'Net training dataset size: {len(train_df)}')
print(f'Net validation dataset size: {len(val_df)}')
print(f'Net test dataset size: {len(test_df)}')

Net training dataset size: 1039
Net validation dataset size: 140
Net test dataset size: 291


In [12]:
# Derive normalized questions and answers
for df in [train_df, val_df, test_df]:
    df['norm_question'] = [ re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", q).lower() for q in df['Question'] ]
    df['norm_answer'] = [ '_START_ '+re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", s).lower()+' _STOP_' for s in df['Sentence']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [13]:
# Data preparation:
# Tokenization:
# Reconsider adding digits to filter later, as encoding of numbers may create excessive vocabulary
# Also check reference on handling numbers in NLP: https://arxiv.org/abs/2103.13136
# Note that I do not yet train the tokenizer on validation and test datasets - should be challenged. 
target_regex = '!"#$%&()*+,-./:;<=>?@[\]^_`{|}~\t\''
tokenizer = Tokenizer(filters=target_regex, num_words=5000+1)
tokenizer.fit_on_texts(train_df['norm_question'] + train_df['norm_answer'])
#vocab_size = len(tokenizer.word_index) + 1
vocab_size = 5000+1
print(f'Vocabulary size based on training dataset: {vocab_size}')

for df in [train_df, val_df, test_df]:
    df['tokenized_question'] = tokenizer.texts_to_sequences(df['norm_question'])
    df['tokenized_answer'] = tokenizer.texts_to_sequences(df['norm_answer'])
 


Vocabulary size based on training dataset: 5001


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app


In [14]:
maxlen_questions = max(len(t) for t in train_df['tokenized_question'].to_list())
maxlen_answers = max(len(t) for t in train_df['tokenized_answer'].to_list())

In [15]:
encoder_input_data = pad_sequences(train_df['tokenized_question'], maxlen=maxlen_questions, padding='post')
print(f'Encoder input data shape: {encoder_input_data.shape}')

decoder_input_data = pad_sequences(train_df['tokenized_answer'], maxlen=maxlen_answers, padding='post')
print(f'Decoder input data shape: {decoder_input_data.shape}')

tokenized_answers = [ ta[1:] for ta in train_df['tokenized_answer'] ]
padded_answers = pad_sequences(tokenized_answers, maxlen=maxlen_answers, padding='post')
decoder_output_data = to_categorical(padded_answers, vocab_size)
print(f'Decoder output data shape: {decoder_output_data.shape}')

enc_inputs = Input(shape=(None,))
enc_embedding = Embedding(vocab_size, 200, mask_zero=True)(enc_inputs)
_, state_h, state_c = LSTM(200, return_state=True)(enc_embedding)
enc_states = [state_h, state_c]

dec_inputs = Input(shape=(None,))
dec_embedding = Embedding(vocab_size, 200, mask_zero=True)(dec_inputs)
dec_lstm = LSTM(200, return_state=True, return_sequences=True)
dec_outputs, _, _ = dec_lstm(dec_embedding, initial_state=enc_states)
dec_dense = Dense(vocab_size, activation=softmax)
output = dec_dense(dec_outputs)

model = Model([enc_inputs, dec_inputs], output)
model.compile(optimizer=RMSprop(), loss='categorical_crossentropy')

model.summary()


Encoder input data shape: (1039, 19)
Decoder input data shape: (1039, 107)
Decoder output data shape: (1039, 107, 5001)
Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 input_2 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 embedding (Embedding)          (None, None, 200)    1000200     ['input_1[0][0]']                
                                                                                                  
 embedding_1 (Embedding)        (None, None, 200)    1000200     ['input_

In [22]:
# Model training

model.fit([encoder_input_data, decoder_input_data], decoder_output_data, batch_size=50, epochs=200, validation_split=0.05)
#model.save('/content/drive/MyDrive/CSCK507_Team_A/qa_model.h5')


Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

<keras.callbacks.History at 0x7f4ee7cb4350>

In [23]:
# Prepare models for inferencing (separate encoder, decoder)
#model.load_weights('/content/drive/MyDrive/CSCK507_Team_A/qa_model.h5')

def make_inference_models():
    dec_state_input_h = Input(shape=(200,))
    dec_state_input_c = Input(shape=(200,))
    dec_states_inputs = [dec_state_input_h, dec_state_input_c]
    dec_outputs, state_h, state_c = dec_lstm(dec_embedding,
                                             initial_state=dec_states_inputs)
    dec_states = [state_h, state_c]
    dec_outputs = dec_dense(dec_outputs)

    dec_model = Model(
        inputs=[dec_inputs] + dec_states_inputs,
        outputs=[dec_outputs] + dec_states)
    print('Inference decoder:')
    dec_model.summary()

    enc_model = Model(inputs=enc_inputs, outputs=enc_states)
    print('Inference encoder:')
    enc_model.summary()
    return enc_model, dec_model


# Also here: need to change to lemmas in case we do that on training data
# (see above)
# Furthermore, there'd be a more compact way of expressing
# below code... but for simplicity, taken from example for time being
def str_to_tokens(sentence):
    words = re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", sentence).lower().split()
    tokens_list = list()
    for current_word in words:
        result = tokenizer.word_index.get(current_word, '')
        if result != '':
            tokens_list.append(result)

    return pad_sequences([tokens_list],
                         maxlen=maxlen_questions,
                         padding='post')


enc_model, dec_model = make_inference_models()



Inference decoder:
Model: "model_3"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_2 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 embedding_1 (Embedding)        (None, None, 200)    1000200     ['input_2[0][0]']                
                                                                                                  
 input_5 (InputLayer)           [(None, 200)]        0           []                               
                                                                                                  
 input_6 (InputLayer)           [(None, 200)]        0           []                               
                                                                         

In [24]:
# get 100 random numbers to choose random sentences and calculate BLEU score
# note that code must be refactored: it was merged from examples and is 
# inconsistent now
questions = train_df['Question'].to_list()
rand_integers = [random.randint(0, len(questions)-1) for i in range(1, 100)]
bleu_total = 0


for i in rand_integers:
    states_values = enc_model.predict(str_to_tokens(questions[i]))
    empty_target_seq = np.zeros((1, 1))
    empty_target_seq[0, 0] = tokenizer.word_index['start']

    decoded_translation = ''
    while True:
        dec_outputs, h, c = dec_model.predict([empty_target_seq]
                                              + states_values)
        sampled_word_index = np.argmax(dec_outputs[0, -1, :])
        sampled_word = None
        for word, index in tokenizer.word_index.items():
            if sampled_word_index == index:
                if word != 'stop':
                    decoded_translation += ' {}'.format(word)
                sampled_word = word

        if sampled_word == 'stop' \
                or len(decoded_translation.split()) \
                > maxlen_answers:
            break

        empty_target_seq = np.zeros((1, 1))
        empty_target_seq[0, 0] = sampled_word_index
        states_values = [h, c]

    decoded_translation = decoded_translation[1:]

    print(f'Original question: {questions[i]}')
    print(f'Predicated answer: {decoded_translation}')

    reference_answers = train_df.loc[train_df['Question']==questions[i], 'norm_answer'].to_list()
    reference_answers = [answer[8:-7] for answer in reference_answers]


    # The following should contain all possible answers, though...
    print(f'{reference_answers}')
    bleu_score = sentence_bleu(reference_answers, decoded_translation, smoothing_function=SmoothingFunction().method0)
    print(f'Bleu score: {bleu_score}\n')
    bleu_total += bleu_score

print(f'Bleu average = {bleu_total/len(rand_integers)}')
    

Original question: what is melissa and joey about
Predicated answer: the american federation of government employees afge is an american labor union created by author
['the series follows local politician mel burke melissa joan hart and joe longo joey lawrence whom mel hires to look after her niece and nephew after a ponzi scheme leaves him broke']
Bleu score: 0.06620873901049407

Original question: how does a steam engine work
Predicated answer: steam engines are external combustion engines where the working fluid is separate from the combustion products
['a steam engine is a heat engine that performs mechanical work using steam as its working fluid ', 'steam engines are external combustion engines  where the working fluid is separate from the combustion products']
Bleu score: 0.9862868634149528

Original question: how many stripes on the flag
Predicated answer: the 50 stars on the flag represent the 50 states of the united states of america and the 13 stripes represent the thirteen b

In [25]:
while True:
    question = input('Ask me something, or enter \'end\' to stop: ')
    if question == 'end':
        break
    states_values = enc_model.predict(str_to_tokens(question))
    empty_target_seq = np.zeros((1, 1))
    empty_target_seq[0, 0] = tokenizer.word_index['start']

    decoded_translation = ''
    while True:
        dec_outputs, h, c = dec_model.predict([empty_target_seq]
                                              + states_values)
        sampled_word_index = np.argmax(dec_outputs[0, -1, :])
        sampled_word = None
        for word, index in tokenizer.word_index.items():
            if sampled_word_index == index:
                if word != 'stop':
                    decoded_translation += ' {}'.format(word)
                sampled_word = word

        if sampled_word == 'stop' \
                or len(decoded_translation.split()) \
                > maxlen_answers:
            break

        empty_target_seq = np.zeros((1, 1))
        empty_target_seq[0, 0] = sampled_word_index
        states_values = [h, c]

    print(decoded_translation)

Ask me something, or enter 'end' to stop: how are you doing today?
 sedimentary rocks are types of rock that are formed by the deposition of material at the earth s surface and within bodies of water
Ask me something, or enter 'end' to stop: How are epithelial tissues joined together
 the term may be applied to someone who are actually a foreigner or it can denote a strong association or assimilation into foreign particularly us society and culture
Ask me something, or enter 'end' to stop: how big is bmc software in houston, tx
 in america or is a neighborhood in the state of ice hockey that it through a leader in which the confederacy through the first day
Ask me something, or enter 'end' to stop: How much is US National Debt limit?
 shem säm säm name s service is the world s last in world s largest producer of the coastal city and is in the of they also the third version of the murder of the united states
Ask me something, or enter 'end' to stop: how much is an adult film actor pai