<a href="https://colab.research.google.com/github/benschlup/csck507_team_a/blob/main/CSCK507_Team_A_WikiQA_Chatbot_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
### **CSCK507 Natural Language Processing, March-May 2022: End-of-Module Assignment**
# **Generative Chatbot**
---
#### Team A
Muhammad Ali (Student ID )  
Benjamin Schlup (Student ID 200050007)  
Chinedu Abonyi (Student ID )  
Victor Armenta-Valdes (Student ID )

---
# **Solution 2: LSTM with Bahdanau Attention Layer**
---

Dataset being used: https://www.microsoft.com/en-us/download/details.aspx?id=52419  
Paper on dataset: https://aclanthology.org/D15-1237/  
Non-attention solution inspired by https://medium.com/swlh/how-to-design-seq2seq-chatbot-using-keras-framework-ae86d950e91d  
Bahdanau addition inspired by https://www.tensorflow.org/text/tutorials/nmt_with_attention
Luong attention extension inspired by https://levelup.gitconnected.com/building-seq2seq-lstm-with-luong-attention-in-keras-for-time-series-forecasting-1ee00958decb

Important note: 
The dataset includes incorrect answers, labelled accordingly. Learning from these can be switched on/off (see below).

In a real setting, it would be sensible to add a concept called "answer triggering" and exclude learning from incorrect answers. Answer triggering  first assesses a question to qualify if the model may deliver a sensible answer - otherwise let the person know that the bot does not know. Ref: https://ieeexplore.ieee.org/document/8079800

In this notebook, the default is set to learn from invalid answers. This leads to more data for learning and thus a greater awareness of how sentences are constructed. And sometimes in funny conversations like with a poorly hearing dialogue partner, who provides 'perfectly valid answers - but to a different question'.

---
## 1. Configuration and framework

In [None]:
# The dataset includes invalid answers (labelled 0) and some questions 
# even have no valid answer at all: Switches allow test runs excluding invalid
# answers.
# Note that the assignment says that answers must be provided by the chatbot: 
# there is no mention that answers must be correct!
train_with_invalid_answers = True
validate_with_invalid_answers = True
test_questions_without_valid_answers = True

# The dataset contains questions with multiple valid answers
train_with_duplicate_questions = True
validate_with_duplicate_questions = True
test_with_duplicate_questions = True

# Configure the tokenizer
vocab_size_limit = 6000 + 1 # set this to None if all tokens from training shall be included (add one to number of tokens)
vocab_include_val = False   # set this to True if tokens from validation set shall be included in vocabulary
vocab_include_test = False  # set this to True if tokens from test set shall be included in vocabulary
oov_token = 1               # set this to None if out-of-vocabulary tokens should be removed from sequences
remove_oov_sentences = True # set this to True if any sentences containing out-of-vocabulary tokens should be removed from training, validation, test dataset

# Limit sentence lengths // not yet implemented
max_question_tokens = 20    # set this to None if no limit on question length
max_answer_tokens = 50      # set this to None if no limit on answer length

# Model parameters
lstm_units = 200
embedding_units = 200
encoder_lstm_dropout = 0.2
encoder_lstm_recurrent_dropout = 0.2


# Training parameters
batch_size = 50
number_of_epochs = 200

In [None]:
# Imports
import codecs
import io
import os
import re
import urllib.request
import yaml
import random
import zipfile

import numpy as np
import pandas as pd
import tensorflow as tf

from tensorflow.keras.activations import softmax
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Layer, Embedding, LSTM, Dense, RepeatVector
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

from keras_preprocessing.text import Tokenizer

from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

In [None]:
# Make sure the GPU is visible to our runtime
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

In [None]:
# Check what GPU we have in place
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Not connected to a GPU


---
## 2. Data acquisition and loading

In [None]:
# Download data: If link does not work any longer, access file manually from here: https://www.microsoft.com/en-us/download/details.aspx?id=52419
urllib.request.urlretrieve("https://download.microsoft.com/download/E/5/F/E5FCFCEE-7005-4814-853D-DAA7C66507E0/WikiQACorpus.zip", "WikiQACorpus.zip")

('WikiQACorpus.zip', <http.client.HTTPMessage at 0x7f1c59f32190>)

In [None]:
# Extract files
with zipfile.ZipFile('WikiQACorpus.zip', 'r') as zipfile:
   zipfile.extractall()

In [None]:
# Import questions and answers: training, validation and test datasets
train_df = pd.read_csv( f'./WikiQACorpus/WikiQA-train.tsv', sep='\t', encoding='ISO-8859-1')
val_df = pd.read_csv( f'./WikiQACorpus/WikiQA-dev.tsv', sep='\t', encoding='ISO-8859-1')
test_df = pd.read_csv( f'./WikiQACorpus/WikiQA-test.tsv', sep='\t', encoding='ISO-8859-1')       

---
## 3. Dataset preparation (pre-processing, transformation)
Note that no cleansing as such is required, as prior analysis has shown.

In [None]:
# Quality checks and exploratory data analysis removed: dataset has proven clean
# Print gross volumes:
print(f'Gross training dataset size: {len(train_df)}')
print(f'Gross validation dataset size: {len(val_df)}')
print(f'Gross test dataset size: {len(test_df)}')

Gross training dataset size: 20347
Gross validation dataset size: 2733
Gross test dataset size: 6116


In [None]:
# Derive normalized questions and answers and count number of tokens
for df in [train_df, val_df, test_df]:
    df.loc[:,'norm_question'] = [ re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", q).lower().strip() for q in df['Question'] ]
    df.loc[:,'norm_answer'] = [ '_START_ '+re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", s).lower().strip()+' _STOP_' for s in df['Sentence']]
    df['question_tokens'] = [ len(x.split()) for x in df['norm_question'] ]
    df['answer_tokens'] = [ len(x.split()) for x in df['norm_answer'] ]

In [None]:
# Drop sentences which are too long
for df in [train_df, val_df, test_df]:
    if max_question_tokens is not None:
        df.drop(df[df['question_tokens']>max_question_tokens].index, inplace=True)
    if max_answer_tokens is not None:
        df.drop(df[df['answer_tokens']>max_answer_tokens+2].index, inplace=True)    

In [None]:
# Remove q/a pairs depending on configuration of the notebook
if not train_with_invalid_answers:
    train_df = train_df[train_df['Label'] == 1]
if not validate_with_invalid_answers:
    val_df = val_df[val_df['Label'] == 1]
if not test_questions_without_valid_answers:
    test_df = test_df[test_df['Label'] == 1]

In [None]:
# Remove duplicate questions in case configured to do so
if not train_with_duplicate_questions:
    train_df.drop_duplicates(subset=['Question'], inplace=True)
if not validate_with_duplicate_questions:
    validate_df.drop_duplicates(subset=['Question'], inplace=True)
if not test_with_duplicate_questions:
    test_df.drop_duplicates(subset=['Question'], inplace=True)

In [None]:
# Data preparation:
# Tokenization:
# Reconsider adding digits to filter later, as encoding of numbers may create excessive vocabulary
# Also check reference on handling numbers in NLP: https://arxiv.org/abs/2103.13136
# Note that I do not yet train the tokenizer on validation and test datasets - should be challenged. 
# my be added to Tokenizer filters=target_regex = '!"#$%&()*+,-./:;<=>?@[\]^_`{|}~\t\''

if remove_oov_sentences:
    oov_token = None
tokenizer = Tokenizer(num_words=vocab_size_limit, oov_token=oov_token)

tokenizer.fit_on_texts(train_df['norm_question'] + train_df['norm_answer'])
if vocab_include_val:
    tokenizer.fit_on_texts(val_df['norm_question'] + val_df['norm_answer'])
if vocab_include_test:
    tokenizer.fit_on_texts(test_df['norm_question'] + test_df['norm_answer'])

vocab_size = len(tokenizer.word_index) + 1
if vocab_size_limit is not None:
    vocab_size = min([vocab_size, vocab_size_limit])
print(f'Vocabulary size based on training dataset: {vocab_size}')

for df in [train_df, val_df, test_df]:
    # Tokenize
    df['tokenized_question'] = tokenizer.texts_to_sequences(df['norm_question'])
    df['tokenized_answer'] = tokenizer.texts_to_sequences(df['norm_answer'])

    # Optionally remove sentences with out-of-vocabulary tokens
    if remove_oov_sentences:
        df.drop(df[df['question_tokens']!=df['tokenized_question'].str.len()].index, inplace=True)
        df.drop(df[df['answer_tokens']!=df['tokenized_answer'].str.len()].index, inplace=True)

Vocabulary size based on training dataset: 6001


In [None]:
# Print net volumes
print(f'Net training dataset size: {len(train_df)}')
print(f'Net validation dataset size: {len(val_df)}')
print(f'Net test dataset size: {len(test_df)}')

Net training dataset size: 2197
Net validation dataset size: 109
Net test dataset size: 245


In [None]:
# Transform data for training and validation by aligning lengths (i.e. padding)
maxlen_questions = max(len(t) for t in train_df['tokenized_question'].to_list())
maxlen_answers = max(len(t) for t in train_df['tokenized_answer'].to_list())

train_encoder_input_data = pad_sequences(train_df['tokenized_question'], maxlen=maxlen_questions, padding='post')
val_encoder_input_data = pad_sequences(val_df['tokenized_question'], maxlen=maxlen_questions, padding='post')
print(f'Encoder input data shape: {train_encoder_input_data.shape}')

train_decoder_input_data = pad_sequences(train_df['tokenized_answer'], maxlen=maxlen_answers, padding='post')
val_decoder_input_data = pad_sequences(val_df['tokenized_answer'], maxlen=maxlen_answers, padding='post')
print(f'Decoder input data shape: {train_decoder_input_data.shape}')

tokenized_answers = [ ta[1:] for ta in train_df['tokenized_answer'] ]
padded_answers = pad_sequences(tokenized_answers, maxlen=maxlen_answers, padding='post')
train_decoder_output_data = to_categorical(padded_answers, vocab_size)
tokenized_answers = [ ta[1:] for ta in val_df['tokenized_answer'] ]
padded_answers = pad_sequences(tokenized_answers, maxlen=maxlen_answers, padding='post')
val_decoder_output_data = to_categorical(padded_answers, vocab_size)
print(f'Decoder output data shape: {train_decoder_output_data.shape}')

Encoder input data shape: (2197, 19)
Decoder input data shape: (2197, 52)
Decoder output data shape: (2197, 52, 6001)


---
## 4. Modelling and training

In [None]:
# Code from https://www.tensorflow.org/text/tutorials/nmt_with_attention

class ShapeChecker():
  def __init__(self):
    # Keep a cache of every axis-name seen
    self.shapes = {}

  def __call__(self, tensor, names, broadcast=False):
    if not tf.executing_eagerly():
      return

    if isinstance(names, str):
      names = (names,)

    shape = tf.shape(tensor)
    rank = tf.rank(tensor)

    if rank != len(names):
      raise ValueError(f'Rank mismatch:\n'
                       f'    found {rank}: {shape.numpy()}\n'
                       f'    expected {len(names)}: {names}\n')

    for i, name in enumerate(names):
      if isinstance(name, int):
        old_dim = name
      else:
        old_dim = self.shapes.get(name, None)
      new_dim = shape[i]

      if (broadcast and new_dim == 1):
        continue

      if old_dim is None:
        # If the axis name is new, add its length to the cache.
        self.shapes[name] = new_dim
        continue

      if new_dim != old_dim:
        raise ValueError(f"Shape mismatch for dimension: '{name}'\n"
                         f"    found: {new_dim}\n"
                         f"    expected: {old_dim}\n")
        
class BahdanauAttention(Layer):
  def __init__(self, units):
    super().__init__()
    # For Eqn. (4), the  Bahdanau attention
    self.W1 = tf.keras.layers.Dense(units, use_bias=False)
    self.W2 = tf.keras.layers.Dense(units, use_bias=False)

    self.attention = tf.keras.layers.AdditiveAttention()

  def call(self, query, value, mask):
    shape_checker = ShapeChecker()
    shape_checker(query, ('batch', 't', 'query_units'))
    shape_checker(value, ('batch', 's', 'value_units'))
    shape_checker(mask, ('batch', 's'))

    # From Eqn. (4), `W1@ht`.
    w1_query = self.W1(query)
    shape_checker(w1_query, ('batch', 't', 'attn_units'))

    # From Eqn. (4), `W2@hs`.
    w2_key = self.W2(value)
    shape_checker(w2_key, ('batch', 's', 'attn_units'))

    query_mask = tf.ones(tf.shape(query)[:-1], dtype=bool)
    value_mask = mask

    context_vector, attention_weights = self.attention(
        inputs = [w1_query, value, w2_key],
        mask=[query_mask, value_mask],
        return_attention_scores = True,
    )
    shape_checker(context_vector, ('batch', 't', 'value_units'))
    shape_checker(attention_weights, ('batch', 't', 's'))

    return context_vector, attention_weights

In [None]:
# Build model

# Input layer for encoder
enc_inputs = Input(shape=(None,), name='Encoder_Input')

# Embedding layer for encoder
enc_embedding = Embedding(vocab_size, embedding_units, mask_zero=True, 
                          name='Encoder_Embedding')(enc_inputs)



# LSTM layer for encoder
stack_h, state_h, state_c = LSTM(lstm_units, return_state=True, 
                                 dropout=encoder_lstm_dropout,
                                 recurrent_dropout=encoder_lstm_recurrent_dropout,
                                 name='Encoder_LSTM')(enc_embedding)



# Combine states from encoder LSTM layer
enc_states = [stack_h, state_h, state_c]


# -START ----------------------------------------------------
#query_value_attention_seq = tf.keras.layers.AdditiveAttention()(
#    [query_seq_encoding, value_seq_encoding])
# - END -----------------------------------------------------

# Input layer for decoder
dec_inputs = Input(shape=(None,), name='Decoder_Input')
DECODER_input = RepeatVector(dec_inputs.shape[1])(state_h)
print(DECODER_input)

# Embedding layer for decoder
dec_embedding = Embedding(vocab_size, embedding_units, mask_zero=True, name='Decoder_Embedding')(dec_inputs)

# LSTM layer for decoder
dec_lstm = LSTM(lstm_units, return_state=True, return_sequences=True, name='Decoder_LSTM')
dec_outputs, _, _ = dec_lstm(dec_embedding, initial_state=enc_states)

# Dense layer for decoder
dec_dense = Dense(vocab_size, activation=softmax, name='Decoder_Dense')
output = dec_dense(dec_outputs)

# -START ----------------------------------------------------
# Query embeddings of shape [batch_size, Tq, dimension].
#query_embeddings = enc_embedding(enc_inputs)
# Value embeddings of shape [batch_size, Tv, dimension].
#value_embeddings = enc_embedding(dec_inputs)
# - END -----------------------------------------------------

# Compile the model
model = Model([enc_inputs, dec_inputs], output)
model.compile(optimizer=RMSprop(), loss='categorical_crossentropy')

# Summarised printout
model.summary()

TypeError: ignored

In [None]:
stack_h

<KerasTensor: shape=(None, 200) dtype=float32 (created by layer 'Encoder_LSTM')>

In [None]:
# Model training

model.fit([train_encoder_input_data, train_decoder_input_data], train_decoder_output_data,
          validation_data=([val_encoder_input_data, val_decoder_input_data], val_decoder_output_data),
          batch_size=batch_size, epochs=number_of_epochs)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

<keras.callbacks.History at 0x7fc978a96710>

In [None]:
# Optionally save model weights to file:
#model.save('/content/drive/MyDrive/CSCK507_Team_A/qa_model.h5')

---
## 5. Validation

In [None]:
# Optionally load model weights from file if already trained:
# WARNING: Any notebook parameters and the learned vocabulary are not 
# saved/loaded - i.e. this only makes sense when all other cells of the notebook
# are run except for the model.fit
#model.load_weights('/content/drive/MyDrive/CSCK507_Team_A/qa_model.h5')

In [None]:
# Prepare models for inferencing (separate encoder, decoder)
#

# Build encoder model for inferencing
enc_model = Model(inputs=enc_inputs, outputs=enc_states, name='Inference_Encoder')
enc_model.summary()

# Build decoder model for inferencing
dec_state_input_h = Input(shape=(lstm_units,))
dec_state_input_c = Input(shape=(lstm_units,))
dec_states_inputs = [dec_state_input_h, dec_state_input_c]
dec_outputs, state_h, state_c = dec_lstm(dec_embedding, initial_state=dec_states_inputs)
dec_states = [state_h, state_c]
dec_outputs = dec_dense(dec_outputs)
dec_model = Model(inputs=[dec_inputs] + dec_states_inputs, outputs=[dec_outputs] + dec_states, name='Inference_Decoder')
dec_model.summary()

Model: "Inference_Encoder"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 Encoder_Input (InputLayer)  [(None, None)]            0         
                                                                 
 Encoder_Embedding (Embeddin  (None, None, 200)        1200200   
 g)                                                              
                                                                 
 Encoder_LSTM (LSTM)         [(None, 200),             320800    
                              (None, 200),                       
                              (None, 200)]                       
                                                                 
Total params: 1,521,000
Trainable params: 1,521,000
Non-trainable params: 0
_________________________________________________________________
Model: "Inference_Decoder"
______________________________________________________________________________

In [None]:
# Prepare question for inferencing

def tokenize(question):
    words = re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", question).lower().split()
    tokens_list = list()
    for current_word in words:
        result = tokenizer.word_index.get(current_word, '')
        if result != '':
            tokens_list.append(result)
        else:
            print(f'Warning: out-of-vocabulary token \'{current_word}\'')
            if oov_token is not None:
                tokens_list.append(oov_token)

    return pad_sequences([tokens_list],
                         maxlen=maxlen_questions,
                         padding='post')

In [None]:
 # Predict answer and compare to ground truth options

 def predict_answer(question, qa_df=None):
    states_values = enc_model.predict(tokenize(question))
    empty_target_seq = np.zeros((1, 1))
    empty_target_seq[0, 0] = tokenizer.word_index['start']

    decoded_answer = ''
    while True:
        dec_outputs, h, c = dec_model.predict([empty_target_seq] + states_values)
        sampled_word_index = np.argmax(dec_outputs[0, -1, :])
        sampled_word = None
        for word, index in tokenizer.word_index.items():
            if sampled_word_index == index:
                if word != 'stop':
                    decoded_answer += ' {}'.format(word)
                sampled_word = word

        if sampled_word == 'stop' or len(decoded_answer.split()) > maxlen_answers:
            break

        empty_target_seq = np.zeros((1, 1))
        empty_target_seq[0, 0] = sampled_word_index
        states_values = [h, c]

    # Skip START token
    decoded_answer = decoded_answer[1:]

    print(f'Original question: {question}')
    print(f'Predicated answer: {decoded_answer}')

    if qa_df is not None:
        # The following should contain all acceptable answers
        reference_answers = qa_df.loc[qa_df['Question']==question, 'norm_answer'].to_list()
        reference_answers = [answer[8:-7] for answer in reference_answers]
        print(f'{reference_answers}')

        # Calculate BLEU score: Note that little differences may result from e.g.
        # spaces that were added to norm_answer when replacing punctuation earlier
        bleu_score = sentence_bleu(reference_answers, decoded_answer, smoothing_function=SmoothingFunction().method0)
        
        print(f'BLEU score: {bleu_score}\n')

    else:
        bleu_score = None

    return bleu_score

In [None]:
# Validate how the model predicts:
# Get 20 random numbers to choose random sentences and calculate BLEU score
# per predicted answer but also on average

def validate_predictions(qa_df):
    bleu_total = 0
    number_of_samples = min(20, len(qa_df.index))

    for sample_question in qa_df['Question'].sample(number_of_samples):
        bleu_total += predict_answer(sample_question, qa_df)

    print(f'BLEU average for answers on trained questions (n={number_of_samples}) = {bleu_total/number_of_samples}')

In [None]:
# Validate how the model predicts from actually trained questions

print('Validating model against sample set from training questions\n')
validate_predictions(train_df)

Validating model against sample set from training questions

Original question: where is diana prince from
Predicated answer: diana prince is a fictional character appearing regularly in stories published by dc comics
['diana prince is a fictional character appearing regularly in stories published by dc comics']
BLEU score: 1.0

Original question: what is a league in the sea
Predicated answer: the league originally referred to the distance a person could walk in an hour
['a league is a unit of length or rarely area', 'it was long common in europe and latin america  but it is no longer an official unit in any nation', 'the league originally referred to the distance a person could walk in an hour', 'since the middle ages many values have been specified in several countries']
BLEU score: 1.0

Original question: when did the cold war start
Predicated answer: the us and ussr became involved in political and military conflicts in the third world countries of latin america africa the middle e

In [None]:
# Validate how the model predicts from test questions (i.e. unseen)

print('Validating model against sample set from test questions:')
print('''
  ! NOTE THAT ASKING FOR ANSWERS ON UNSEEN QUESTIONS IS BARELY HELPFUL WITH
  ! LITTLE DATASETS AND LITTLE VARIANCE ON BOTH Q/A SIDES:
  !ADDING "ANSWER TRIGGERING" CONCEPT MAY BE PRUDENT
  ''')
validate_predictions(test_df)

Validating model against sample set from test questions:

  ! NOTE THAT ASKING FOR ANSWERS ON UNSEEN QUESTIONS IS BARELY HELPFUL WITH
  ! LITTLE DATASETS AND LITTLE VARIANCE ON BOTH Q/A SIDES:
  !ADDING "ANSWER TRIGGERING" CONCEPT MAY BE PRUDENT
  
Original question: what was colonial government like
Predicated answer: in the 20th century memorial day had a permanent amount for economic activities and the same earlier in 1933
['by the time of the american revolution in 1775 most of these features applied to most of the colonies']
BLEU score: 0.2444507336791804

Original question: WHAT WAS THE WEATHER LIKE ON FEBRUARY 12, 1909
Predicated answer: the us office of the video games in the united states air forces of following america two 4 and the 2008 greatest in the united states
['january â\x80\x93 february â\x80\x93 march â\x80\x93 april â\x80\x93 may â\x80\x93 june â\x80\x93 july â\x80\x93 august â\x80\x93 september â\x80\x93 october â\x80\x93 november â\x80\x93 december', 'the followi

### Manual validation
Performed with three types of questions:
* Question from actual training set
* Question from test set (i.e. unseen) -> only to verify if 'a' answer is provided
* Reworded questions from actual training set: demonstrate robustness

In [None]:
while True:
    test_case = input('Enter test case description, or enter \'end\' to stop: ')
    if test_case == 'end':
        break
    question = input('Ask me something: ')

    print(f'{predict_answer(question)}\n')

Enter test case description, or enter 'end' to stop: test case 1: accurate question from actual training
Ask me something: how much are the harry potter movies worth?
Original question: how much are the harry potter movies worth?
Predicated answer: harry potter is a series of seven fantasy novels written by the british author j k rowling
None

Enter test case description, or enter 'end' to stop: test case 2: accurate question from actual training
Ask me something: when was apple computer founded?
Original question: when was apple computer founded?
Predicated answer: the company was founded on april 1 1976 and incorporated as apple computer inc on january 3 1977
None

Enter test case description, or enter 'end' to stop: test case 3: varying the question from training
Ask me something: when was apple founded?
Original question: when was apple founded?
Predicated answer: the company was founded on april 1 1976 and incorporated as apple computer inc on january 3 1977
None

Enter test case 

---
# END OF NOTEBOOK
---