# BUILDING A QUESTION & ANSWER BOT

BACKGROUND: WE WILL BE CREATING A QUESTION AND ANSWER BOT BASED ON THE BABI DATA SET FROM FACEBOOK RESEARCH.                      

Full Details: https://research.fb.com/downloads/babi/

Jason Weston, Antoine Bordes, Sumit Chopra, Tomas Mikolov, Alexander M. Rush, "Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks", http://arxiv.org/abs/1502.05698



DATA:  Stories (sentences, questions and answers - yes or no)


Pipeline:
Datasets
ETL using pickle 
EDA
VECTORIZE THE DATA             
Model Creation ("End-to-End Memory Networks" by Sukhbastar et al.")

* Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus,
  "End-To-End Memory Networks",
  http://arxiv.org/abs/1503.08895
  
  
Model Evaluation
Model Deployment - create your own stories and evaluate

THREE STEPS:                                 

INTENT CLASSIFICATION OR DETECTION = INPUT MEMORY REPRESENTATION                                  
ENTITY EXTRACTION FOR THE MOST ACCURATE RESPONSE = OUTPUT MEMORY REPRESENTATION                            
GENERATING AN ACTION - GENERATE THE FINAL PREDICTION

In [None]:
import pickle
import numpy as np

In [None]:
with open("train_qa.txt", "rb") as fp:   # Unpickling and read binary
    train_data =  pickle.load(fp)

In [None]:
with open("test_qa.txt", "rb") as fp:   # Unpickling
    test_data =  pickle.load(fp)

----

## Exploring the Format of the Data

In [None]:
type(test_data)
#list

In [None]:
type(train_data)

In [None]:
len(test_data)

In [None]:
len(train_data)

In [None]:
train_data[0]
#note punctuation

In [None]:
' '.join(train_data[0][0])

In [None]:
' '.join(train_data[0][1])

In [None]:
train_data[0][2]

-----

## Setting up Vocabulary of All Words

In [None]:
# Create a set that holds the vocab words unique
#even though have 11000 setences - want the unique words only why use set()
vocab = set()

In [None]:
all_data = test_data + train_data

In [None]:
type(all_data)

In [None]:
len(all_data)

In [None]:
all_data[0]

In [None]:
for story, question , answer in all_data:
    # In case you don't know what a union of sets is:
    # https://www.programiz.com/python-programming/methods/set/union
    vocab = vocab.union(set(story))
    vocab = vocab.union(set(question))

In [None]:
vocab.add('no')
vocab.add('yes')

In [None]:
vocab

In [None]:
#QA:
len(vocab)

In [None]:
vocab_len = len(vocab) + 1 #we add an extra space to hold a 0 for Keras's pad_sequences

In [None]:
vocab_len

In [None]:
max_story_len = max([len(data[0]) for data in all_data])
#for every sentence (data=11K) in all data, calculate the length of the sentence
#and get me the maximum length
#remember that story is located at position [0]

In [None]:
max_story_len
#for keras preprocessing = padding sentences (or sequences) 

In [None]:
max_question_len = max([len(data[1]) for data in all_data])
#for every question (data=11K) in all data, calculate the length of the question
#and get me the maximum length
#remember that question is located at position [0]

In [None]:
max_question_len
#for keras preprocessing = padding sequences

## Vectorizing the Data

In [None]:
vocab

In [None]:
# Reserve 0 for pad_sequences
vocab_size = len(vocab) + 1

-----------

In [None]:
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer

In [None]:
# integer encode sequences of words
# do not want any filters - keep as is so leave as empty list because not cleaning data any further
tokenizer = Tokenizer(filters=[])
#now convert vocab to a key code
tokenizer.fit_on_texts(vocab)

In [None]:
tokenizer.word_index

In [None]:
#do a QA by calling word index like below
word_index = tokenizer.word_index

_______________________________________________________________________________________________________________________________

In [None]:
train_story_text = []
train_question_text = []
train_answers = []

#unzip the data into story, question and answer
for story,question,answer in train_data:
    train_story_text.append(story)
    train_question_text.append(question)

In [None]:
train_story_seq = tokenizer.texts_to_sequences(train_story_text)
#like before convert the text to a sequence of coded Id words
#convert training "sentences" to training "numerical sequences"

In [None]:
len(train_story_text)
#get the stories only

In [None]:
len(train_story_seq)

In [None]:
#do for the others as a QA

In [None]:
type(train_question_text)

In [None]:
type(train_story_seq)

### Functionalize Vectorization

In [None]:
def vectorize_stories(data, word_index=tokenizer.word_index, max_story_len=max_story_len,max_question_len=max_question_len):
    '''
    INPUT: 
    
    data: consisting of Stories,Queries,and Answers
    word_index: word index dictionary from tokenizer
    max_story_len: the length of the longest story (used for pad_sequences function)
    max_question_len: length of the longest question (used for pad_sequences function)


    OUTPUT:
    
    Vectorizes the stories,questions, and answers into padded sequences. We first loop for every story, query , and
    answer in the data. Then we convert the raw words to an word index value. Then we append each set to their appropriate
    output list. Then once we have converted the words to numbers, we pad the sequences so they are all of equal length.
    
    Returns this in the form of a tuple (X,Xq,Y) (padded based on max lengths)
    '''
    
    
    # X = STORIES
    X = []
    # Xq = QUERY/QUESTION
    Xq = []
    # Y = CORRECT ANSWER
    Y = []
    
    
    for story, query, answer in data:
        
        # Grab the word index (code) for every word in story
        #[23, 14, 5, 6]
        x = [word_index[word.lower()] for word in story]
        # Grab the word index for every word in query
        xq = [word_index[word.lower()] for word in query]
        
        # Grab the Answers (either Yes/No so we don't need to use list comprehension here)
        # Index 0 is reserved so we're going to use + 1 and pad sequences
        #set up an empty matrix of np.zeroes
        y = np.zeros(len(word_index) + 1)
        
        # Now that y is all zeros and we know its just Yes/No , we can use numpy logic to create this assignment
        #38 long with 1 = yes or 1 = no.
        y[word_index[answer]] = 1
        
        # Append each set of story,query, and answer to their respective holding lists
        X.append(x)
        Xq.append(xq)
        Y.append(y)
        
    # Finally, pad the sequences based on their max length so the RNN can be trained on uniformly long sequences.
        
    # RETURN TUPLE FOR UNPACKING
    return (pad_sequences(X, maxlen=max_story_len),pad_sequences(Xq, maxlen=max_question_len), np.array(Y))

VECTORIZE THE TRAINING DATA AND TEST DATA

In [None]:
inputs_train, queries_train, answers_train = vectorize_stories(train_data)

In [None]:
inputs_test, queries_test, answers_test = vectorize_stories(test_data)

QA

In [None]:
inputs_test
#test sentences with word index position
#note that last code  = period

In [None]:
queries_test
#questions

In [None]:
answers_test
#all zeroes until get to the answer

In [None]:
sum(answers_test)

In [None]:
tokenizer.word_index['yes']
#refer back to your vocab code key

In [None]:
tokenizer.word_index['no']
#refer back to your vocab code key

## Creating the Model

In [None]:
#creating creating a sequential model
#using dense layers, LSTM layer to deal with sequences and 
#using embedding to handle the vocabulary 
#embedding is converting text into a dense vector of a fized size and must be first layer of the model
#input dim = vocab size
#output dim = desired vector space = 64

from keras.models import Sequential, Model
from keras.layers.embeddings import Embedding
from keras.layers import Input, Activation, Dense, Permute, Dropout
from keras.layers import add, dot, concatenate
from keras.layers import LSTM

### Placeholders for Inputs

Recall we technically have two inputs, stories and questions. So we need to use placeholders. `Input()` is used to instantiate a Keras tensor.



In [None]:
#or 156 and batch size that makes this run best is to be determined much later on
#(max story len, batch size)
input_sequence = Input((max_story_len,))

#or 6 and batch size that makes this run best
question = Input((max_question_len,))

### Building the Networks

To understand why we chose this setup, make sure to read the paper we are using:

* Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus,
  "End-To-End Memory Networks",
  http://arxiv.org/abs/1503.08895

## Encoders

### Input Encoder m:  input memory representation

In [None]:
# Input gets embedded to a sequence of vectors
#make input dim to be the vocab size (38)
#first layer is already the embedding layer - input embedded to sequence of vectors
#output dim is from the paper this a reduction step = 156 to 64
#turns off random % of neurons while training to prevent overfitting - experiment
input_encoder_m = Sequential()
input_encoder_m.add(Embedding(input_dim=vocab_size,output_dim=64))
input_encoder_m.add(Dropout(0.3))

# This encoder will output:
# (samples, story_maxlen, embedding_dim)

### Input Encoder c:  ouptput memory representaton

In [None]:
# embed the input into a sequence of vectors of size query_maxlen
#output vector encoder must be matched to the question size
input_encoder_c = Sequential()
input_encoder_c.add(Embedding(input_dim=vocab_size,output_dim=max_question_len))
input_encoder_c.add(Dropout(0.3))
# output: (samples, story_maxlen, query_maxlen)

### Question Encoder:  question representation 

In [None]:
# embed the question into a sequence of vectors using the same dimensions of the story
#match to input sentences or memory vectors = vocab size:64
#make specific for question length
question_encoder = Sequential()
question_encoder.add(Embedding(input_dim=vocab_size,
                               output_dim=64,
                               input_length=max_question_len))
question_encoder.add(Dropout(0.3))
# output: (samples, query_maxlen, embedding_dim)

### Encode the Sequences

In [None]:
# encode input sequence and questions to sequences of dense vectors
#turn the words into coded version why coded - past tense
#encoder (input sequence) --> encoded
input_encoded_m = input_encoder_m(input_sequence) #create memory vectors for sentences (mi)
input_encoded_c = input_encoder_c(input_sequence) #create output vectors (internal state (ci))
question_encoded = question_encoder(question) #create query vector or mu (u)

##### INTENT DETECTION - by using dot product to compute the match between input_encoded_m and question_encoded followed by a softmax

In [None]:
#intent detection is simply matching or calculating the probabilities followed by a softmax function
#that converts numbers to a probability distribution that sums to 1 (normalization)
#need to match the query to the right memory vector (sentences) to understand what the user means
#match = intent detection (max prob)
match = dot([input_encoded_m, question_encoded], axes=(2, 2)) #match = inner product (u x m)
match = Activation('softmax')(match) #convert to probability distribution
#output is probability vector over the inputs weighted by the query vector or the weights

In [None]:
type(match)

In [None]:
match

#### ENTITY EXTRACTION - generate a response vector or combined "intent:entity" vector -                                  
(1) weight each output vector (ci) from sentences by the probablity vector from the input and                         
(2) and take the sum to generate the response vector that will be used to predict the answer.  

In [None]:
#extract the "entity" or "response vector" that matches intent detected and add this response vector to your cell state 
#so that can be fed into the model to predict the right answer
#response = output vector = summed transformed output vectors weighted by the probabilty from the input

response = add([match, input_encoded_c])  # (samples, story_maxlen, query_maxlen)

#change the shape so that predictive models can act on the vector

response = Permute((2, 1))(response)  # (samples, query_maxlen, story_maxlen)

#### Concatenate - join response vector to the query

In [None]:
# concatenate the match matrix with the question vector sequence to feed into predictive model 
# combine your response to the question encoded
answer = concatenate([response, question_encoded])

In [None]:
#QA
answer

In [None]:
#Complete processing of the answer vector

In [None]:
# Reduce with RNN (LSTM)
#from 64 to 32
#use 32 nuerons vs. 64
#more efficient processing
answer = LSTM(32)(answer)  # (samples, 32)

# Regularization with Dropout
answer = Dropout(0.3)(answer)
answer = Dense(vocab_size)(answer)  # (samples, vocab_size) #yes or no

# we output a probability distribution over the vocabulary
answer = Activation('softmax')(answer)


In [None]:
# build the final model
model = Model([input_sequence, question], answer)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy',
              metrics=['accuracy'])
#not binary but acrosss all the vocabulary with highest value placed on yes or no but RNN could 
#output an answer that is not yes or no.

In [None]:
#QA
model.summary()

In [None]:
# train the model - pass in tuples of (inputs/queries and answers)
# training using 5 epochs although 100 to 400 is not unusual
# experiment with batch sizes
history = model.fit([inputs_train, queries_train], answers_train,batch_size=32,epochs=30,validation_data=([inputs_test, queries_test], answers_test))

## Evaluating the Model

### Plotting Out Training History

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

### Evaluating on Given Test Set - PREDICT THE ANSWER

In [None]:
#use the 5 epoch model to evaluate on the same test set used to validate the system
pred_results = model.predict(([inputs_test, queries_test]))

In [None]:
#discuss the results

In [None]:
test_data[0][0]

In [None]:
story =' '.join(word for word in test_data[0][0])
print(story)

In [None]:
query = ' '.join(word for word in test_data[0][1])
print(query)

In [None]:
print("True Test Answer from Data is:",test_data[0][2])

In [None]:
#Generate prediction from model
#which word has the max prob - either yes or no?
val_max = np.argmax(pred_results[0])

for key, val in tokenizer.word_index.items():
    if val == val_max:
        k = key

print("Predicted answer is: ", k)
print("Probability of certainty was: ", pred_results[0][val_max])

### Saving the Model -  if pleased with the model then save but will not do here. 

In [None]:
filename = 'chatbot_30_epochs.h5'
model.save(filename)

## Writing Your Own Stories and Questions

Remember you can only use words from the existing vocab

In [None]:
vocab

In [None]:
# Note the whitespace of the periods
my_story = "John left the kitchen . Sandra dropped the football in the garden ."
my_story.split()

In [None]:
my_question = "Is the football in the garden ?"

In [None]:
my_question.split()

In [None]:
mydata = [(my_story.split(),my_question.split(),'yes')]

In [None]:
my_story,my_ques,my_ans = vectorize_stories(mydata)

In [None]:
#use a pretrained model to evaluate your own story - make sure in the correct folder
#filename = 'chatbot_120_epochs.h5'
#model.load_weights(filename) 

In [None]:
#do not include the label or answers
pred_results = model.predict(([ my_story, my_ques]))

In [None]:
#Generate prediction from model
val_max = np.argmax(pred_results[0])

for key, val in tokenizer.word_index.items():
    if val == val_max:
        k = key

print("Predicted answer is: ", k)
print("Probability of certainty was: ", pred_results[0][val_max])

# THANK YOU!!!