# CoQa Chatbot using LSTM

I used the retrieval chatbot (which uses LSTM Model for vectorization) which is described in the blog: https://omarito.me/building-a-basic-fatwa-chat-bot/ but with some modifications.

## Modifications:

   1. It uses CoQa dataset instead of the dataset of askfm
   2. It prints only the top one answer if the simularity of its question is larger than 0.5
   3. if the simlrity of top one questions < .5 it prints the top five questions as suggestions

## 1. LSTM Model Training
I ran this part on colab and saved the final trained model to be used in tha chatbot. I performed some changes in the dataset cleaning phase to be suitable for english words instead of arabic words, i also removed the numbers and punctuations and i didn't use stemming as in the blog.

In [1]:
################################        Dataset cleaning and tokenization          ####################################
# Importing the required libraries
import pandas as pd
import numpy as np
import re
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, Embedding, LSTM, RepeatVector
from tensorflow.python.keras.utils import np_utils
from nltk.stem.porter import PorterStemmer
from six.moves import cPickle
from  tensorflow.python.keras.callbacks import ModelCheckpoint
# Defining the Global variables 
BATCH_SIZE = 32 # Batch size for GPU
NUM_WORDS = 10000 # Vocab length
MAX_LEN = 20 # Padding length (# of words)
LSTM_EMBED = 8 # Number of LSTM nodes

# Reading the dataset
train_data = pd.read_csv("Rearranged_data.csv")

################################                Cleaning                ####################################
# Dropping the answers and the context number
train_data.drop(['Answers','Context Number'], inplace=True, axis=1)
# Removing the questions that having number of words grater than 20 words
train_data = train_data[train_data.Questions.apply(lambda x: len(x.split())) < MAX_LEN]
# Cleaning the questions
# removing all punc. and numbers
train_data.Questions = train_data.Questions.apply(lambda x: (re.sub('[^a-zA-Z]', ' ', x)).strip()) 
# lowercasing, removing repeated spaces
train_data.Questions = train_data.Questions.apply(lambda x: ' '.join(x.lower().split()))
train_data = train_data[train_data.Questions.apply(len) > 0]

################################                Tokenization                ####################################
# Fitting the tokenizer on the questions after cleaning
tokenizer = Tokenizer(num_words=NUM_WORDS, lower=False)
tokenizer.fit_on_texts(train_data["Questions"].values)
# Save the tokenizer for later use
cPickle.dump(tokenizer, open("lstm-autoencoder-tokenizer.pickle", "wb"))
# Tokenizing the questions
train_data = tokenizer.texts_to_sequences(train_data["Questions"].values)
# Padding sequences that are shorter than MAX_LEN
train_data = pad_sequences(train_data, padding='post', truncating='post', maxlen=MAX_LEN)

In [2]:
# Printing tokenizer indices of the words 
tokenizer.index_word

{1: 'what',
 2: 'the',
 3: 'did',
 4: 'was',
 5: 'he',
 6: 'who',
 7: 'is',
 8: 'to',
 9: 'how',
 10: 'it',
 11: 'of',
 12: 'in',
 13: 'they',
 14: 'where',
 15: 'a',
 16: 'does',
 17: 'she',
 18: 'when',
 19: 'do',
 20: 'his',
 21: 'for',
 22: 'many',
 23: 'that',
 24: 's',
 25: 'were',
 26: 'have',
 27: 'are',
 28: 'with',
 29: 'and',
 30: 'about',
 31: 'there',
 32: 'her',
 33: 'why',
 34: 'on',
 35: 'this',
 36: 'him',
 37: 'name',
 38: 'from',
 39: 'which',
 40: 'at',
 41: 'one',
 42: 'else',
 43: 'be',
 44: 'had',
 45: 'them',
 46: 'long',
 47: 'people',
 48: 'old',
 49: 'first',
 50: 'other',
 51: 'get',
 52: 'go',
 53: 'has',
 54: 'by',
 55: 'as',
 56: 'any',
 57: 'their',
 58: 'say',
 59: 'time',
 60: 'think',
 61: 'like',
 62: 'want',
 63: 'would',
 64: 'kind',
 65: 'an',
 66: 'been',
 67: 'happened',
 68: 'after',
 69: 'up',
 70: 'year',
 71: 'will',
 72: 'not',
 73: 'much',
 74: 'take',
 75: 'doing',
 76: 'or',
 77: 'out',
 78: 'day',
 79: 'can',
 80: 'then',
 81: 'make',
 

In [None]:
################################                Training                ####################################
## thi part was ran at colab

# Defining LSTM model architecture
model = Sequential()
model.add(Embedding(NUM_WORDS, 100, input_length=MAX_LEN))
model.add(LSTM(LSTM_EMBED, dropout=0.2, recurrent_dropout=0.2, input_shape=(train_data.shape[1], NUM_WORDS)))
model.add(RepeatVector(train_data.shape[-1]))
model.add(LSTM(LSTM_EMBED, dropout=0.2, recurrent_dropout=0.2, return_sequences=True))
model.add(Dense(NUM_WORDS, activation='softmax'))
# Model compiling
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Define the checkpoint (to save the model after each epoch)
filepath = "lstm-encoder-{epoch:02d}-{loss:.4f}.h5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

# Fitting the model (training for 25 epoch with 32 batch size)
model.fit(train_data, np.expand_dims(train_data, -1), epochs=25, batch_size= BATCH_SIZE, callbacks=callbacks_list)
# Saving the final model
model.save("lstm-encoder.h5")

In [5]:
# Evaluating the loss and he accuracy of the final saved model
import warnings
warnings.filterwarnings('ignore')
from tensorflow.python.keras.models import load_model
loaded_model = load_model("lstm-encoder.h5")
loaded_model.evaluate(train_data, np.expand_dims(train_data, -1), batch_size= BATCH_SIZE)



[1.1700287287651618, 0.81302434]

# 2. Building the chatbot 

In [1]:
# Importing the required libraries
import pandas as pd
import re
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.sequence import pad_sequences
from six.moves import cPickle
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from tensorflow.python.keras import backend as K
from tensorflow.python.keras.utils import np_utils

# Defining the Global variables 
BATCH_SIZE = 32 # Batch size for GPU
NUM_WORDS = 10000 # Vocab length
MAX_LEN = 20 # Padding length (# of words)
LSTM_EMBED = 8 # Number of LSTM nodes
K.set_learning_phase(False)

# Reading the dataset
dataset = pd.read_csv("Rearranged_data.csv")
# Removing context number column
data = dataset[['Questions','Answers']]
dataset = []
# Cleaning the questions
data.Questions = data.Questions.apply(lambda x: (re.sub('[^a-zA-Z]', ' ', x)).strip())
data.Questions = data.Questions.apply(lambda x: ' '.join(x.lower().split()))

# Loading the tokenizer
tokenizer = cPickle.load(open("lstm-autoencoder-tokenizer.pickle", "rb"))
# Questions tokenization
Questions = tokenizer.texts_to_sequences(data.Questions)
# Padding sequences that are shorter than MAX_LEN
Questions = pad_sequences(Questions, padding='post', truncating='post', maxlen=MAX_LEN)

# Reading the encoder model
model = load_model("lstm-encoder.h5")
# Creating the encoding function
encode = K.function([model.input], [model.layers[1].output])
# Encoding the questions with the predefined function
Questions = np.squeeze(np.array(encode([Questions])))


W0331 17:53:05.172277 140399519250176 deprecation.py:506] From /home/zeinab/miniconda3/envs/myenv/lib/python3.6/site-packages/tensorflow/python/keras/initializers.py:119: calling RandomUniform.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W0331 17:53:05.274111 140399519250176 deprecation.py:506] From /home/zeinab/miniconda3/envs/myenv/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W0331 17:53:07.758755 140399519250176 deprecation.py:323] From /home/zeinab/miniconda3/envs/myenv/lib/python3.6/site-packages/tensorflow/python/ops/math_gr

##### Now I will try the same conversations that i tried using TF-IDF chatbot

### First conversation 

In [6]:
print("Welcome to CoQa chatbot")
question = input('Please enter a question: ')

while True:
    """Cleaning input Question and Vectorizing it """
    # Cleaning 
    question = (re.sub('[^a-zA-Z]', ' ', question)).strip()
    question = ' '.join(question.lower().split())
    # Tokenization
    question = tokenizer.texts_to_sequences([question])
    # Padding to MAX_LEN
    question = pad_sequences(question, padding='post', truncating='post', maxlen=MAX_LEN)
    # Encodding
    question = np.squeeze(encode([question]))
    
    """Cosine Simularity"""
    rank = cosine_similarity(question.reshape(1, -1), Questions)
    
    """Getting maximum cosine simularity rank"""
    print("\nSearching for the best answer..........\n")   
    array=np.asarray(rank)
    array=array.reshape(len(data),1)
    maximum = max(array)
    top_one =np.argsort(-1*rank, axis=-1).T[:1].tolist()
    
    """Checking if maximum cosine simularity rank<0.5 """
    for item in top_one:
        
      if (maximum < 0.5):
        print("Sorry, I can't find the answer\n")
        
        """ printting the answers of the top five quesions in cosine
         simularity"""
        print('Answers of the most similar five Questions\n')
        top_five = np.argsort(-1*rank, axis=-1).T[:5].tolist()
        for item in top_five:
          print(data['Questions'].iloc[item].values[0],' : ' , data['Answers'].iloc[item].values[0]) 

      else:
        print('The answer is "', 
              data['Answers'].iloc[item].values[0],'" with similarity: ', maximum[0])  
    print("\n ---------------------------------------------------------\n")
    
    flag = True
    flag_N= False
    while flag:
          do = [input('Do you have another question (Y/N)? ')]
          if do[0] == 'Y':
             question = input('Please enter another question: ')
             flag = False
          elif do[0] == 'N':
                print('\nGood bye')
                flag = False
                flag_N = True
          else:
             print("\n I can't understand\n")

    if flag_N == True:
        break
    
    

Welcome to CoQa chatbot
Please enter a question:  When was the Vat formally opened?

Searching for the best answer..........

The answer is " It was formally established in 1475 " with similarity:  1.0

 ---------------------------------------------------------

Do you have another question (Y/N)? Y
Please enter another question: for what subjects?

Searching for the best answer..........

The answer is " history, and law " with similarity:  0.99999994

 ---------------------------------------------------------

Do you have another question (Y/N)? Y
Please enter another question: what must be requested in person or by mail?

Searching for the best answer..........

The answer is " Photocopies " with similarity:  0.99999994

 ---------------------------------------------------------

Do you have another question (Y/N)? Y
Please enter another question: when were the Secret Archives moved from the rest of the library?

Searching for the best answer..........

The answer is " at the beginn

##### we can notice that simularity is less than one in some questions unlike the TF-IDF method they were almost all having simularity equal to 1, i think this is logical because LSTM is compressing the vectors while keeping the important info. as much as possible so it may lose some info.

### Second conversation

In [7]:
print("Welcome to CoQa chatbot")
question = input('Please enter a question: ')

while True:
    """Cleaning input Question and Vectorizing it """
    # Cleaning 
    question = (re.sub('[^a-zA-Z]', ' ', question)).strip()
    question = ' '.join(question.lower().split())
    # Tokenization
    question = tokenizer.texts_to_sequences([question])
    # Padding to MAX_LEN
    question = pad_sequences(question, padding='post', truncating='post', maxlen=MAX_LEN)
    # Encodding
    question = np.squeeze(encode([question]))
    
    """Cosine Simularity"""
    rank = cosine_similarity(question.reshape(1, -1), Questions)
    
    """Getting maximum cosine simularity rank"""
    print("\nSearching for the best answer..........\n")   
    array=np.asarray(rank)
    array=array.reshape(len(data),1)
    maximum = max(array)
    top_one =np.argsort(-1*rank, axis=-1).T[:1].tolist()
    
    """Checking if maximum cosine simularity rank<0.5 """
    for item in top_one:
        
      if (maximum < 0.5):
        print("Sorry, I can't find the answer\n")
        
        """ printting the answers of the top five quesions in cosine
         simularity"""
        print('Answers of the most similar five Questions\n')
        top_five = np.argsort(-1*rank, axis=-1).T[:5].tolist()
        for item in top_five:
          print(data['Questions'].iloc[item].values[0],' : ' , data['Answers'].iloc[item].values[0]) 

      else:
        print('The answer is "', 
              data['Answers'].iloc[item].values[0],'" with similarity: ', maximum[0])  
    print("\n ---------------------------------------------------------\n")
    
    flag = True
    flag_N= False
    while flag:
          do = [input('Do you have another question (Y/N)? ')]
          if do[0] == 'Y':
             question = input('Please enter another question: ')
             flag = False
          elif do[0] == 'N':
                print('\nGood bye')
                flag = False
                flag_N = True
          else:
             print("\n I can't understand\n")

    if flag_N == True:
        break

Welcome to CoQa chatbot
Please enter a question: Is the JPEG format supported by Adobe Flash Player 11.0?

Searching for the best answer..........

The answer is " yes " with similarity:  1.0000001

 ---------------------------------------------------------

Do you have another question (Y/N)? Y
Please enter another question:  what is JPEG XR short for?

Searching for the best answer..........

The answer is " JPEG extended range " with similarity:  1.0

 ---------------------------------------------------------

Do you have another question (Y/N)? Y
Please enter another question: when did microsoft put HD Photo up for consideration to be named JPEG XR?

Searching for the best answer..........

The answer is " July 2007 " with similarity:  1.0

 ---------------------------------------------------------

Do you have another question (Y/N)? Y
Please enter another question: what did they rename it to?

Searching for the best answer..........

The answer is " HD Photo " with similarity:  1

### Third conversation

In [9]:
print("Welcome to CoQa chatbot")
question = input('Please enter a question: ')

while True:
    """Cleaning input Question and Vectorizing it """
    # Cleaning 
    question = (re.sub('[^a-zA-Z]', ' ', question)).strip()
    question = ' '.join(question.lower().split())
    # Tokenization
    question = tokenizer.texts_to_sequences([question])
    # Padding to MAX_LEN
    question = pad_sequences(question, padding='post', truncating='post', maxlen=MAX_LEN)
    # Encodding
    question = np.squeeze(encode([question]))
    
    """Cosine Simularity"""
    rank = cosine_similarity(question.reshape(1, -1), Questions)
    
    """Getting maximum cosine simularity rank"""
    print("\nSearching for the best answer..........\n")   
    array=np.asarray(rank)
    array=array.reshape(len(data),1)
    maximum = max(array)
    top_one =np.argsort(-1*rank, axis=-1).T[:1].tolist()
    
    """Checking if maximum cosine simularity rank<0.5 """
    for item in top_one:
        
      if (maximum < 0.5):
        print("Sorry, I can't find the answer\n")
        
        """ printting the answers of the top five quesions in cosine
         simularity"""
        print('Answers of the most similar five Questions\n')
        top_five = np.argsort(-1*rank, axis=-1).T[:5].tolist()
        for item in top_five:
          print(data['Questions'].iloc[item].values[0],' : ' , data['Answers'].iloc[item].values[0]) 

      else:
        print('The answer is "', 
              data['Answers'].iloc[item].values[0],'" with similarity: ', maximum[0])  
    print("\n ---------------------------------------------------------\n")
    
    flag = True
    flag_N= False
    while flag:
          do = [input('Do you have another question (Y/N)? ')]
          if do[0] == 'Y':
             question = input('Please enter another question: ')
             flag = False
          elif do[0] == 'N':
                print('\nGood bye')
                flag = False
                flag_N = True
          else:
             print("\n I can't understand\n")

    if flag_N == True:
        break

Welcome to CoQa chatbot
Please enter a question: what is the GoI?

Searching for the best answer..........

The answer is " a ship to travel to South America " with similarity:  1.0000001

 ---------------------------------------------------------

Do you have another question (Y/N)? Y
Please enter another question: where did India come from?

Searching for the best answer..........

The answer is " the Indus river " with similarity:  1.0

 ---------------------------------------------------------

Do you have another question (Y/N)? Y
Please enter another question: how many states are in India?

Searching for the best answer..........

The answer is " 29 " with similarity:  1.0

 ---------------------------------------------------------

Do you have another question (Y/N)? Y
Please enter another question: how was it created?

Searching for the best answer..........

The answer is " by the constitution of India " with similarity:  0.99999994

 ------------------------------------------

##### We can notice that the answer of first question is wrong while tf-idf was giving the right answer, the wrong answer is for the question ' What is the Beagle? '

### A conversation illustrates some mistakes due to the repetition of the questions in different contexts

In [10]:
print("Welcome to CoQa chatbot")
question = input('Please enter a question: ')

while True:
    """Cleaning input Question and Vectorizing it """
    # Cleaning 
    question = (re.sub('[^a-zA-Z]', ' ', question)).strip()
    question = ' '.join(question.lower().split())
    # Tokenization
    question = tokenizer.texts_to_sequences([question])
    # Padding to MAX_LEN
    question = pad_sequences(question, padding='post', truncating='post', maxlen=MAX_LEN)
    # Encodding
    question = np.squeeze(encode([question]))
    
    """Cosine Simularity"""
    rank = cosine_similarity(question.reshape(1, -1), Questions)
    
    """Getting maximum cosine simularity rank"""
    print("\nSearching for the best answer..........\n")   
    array=np.asarray(rank)
    array=array.reshape(len(data),1)
    maximum = max(array)
    top_one =np.argsort(-1*rank, axis=-1).T[:1].tolist()
    
    """Checking if maximum cosine simularity rank<0.5 """
    for item in top_one:
        
      if (maximum < 0.5):
        print("Sorry, I can't find the answer\n")
        
        """ printting the answers of the top five quesions in cosine
         simularity"""
        print('Answers of the most similar five Questions\n')
        top_five = np.argsort(-1*rank, axis=-1).T[:5].tolist()
        for item in top_five:
          print(data['Questions'].iloc[item].values[0],' : ' , data['Answers'].iloc[item].values[0]) 

      else:
        print('The answer is "', 
              data['Answers'].iloc[item].values[0],'" with similarity: ', maximum[0])  
    print("\n ---------------------------------------------------------\n")
    
    flag = True
    flag_N= False
    while flag:
          do = [input('Do you have another question (Y/N)? ')]
          if do[0] == 'Y':
             question = input('Please enter another question: ')
             flag = False
          elif do[0] == 'N':
                print('\nGood bye')
                flag = False
                flag_N = True
          else:
             print("\n I can't understand\n")

    if flag_N == True:
        break

Welcome to CoQa chatbot
Please enter a question: What is the largest island?

Searching for the best answer..........

The answer is " Australia " with similarity:  1.0000001

 ---------------------------------------------------------

Do you have another question (Y/N)? Y
Please enter another question: Where is it?

Searching for the best answer..........

The answer is " southern Atlantic Ocean. " with similarity:  0.99999994

 ---------------------------------------------------------

Do you have another question (Y/N)? Y
Please enter another question: How many people live there?

Searching for the best answer..........

The answer is " 7.2 million " with similarity:  1.0

 ---------------------------------------------------------

Do you have another question (Y/N)? N

Good bye


#### In the previous conversation the right answers were:
1. ' Australia ' and the answer that printed is right but TF-IDF ws giving a wrong answer
2. ' south of earth ' and the answer that printed was for another question in another context about 'South Georgia and the South Sandwich Islands' with also the question 'Where is it?'
3. ' The population is nearly as large as Shanghai's ' and the answer that printed was for another question in another context  about 'Hong Kong' with also the question 'How many people live there?'

##### Also LSTM has the problem when the question is repeated in many context, and I think also if we changed a little in the question to be diffrernt from the dataset it will produce another mistakes. 