# Seq2Seq Chatbot using Keras and Chatterbot Data

Here, a seq2seq LSTM model will be assembled using Keras Functional API to create a working Chatbot which would answer questions asked to it.

## Importing packages

In [1]:
import numpy as np
import tensorflow as tf
import pickle
from tensorflow.keras import layers, activations, models, preprocessing, utils
import os
import yaml
import requests, zipfile, io
import re
import tensorflow_datasets as tfds

print("Tensorflow version {}".format(tf.__version__))

Tensorflow version 2.0.0


## Preprocessing the data
### Download the data

The dataset is from chatterbot/english on Kaggle.com by kausr25. It contains pairs of questions and answers based on a number of subjects like food, history, AI etc.

The raw data could be found on the following repository: https://github.com/shubham0204/Dataset_Archives

Alternatively, the data can be downloaded from kaggle through the following link: https://www.kaggle.com/kausr25/chatterbotenglish

In [2]:
r = requests.get('https://github.com/shubham0204/Dataset_Archives/blob/master/chatbot_nlp.zip?raw=true') 
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()

### Reading the data from the files

Parse each of the .yaml files in the following ways:
+ Concatenate two or more sentences if the answer has two or more of them.
+ Remove unwanted data types which are produced while parsing the data.
+ Append `<START>` and `<END>` to all the answers.
+ Create a Tokenizer and load the whole vocabulary ( questions + answers ) into it.


In [3]:
dir_path = 'chatbot_nlp/data'
files_list = os.listdir(dir_path + os.sep)

questions = list()
answers = list()

for filepath in files_list:
    stream = open(dir_path + os.sep + filepath , 'rb')
    docs = yaml.safe_load(stream)
    conversations = docs['conversations']
    for con in conversations:
        if len(con) > 2 :
            questions.append(con[0])
            replies = con[1:]
            ans = ''
            for rep in replies:
                ans += ' ' + rep
            answers.append(ans)
        elif len(con)> 1:
            questions.append(con[0])
            answers.append(con[1])

answers_with_tags = list()
for i in range(len(answers)):
    if type(answers[i]) == str:
        answers_with_tags.append(answers[i])
    else:
        questions.pop(i)

answers = list()
for i in range(len(answers_with_tags)):
    answers.append('<START> ' + answers_with_tags[i] + ' <END>')

tokenizer = preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(questions + answers)
VOCAB_SIZE = len(tokenizer.word_index)+1
print('VOCAB SIZE : {}'.format(VOCAB_SIZE))

VOCAB SIZE : 1894


### Preparing data for Seq2Seq model

The model requires 3 arrays: encoder_input_data, decoder_input_data and decoder_output_data.

+ For encoder_input_data tokenize the questions and pad them to their maximum length.

+ For decoder_input_data tokenize the answers and pad them to their maximum length.

+ For decoder_output_data tokenize the answers, and remove the first element from all the tokenized_answers - this is the `<START>` element added earlier.


In [4]:
# encoder_input_data
tokenized_questions = tokenizer.texts_to_sequences(questions)
maxlen_questions = max([len(x) for x in tokenized_questions])
padded_questions = preprocessing.sequence.pad_sequences(tokenized_questions, maxlen=maxlen_questions, padding='post')
encoder_input_data = np.array(padded_questions)
print(encoder_input_data.shape , maxlen_questions)

# decoder_input_data
tokenized_answers = tokenizer.texts_to_sequences(answers)
maxlen_answers = max([len(x) for x in tokenized_answers])
padded_answers = preprocessing.sequence.pad_sequences(tokenized_answers, maxlen=maxlen_answers, padding='post')
decoder_input_data = np.array(padded_answers)
print(decoder_input_data.shape, maxlen_answers)

# decoder_output_data
tokenized_answers = tokenizer.texts_to_sequences(answers)
for i in range(len(tokenized_answers)):
    tokenized_answers[i] = tokenized_answers[i][1:]
padded_answers = preprocessing.sequence.pad_sequences(tokenized_answers, maxlen=maxlen_answers, padding='post')
onehot_answers = utils.to_categorical(padded_answers, VOCAB_SIZE)
decoder_output_data = np.array(onehot_answers)
print(decoder_output_data.shape)

# Saving all the arrays to storage
np.save('enc_in_data.npy', encoder_input_data)
np.save('dec_in_data.npy', decoder_input_data)
np.save('dec_tar_data.npy', decoder_output_data)

(564, 22) 22
(564, 74) 74
(564, 74, 1894)


## Defining Encoder-Decoder model

The model will have Embedding, LSTM and Dense layers. The configuration is as follows:
+ 2 Input Layers - one for encoder_input_data and one for decoder_input_data.
+ Embedding layer - for converting token vectors to fix sized dense vectors (include the mask_zero = True argument)
+ LSTM layer - provide access to Long-Short Term cells.

Working :
+ Encoder_input_data comes in the Embedding layer (encoder_embedding).
+ Output of Embedding layer goes to the LSTM cell which produces 2 state vectors (h and c which are encoder_states). These states are set in the LSTM cell of the decoder.
+ Decoder_input_data comes in through Embedding layer.
+ Embeddings goes in LSTM cell (which had the states) to produce seqeunces.

In [5]:
encoder_inputs = tf.keras.layers.Input(shape=(None,))
encoder_embedding = tf.keras.layers.Embedding(VOCAB_SIZE, 200, mask_zero=True) (encoder_inputs)
encoder_outputs, state_h, state_c = tf.keras.layers.LSTM(200, return_state=True)(encoder_embedding)
encoder_states = [state_h, state_c]

decoder_inputs = tf.keras.layers.Input(shape=(None, ))
decoder_embedding = tf.keras.layers.Embedding(VOCAB_SIZE, 200, mask_zero=True) (decoder_inputs)
decoder_lstm = tf.keras.layers.LSTM( 200 , return_state=True , return_sequences=True)
decoder_outputs , _ , _ = decoder_lstm (decoder_embedding, initial_state=encoder_states)
decoder_dense = tf.keras.layers.Dense(VOCAB_SIZE, activation=tf.keras.activations.softmax) 
output = decoder_dense (decoder_outputs)

model = tf.keras.models.Model([encoder_inputs, decoder_inputs], output)
model.compile(optimizer=tf.keras.optimizers.RMSprop(), loss='categorical_crossentropy')

model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, None, 200)    378800      input_1[0][0]                    
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, None, 200)    378800      input_2[0][0]                    
______________________________________________________________________________________________

## Training the model

The model is trained for 75 epochs with RMSprop optimiser and categorical_crossentropy loss function.

In [6]:
model.fit([encoder_input_data , decoder_input_data], decoder_output_data, batch_size=50, epochs=75) 
model.save('model.h5') 

Train on 564 samples
Epoch 1/75
Epoch 2/75
Epoch 3/75
Epoch 4/75
Epoch 5/75
Epoch 6/75
Epoch 7/75
Epoch 8/75
Epoch 9/75
Epoch 10/75
Epoch 11/75
Epoch 12/75
Epoch 13/75
Epoch 14/75
Epoch 15/75
Epoch 16/75
Epoch 17/75
Epoch 18/75
Epoch 19/75
Epoch 20/75
Epoch 21/75
Epoch 22/75
Epoch 23/75
Epoch 24/75
Epoch 25/75
Epoch 26/75
Epoch 27/75
Epoch 28/75
Epoch 29/75
Epoch 30/75
Epoch 31/75
Epoch 32/75
Epoch 33/75
Epoch 34/75
Epoch 35/75
Epoch 36/75
Epoch 37/75
Epoch 38/75
Epoch 39/75
Epoch 40/75
Epoch 41/75
Epoch 42/75
Epoch 43/75
Epoch 44/75
Epoch 45/75
Epoch 46/75
Epoch 47/75
Epoch 48/75
Epoch 49/75
Epoch 50/75
Epoch 51/75
Epoch 52/75
Epoch 53/75
Epoch 54/75
Epoch 55/75
Epoch 56/75
Epoch 57/75
Epoch 58/75
Epoch 59/75
Epoch 60/75
Epoch 61/75
Epoch 62/75
Epoch 63/75
Epoch 64/75
Epoch 65/75
Epoch 66/75
Epoch 67/75
Epoch 68/75
Epoch 69/75
Epoch 70/75
Epoch 71/75
Epoch 72/75
Epoch 73/75
Epoch 74/75
Epoch 75/75


## Defining inference models

Inference models are created, which help in predicting answers.

Encoder inference model: Takes the question as input, and outputs LSTM states (h and c).

Decoder inference model: Takes in 2 inputs, one being the LSTM states (output of encoder model), second being the answer input seqeunces (ones not having the `<start>` tag). It will output the answers for the question which we fed to the encoder model and its state values.

In [7]:
def make_inference_models():
    
    encoder_model = tf.keras.models.Model(encoder_inputs, encoder_states)
    
    decoder_state_input_h = tf.keras.layers.Input(shape=(200,))
    decoder_state_input_c = tf.keras.layers.Input(shape=(200,))
    
    decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
    
    decoder_outputs, state_h, state_c = decoder_lstm(
        decoder_embedding , initial_state=decoder_states_inputs)
    decoder_states = [state_h, state_c]
    decoder_outputs = decoder_dense(decoder_outputs)
    decoder_model = tf.keras.models.Model(
        [decoder_inputs] + decoder_states_inputs,
        [decoder_outputs] + decoder_states)
    
    return encoder_model , decoder_model

## Talking with the Chatbot

A method str_to_tokens is defined, which converts str questions to Integer tokens with padding.

In [8]:
def str_to_tokens( sentence : str ):
    words = sentence.lower().split()
    tokens_list = list()
    for word in words:
        tokens_list.append( tokenizer.word_index[word]) 
    return preprocessing.sequence.pad_sequences([tokens_list], maxlen=maxlen_questions, padding='post')

1. A question is taken as input and state values are predicted using enc_model.
2. State values are set in the decoder's LSTM.
3. A sequence is generated, which contains the `<start>` element.
4. This sequence is inputted in the dec_model.
5. The `<start>` element is replaced with the element predicted by the dec_model and state values updated.
6. The above steps are carried out iteratively until the `<end>` tag or the maximum answer length is hit.

In [11]:
enc_model, dec_model = make_inference_models()

for _ in range(10):
    states_values = enc_model.predict(str_to_tokens(input('Enter question : ' )))
    empty_target_seq = np.zeros((1, 1))
    empty_target_seq[0, 0] = tokenizer.word_index['start']
    stop_condition = False
    decoded_translation = ''
    while not stop_condition :
        dec_outputs, h, c = dec_model.predict([empty_target_seq] + states_values)
        sampled_word_index = np.argmax(dec_outputs[0, -1, :])
        sampled_word = None
        for word, index in tokenizer.word_index.items():
            if sampled_word_index == index:
                sampled_word = word
                if sampled_word != 'end':
                    decoded_translation += ' {}'.format(sampled_word)
        
        if sampled_word == 'end' or len(decoded_translation.split()) > maxlen_answers:
            stop_condition = True
            
        empty_target_seq = np.zeros((1, 1))  
        empty_target_seq[0, 0] = sampled_word_index
        states_values = [h, c] 

    print(decoded_translation)

Enter question : hello
 hi
Enter question : what are you doing
 i am a human emotion i am not yet studied of the ability to get
Enter question : what human emotion are you
 i am a chat robot business
Enter question : what business
 i wouldn't recommend buying on the rings
Enter question : where are the rings
 my favorite subject is chemistry
Enter question : what do you do with chemistry
 i am interested in the chat robot
Enter question : are you a chat robot
 i have a lot of my social bot i am not yet way of you
Enter question : you are social then
 i could use a copy of my artificial states
Enter question : what are your artificial states
 i am interested in the computer i am so i am effectively deathless
Enter question : that is good for you
 a chat robot is a chat bot i am a original chat of a computer it is a computer to try to control a computer to control it
