# Chatbot using Seq2Seq LSTM models
In this notebook, we will assemble a seq2seq LSTM model using Keras Functional API to create a working Chatbot which would answer questions asked to it.

Chatbots have become applications themselves. You can choose the field or stream and gather data regarding various questions. We can build a chatbot for an e-commerce webiste or a school website where parents could get information about the school.


Messaging platforms like Allo have implemented chatbot services to engage users. The famous [Google Assistant](https://assistant.google.com/), [Siri](https://www.apple.com/in/siri/), [Cortana](https://www.microsoft.com/en-in/windows/cortana) and [Alexa](https://www.alexa.com/) may have been build using simialr models.

So, let's start building our Chatbot.


## 1) Importing the packages

We will import [TensorFlow](https://www.tensorflow.org) and our beloved [Keras](https://www.tensorflow.org/guide/keras). Also, we import other modules which help in defining model layers.






In [1]:
import numpy as np
import tensorflow as tf
import pickle
from tensorflow.keras import layers , activations , models , preprocessing
import pandas as pd

## 2) Preprocessing the data

### A) Download the data

The dataset hails from [chatterbot/english on Kaggle](https://www.kaggle.com/kausr25/chatterbotenglish).com by [kausr25](https://www.kaggle.com/kausr25). It contains pairs of questions and answers based on a number of subjects like food, history, AI etc.

The raw data could be found from this repo -> https://github.com/shubham0204/Dataset_Archives


In [None]:

# !wget https://github.com/shubham0204/Dataset_Archives/blob/master/chatbot_nlp.zip?raw=true -O chatbot_nlp.zip
# !unzip chatbot_nlp.zip


### B) Reading the data from the files

We parse each of the `.yaml` files.

*   Concatenate two or more sentences if the answer has two or more of them.
*   Remove unwanted data types which are produced while parsing the data.
*   Append `<START>` and `<END>` to all the `answers`.
*   Create a `Tokenizer` and load the whole vocabulary ( `questions` + `answers` ) into it.





In [2]:
from tensorflow.keras import preprocessing , utils
import os
import yaml


In [3]:
questions = list()
answers = list()

In [4]:
data_from_excel = pd.read_csv('English Dataset/ratings.csv' , encoding ='unicode_escape')#,skiprows = 2)
data_from_excel.head(20)

Unnamed: 0,Topic,Definition of the topic,Rate,Rating
0,cup,cup (s),0,Zero
1,cup,bottle,0,Zero
2,cup,We calibrate with it,1,One
3,cup,It is placed with a cup of coffee,1,One
4,cup,We drink in it,2,Two
5,cup,A cup to drink from,2,Two
6,cup,We put water/juice/tea in it and drink it,2,Two
7,cup,Something made of glass that we drink from,2,Two
8,cup,"Something to put water, juice, tea in",2,Two
9,Veil,Colorful,0,Zero


In [7]:
for i in range(len(data_from_excel['Definition of the topic'])):
    print(data_from_excel['Definition of the topic'][i])
    questions.append(data_from_excel['Definition of the topic'][i].lower())
for i in range(len(data_from_excel['Rating'])):
    answers.append( '<START> ' + data_from_excel['Rating'][i] + ' <END>' )
print("Now : " , len(questions))

cup (s)
bottle
 We calibrate with it
 It is placed with a cup of coffee
We drink in it
A cup to drink from
We put water/juice/tea in it and drink it
Something made of glass that we drink from
Something to put water, juice, tea in
Colorful
short
At his death
My mother's
Made of silk and cotton
We wear it and pray/go out with it
We wrap it up and wear it (S)
Women wear it
wear (s)
We wear it and go out with it
Women wear it on their heads when they go out on the street
Something for a girl to cover her hair with
A covering that a girl wears on her head when praying and going out
 scarf that we wear on our heads
head cover
A Muslim girl puts it on her head and it is a head covering
Hair cover
 It means orange (s)
Round
juice
Tasty
Yellow and green
We eat it
Type of fruit (S)
We eat it and color it
orange
fruit (s)
We eat it (S)
We peel it and eat it
Round fruit and we eat it
Fruit (rounded winter/orange in color/pinkle)
A winter fruit that we eat and make juice from
He runs and drinks
Flo

In [8]:
tokenizer = preprocessing.text.Tokenizer()
tokenizer.fit_on_texts( questions + answers )
VOCAB_SIZE = len( tokenizer.word_index )+1
print( 'VOCAB SIZE : {}'.format( VOCAB_SIZE ))

VOCAB SIZE : 740



### C) Preparing data for Seq2Seq model

Our model requires three arrays namely `encoder_input_data`, `decoder_input_data` and `decoder_output_data`.

For `encoder_input_data` :
* Tokenize the `questions`. Pad them to their maximum length.

For `decoder_input_data` :
* Tokenize the `answers`. Pad them to their maximum length.

For `decoder_output_data` :

* Tokenize the `answers`. Remove the first element from all the `tokenized_answers`. This is the `<START>` element which we added earlier.



In [9]:
# encoder_input_data
tokenized_questions = tokenizer.texts_to_sequences( questions )
maxlen_questions = max( [ len(x) for x in tokenized_questions ] )
padded_questions = preprocessing.sequence.pad_sequences( tokenized_questions , maxlen=maxlen_questions , padding='post' )
encoder_input_data = np.array( padded_questions )
print( encoder_input_data.shape , maxlen_questions )

# decoder_input_data
tokenized_answers = tokenizer.texts_to_sequences( answers )
maxlen_answers = max( [ len(x) for x in tokenized_answers ] )
padded_answers = preprocessing.sequence.pad_sequences( tokenized_answers , maxlen=maxlen_answers , padding='post' )
decoder_input_data = np.array( padded_answers )
print( decoder_input_data.shape , maxlen_answers )

# decoder_output_data
tokenized_answers = tokenizer.texts_to_sequences( answers )
for i in range(len(tokenized_answers)) :
    tokenized_answers[i] = tokenized_answers[i][1:]
padded_answers = preprocessing.sequence.pad_sequences( tokenized_answers , maxlen=maxlen_answers , padding='post' )
onehot_answers = utils.to_categorical( padded_answers , VOCAB_SIZE )
decoder_output_data = np.array( onehot_answers )
print( decoder_output_data.shape )


(480, 29) 29
(480, 3) 3
(480, 3, 740)


## 3) Defining the Encoder-Decoder model
The model will have Embedding, LSTM and Dense layers. The basic configuration is as follows.


*   2 Input Layers : One for `encoder_input_data` and another for `decoder_input_data`.
*   Embedding layer : For converting token vectors to fix sized dense vectors. **( Note :  Don't forget the `mask_zero=True` argument here )**
*   LSTM layer : Provide access to Long-Short Term cells.

Working :

1.   The `encoder_input_data` comes in the Embedding layer (  `encoder_embedding` ).
2.   The output of the Embedding layer goes to the LSTM cell which produces 2 state vectors ( `h` and `c` which are `encoder_states` )
3.   These states are set in the LSTM cell of the decoder.
4.   The decoder_input_data comes in through the Embedding layer.
5.   The Embeddings goes in LSTM cell ( which had the states ) to produce seqeunces.



<center><img style="float: center;" src="https://cdn-images-1.medium.com/max/1600/1*bnRvZDDapHF8Gk8soACtCQ.gif"></center>


Image credits to [Hackernoon](https://hackernoon.com/tutorial-3-what-is-seq2seq-for-text-summarization-and-why-68ebaa644db0).










In [10]:

encoder_inputs = tf.keras.layers.Input(shape=( maxlen_questions , ))
encoder_embedding = tf.keras.layers.Embedding( VOCAB_SIZE, 200 , mask_zero=True ) (encoder_inputs)
encoder_outputs , state_h , state_c = tf.keras.layers.LSTM( 200 , return_state=True )( encoder_embedding )
encoder_states = [ state_h , state_c ]

decoder_inputs = tf.keras.layers.Input(shape=( maxlen_answers ,  ))
decoder_embedding = tf.keras.layers.Embedding( VOCAB_SIZE, 200 , mask_zero=True) (decoder_inputs)
decoder_lstm = tf.keras.layers.LSTM( 200 , return_state=True , return_sequences=True )
decoder_outputs , _ , _ = decoder_lstm ( decoder_embedding , initial_state=encoder_states )
decoder_dense = tf.keras.layers.Dense( VOCAB_SIZE , activation=tf.keras.activations.softmax )
output = decoder_dense ( decoder_outputs )
print(output)
model = tf.keras.models.Model([encoder_inputs, decoder_inputs], output )
model.compile(optimizer=tf.keras.optimizers.RMSprop(), loss='categorical_crossentropy')

model.summary()


KerasTensor(type_spec=TensorSpec(shape=(None, 3, 740), dtype=tf.float32, name=None), name='dense/Softmax:0', description="created by layer 'dense'")
Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, 29)]         0           []                               
                                                                                                  
 input_2 (InputLayer)           [(None, 3)]          0           []                               
                                                                                                  
 embedding (Embedding)          (None, 29, 200)      148000      ['input_1[0][0]']                
                                                                                                  
 embedding_1 (Embedding)        (None, 3, 20

## 4) Training the model
We train the model for a number of epochs with `RMSprop` optimizer and `categorical_crossentropy` loss function.

In [11]:
model.fit([encoder_input_data , decoder_input_data], decoder_output_data, batch_size=6, epochs=280 )
model.save( 'model.h5' )

Epoch 1/280
Epoch 2/280
Epoch 3/280
Epoch 4/280
Epoch 5/280
Epoch 6/280
Epoch 7/280
Epoch 8/280
Epoch 9/280
Epoch 10/280
Epoch 11/280
Epoch 12/280
Epoch 13/280
Epoch 14/280
Epoch 15/280
Epoch 16/280
Epoch 17/280
Epoch 18/280
Epoch 19/280
Epoch 20/280
Epoch 21/280
Epoch 22/280
Epoch 23/280
Epoch 24/280
Epoch 25/280
Epoch 26/280
Epoch 27/280
Epoch 28/280
Epoch 29/280
Epoch 30/280
Epoch 31/280
Epoch 32/280
Epoch 33/280
Epoch 34/280
Epoch 35/280
Epoch 36/280
Epoch 37/280
Epoch 38/280
Epoch 39/280
Epoch 40/280
Epoch 41/280
Epoch 42/280
Epoch 43/280
Epoch 44/280
Epoch 45/280
Epoch 46/280
Epoch 47/280
Epoch 48/280
Epoch 49/280
Epoch 50/280
Epoch 51/280
Epoch 52/280
Epoch 53/280
Epoch 54/280
Epoch 55/280
Epoch 56/280
Epoch 57/280
Epoch 58/280
Epoch 59/280
Epoch 60/280
Epoch 61/280
Epoch 62/280
Epoch 63/280
Epoch 64/280
Epoch 65/280
Epoch 66/280
Epoch 67/280
Epoch 68/280
Epoch 69/280
Epoch 70/280
Epoch 71/280
Epoch 72/280
Epoch 73/280
Epoch 74/280
Epoch 75/280
Epoch 76/280
Epoch 77/280
Epoch 78

Epoch 101/280
Epoch 102/280
Epoch 103/280
Epoch 104/280
Epoch 105/280
Epoch 106/280
Epoch 107/280
Epoch 108/280
Epoch 109/280
Epoch 110/280
Epoch 111/280
Epoch 112/280
Epoch 113/280
Epoch 114/280
Epoch 115/280
Epoch 116/280
Epoch 117/280
Epoch 118/280
Epoch 119/280
Epoch 120/280
Epoch 121/280
Epoch 122/280
Epoch 123/280
Epoch 124/280
Epoch 125/280
Epoch 126/280
Epoch 127/280
Epoch 128/280
Epoch 129/280
Epoch 130/280
Epoch 131/280
Epoch 132/280
Epoch 133/280
Epoch 134/280
Epoch 135/280
Epoch 136/280
Epoch 137/280
Epoch 138/280
Epoch 139/280
Epoch 140/280
Epoch 141/280
Epoch 142/280
Epoch 143/280
Epoch 144/280
Epoch 145/280
Epoch 146/280
Epoch 147/280
Epoch 148/280
Epoch 149/280
Epoch 150/280
Epoch 151/280
Epoch 152/280
Epoch 153/280
Epoch 154/280
Epoch 155/280
Epoch 156/280
Epoch 157/280
Epoch 158/280
Epoch 159/280
Epoch 160/280
Epoch 161/280
Epoch 162/280
Epoch 163/280
Epoch 164/280
Epoch 165/280
Epoch 166/280
Epoch 167/280
Epoch 168/280
Epoch 169/280
Epoch 170/280
Epoch 171/280
Epoch 

Epoch 200/280
Epoch 201/280
Epoch 202/280
Epoch 203/280
Epoch 204/280
Epoch 205/280
Epoch 206/280
Epoch 207/280
Epoch 208/280
Epoch 209/280
Epoch 210/280
Epoch 211/280
Epoch 212/280
Epoch 213/280
Epoch 214/280
Epoch 215/280
Epoch 216/280
Epoch 217/280
Epoch 218/280
Epoch 219/280
Epoch 220/280
Epoch 221/280
Epoch 222/280
Epoch 223/280
Epoch 224/280
Epoch 225/280
Epoch 226/280
Epoch 227/280
Epoch 228/280
Epoch 229/280
Epoch 230/280
Epoch 231/280
Epoch 232/280
Epoch 233/280
Epoch 234/280
Epoch 235/280
Epoch 236/280
Epoch 237/280
Epoch 238/280
Epoch 239/280
Epoch 240/280
Epoch 241/280
Epoch 242/280
Epoch 243/280
Epoch 244/280
Epoch 245/280
Epoch 246/280
Epoch 247/280
Epoch 248/280
Epoch 249/280
Epoch 250/280
Epoch 251/280
Epoch 252/280
Epoch 253/280
Epoch 254/280
Epoch 255/280
Epoch 256/280
Epoch 257/280
Epoch 258/280
Epoch 259/280
Epoch 260/280
Epoch 261/280
Epoch 262/280
Epoch 263/280
Epoch 264/280
Epoch 265/280
Epoch 266/280
Epoch 267/280
Epoch 268/280
Epoch 269/280
Epoch 270/280
Epoch 

## 5) Defining inference models
We create inference models which help in predicting answers.

**Encoder inference model** : Takes the question as input and outputs LSTM states ( `h` and `c` ).

**Decoder inference model** : Takes in 2 inputs, one are the LSTM states ( Output of encoder model ), second are the answer input seqeunces ( ones not having the `<start>` tag ). It will output the answers for the question which we fed to the encoder model and its state values.

In [12]:
def make_inference_models():
    
    encoder_model = tf.keras.models.Model(encoder_inputs, encoder_states)

    decoder_state_input_h = tf.keras.layers.Input(shape=( 200 ,))
    decoder_state_input_c = tf.keras.layers.Input(shape=( 200 ,))

    decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

    decoder_outputs, state_h, state_c = decoder_lstm(
        decoder_embedding , initial_state=decoder_states_inputs)
    decoder_states = [state_h, state_c]
    decoder_outputs = decoder_dense(decoder_outputs)
    decoder_model = tf.keras.models.Model(
        [decoder_inputs] + decoder_states_inputs,
        [decoder_outputs] + decoder_states)

    return encoder_model , decoder_model


## 6) Talking with our Chatbot

First, we define a method `str_to_tokens` which converts `str` questions to Integer tokens with padding.


In [13]:
def str_to_tokens( sentence : str ):
    words = sentence.lower().split()
    tokens_list = list()
    for word in words:
        tokens_list.append( tokenizer.word_index[ word ] )
    return preprocessing.sequence.pad_sequences( [tokens_list] , maxlen=maxlen_questions , padding='post')





1.   First, we take a question as input and predict the state values using `enc_model`.
2.   We set the state values in the decoder's LSTM.
3.   Then, we generate a sequence which contains the `<start>` element.
4.   We input this sequence in the `dec_model`.
5.   We replace the `<start>` element with the element which was predicted by the `dec_model` and update the state values.
6.   We carry out the above steps iteratively till we hit the `<end>` tag or the maximum answer length.







In [14]:
enc_model , dec_model = make_inference_models()

In [15]:

enc_model.save('enc_model.h5')
dec_model.save('dec_model.h5')



In [16]:
for _ in range(10):
    states_values = enc_model.predict( str_to_tokens(input( 'Enter question : ') ) )
    empty_target_seq = np.zeros( ( 1 , 1 ) )
    empty_target_seq[0, 0] = tokenizer.word_index['start']
    stop_condition = False
    decoded_translation = ''
    while not stop_condition :
        dec_outputs , h , c = dec_model.predict([ empty_target_seq ] + states_values )
        sampled_word_index = np.argmax( dec_outputs[0, -1, :] )
        sampled_word = None
        for word , index in tokenizer.word_index.items() :
            if sampled_word_index == index :
                decoded_translation += ' {}'.format( word )
                sampled_word = word

        if sampled_word == 'end' or len(decoded_translation.split()) > maxlen_answers:
            stop_condition = True

        empty_target_seq = np.zeros( ( 1 , 1 ) )
        empty_target_seq[ 0 , 0 ] = sampled_word_index
        states_values = [ h , c ]

    print( decoded_translation )

Enter question : We calibrate with it
 one end
Enter question : A covering that a girl wears on her head when praying and going out
 two end
Enter question : end
 one end
Enter question :  It means orange (s)


KeyError: '(s)'