# **Chatbot using Seq2Seq LSTM models**

This project is to create conversational chatbot using Sequence to sequence LSTM models. 
Sequence to sequence learning is about training models to convert from one domain to sequences another domain. 

# Step 1: Import all the packages 

In [18]:
import numpy as np 
import tensorflow as tf
import pickle
from tensorflow.keras import layers, activations, models, preprocessing

# Step 2: Download all the data from kaggle

In [2]:
!pip install kaggle 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle (1).json


{'kaggle.json': b'{"username":"anunarayanan","key":"49f269a7e831b68d0902c118acea274c"}'}

In [None]:
!mkdir -p ~/.kaggle

In [None]:
!cp kaggle.json ~/.kaggle/

In [None]:
!ls ~/.kaggle

kaggle.json


In [None]:
!chmod 600 /root/.kaggle/kaggle.json

In [None]:
!kaggle datasets download -d kausr25/chatterbotenglish

Downloading chatterbotenglish.zip to /content
  0% 0.00/23.2k [00:00<?, ?B/s]
100% 23.2k/23.2k [00:00<00:00, 17.7MB/s]


In [19]:
!unzip /content/chatterbotenglish.zip

unzip:  cannot find or open /content/chatterbotenglish.zip, /content/chatterbotenglish.zip.zip or /content/chatterbotenglish.zip.ZIP.


In [20]:
#!wget https://github.com/shubham0204/Dataset_Archives/blob/master/chatbot_nlp.zip?raw=true -O chatbot_nlp.zip
!unzip chatbot_nlp.zip

Archive:  chatbot_nlp.zip
   creating: chatbot_nlp/data/
  inflating: chatbot_nlp/data/ai.yml  


# Step 3: Preprocessing the data

### a) Reading the data from the files
We parse each of the .yaml files.

1. Concatenate two or more sentences if the answer has two or more of them.
2. Remove unwanted data types which are produced while parsing the data.
3. Append <START> and <END> to all the answers.
4. Create a Tokenizer and load the whole vocabulary ( questions + answers ) into it.

In [21]:
from tensorflow.keras import preprocessing, utils
import os
import yaml

The dataset contains .yml files which have pairs of different questions and their answers on varied subjects like history, bot profile, science etc.
We can easily read them as folows:

In [22]:
dir_path = '/content/chatbot_nlp/data'
files_list = os.listdir(dir_path + os.sep)

In [52]:
questions = list()
answers = list()

for filepath in files_list:
    stream = open( dir_path + os.sep + filepath , 'rb')
    docs = yaml.safe_load(stream)
    conversations = docs['conversations']
    for con in conversations:
        if len( con ) > 2 :
            questions.append(con[0])
            replies = con[ 1 : ]
            ans = ''
            for rep in replies:
                ans += ' ' + rep
            answers.append( ans )
        elif len( con )> 1:
            questions.append(con[0])
            answers.append(con[1])

answers_with_tags = list()
for i in range( len( answers ) ):
    if type( answers[i] ) == str:
        answers_with_tags.append( answers[i] )
    else:
        questions.pop( i )

answers = list()
for i in range( len( answers_with_tags ) ) :
    answers.append( '<START> ' + answers_with_tags[i] + ' <END>' )

tokenizer = preprocessing.text.Tokenizer(oov_token=1)
tokenizer.fit_on_texts( questions + answers )
VOCAB_SIZE = len( tokenizer.word_index )+1
print( 'VOCAB SIZE : {}'.format( VOCAB_SIZE ))

VOCAB SIZE : 495


### b) Preparing data for Seq2Seq model

This model requires 3 arrays encoder_input_data, decoder_input_data and decoder_output_data.

For encoder_input_data:
Tokenize the Questions and Pad them to their maximum Length.

For decoder_input_data:
Tokenize the Answers and Pad them to their maximum Length.

For decoder_output_data:
Tokenize the Answers and Remove the 1st element from all the tokenized_answers. This is the <START> element which was added earlier.

In [53]:
from gensim.models import Word2Vec
import re

In [54]:
vocab = []
for word in tokenizer.word_index:
  vocab.append(word)

def tokenize(sentences):
  tokens_list = []
  vocabulary = []
  for sentence in sentences:
    sentence = sentence.lower()
    sentence = re.sub('[^a-zA-Z]', ' ', sentence)
    tokens = sentence.split()
    vocabulary += tokens
    tokens_list.append(tokens)
  return tokens_list, vocabulary

In [55]:
#encoder_input_data
tokenized_questions = tokenizer.texts_to_sequences( questions )
maxlen_questions = max( [len(x) for x in tokenized_questions ] )
padded_questions = preprocessing.sequence.pad_sequences( tokenized_questions, maxlen = maxlen_questions, padding = 'post')
encoder_input_data = np.array(padded_questions)
print(encoder_input_data.shape, maxlen_questions)

(45, 7) 7


In [56]:
# decoder_input_data
tokenized_answers = tokenizer.texts_to_sequences( answers )
maxlen_answers = max( [ len(x) for x in tokenized_answers ] )
padded_answers = preprocessing.sequence.pad_sequences( tokenized_answers , maxlen=maxlen_answers , padding='post' )
decoder_input_data = np.array( padded_answers )
print( decoder_input_data.shape , maxlen_answers )

(45, 64) 64


In [57]:
# decoder_output_data
tokenized_answers = tokenizer.texts_to_sequences( answers )
for i in range(len(tokenized_answers)) :
    tokenized_answers[i] = tokenized_answers[i][1:]
padded_answers = preprocessing.sequence.pad_sequences( tokenized_answers , maxlen=maxlen_answers , padding='post' )
onehot_answers = utils.to_categorical( padded_answers , VOCAB_SIZE )
decoder_output_data = np.array( onehot_answers )
print( decoder_output_data.shape )

(45, 64, 495)


# Step 4: Defining Encoder Decoder Model





In [81]:
##encoder_inputs = tf.keras.layers.Input(shape=( maxlen_questions , ))
#encoder_inputs = tf.keras.layers.Input(shape=( None , ))
#encoder_embedding = tf.keras.layers.Embedding( VOCAB_SIZE, 200 , mask_zero=True ) (encoder_inputs)
##encoder_outputs , state_h , state_c = tf.keras.layers.LSTM( 200 , return_state=True )( encoder_embedding )
#_, state_h , state_c = tf.keras.layers.LSTM( 200 , return_state=True )( encoder_embedding )
#encoder_states = [ state_h , state_c ]

#decoder_inputs = tf.keras.layers.Input(shape=( None ,  ))
##decoder_inputs = tf.keras.layers.Input(shape=( maxlen_answers ,  ))
#decoder_embedding = tf.keras.layers.Embedding( VOCAB_SIZE, 200 , mask_zero=True) (decoder_inputs)
#decoder_lstm = tf.keras.layers.LSTM( 200 , return_state=True , return_sequences=True )
#decoder_outputs , _ , _ = decoder_lstm ( decoder_embedding , initial_state=encoder_states )
#decoder_dense = tf.keras.layers.Dense( VOCAB_SIZE , activation=tf.keras.activations.softmax ) 
#output = decoder_dense ( decoder_outputs )

#model = tf.keras.models.Model([encoder_inputs, decoder_inputs], output )
#model.compile(optimizer=tf.keras.optimizers.RMSprop(), loss='categorical_crossentropy')
##model.compile(optimizer=tf.keras.optimizers.Adam(), loss='categorical_crossentropy')

#model.summary()

In [102]:
encoder_inputs = tf.keras.layers.Input(shape=( maxlen_questions , ))
encoder_embedding = tf.keras.layers.Embedding( VOCAB_SIZE, 128 , mask_zero=True ) (encoder_inputs)
encoder_outputs , state_h , state_c = tf.keras.layers.LSTM( 128 , dropout=0.2, recurrent_dropout=0.2, return_state=True )( encoder_embedding )
encoder_states = [ state_h , state_c ]

decoder_inputs = tf.keras.layers.Input(shape=( maxlen_answers ,  ))
decoder_embedding = tf.keras.layers.Embedding( VOCAB_SIZE, 128 , mask_zero=True) (decoder_inputs)
decoder_lstm = tf.keras.layers.LSTM( 128 , return_state=True , return_sequences=True )
decoder_outputs , _ , _ = decoder_lstm ( decoder_embedding , initial_state=encoder_states )
decoder_dense = tf.keras.layers.Dense( VOCAB_SIZE , activation=tf.keras.activations.softmax ) 
output = decoder_dense ( decoder_outputs )

model = tf.keras.models.Model([encoder_inputs, decoder_inputs], output )
#model.compile(optimizer=tf.keras.optimizers.RMSprop(), loss='categorical_crossentropy')
model.compile(optimizer=tf.keras.optimizers.Adam(), loss='categorical_crossentropy')

model.summary()

Model: "model_31"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_43 (InputLayer)          [(None, 7)]          0           []                               
                                                                                                  
 input_44 (InputLayer)          [(None, 64)]         0           []                               
                                                                                                  
 embedding_18 (Embedding)       (None, 7, 128)       63360       ['input_43[0][0]']               
                                                                                                  
 embedding_19 (Embedding)       (None, 64, 128)      63360       ['input_44[0][0]']               
                                                                                           

# Step 5: Training the Model

We train the model for a number of epochs with RMSprop optimizer and categorical_crossentropy loss function.

In [103]:
model.fit([encoder_input_data , decoder_input_data], decoder_output_data, batch_size=100, epochs=300 ) 
model.save( 'model.h6' )

Epoch 1/300
Epoch 2/300
Epoch 3/300
Epoch 4/300
Epoch 5/300
Epoch 6/300
Epoch 7/300
Epoch 8/300
Epoch 9/300
Epoch 10/300
Epoch 11/300
Epoch 12/300
Epoch 13/300
Epoch 14/300
Epoch 15/300
Epoch 16/300
Epoch 17/300
Epoch 18/300
Epoch 19/300
Epoch 20/300
Epoch 21/300
Epoch 22/300
Epoch 23/300
Epoch 24/300
Epoch 25/300
Epoch 26/300
Epoch 27/300
Epoch 28/300
Epoch 29/300
Epoch 30/300
Epoch 31/300
Epoch 32/300
Epoch 33/300
Epoch 34/300
Epoch 35/300
Epoch 36/300
Epoch 37/300
Epoch 38/300
Epoch 39/300
Epoch 40/300
Epoch 41/300
Epoch 42/300
Epoch 43/300
Epoch 44/300
Epoch 45/300
Epoch 46/300
Epoch 47/300
Epoch 48/300
Epoch 49/300
Epoch 50/300
Epoch 51/300
Epoch 52/300
Epoch 53/300
Epoch 54/300
Epoch 55/300
Epoch 56/300
Epoch 57/300
Epoch 58/300
Epoch 59/300
Epoch 60/300
Epoch 61/300
Epoch 62/300
Epoch 63/300
Epoch 64/300
Epoch 65/300
Epoch 66/300
Epoch 67/300
Epoch 68/300
Epoch 69/300
Epoch 70/300
Epoch 71/300
Epoch 72/300
Epoch 73/300
Epoch 74/300
Epoch 75/300
Epoch 76/300
Epoch 77/300
Epoch 78



# Step 6: Defining Inference Models

Encoder Inference Model: Takes questions as input and outputs LSTM states (h and c)

Decoder Inference Model: Takes in 2 inputs one are the LSTM states, second are the answer input sequences. it will o/p the answers for questions which fed to the encoder model and it's state values.

In [107]:
from keras import Input, Model
from keras.layers import Embedding, LSTM, Dense


In [108]:
def make_inference_models():
    decoder_state_input_h = Input(shape=(128,))
    decoder_state_input_c = Input(shape=(128,))
    decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
    decoder_outputs, state_h, state_c = decoder_lstm(decoder_embedding,
                                             initial_state=decoder_states_inputs)
    decoder_states = [state_h, state_c]
    decoder_outputs = decoder_dense(decoder_outputs)
    decoder_model = Model(
        inputs=[decoder_inputs] + decoder_states_inputs,
        outputs=[decoder_outputs] + decoder_states)
    #print('Inference decoder:')
    #decoder_model.summary()
    #print('Inference encoder:')
    encoder_model = Model(inputs=encoder_inputs, outputs=encoder_states)
    #encoder_model.summary()
    return encoder_model, decoder_model

# Step 7: Talking with the Chatbot

define a method str_to_tokens which converts str questions to Integer tokens with padding.

1. First, we take a question as input and predict the state values using enc_model.
2. We set the state values in the decoder's LSTM.
3. Then, we generate a sequence which contains the <start> element.
4. We input this sequence in the dec_model.
5. We replace the <start> element with the element which was predicted by the dec_model and update the state values.
6. We carry out the above steps iteratively till we hit the <end> tag or the maximum answer length.



In [90]:
#def str_to_tokens( sentence : str ):

#    words = sentence.lower().split()
#    tokens_list = list()
  
#    for word in words:
#        tokens_list.append( tokenizer.word_index[ word ] ) 
#    return preprocessing.sequence.pad_sequences( [tokens_list] , maxlen=maxlen_questions , padding='post')


In [109]:
from keras_preprocessing.sequence import pad_sequences


In [110]:
def str_to_tokens(sentence: str):
    words = sentence.lower().split()
    tokens_list = list()
    for current_word in words:
        result = tokenizer.word_index.get(current_word, '')
        if result != '':
            tokens_list.append(result)
    return pad_sequences([tokens_list],
                         maxlen=maxlen_questions,
                         padding='post')

In [111]:
enc_model , dec_model = make_inference_models()

for _ in range(100):
    states_values = enc_model.predict( str_to_tokens( input( 'Enter question : ' ) ) )
    empty_target_seq = np.zeros( ( 1 , 1 ) )
    empty_target_seq[0, 0] = tokenizer.word_index['start']
    stop_condition = False
    decoded_translation = ''
    while not stop_condition :
        dec_outputs , h , c = dec_model.predict([ empty_target_seq ] + states_values )
        sampled_word_index = np.argmax( dec_outputs[0, -1, :] )
        sampled_word = None
        for word , index in tokenizer.word_index.items() :
            if sampled_word_index == index :
                decoded_translation += ' {}'.format( word )
                sampled_word = word
        
        if sampled_word == 'end' or len(decoded_translation.split()) > maxlen_answers:
            stop_condition = True
            
        empty_target_seq = np.zeros( ( 1 , 1 ) )  
        empty_target_seq[ 0 , 0 ] = sampled_word_index
        states_values = [ h , c ] 

    print( decoded_translation )

Enter question : What is Machine Learning?




 a genetic algorithm ga is a search heuristic that mimics the process of pattern recognition and computational learning and inverse problems refers to the predictions on the distribution of the minimum number of substitutions required to change one string into the distribution of the minimum number of uncertainty associated with the distribution of the minimum number of uncertainty associated with the distribution of the minimum
Enter question : genetic algorithm?
 in classification and data end
Enter question : classification?
 support vector machines svm are similar according to learn a similarity function or data end
Enter question : SVM?
 support vector machines svm are similar according to learn a similarity function or data end
Enter question : Classification?
 support vector machines svm are similar according to learn a similarity function or data end
Enter question : What is regression?
 in mathematics the euclidean distance or euclidean metric is the “ordinary” i e straight li

KeyboardInterrupt: ignored