# Building a Chatbot with Deep Learning

![image](https://user-images.githubusercontent.com/35156624/126909072-47c9be9e-549c-420f-ac4b-f9bbd2a4de22.png)


In [25]:
import numpy as np
import tensorflow as tf
import re 
import time 

## We need to import the dataset for data preprocessing

In [26]:
movie_lines = open('movie_lines.txt', encoding = 'utf-8', errors = 'ignore').read().split('\n')
conversations = open('movie_conversations.txt', encoding = 'utf-8', errors = 'ignore').read().split('\n')

In [27]:
print()
print("Raw movie lines:")
print()
movie_lines[1:5]


Raw movie lines:



['L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!',
 'L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.',
 'L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?',
 "L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go."]

In [28]:
print()
print("Raw Conversations:")
print()
conversations[1:5]


Raw Conversations:



["u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L198', 'L199']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L200', 'L201', 'L202', 'L203']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L204', 'L205', 'L206']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L207', 'L208']"]

## Create a dictionary to map movie line and id

In [29]:
id_2_movieline = {}
for line in movie_lines:
    _line = line.split(" +++$+++ ")
    if len(_line) == 5:
        id_2_movieline[_line[0]] = _line[4]

In [30]:
print()
print("Movie Lines of data set:")
print()
print(dict(list(id_2_movieline.items())[1:10]))


Movie Lines of data set:

{'L454353': 'Mary had been there, one night, and had left.', 'L334710': "It's Jennie. Just tell me if Telly is there.", 'L86783': 'I just know how you get. Good to know, them butterflies still in ya gut.', 'L103759': 'Was Future Man adopted?', 'L633072': "You're not talking sense.", 'L642272': 'She looks like a sick marrow!', 'L422114': "How's that, Sheriff?", 'L144387': "Daryll Lee Cullum?  I don't think so.  If he's escaped we'd have the National Guard, cops'd be crawling through sewers.  You'd have a guard on your front door.", 'L413490': 'John: take that look offen your face and act nice.'}


## Create a list of all the conversations

In [31]:
conversations_ids = []
for conversation in conversations[:-1]:
    _conversation = conversation.split(" +++$+++ ")[-1][1:-1].replace("'", "").replace(" ", "")
    conversations_ids.append(_conversation.split(","))
print()
print("List of conversations:")
print()
conversations_ids[:10]


List of conversations:



[['L194', 'L195', 'L196', 'L197'],
 ['L198', 'L199'],
 ['L200', 'L201', 'L202', 'L203'],
 ['L204', 'L205', 'L206'],
 ['L207', 'L208'],
 ['L271', 'L272', 'L273', 'L274', 'L275'],
 ['L276', 'L277'],
 ['L280', 'L281'],
 ['L363', 'L364'],
 ['L365', 'L366']]

In [32]:
print("Split the questions and answers")
print()
questions = []
answers = []
for convs in conversations_ids:
    for i in range(len(convs) - 1):
        questions.append(id_2_movieline[convs[i]])
        answers.append(id_2_movieline[convs[i + 1  ]])
print("Questions:")
print()
print(questions[:10])
print()
print("Answers:")
print()
print(answers[:10])

Split the questions and answers

Questions:

['Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.', "Well, I thought we'd start with pronunciation, if that's okay with you.", 'Not the hacking and gagging and spitting part.  Please.', "You're asking me out.  That's so cute. What's your name again?", "No, no, it's my fault -- we didn't have a proper introduction ---", 'Cameron.', "The thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.", 'Why?', 'Unsolved mystery.  She used to be really popular when she started high school, then it was just like she got sick of it or something.', 'Gosh, if only we could find Kat a boyfriend...']

Answers:

["Well, I thought we'd start with pronunciation, if that's okay with you.", 'Not the hacking and gagging and spitting part.  Please.', "Okay... then how 'bout we try out some French cuisine.  Saturday?

## Now we need to clean the text

In [33]:
def clean(text):
    """
    function: clean
    params: String text
    does: cleans the text removing stop words, punctuation, lower case.
    returns: String clean text 
    """
    text = text.lower()
    text = re.sub(r"i'm", "i am", text)
    text = re.sub(r"he's", "he is", text)
    text = re.sub(r"she's", "she is", text)
    text = re.sub(r"that's", "that is", text)
    text = re.sub(r"what's", "what is", text)
    text = re.sub(r"where's", "where is", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"'ve", " have", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"can't", "cannot", text)
    text = re.sub(r"[-()\"#/@;:<>{}+=-[.?,]", "", text)
    return text

In [34]:
clean_ques = []
clean_answ = []
for question in questions:
    clean_ques.append(clean(question))
for answer in answers:
    clean_answ.append(clean(answer))
print()
print("Cleaned Questions:")
print(clean_ques[:10])
print()
print("Cleaned Answers:")
print(clean_answ[:10])


Cleaned Questions:
['can we make this quick  roxanne korrine and andrew barrett are having an incredibly horrendous public break up on the quad  again', 'well i thought we would start with pronunciation if that is okay with you', 'not the hacking and gagging and spitting part  please', 'you are asking me out  that is so cute what is your name again', "no no it's my fault  we didn't have a proper introduction ", 'cameron', 'the thing is cameron  i am at the mercy of a particularly hideous breed of loser  my sister  i cannot date until she does', 'why', 'unsolved mystery  she used to be really popular when she started high school then it was just like she got sick of it or something', 'gosh if only we could find kat a boyfriend']

Cleaned Answers:
['well i thought we would start with pronunciation if that is okay with you', 'not the hacking and gagging and spitting part  please', "okay then how 'bout we try out some french cuisine  saturday  night", 'forget it', 'cameron', 'the thing is

## Remove less frequent words

Find the number of occurunces of each word and remove the lowers 5%, this is to speed up the process of training the data in the neural network and to focus on the most impactful words in the corpus.

In [35]:
count_words = {}
for ques in clean_ques:
    for word in ques.split():
        if word in count_words:
            count_words[word] += 1
        else:
            count_words[word] = 1

for answ in clean_answ:
    for word in answ.split():
        if word in count_words:
            count_words[word] += 1
        else:
            count_words[word] = 1
print()
print("Word count hash table:")
print()
print(dict(list(count_words.items())[1:10]))


Word count hash table:

{'scumbag': 38, 'duty': 213, 'foundation!': 2, 'heaving': 4, 'doggonedest': 1, 'areenchanting': 1, 'onsite': 4, 'hotspur': 1, 'successteamwork': 1}


## Tokenize and create a threshold 

Tokenize to get all words and filter out words that do not meet the threshold. The threshold is set at 20%, this hyperparamater can be attuned at different levels to improve the model. Map the words to a unique number.

In [36]:
threshold = 20
questions_mapping = {}
w_count = 0
for word, count in count_words.items():
    if count > threshold:
        questions_mapping[word] = w_count
        w_count += 1
        
answers_mapping = {}
count = 0
for word, count in count_words.items():
    if count > threshold:
        answers_mapping[word] = w_count
        w_count += 1

print()
print("Questions Mapping:")
print()
print(dict(list(questions_mapping.items())[1:10]))
print()
print("Answers Mapping")
print()
print(dict(list(answers_mapping.items())[1:10]))


Questions Mapping:

{'publish': 4295, 'parasites': 2220, 'scumbag': 0, 'duty': 1, 'mallory': 6852, 'federation': 5436, 'moraes': 6508, 'kinky': 8466, 'brady': 6504}

Answers Mapping

{'publish': 12831, 'parasites': 10756, 'scumbag': 8536, 'duty': 8537, 'mallory': 15388, 'federation': 13972, 'moraes': 15044, 'kinky': 17002, 'brady': 15040}


In [37]:
tokens = ['<PAD>', '<EOS>', '<OUT>','<SOS>']

for token in tokens:
    questions_mapping[token] = len(questions_mapping) + 1

for token in tokens:
    answers_mapping[token] = len(answers_mapping) + 1

In [38]:
inverse_answers = {w_i: w for w, w_i in answers_mapping.items()}

Now we need to add the EOS token to end of every answer

In [39]:
for i in range(len(clean_answ)):
    clean_answ[i] += ' <EOS>'

In [40]:
print()
print("EOS token at the end of each answer, this is used for the decoding part of the seq2seq model:")
print()
clean_answ[:10]


EOS token at the end of each answer, this is used for the decoding part of the seq2seq model:



['well i thought we would start with pronunciation if that is okay with you <EOS>',
 'not the hacking and gagging and spitting part  please <EOS>',
 "okay then how 'bout we try out some french cuisine  saturday  night <EOS>",
 'forget it <EOS>',
 'cameron <EOS>',
 'the thing is cameron  i am at the mercy of a particularly hideous breed of loser  my sister  i cannot date until she does <EOS>',
 'seems like she could get a date easy enough <EOS>',
 'unsolved mystery  she used to be really popular when she started high school then it was just like she got sick of it or something <EOS>',
 'that is a shame <EOS>',
 'let me see what i can do <EOS>']