### Intro: 

Through this tutorial we will implement a Deep NLP ChatBot using Tensorflow. So without further a do let's get right into it.

We'll start by importing the libraries needed for this project.

In [178]:
import numpy as np 
import tensorflow as tf
import re #Helps with data preprocessing
import time
import datetime

The dataset used for the training of this ChatBot are taking from: 

https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html

This dataset is called Cornell Movie--Dialogs Corpus, and it contains conversations between actors from a large number of movies, so the type of our ChatBot would be a friend-like ChatBot (able to do casual conversations), for more field specific ChatBots we can use other kind of datasets. Anyway, for further informations about the data used you can look at the link above.

It's important to know that the dataset used is composed of 2 text files: "movie_lines.txt" and "movie_conversations.txt". The first contains the lines from different movies in an unorderly fashion, but these lines have IDs, these IDs are used in the second file to identify the lines that correspond to a certain conversation, so the second file works as a way to order the line fro first file. 

# I. Data preprocesing: 

Generally, this is the longest part of each project, in which we will make the data ready for input into the deep learning model. Luckily the er library is here to carry some load of this phase. Let's begin:

In [11]:
# Loading data: We will load both the lines and conversations

with open("C:/Users/YsfEss/Desktop/data/movie_lines.txt",encoding='utf-8',errors='ignore') as f1:
    lines=f1.read().split('\n') #304714 lines
with open("C:/Users/YsfEss/Desktop/data/movie_conversations.txt",encoding='utf-8',errors='ignore') as f2:
    convos=f2.read().split('\n')

In [47]:
# Now let's create a dictionary that maps each line with its ID.
id2line={}

for line in lines:
    spl=line.split(' +++$+++ ')
    if len(spl)==5:
        id2line[spl[0]]=spl[-1]

In [48]:
# We will now create a list of conversations. 

convoli= []

for conv in convos[:-1]: #The last row of this list is empty
    spl=conv.split(' +++$+++ ')[-1][1:-1].replace("'","").replace(' ','')
    convoli.append(spl.split(','))

In [54]:
# From the list of convo ids we will try build two lists one for 'questions' and the other for 'answers'.

questions=[]
answers=[]

for conv in convoli:
    k=len(conv)
    for i in range(k-1):
        questions.append(id2line[conv[i]])
        answers.append(id2line[conv[i+1]])

In [73]:
# Now for text cleaning

def cleanText(text):
    # text to lower case
    text=text.lower()
    # Now to make it easier for the ChatBot to learn we gonna use re to replace expression like "i'm" with "i am"
    text=re.sub(r"i'm","i am",text)
    text=re.sub(r"she's","she is",text)
    text=re.sub(r"he's","he is",text)
    text=re.sub(r"it's","it is",text)
    text=re.sub(r"that's","that is",text)
    text=re.sub(r"what's","what is",text)
    text=re.sub(r"where's","where is",text)
    text=re.sub(r"\'ve"," have",text)
    text=re.sub(r"\'ll"," will",text)
    text=re.sub(r"\'d"," would",text)
    text=re.sub(r"\'re"," are",text)
    text=re.sub(r"won't","would not",text)
    text=re.sub(r"\n't"," not",text)
    text=re.sub(r"cannot","can not",text)
    text=re.sub(r"gonna","going to",text)
    text=re.sub(r"wanna","want to",text)
    text=re.sub(r"[-()/\"#$%^&*()_+@=?<>:;,.!{}|]","",text)
    #Do as you can in here the better the cleaning the better the result
    return(text)

clean_questions=[cleanText(line) for line in questions]
clean_answers=[cleanText(line) for line in answers]

In [183]:
# In order to optimize our ChatBot training we will try to remove infrequent words from both questions and answers lists.
# So the first step to do that is to generate a dictionnary that maps word to their cardinality within the dataset. 

start=datetime.datetime.now()
wordOccur={}
for question in clean_questions:
    l=question.split()
    for i in range (len(l)) :
        if (l[i] in wordOccur.keys() and l.count(l[i])==1):
            wordOccur[l[i]]+=l.count(l[i])
        else:
            wordOccur[l[i]]=l.count(l[i])
end=datetime.datetime.now()
print(end-start)

0:00:04.046888


In [184]:
start=datetime.datetime.now()
wordOccur={}
for question in clean_questions:
    l=question.split()
    for i in range (len(l)) :
        if l[i] in wordOccur.keys():
            wordOccur[l[i]]+=1
        else:
            wordOccur[l[i]]=1
end=datetime.datetime.now()
print(end-start)

0:00:01.299222


In [115]:
l=[0,1,2,3]

In [117]:
l[0:-1]

[0, 1, 2]

In [165]:
a='je suis je suis tu es tu'
e=a.split()
dict1={}
dict2={}
for i in range (len(e)):
    dict1[e[i]]=0
    for j in range (len(e)):
        if (e[j]==e[i]):
            dict1[e[i]]+=1
            if e[i] not in dict2.keys():
                dict2[e[i]]=min(i,j)

In [166]:
dict1

{'es': 1, 'je': 2, 'suis': 2, 'tu': 2}

In [167]:
dict2

{'es': 5, 'je': 0, 'suis': 1, 'tu': 4}

In [155]:
min(0,2)

0