### Intro: 

Through this tutorial we will implement a Deep NLP ChatBot using Tensorflow. So without further a do let's get right into it.

We'll start by importing the libraries needed for this project.

In [1]:
import numpy as np 
import tensorflow as tf
import re #Helps with data preprocessing
import time
import datetime

The dataset used for the training of this ChatBot are taking from: 

https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html

This dataset is called Cornell Movie--Dialogs Corpus, and it contains conversations between actors from a large number of movies, so the type of our ChatBot would be a friend-like ChatBot (able to do casual conversations), for more field specific ChatBots we can use other kind of datasets. Anyway, for further informations about the data used you can look at the link above.

It's important to know that the dataset used is composed of 2 text files: "movie_lines.txt" and "movie_conversations.txt". The first contains the lines from different movies in an unorderly fashion, but these lines have IDs, these IDs are used in the second file to identify the lines that correspond to a certain conversation, so the second file works as a way to order the line fro first file. 

# I. Data preprocesing: 

Generally, this is the longest part of each project, in which we will make the data ready for input into the deep learning model. Luckily the er library is here to carry some load of this phase. Let's begin:

In [2]:
# Loading data: We will load both the lines and conversations

with open("C:/Users/YsfEss/Desktop/data/movie_lines.txt",encoding='utf-8',errors='ignore') as f1:
    lines=f1.read().split('\n') #304714 lines
with open("C:/Users/YsfEss/Desktop/data/movie_conversations.txt",encoding='utf-8',errors='ignore') as f2:
    convos=f2.read().split('\n')

In [3]:
# Now let's create a dictionary that maps each line with its ID.
id2line={}

for line in lines:
    spl=line.split(' +++$+++ ')
    if len(spl)==5:
        id2line[spl[0]]=spl[-1]

In [4]:
# We will now create a list of conversations. 

convoli= []

for conv in convos[:-1]: #The last row of this list is empty
    spl=conv.split(' +++$+++ ')[-1][1:-1].replace("'","").replace(' ','')
    convoli.append(spl.split(','))

In [5]:
# From the list of convo ids we will try build two lists one for 'questions' and the other for 'answers'.

questions=[]
answers=[]

for conv in convoli:
    k=len(conv)
    for i in range(k-1):
        questions.append(id2line[conv[i]])
        answers.append(id2line[conv[i+1]])

In [96]:
# Now for text cleaning

def cleanText(text):
    # text to lower case
    text=text.lower()
    # Now to make it easier for the ChatBot to learn we gonna use re to replace expression like "i'm" with "i am"
    text=re.sub(r"i'm","i am",text)
    text=re.sub(r"she's","she is",text)
    text=re.sub(r"he's","he is",text)
    text=re.sub(r"it's","it is",text)
    text=re.sub(r"that's","that is",text)
    text=re.sub(r"what's","what is",text)
    text=re.sub(r"where's","where is",text)
    text=re.sub(r"\'ve"," have",text)
    text=re.sub(r"\'ll"," will",text)
    text=re.sub(r"\'d"," would",text)
    text=re.sub(r"\'re"," are",text)
    text=re.sub(r"won't","would not",text)
    text=re.sub(r"can't","can not",text)
    text=re.sub(r"wouldn't","would not",text)
    text=re.sub(r"couldn't","could not",text)
    text=re.sub(r"haven't","have not",text)
    text=re.sub(r"didn't","did not",text)
    text=re.sub(r"cannot","can not",text)
    text=re.sub(r"gonna","going to",text)
    text=re.sub(r"wanna","want to",text)
    text=re.sub(r"don't","do not",text)
    text=re.sub(r"[-()/\"#$%^&*()_+@=?<>:;,.!{}'|]","",text)
    #Do as you can in here the better the cleaning the better the result
    return(text)

clean_questions=[cleanText(line) for line in questions if len(cleanText(line))!=0]
clean_answers=[cleanText(line) for line in answers if len(cleanText(line))!=0]

In [97]:
# In order to optimize our ChatBot training we will try to remove infrequent words from both questions and answers lists.
# So the first step to do that is to generate a dictionnary that maps word to their cardinality within the dataset.

wordOccur={}
for question in clean_questions:
    l=question.split()
    for i in range (len(l)) :
        if l[i] in wordOccur.keys():
            wordOccur[l[i]]+=1
        else:
            wordOccur[l[i]]=1
for answer in clean_answers:
    l=answer.split()
    for i in range (len(l)) :
        if l[i] in wordOccur.keys():
            wordOccur[l[i]]+=1
        else:
            wordOccur[l[i]]=1
            
# The second step is to set a threshold for the number of occurence of words that will be used in the training of the model.
# Let's create a 2 dictionaries that maps each word from questions/answers to a unique identifier.

treshold=20 #This as of now a hyperparameter of the model, 20 seems reasonable we can either decrease it or increase it based on obtained results.

Qwords=[q.split()[i] for q in clean_questions for i in range(len(q.split()))] #Words in the questions.
Qwords=list(set(Qwords)) #Remove redundencies
Awords=[a.split()[i] for a in clean_answers for i in range(len(a.split()))]   #Words in the answers.
Awords=list(set(Awords))

questionwordsIDs={}

wordID=0
for word , count in wordOccur.items():
    if (count > 20 and word in Qwords):
        questionwordsIDs[word]=wordID
        wordID+=1
        
answerwordsIDs={}
        
wordID=0
for word , count in wordOccur.items():
    if (count > 20 and word in Awords):
        answerwordsIDs[word]=wordID
        wordID+=1

In [101]:
# We will now add tokens necessary for the SEQ2SEQ model to the dictionary with their unique IDs.

tokens=['<PAD>','<EOS>','<OUT>','<SOS>']
for token in tokens:
    questionwordsIDs[token]=len(questionwordsIDs)+1
for token in tokens:
    answerwordsIDs[token]=len(answerwordsIDs)+1

In [102]:
# In thw implmentation of the SEQ2SEQ model we will need the inverse mapping ID--> word for the answer dictionary so let's do that.

answerIDs2words={wordID:word for word,wordID in answerwordsIDs.items()}

In [103]:
# Let's add at the end to clean_answers <EOS>.

for i in range (len(clean_answers)):
    clean_answers[i]+=' <EOS>'

In [120]:
# Now we will translate questions and answers into a set of integers which are their IDs as defined as before.

codedQuestions=[]
i=0
for question in clean_questions:
    l=question.split()
    temp=[]
    if len(l)>0:
        for word in l:
            if (word not in questionwordsIDs.keys()):
                temp.append(questionwordsIDs['<OUT>'])
            else:
                temp.append(questionwordsIDs[word])
        if len(temp)==0:
            print(i)
        codedQuestions.append(temp)
        i+=1

codedAnswers=[]
for answer in clean_answers:
    l=answer.split()
    temp=[]
    if len(l)>0:
        for word in l:
            if (word not in answerwordsIDs.keys()):
                temp.append(answerwordsIDs['<OUT>'])
            else:
                temp.append(answerwordsIDs[word])
        codedAnswers.append(temp)

In [121]:
# So final step,  before getting into modeling and what we will need to do is sorting the questions and answers by length
# this helps (speed-up) with the learning process. 

SortclQues=sorted(codedQuestions,key=len)
SortclAns=sorted(codedAnswers,key=len)

# II. Building the SEQ2SEQ model:

Now we will start using Tensorflow to build the architecture of the model that ww will train in the next phase, so let's get into it.

It's important to note that in Tensorflow all variables are tensors, a tensor is a special data structure that is without being mathematically rigorous can be considered as a multidimensional vector, a matrix for example is a rank 2 tensor. These tensor based variables allow a fast computation for deep neural networks, so in order to use this tensor variables we must define them in a Tensorflow placeholder. So the first thing we will do is create placeholders for inputs and targets. Let's go!

In [None]:
def modelInputs():
    inputs=tf.placeholder()