# Chatbot using Deep Neural Network

*In this notebook, we explore a fun and interesting use-case of recurrent
sequence-to-sequence models. We will train a simple chatbot using movie
scripts from the [Cornell Movie-Dialogs Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html)*

## Downloading the data

In [1]:
!wget http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip

--2020-08-22 11:30:23--  http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip
Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 132.236.207.20
Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|132.236.207.20|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9916637 (9.5M) [application/zip]
Saving to: ‘cornell_movie_dialogs_corpus.zip’


2020-08-22 11:30:24 (8.12 MB/s) - ‘cornell_movie_dialogs_corpus.zip’ saved [9916637/9916637]



In [2]:
!unzip /content/cornell_movie_dialogs_corpus.zip

Archive:  /content/cornell_movie_dialogs_corpus.zip
   creating: cornell movie-dialogs corpus/
  inflating: cornell movie-dialogs corpus/.DS_Store  
   creating: __MACOSX/
   creating: __MACOSX/cornell movie-dialogs corpus/
  inflating: __MACOSX/cornell movie-dialogs corpus/._.DS_Store  
  inflating: cornell movie-dialogs corpus/chameleons.pdf  
  inflating: __MACOSX/cornell movie-dialogs corpus/._chameleons.pdf  
  inflating: cornell movie-dialogs corpus/movie_characters_metadata.txt  
  inflating: cornell movie-dialogs corpus/movie_conversations.txt  
  inflating: cornell movie-dialogs corpus/movie_lines.txt  
  inflating: cornell movie-dialogs corpus/movie_titles_metadata.txt  
  inflating: cornell movie-dialogs corpus/raw_script_urls.txt  
  inflating: cornell movie-dialogs corpus/README.txt  
  inflating: __MACOSX/cornell movie-dialogs corpus/._README.txt  


## Importing Dependencies

In [3]:
import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
import csv
import random
import re
import os
import unicodedata
import codecs
from io import open
import itertools

In [4]:
USE_CUDA = torch.cuda.is_available()
device = torch.device("cuda" if USE_CUDA else "cpu")

In [5]:
device

device(type='cuda')

Load & Preprocess Data
----------------------

The next step is to reformat our data file and load the data into
structures that we can work with.

The `Cornell Movie-Dialogs
Corpus <https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html>`__
is a rich dataset of movie character dialog:

-  220,579 conversational exchanges between 10,292 pairs of movie
   characters
-  9,035 characters from 617 movies
-  304,713 total utterances

This dataset is large and diverse, and there is a great variation of
language formality, time periods, sentiment, etc. Our hope is that this
diversity makes our model robust to many forms of inputs and queries.

First, we’ll take a look at some lines of our datafile to see the
original format.

In [6]:
lines_filepath = os.path.join("cornell movie-dialogs corpus", "movie_lines.txt")
conv_filepath = os.path.join("cornell movie-dialogs corpus", "movie_conversations.txt")

In [7]:
# visualise some lines
with open(lines_filepath, 'r',encoding='iso-8859-1') as file:
  lines= file.readlines()
for line in lines[:10]:
  print(line.strip())

L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!
L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!
L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.
L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?
L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go.
L924 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Wow
L872 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Okay -- you're gonna need to learn how to lie.
L871 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ No
L870 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I'm kidding.  You know how sometimes you just become this "persona"?  And you don't know how to quit?
L869 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Like my fear of wearing pastels?


In [8]:
#split each line of file into directory of fields (lineID, chracterID, movieID, character, text)
line_fields= ["lineID", "chracterID", "movieID", "character", "text"]
lines= {}

with open(lines_filepath, 'r', encoding='iso-8859-1') as f:
  for line in f:
    values= line.split(' +++$+++ ')
    lineObj= {}
    for i, field in enumerate(line_fields):
      lineObj[field] = values[i]
    lines[lineObj['lineID']] = lineObj

In [9]:
list(lines.items())[:5]

[('L1045',
  {'character': 'BIANCA',
   'chracterID': 'u0',
   'lineID': 'L1045',
   'movieID': 'm0',
   'text': 'They do not!\n'}),
 ('L1044',
  {'character': 'CAMERON',
   'chracterID': 'u2',
   'lineID': 'L1044',
   'movieID': 'm0',
   'text': 'They do to!\n'}),
 ('L985',
  {'character': 'BIANCA',
   'chracterID': 'u0',
   'lineID': 'L985',
   'movieID': 'm0',
   'text': 'I hope so.\n'}),
 ('L984',
  {'character': 'CAMERON',
   'chracterID': 'u2',
   'lineID': 'L984',
   'movieID': 'm0',
   'text': 'She okay?\n'}),
 ('L925',
  {'character': 'BIANCA',
   'chracterID': 'u0',
   'lineID': 'L925',
   'movieID': 'm0',
   'text': "Let's go.\n"})]

In [10]:
# Group the field of lines form "loadlines" into conversations based on "movie_conversations.txt"
conv_fields = ['character1ID', 'character2ID', 'movieID', 'utteranceIDs']
conversations = []
with open(conv_filepath, 'r', encoding= 'iso-8859-1') as f:
  for line in f:
    values = line.split(" +++$+++ ")
    #Extract Fields
    convObj= {}
    for i, field in enumerate(conv_fields):
      convObj[field] = values[i]
    #convert string result from split to list, since convObj["utteranceIDs"] == ['L194', 'L195', 'L196', .....]
    lineIds = eval(convObj["utteranceIDs"])
    #Reassemble lines
    convObj["lines"] = []
    for lineId in lineIds:
      convObj["lines"].append(lines[lineId])
    conversations.append(convObj)

In [11]:
conversations[0]

{'character1ID': 'u0',
 'character2ID': 'u2',
 'lines': [{'character': 'BIANCA',
   'chracterID': 'u0',
   'lineID': 'L194',
   'movieID': 'm0',
   'text': 'Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.\n'},
  {'character': 'CAMERON',
   'chracterID': 'u2',
   'lineID': 'L195',
   'movieID': 'm0',
   'text': "Well, I thought we'd start with pronunciation, if that's okay with you.\n"},
  {'character': 'BIANCA',
   'chracterID': 'u0',
   'lineID': 'L196',
   'movieID': 'm0',
   'text': 'Not the hacking and gagging and spitting part.  Please.\n'},
  {'character': 'CAMERON',
   'chracterID': 'u2',
   'lineID': 'L197',
   'movieID': 'm0',
   'text': "Okay... then how 'bout we try out some French cuisine.  Saturday?  Night?\n"}],
 'movieID': 'm0',
 'utteranceIDs': "['L194', 'L195', 'L196', 'L197']\n"}

In [12]:
# Extract the pair of sentences from conversations
qa_pairs= []
for conversation in conversations:
  #Iterate over all the lines of the conversation
  for i in range(len(conversation["lines"]) - 1):
    inputLine= conversation["lines"][i]["text"].strip()
    targetLine= conversation["lines"][i+1]["text"].strip()
    # Filter wrong samples (if one of the lists is empty)
    if inputLine and targetLine:
      qa_pairs.append([inputLine, targetLine])

In [13]:
qa_pairs[0]

['Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.',
 "Well, I thought we'd start with pronunciation, if that's okay with you."]

In [14]:
# Define path to new file
datafile= os.path.join("cornell movie-dialogs corpus", "formated_movie_lines.txt")
delimiter= '\t'
#Unescape the delimeter
delimiter = str(codecs.decode(delimiter, "unicode_escape"))

# Write new csv file
print("\nWriting newly formatted file...")
with open(datafile, 'w', encoding= 'utf-8') as outputfile:
  writer = csv.writer(outputfile, delimiter= delimiter)
  for pair in qa_pairs:
    writer.writerow(pair)
print('Done Writing to file')


Writing newly formatted file...
Done Writing to file


In [15]:
#define path to new file
datafile = os.path.join("cornell movie-dialogs corpus", "formated_movie_lines.txt")
with open(datafile, 'rb') as file:
  lines= file.readlines()
for line in lines[:8]:
  print(line)

b"Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.\tWell, I thought we'd start with pronunciation, if that's okay with you.\r\n"
b"Well, I thought we'd start with pronunciation, if that's okay with you.\tNot the hacking and gagging and spitting part.  Please.\r\n"
b"Not the hacking and gagging and spitting part.  Please.\tOkay... then how 'bout we try out some French cuisine.  Saturday?  Night?\r\n"
b"You're asking me out.  That's so cute. What's your name again?\tForget it.\r\n"
b"No, no, it's my fault -- we didn't have a proper introduction ---\tCameron.\r\n"
b"Cameron.\tThe thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.\r\n"
b"The thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.\tSeems like she could get a date easy enough...\r\n"
b'Why?\tUnsolved myster

## Load and preprocess data

In [16]:
# Default word tokens
PAD_token = 0  # Used for padding short sentences
SOS_token = 1  # Start-of-sentence token
EOS_token = 2  # End-of-sentence token

class Vocabulary:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
        self.num_words = 3  # Count SOS, EOS, PAD

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.num_words
            self.word2count[word] = 1
            self.index2word[self.num_words] = word
            self.num_words += 1
        else:
            self.word2count[word] += 1

    # Remove words below a certain count threshold
    def trim(self, min_count):
        keep_words = []

        for k, v in self.word2count.items():
            if v >= min_count:
                keep_words.append(k)

        print('keep_words {} / {} = {:.4f}'.format(
            len(keep_words), len(self.word2index), len(keep_words) / len(self.word2index)
        ))

        # Reinitialize dictionaries
        self.word2index = {}
        self.word2count = {}
        self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
        self.num_words = 3 # Count default tokens

        for word in keep_words:
            self.addWord(word)

In [17]:
# Turn a Unicode string to plain ASCII, thanks to
# https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')

In [18]:
unicodeToAscii('François Chollet')

'Francois Chollet'

In [19]:
# Lowercase, trim, and remove non-letter characters
def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    s = re.sub(r"\s+", r" ", s).strip()
    return s

In [20]:
datafile = os.path.join('cornell movie-dialogs corpus', 'formated_movie_lines.txt')
# Read the files and split into lines
print("Reading and processing file......Please Wait")
lines= open(datafile, encoding='utf-8').read().strip().split('\n')
#Split every line into pairs and normalise
pairs= [[normalizeString(s) for s in pair.split('\t')] for pair in lines]
print("Done Reading")
voc= Vocabulary("cornell movie dialogs corpus")

Reading and processing file......Please Wait
Done Reading


In [21]:
MAX_LENGTH = 10
def filterPair(p):
  # Input sequences need to preserve the last word for EOS token
  return len(p[0].split()) < MAX_LENGTH and len(p[1].split()) < MAX_LENGTH

def filterPairs(pairs):
  return [pair for pair in pairs if filterPair(pair)]

In [22]:
pairs = [pair for pair in pairs if len(pair) > 1]
print(f"There are {len(pairs)} pairs/conversations in the dataset")
pairs = filterPairs(pairs)
print(f"After filtering, there are {len(pairs)} pairs/coversations")

There are 221282 pairs/conversations in the dataset
After filtering, there are 64271 pairs/coversations


In [23]:
#Loop through each pair of and add the question and reply sentence to the vocabulary
for pair in pairs:
  voc.addSentence(pair[0])
  voc.addSentence(pair[1])
print("Counted Words:", voc.num_words) 

Counted Words: 18008


In [24]:
for pair in pairs[:5]:
  print(pair)

['there .', 'where ?']
['you have my word . as a gentleman', 'you re sweet .']
['hi .', 'looks like things worked out tonight huh ?']
['you know chastity ?', 'i believe we share an art instructor']
['have fun tonight ?', 'tons']


In [25]:
MIN_COUNT = 3    # Minimum word count threshold for trimming

def trimRareWords(voc, pairs, MIN_COUNT):
    # Trim words used under the MIN_COUNT from the voc
    voc.trim(MIN_COUNT)
    # Filter out pairs with trimmed words
    keep_pairs = []
    for pair in pairs:
        input_sentence = pair[0]
        output_sentence = pair[1]
        keep_input = True
        keep_output = True
        # Check input sentence
        for word in input_sentence.split(' '):
            if word not in voc.word2index:
                keep_input = False
                break
        # Check output sentence
        for word in output_sentence.split(' '):
            if word not in voc.word2index:
                keep_output = False
                break

        # Only keep pairs that do not contain trimmed word(s) in their input or output sentence
        if keep_input and keep_output:
            keep_pairs.append(pair)

    print("Trimmed from {} pairs to {}, {:.4f} of total".format(len(pairs), len(keep_pairs), len(keep_pairs) / len(pairs)))
    return keep_pairs


# Trim voc and pairs
pairs = trimRareWords(voc, pairs, MIN_COUNT)

keep_words 7823 / 18005 = 0.4345
Trimmed from 64271 pairs to 53165, 0.8272 of total


## Prepare Data for Models