<a href="https://colab.research.google.com/github/ecuadrafoy/toolbox/blob/master/Chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a Chatbot Using Attention-Based Neural Networks

We will expand on our sequence-to-sequence models from the previous chapter, adding something called attention to our models.

This improvement to the sequence-to-sequence models means that our model learns where in the input sentence to look to obtain the information it needs, rather than using the whole input sentence decision. 

## The theory of attention

* Previously the encoder would take the hidden state of the whole sentence and depend on the decoder to transform it into the out put. However decoding over the entirety of the hidden state is not necessarily the most efficient way of using this task. This is because the hidden state represents the entirety of the input sentence.
  * We do not need to consider the entirety of the input sentence, just the parts that are relevant to the prediction we are trying to make
  * By using attention within our sequence-to-sequence neural network. We can teach our model to only look at the relevant parts of the input in order to make its prediction, resulting in a much more efficient and accurate model.
  * There are two main types of attention mechanisms that we can implement: local and global attention.

### Local Attention
In local attention, our model only looks at a few hidden states from the encoder.
  * For example, if we are performing a sentence translation task and we are calculating the second word in our translation, the model may wish to only look at the hidden states from the encoder related to the second word in the input sentence.

![](https://learning.oreilly.com/library/view/hands-on-natural-language/9781789802740/image/B12365_08_2.jpg)

  * We first start by calculating the aligned position, pt, from our final hidden state, hn. This tells us which hidden states we need to be looking at to make our prediction. 
  * We then calculate our local weights and apply them to our hidden states in order to determine our context vector
    * These weights may tell us to pay more attention to the most relevant hidden state (h2) but less attention to the preceding hidden state (h1)
  * We then take our context vector and pass it forward to our decoder in order to make its prediction
    * Here instead of passing the final hidden state hn, we only consider the relevant hidden states that our model deems necessary to make its prediction.

### Global Attention

The global attention model works in a very similar way. However, instead of only looking at a few of the hidden states, we want to look at all of our model's hidden states
![](https://learning.oreilly.com/library/view/hands-on-natural-language/9781789802740/image/B12365_08_3.jpg)

* Our model is now looking at all the hidden states and calculating the global weights across all of them
* Allowing our model to look at any given part of the input sentence that it considers relevant, instead of being limited to a local area determined by the local attention methodology
* The global attention framework is like learning a mask that only allows through hidden states that are relevant to our prediction


## Training the Chatbot
* Our chatbot will take a line of human-entered input and respond to it with a generated sentence.
* The perfect dataset for a task such as this would be actual chat logs from conversations between two human users.
* Movie scripts consist of conversations between two or more characters. While this data is not naturally in the format we would like it to be in, we can easily transform it into the format that we need. 

In [1]:
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import torch
from torch.jit import script, trace
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
import csv
import random
import re
import os
import unicodedata
import codecs
from io import open
import itertools
import math

USE_CUDA = torch.cuda.is_available()
device = torch.device("cuda" if USE_CUDA else "cpu")

In [2]:
from google.colab import drive
drive.mount('/content/drive/')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive/


In [9]:
corpus = "movie_corpus"
corpus_name = "movie_corpus"

In [3]:
path = "/content/drive/My Drive/NLP_PyTorch/Glove"
datafile = os.path.join(path, "formatted_movie_lines.txt")

In [4]:
with open(datafile, 'rb') as file:
    lines = file.readlines()
    
for line in lines[:3]:
    print(str(line) + '\n')

b"Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.\tWell, I thought we'd start with pronunciation, if that's okay with you.\n"

b"Well, I thought we'd start with pronunciation, if that's okay with you.\tNot the hacking and gagging and spitting part.  Please.\n"

b"Not the hacking and gagging and spitting part.  Please.\tOkay... then how 'bout we try out some French cuisine.  Saturday?  Night?\n"



## Creating the Vocabulary
* In the past, our corpus has comprised of several dictionaries consisting of the unique words in our corpus and lookups between word and indices. However, we can do this in a far more elegant way by creating a vocabulary class that consists of all of the elements required:
  * We start by creating our Vocabulary class.
  * We initialize this class with empty dictionaries—word2index and word2count.
  * We also initialize the index2word dictionary with placeholders for our padding tokens, as well as our Start-of-Sentence (SOS) and End-of-Sentence (EOS) tokens.
  * We keep a running count of the number of words in our vocabulary, too (which is 3 to start with as our corpus already contains the three tokens mentioned).

In [5]:

PAD_token = 0 
SOS_token = 1
EOS_token = 2

class Vocabulary:
    def __init__(self, name):
        self.name = name
        self.trimmed = False
        self.word2index = {}
        self.word2count = {}
        self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
        self.num_words = 3

    def addWord(self, w): 
      # addWord takes a word as input. If this is a new word that is not already in our vocabulary, we add this word to our indices
      # set the count of this word to 1, and increment the total number of words in our vocabulary by 1. 
      #If the word in question is already in our vocabulary, we simply increment the count of this word by 1
        if w not in self.word2index:
            self.word2index[w] = self.num_words
            self.word2count[w] = 1
            self.index2word[self.num_words] = w
            self.num_words += 1
        else:
            self.word2count[w] += 1        
        
    def addSentence(self, sent):
      # We also use the addSentence function to apply the addWord function to all the words within a given sentence
        for word in sent.split(' '):
            self.addWord(word)
# One thing we can do to speed up the training of our model is reduce the size of our vocabulary
# An easy way to do this is to remove any low-frequency words from our vocabulary
# Any words occurring just once or twice in our dataset are unlikely to have huge predictive power

    def trim(self, min_cnt):
        if self.trimmed:
            return
        self.trimmed = True

        words_to_keep = []

        for k, v in self.word2count.items(): 
          # function first loops through the word count dictionary and if the occurrence of the word is greater than the minimum required count, it is appended to a new list:
            if v >= min_cnt:
                words_to_keep.append(k)

        print('Words to Keep: {} / {} = {:.2%}'.format(
            len(words_to_keep), len(self.word2index), len(words_to_keep) / len(self.word2index)
        ))
# Finally, our indices are rebuilt from the new words_to_keep list
# We set all the indices to their initial empty values and then repopulate them by looping through our kept words with the addWord function:

        self.word2index = {}
        self.word2count = {}
        self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
        self.num_words = 3

        for w in words_to_keep:
            self.addWord(w)

## Loading the Data


In [6]:
def unicodeToAscii(s):
  # converting it from Unicode into ASCII format
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

def cleanString(s):
  # process our input strings so that they are all in lowercase and do not contain any trailing whitespace or punctuation, except the most basic characters
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    s = re.sub(r"\s+", r" ", s).strip()
    return s

# This function reads our data file into lines and then applies the cleanString function to every line
# It also creates an instance of the Vocabulary class
def readVocs(datafile, corpus_name):
    lines = open(datafile, encoding='utf-8').\
        read().strip().split('\n')
    pairs = [[cleanString(s) for s in l.split('\t')] for l in lines]
    voc = Vocabulary(corpus_name)
    return voc, pairs

# filter our input pairs by their maximum length
# done to reduce the potential dimensionality of our model.


def filterPair(p, max_length):
  # filterPair, returns a Boolean value based on whether the current line has an input and output length that is less than the maximum length.
    return len(p[0].split(' ')) < max_length and len(p[1].split(' ')) < max_length

def filterPairs(pairs, max_length):
  #  Our second function, filterPairs, simply applies this condition to all the pairs within our dataset, only keeping the ones that meet this condition
    return [pair for pair in pairs if filterPair(pair, max_length)]

# final function that applies all the previous functions we have put together and run it to create our vocabulary and data pairs
def loadData(corpus, corpus_name, datafile, max_length):
    voc, pairs = readVocs(datafile, corpus_name)
    print(str(len(pairs)) + " Sentence pairs")
    pairs = filterPairs(pairs,max_length)
    print(str(len(pairs))+ " Sentence pairs after trimming")
    for p in pairs:
        voc.addSentence(p[0])
        voc.addSentence(p[1])
    print(str(voc.num_words) + " Distinct words in vocabulary")
    return voc, pairs

In [10]:
max_length = 10 
voc, pairs = loadData(corpus, corpus_name, datafile, max_length)

221282 Sentence pairs
64271 Sentence pairs after trimming
18008 Distinct words in vocabulary


In [12]:
print("Example Pairs:")
for pair in pairs[-20:]:
    print(pair)

Example Pairs:
['yes i have mine .', 'and i have mine .']
['yes . . .yes you have yours .', 'why don t we talk inside ?']
['yes i know .', 'it wouldn t be fair to her .']
['it wouldn t be fair to her .', 'yes i know .']
['are you ready for me ?', 'mmmmmmmmm !']
['mmmmmmmmm !', 'ready for fuchsmachen ? ? ?']
['ready for fuchsmachen ? ? ?', 'mmmmmmmmmmmmmmm !']
['his what ? ?', 'his schwanzstucker .']
['his schwanzstucker .', 'whew ! a nineteen inch drill .']
['how long is it so far ?', 'four']
['four', 'three minutes to go !']
['three minutes to go !', 'yes .']
['another fifteen seconds to go .', 'do something ! stall them !']
['yes sir name please ?', 'food !']
['food !', 'do you have a reservation ?']
['do you have a reservation ?', 'food ! !']
['grrrhmmnnnjkjmmmnn !', 'franz ! help ! lunatic !']
['what o clock is it mr noggs ?', 'eleven o clock my lorj']
['stuart ?', 'yes .']
['yes .', 'how quickly can you move your artillery forward ?']


## Removing Rare Words

In [13]:
def removeRareWords(voc, all_pairs, minimum):
# Create a function to remove these rare words and call the trim method from our vocabulary as our first step
    voc.trim(minimum) # first calculate the percentage of words that we will keep within our model
    
    pairs_to_keep = []
    
    for p in all_pairs:
      # we loop through all the words in the input and output sentences. 
      # If for a given pair either the input or output sentence has a word that isn't in our new trimmed corpus, we drop this pair from our dataset
        keep = True
        
        for word in p[0].split(' '):
            if word not in voc.word2index:
                keep = False
                break
        for word in p[1].split(' '):
            if word not in voc.word2index:
                keep = False
                break

        if keep:
            pairs_to_keep.append(p)

    print("Trimmed from {} pairs to {}, {:.2%} of total".format(len(all_pairs)\
        , len(pairs_to_keep), len(pairs_to_keep)/ len(all_pairs)))
    return pairs_to_keep


minimum_count = 3
pairs = removeRareWords(voc, pairs, minimum_count)

Words to Keep: 7823 / 18005 = 43.45%
Trimmed from 64271 pairs to 53165, 82.72% of total


## Transforming Data to Tensors
* our model will not take raw text as input, but rather, tensor representations of sentences
*  We will also not process our sentences one by one, but instead in smaller batches

In [14]:
def indexFromSentence(voc, sent):
  # grabs the index of each word in the sentence from the vocabulary and appends an EOS token to the end
    return [voc.word2index[w] for w in sent.split(' ')] + [EOS_token]

def zeroPad(l, fillvalue=PAD_token):
  # pads any tensors with zeroes so that all of the sentences within the tensor are effectively the same length
    return list(itertools.zip_longest(*l, fillvalue=fillvalue))

# to generate our input tensor, we apply both of these functions
def inputVar(l, voc):
    indexes_batch = [indexFromSentence(voc, sentence) for sentence in l] # First, we get the indices of our input sentence
    padList = zeroPad(indexes_batch) # then apply padding
    padTensor = torch.LongTensor(padList) # transform the output into LongTensor
    lengths = torch.tensor([len(indexes) for indexes in indexes_batch]) # obtain the lengths of each of our input sentences out output this as a tensor
    return padTensor, lengths

def getMask(l, value=PAD_token): # create a Boolean mask to ignore padded tokens
    m = []
    for i, seq in enumerate(l):
        m.append([])
        for token in seq: # returns 1 if the output consists of a word and 0 if it consists of a padding token
            if token == PAD_token:
                m[i].append(0)
            else:
                m[i].append(1)
    return m

def outputVar(l, voc):
  # apply this to our outputVar function
  # This is identical to the inputVar function, except that along with the indexed output tensor and the tensor of lengths, we also return the Boolean mask of our output tensor
    indexes_batch = [indexFromSentence(voc, sentence) for sentence in l]
    max_target_len = max([len(indexes) for indexes in indexes_batch])
    padList = zeroPad(indexes_batch)
  # This Boolean mask just returns True when there is a word within the output tensor and False when there is a padding token
    mask = torch.BoolTensor(getMask(padList))
    padTensor = torch.LongTensor(padList)
    return padTensor, mask, max_target_len

def batch2Train(voc, batch):
  #in order to create our input and output batches concurrently, we loop through the pairs in our batch and create input and output tensors 
  # for both pairs using the functions we created previously
    batch.sort(key=lambda x: len(x[0].split(" ")), reverse=True)
    
    input_batch = []
    output_batch = []
    
    for p in batch:
        input_batch.append(p[0])
        output_batch.append(p[1])
        
    inp, lengths = inputVar(input_batch, voc)
    output, mask, max_target_len = outputVar(output_batch, voc)
    
    return inp, lengths, output, mask, max_target_len

In [15]:
test_batch_size = 5
batches = batch2Train(voc, [random.choice(pairs) for _ in range(test_batch_size)])
input_variable, lengths, target_variable, mask, max_target_len = batches

print("Input:")
print(input_variable)

print("Target:")
print(target_variable)

print("Mask:")
print(mask)

Input:
tensor([[  25,   25,  331, 3197,  598],
        [ 988,  505,  117, 5580,    7],
        [ 117,   60,  101,    4,    2],
        [  84,    4, 2413,    2,    0],
        [ 219,   25,    6,    0,    0],
        [  25,  505,    2,    0,    0],
        [ 200,  746,    0,    0,    0],
        [3236, 4607,    0,    0,    0],
        [   4,    4,    0,    0,    0],
        [   2,    2,    0,    0,    0]])
Target:
tensor([[   7,  197,   34, 1014,   25],
        [  14,  117,    4,  124, 1962],
        [  64,   60,    2,  125,   86],
        [ 266, 1313,    0,   76,  144],
        [   4,    4,    0,  650,    7],
        [   2,  385,    0,   64,    4],
        [   0,  169,    0,  306,    2],
        [   0,   83,    0, 1014,    0],
        [   0,    4,    0,   66,    0],
        [   0,    2,    0,    2,    0]])
Mask:
tensor([[ True,  True,  True,  True,  True],
        [ True,  True,  True,  True,  True],
        [ True,  True,  True,  True,  True],
        [ True,  True, False,  True,  True

## Constructing the Model

In [16]:
#The Encoder
class EncoderRNN(nn.Module):
    def __init__(self, hidden_size, embedding, n_layers=1, dropout=0):
        super(EncoderRNN, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size
        self.embedding = embedding
# We now define our GRU, taking into account the size of our input, the number of layers, and whether we should implement dropout
        self.gru = nn.GRU(hidden_size, hidden_size, n_layers,
                          dropout=(0 if n_layers == 1 else dropout), bidirectional=True) #Applying bidirectionality

    def forward(self, input_seq, input_lengths, hidden=None):
# We do this by first embedding our input sentences and then using the pack_padded_sequence function on our embeddings
        embedded = self.embedding(input_seq)
        packed = nn.utils.rnn.pack_padded_sequence(embedded, input_lengths)
        # This function "packs" our padded sequence so that all of our inputs are of the same length
        # We then pass out the packed sequences through our GRU to perform a forward pass
        outputs, hidden = self.gru(packed, hidden)
        # Unpack padding
        outputs, _ = nn.utils.rnn.pad_packed_sequence(outputs) # unpack our padding and sum the GRU outputs
        # Sum bidirectional GRU outputs
        outputs = outputs[:, :, :self.hidden_size] + outputs[:, : ,self.hidden_size:]
        # Return output and final hidden state
        return outputs, hidden

## Constructing the Attention Module
* we will apply to our encoder so that we can learn from the relevant parts of the encoder's output

In [17]:
class Attn(nn.Module):
    def __init__(self, hidden_size):
        super(Attn, self).__init__()
        self.hidden_size = hidden_size

    def dot_score(self, hidden, encoder_output): 
      # This function simply calculates the dot product of our encoder output with the output of our hidden state by our encoder
        return torch.sum(hidden * encoder_output, dim=2)

    def forward(self, hidden, encoder_outputs):
      # First, calculate the attention weights/energies based on the dot_score method, then transpose the results, and return the softmax transformed probability scores
        attn_energies = self.dot_score(hidden, encoder_outputs)
        attn_energies = attn_energies.t()
        return F.softmax(attn_energies, dim=1).unsqueeze(1)

##Constructing the Decoder

In [18]:
class DecoderRNN(nn.Module):
    def __init__(self, embedding, hidden_size, output_size, n_layers=1, dropout=0.1):
        super(DecoderRNN, self).__init__()

        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        self.dropout = dropout
# We will create an embedding layer and a corresponding dropout layer
        self.embedding = embedding
        self.embedding_dropout = nn.Dropout(dropout)
# We use GRUs again for our decoder; however, this time, we do not need to make our GRU layer bidirectional as we will be decoding the output from our encoder sequentially
        self.gru = nn.GRU(hidden_size, hidden_size, n_layers, dropout=(0 if n_layers == 1 else dropout))
# create two linear layers—one regular layer for calculating our output
        self.concat = nn.Linear(2 * hidden_size, hidden_size)
# one layer that can be used for concatenation
        self.out = nn.Linear(hidden_size, output_size)

        self.attn = Attn(hidden_size)

    def forward(self, input_step, last_hidden, encoder_outputs):
      # the forward pass will be used one step (word) at a time

        embedded = self.embedding(input_step)
        embedded = self.embedding_dropout(embedded)
      # We start by getting the embedding of the current input word and making a forward pass through the GRU layer to get our output and hidden states:
        rnn_output, hidden = self.gru(embedded, last_hidden)
      # we use the attention module to get the attention weights from the GRU output
      # These weights are then multiplied by the encoder outputs to effectively give us a weighted sum of our attention weights and our encoder output
        attn_weights = self.attn(rnn_output, encoder_outputs)
        context = attn_weights.bmm(encoder_outputs.transpose(0, 1))
      # We then concatenate our weighted context vector with the output of our GRU and apply a tanh function to get out final concatenated output
        rnn_output = rnn_output.squeeze(0)
        context = context.squeeze(1)
        concat_input = torch.cat((rnn_output, context), 1)
        concat_output = torch.tanh(self.concat(concat_input))
      # we simply use this final concatenated output to predict the next word and apply a softmax function
      # The forward pass finally returns this output, along with the final hidden state
        output = self.out(concat_output)
        output = F.softmax(output, dim=1)
        return output, hidden

In [20]:
# define a loss function that applies a Boolean mask over our outputs and only calculates the loss of the non-padded tokens
def NLLMaskLoss(inp, target, mask):
  # we calculate cross-entropy loss across the whole output tensors.
  # However, to get the total loss, we only average over the elements of the tensor that are selected by the Boolean mask
    TotalN = mask.sum()
    CELoss = -torch.log(torch.gather(inp, 1, target.view(-1, 1)).squeeze(1))
    loss = CELoss.masked_select(mask).mean()
    loss = loss.to(device)
    return loss, TotalN.item()

In [22]:
# For the majority of our training, we need two main functions—one function, train(), 
# which performs training on a single batch of our training data and another function, trainIters()
# which iterates through our whole dataset and calls train() on each of the individual batches. 
def train(input_variable, lengths, target_variable, mask, max_target_len, encoder, decoder, embedding,
          encoder_optimizer, decoder_optimizer, batch_size, clip, max_length=max_length):
#  defining train() in order to train on a single batch of data
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    input_variable = input_variable.to(device)
    lengths = lengths.to(device)
    target_variable = target_variable.to(device)
    mask = mask.to(device)

    loss = 0
    print_losses = []
    n_totals = 0
# perform a forward pass of the inputs and sequence lengths though the encoder to get the output and hidden states:

    encoder_outputs, encoder_hidden = encoder(input_variable, lengths)
# we create our initial decoder input, starting with SOS tokens for each sentence. We then set the initial hidden state of our decoder to be equal to that of the encoder
    decoder_input = torch.LongTensor([[SOS_token for _ in range(batch_size)]])
    decoder_input = decoder_input.to(device)

    decoder_hidden = encoder_hidden[:decoder.n_layers]

    use_TF = True if random.random() < teacher_forcing_ratio else False
# implement teacher forcing 
    if use_TF:
        for t in range(max_target_len):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden, encoder_outputs
            )
            decoder_input = target_variable[t].view(1, -1)
            mask_loss, nTotal = NLLMaskLoss(decoder_output, target_variable[t], mask[t])
            loss += mask_loss
            print_losses.append(mask_loss.item() * nTotal)
            n_totals += nTotal
    else:
      # if we do need to implement teacher forcing, run the following code. We pass each of our sequence batches through the decoder to obtain our output
      # We then set the next input as the true output (target). Finally, we calculate and accumulate the loss using our loss function and print this to the console
        for t in range(max_target_len):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden, encoder_outputs
            )
            _, topi = decoder_output.topk(1)
          # If we do not implement teacher forcing on a given batch, the procedure is almost identical
          # However, instead of using the true output as the next input into the sequence, we use the one generated by the model
            decoder_input = torch.LongTensor([[topi[i][0] for i in range(batch_size)]])
            decoder_input = decoder_input.to(device)
            mask_loss, nTotal = NLLMaskLoss(decoder_output, target_variable[t], mask[t])
            loss += mask_loss
            print_losses.append(mask_loss.item() * nTotal)
            n_totals += nTotal
#  as with all of our models, the final steps are to perform backpropagation, 
# implement gradient clipping, and step through both of our encoder and decoder optimizers to update the weights using gradient descent
    loss.backward()

    _ = nn.utils.clip_grad_norm_(encoder.parameters(), clip)
    _ = nn.utils.clip_grad_norm_(decoder.parameters(), clip)

    encoder_optimizer.step()
    decoder_optimizer.step()

    return sum(print_losses) / n_totals

In [23]:
def trainIters(model_name, voc, pairs, encoder, decoder, encoder_optimizer,\
               decoder_optimizer, embedding, encoder_n_layers, decoder_n_layers, \
               save_dir, n_iteration, batch_size, print_every, save_every, clip, corpus_name, loadFilename):
# repeatedly calls our training function on different batches of input data.
# plitting our data into batches using the batch2Train
    training_batches = [batch2Train(voc, [random.choice(pairs) for _ in range(batch_size)])
                      for _ in range(n_iteration)]
# create a few variables that will allow us to count iterations and keep track of the total loss over each epoch
    print('Starting ...')
    start_iteration = 1
    print_loss = 0
    if loadFilename:
        start_iteration = checkpoint['iteration'] + 1
# define our training loop. For each iteration, we get a training batch from our list of batches
    print("Beginning Training...")
    for iteration in range(start_iteration, n_iteration + 1):
        training_batch = training_batches[iteration - 1]
        # Extract relevant fields from our batch and run a single training iteration using these parameters
        input_variable, lengths, target_variable, mask, max_target_len = training_batch

        # Run a training iteration with batch
        loss = train(input_variable, lengths, target_variable, mask, max_target_len, encoder,
                     decoder, embedding, encoder_optimizer, decoder_optimizer, batch_size, clip)
        print_loss += loss
# On every iteration, we also make sure we print our progress so far, keeping track of how many iterations we have completed and what our loss was for each epoch
        if iteration % print_every == 0:
            print_loss_avg = print_loss / print_every
            print("Iteration: {}; Percent done: {:.1f}%; Mean loss: {:.4f}".format(iteration, iteration / n_iteration * 100, print_loss_avg))
            print_loss = 0
# we also need to save our model state after every few epochs
        if (iteration % save_every == 0):
            directory = os.path.join(save_dir, model_name, corpus_name, '{}-{}_{}'.format(encoder_n_layers, decoder_n_layers, hidden_size))
            if not os.path.exists(directory):
                os.makedirs(directory)
            torch.save({
                'iteration': iteration,
                'en': encoder.state_dict(),
                'de': decoder.state_dict(),
                'en_opt': encoder_optimizer.state_dict(),
                'de_opt': decoder_optimizer.state_dict(),
                'loss': loss,
                'voc_dict': voc.__dict__,
                'embedding': embedding.state_dict()
            }, os.path.join(directory, '{}_{}.tar'.format(iteration, 'checkpoint')))

## Defining a Greedy encoder
* Defining a class that will allow us to decode the encoded input and produce text
* This simply means that at each step of the decoder, our model takes the word with the highest predicted probability as the output.

In [24]:
class GreedySearchDecoder(nn.Module):
    def __init__(self, encoder, decoder):
      # initializing the GreedyEncoder() class with our pretrained encoder and decoder
        super(GreedySearchDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, input_seq, input_length, max_length):
      # pass the input through our encoder to get our encoder's output and hidden state
        encoder_outputs, encoder_hidden = self.encoder(input_seq, input_length)
        decoder_hidden = encoder_hidden[:decoder.n_layers]
      # create the decoder input with SOS tokens and initialize the tensors to append decoded words to (initialized as a single zero value)
        decoder_input = torch.ones(1, 1, device=device, dtype=torch.long) * SOS_token
        all_tokens = torch.zeros([0], device=device, dtype=torch.long)
        all_scores = torch.zeros([0], device=device)
      # add a max function to obtain the highest-scoring predicted word and its score, which we then append to the all_tokens and all_scores variables
        for _ in range(max_length):
            decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden, encoder_outputs)
            decoder_scores, decoder_input = torch.max(decoder_output, dim=1)
            all_tokens = torch.cat((all_tokens, decoder_input), dim=0)
            all_scores = torch.cat((all_scores, decoder_scores), dim=0)
            decoder_input = torch.unsqueeze(decoder_input, 0)
      # After the whole sequence has been iterated over, we return the complete predicted sentence:

        return all_tokens, all_scores

In [25]:
def evaluate(encoder, decoder, searcher, voc, sentence, max_length=max_length):
  #  takes our input function and returns the predicted output words
    indices = [indexFromSentence(voc, sentence)]# We start by transforming our input sentence into indices using our vocabulary
    lengths = torch.tensor([len(indexes) for indexes in indices])
    input_batch = torch.LongTensor(indices).transpose(0, 1) # obtain a tensor of the lengths of each of these sentences and transpose it
  # we assign our lengths and input tensors to the relevant devices.
    input_batch = input_batch.to(device)
    lengths = lengths.to(device)
  # Next, run the inputs through the searcher (GreedySearchDecoder) to obtain the word indices of the predicted output.
  # Finally, we transform these word indices back into word tokens before returning them as the function output
    tokens, scores = searcher(input_batch, lengths, max_length)
    decoded_words = [voc.index2word[token.item()] for token in tokens]
    return decoded_words


def runChatBot(encoder, decoder, searcher, voc):
  # we create a runchatbot function, which acts as the interface with our chatbot
  # This function takes human-typed input and prints the chatbot's response. 
  # We create this function as a while loop that continues until we terminate the function or type quit as our input
    input_sentence = ''
    while(1):
        try:
            input_sentence = input('> ') #Take input and normalize it
            if input_sentence == 'quit': break
            input_sentence = cleanString(input_sentence)
            output_words = evaluate(encoder, decoder, searcher, voc, input_sentence)
            # we take these output words and format them, ignoring the EOS and padding tokens, before printing the chatbot's response
            output_words[:] = [x for x in output_words if not (x == 'EOS' or x == 'PAD')] 
            print('Response:', ' '.join(output_words))

        except KeyError:
            print("Error: Unknown Word")

## Training the Model



In [26]:
model_name = 'chatbot_model'
hidden_size = 500
encoder_n_layers = 2
decoder_n_layers = 2
dropout = 0.15
batch_size = 64

loadFilename = None
checkpoint_iter = 4000

if loadFilename:
    checkpoint = torch.load(loadFilename)
    encoder_sd = checkpoint['en']
    decoder_sd = checkpoint['de']
    encoder_optimizer_sd = checkpoint['en_opt']
    decoder_optimizer_sd = checkpoint['de_opt']
    embedding_sd = checkpoint['embedding']
    voc.__dict__ = checkpoint['voc_dict']


print('Building encoder and decoder ...')
# We first load our embeddings from the vocabulary
embedding = nn.Embedding(voc.num_words, hidden_size)
# If we have already trained a model, we can load the trained embeddings layer
if loadFilename:
    embedding.load_state_dict(embedding_sd)
# creating model instances using the defined hyperparameters
encoder = EncoderRNN(hidden_size, embedding, encoder_n_layers, dropout)
decoder = DecoderRNN(embedding, hidden_size, voc.num_words, decoder_n_layers, dropout)
# if we have already trained a model, we simply load the trained model states into our models
if loadFilename:
    encoder.load_state_dict(encoder_sd)
    decoder.load_state_dict(decoder_sd)

encoder = encoder.to(device)
decoder = decoder.to(device)
print('Models built and ready to go!')

Building encoder and decoder ...
Models built and ready to go!


In [27]:
save_dir = './'

clip = 50.0
teacher_forcing_ratio = 1.0
learning_rate = 0.0001
decoder_learning_ratio = 5.0

epochs = 4000

print_every = 1
save_every = 500

encoder.train() # switch models to train mode
decoder.train()

print('Building optimizers ...')
#  create optimizers for both the encoder and decoder
encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate * decoder_learning_ratio)

if loadFilename:
    encoder_optimizer.load_state_dict(encoder_optimizer_sd)
    decoder_optimizer.load_state_dict(decoder_optimizer_sd)
# The final step before running the training is to make sure CUDA is configured to be called if you wish to use GPU training
# To do this, we simply loop through the optimizer states for both the encoder and decoder and enable CUDA across all of the states
for state in encoder_optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.cuda()

for state in decoder_optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.cuda()

print("Starting Training!")
trainIters(model_name, voc, pairs, encoder, decoder, encoder_optimizer, decoder_optimizer,
           embedding, encoder_n_layers, decoder_n_layers, save_dir, epochs, batch_size,
           print_every, save_every, clip, corpus_name, loadFilename)

Building optimizers ...
Starting Training!
Starting ...
Beginning Training...
Iteration: 1; Percent done: 0.0%; Mean loss: 8.9635
Iteration: 2; Percent done: 0.1%; Mean loss: 8.8302
Iteration: 3; Percent done: 0.1%; Mean loss: 8.7005
Iteration: 4; Percent done: 0.1%; Mean loss: 8.4026
Iteration: 5; Percent done: 0.1%; Mean loss: 8.0496
Iteration: 6; Percent done: 0.1%; Mean loss: 7.4402
Iteration: 7; Percent done: 0.2%; Mean loss: 7.0134
Iteration: 8; Percent done: 0.2%; Mean loss: 6.8891
Iteration: 9; Percent done: 0.2%; Mean loss: 6.8356
Iteration: 10; Percent done: 0.2%; Mean loss: 6.5928
Iteration: 11; Percent done: 0.3%; Mean loss: 6.4639
Iteration: 12; Percent done: 0.3%; Mean loss: 5.9568
Iteration: 13; Percent done: 0.3%; Mean loss: 5.7199
Iteration: 14; Percent done: 0.4%; Mean loss: 5.6858
Iteration: 15; Percent done: 0.4%; Mean loss: 5.3308
Iteration: 16; Percent done: 0.4%; Mean loss: 5.3953
Iteration: 17; Percent done: 0.4%; Mean loss: 5.4125
Iteration: 18; Percent done: 0

In [28]:
# switch our model into evaluation mode
encoder.eval()
decoder.eval()
# initialize an instance of GreedySearchDecoder in order to be able to perform the evaluation and return the predicted output as text
searcher = GreedySearchDecoder(encoder, decoder)

In [32]:
runChatBot(encoder, decoder, searcher, voc)

> hello
Response: hello . . . . .
> what is your name?
Response: my name is travis . . .
> How are you travis?
Response: i m fine . . . .
> Are you sure?
Response: i m sure . . . .
> Do you have any friends?
Response: no . . . . .
> that sucks
Response: yes . . . . .
> what is your job?
Response: i don t know . . .
> DO you have a job?
Response: yes . . . . .
> What is your job?
Response: i don t know . . .
> Do you feel bad?
Response: i don t know . . .
> Im sorry to hear that
Response: what ? ? ? . .
> Sorry
Response: it s okay . . . .
> It's a nice day today
Response: i m a cop . . .
> i dont like that
Error: Unknown Word
> I'm sorry you are a cop
Response: i am . . .
> that is unfortunate
Response: you re a loser . . .
> me? YOU
Response: yes . . . . .
> I hate you
Response: i m fine . . . .
> Bye
Response: bye . . . . .
> /end
Response: i m sorry . . . .


KeyboardInterrupt: ignored