# ChatBot References

Professor Smyth's website discussing projects

https://www.ics.uci.edu/~smyth/courses/cs175/project_reading.html

Chatbot resources on website

https://pytorch.org/tutorials/beginner/chatbot_tutorial.html

https://web.stanford.edu/~jurafsky/slp3/26.pdf

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/12/CortanaLUDialog-FromSLTproceedings.pdf

Dataset for tutorial

https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html

# Start of Tutorial

### Import necessary libraries

In [2]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import torch
from torch.jit import script, trace
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
import csv
import random
import re
import os
import unicodedata
import codecs
from io import open
import itertools
import math


USE_CUDA = torch.cuda.is_available()
device = torch.device("cuda" if USE_CUDA else "cpu")

# added for more consistent work
random.seed(0)

### Preprocess Data

Looking at some files in the data to see how they are structured ...

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [7]:
corpus_name = "/content/drive/MyDrive/CS175_Datasets"
corpus = os.path.join("data", corpus_name)

def printLines(file, n=10):
    with open(file, 'rb') as datafile:
        lines = datafile.readlines()
    for line in lines[:n]:
        print(line)

printLines(os.path.join(corpus, "qa_Video_Games.json"))

b'{\'questionType\': \'yes/no\', \'asin\': \'B00000K4MC\', \'answerTime\': \'Nov 4, 2014\', \'unixTime\': 1415088000, \'question\': \'Will it work with Windows 8?\', \'answerType\': \'Y\', \'answer\': "You should look here: http://steamcommunity.com/sharedfiles/filedetails/?id=282196403 After the game was rereleased on Steam, a Windows 8 user wrote a Steam guide on how to fix it to work on Windows 8. Note that this is done for the Steam edition. It\'s almost the same as the disc version, but instead of being under Steam&gt;Steamapps&gt;common it would be under its own little folder in Program Files, or under a Microprose folder."}\n'
b"{'questionType': 'yes/no', 'asin': 'B00000K4MC', 'answerTime': 'Nov 29, 2014', 'unixTime': 1417248000, 'question': 'Will it work on Windows 7 ?', 'answerType': 'Y', 'answer': 'Yes'}\n"
b"{'questionType': 'yes/no', 'asin': 'B00000K4MC', 'answerTime': 'Aug 16, 2014', 'unixTime': 1408172400, 'question': 'Is there a digital download of this game?', 'answerTy

Scan through data files and reformat them to be (sentence, response) pairs separated by a tab eventually. First we create helper functions to help clean up the data a bit and make it more readable ...

In [18]:
import json
# # Splits each line of the file into a dictionary of fields
# def loadLines(fileName, fields):
#     lines = {}
#     with open(fileName, 'r', encoding='iso-8859-1') as f:
#         for line in f:
#             values = line.split(" +++$+++ ")
#             # Extract fields
#             lineObj = {}
#             for i, field in enumerate(fields):
#                 lineObj[field] = values[i]
#             lines[lineObj['lineID']] = lineObj
#     return lines


# # Groups fields of lines from `loadLines` into conversations based on *movie_conversations.txt*
# def loadConversations(fileName, lines, fields):
#     conversations = []
#     with open(fileName, 'r', encoding='iso-8859-1') as f:
#         for line in f:
#             values = line.split(" +++$+++ ")
#             # Extract fields
#             convObj = {}
#             for i, field in enumerate(fields):
#                 convObj[field] = values[i]
#             # Convert string to list (convObj["utteranceIDs"] == "['L598485', 'L598486', ...]")
#             utterance_id_pattern = re.compile('L[0-9]+')
#             lineIds = utterance_id_pattern.findall(convObj["utteranceIDs"])
#             # Reassemble lines
#             convObj["lines"] = []
#             for lineId in lineIds:
#                 convObj["lines"].append(lines[lineId])
#             conversations.append(convObj)
#     return conversations


# # Extracts pairs of sentences from conversations
# def extractSentencePairs(conversations):
#     qa_pairs = []
#     for conversation in conversations:
#         # Iterate over all the lines of the conversation
#         for i in range(len(conversation["lines"]) - 1):  # We ignore the last line (no answer for it)
#             inputLine = conversation["lines"][i]["text"].strip()
#             targetLine = conversation["lines"][i+1]["text"].strip()
#             # Filter wrong samples (if one of the lists is empty)
#             if inputLine and targetLine:
#                 qa_pairs.append([inputLine, targetLine])
#     return qa_pairs


def parse(file):
    for l in file:
        yield eval(l)

def loadConversations(fileName):
    conversations = []
    with open(fileName, 'r', encoding='iso-8859-1') as f:
    for conversation in parse(f):
        conversations.append([conversation["question"], conversation["answer"]])
    return conversations




def extractSentencePairs(conversations):
    return conversations

Next we will be using the functions above we now parse the data into a form that is useful to us and see what a few lines from the data look like ...

In [19]:
# Define path to new file
datafile = os.path.join(corpus, "formatted_movie_lines.txt")

delimiter = '\t'
# Unescape the delimiter
delimiter = str(codecs.decode(delimiter, "unicode_escape"))

# Initialize lines dict, conversations list, and field ids
lines = {}
conversations = []
MOVIE_LINES_FIELDS = ["lineID", "characterID", "movieID", "character", "text"]
MOVIE_CONVERSATIONS_FIELDS = ["character1ID", "character2ID", "movieID", "utteranceIDs"]

# Load lines and process conversations
print("\nProcessing corpus...")
# lines = loadLines(os.path.join(corpus, "movie_lines.txt"), MOVIE_LINES_FIELDS)
print("\nLoading conversations...")
conversations = loadConversations(os.path.join(corpus, "qa_Video_Games.json"))

# Write new csv file
print("\nWriting newly formatted file...")
with open(datafile, 'w', encoding='utf-8') as outputfile:
    writer = csv.writer(outputfile, delimiter=delimiter, lineterminator='\n')
    for pair in extractSentencePairs(conversations):
        writer.writerow(pair)

# Print a sample of lines
print("\nSample lines from file:")
printLines(datafile)


Processing corpus...

Loading conversations...

Writing newly formatted file...

Sample lines from file:
b"Will it work with Windows 8?\tYou should look here: http://steamcommunity.com/sharedfiles/filedetails/?id=282196403 After the game was rereleased on Steam, a Windows 8 user wrote a Steam guide on how to fix it to work on Windows 8. Note that this is done for the Steam edition. It's almost the same as the disc version, but instead of being under Steam&gt;Steamapps&gt;common it would be under its own little folder in Program Files, or under a Microprose folder.\n"
b'Will it work on Windows 7 ?\tYes\n'
b'Is there a digital download of this game?\tYou can buy rollercoaster tycoon Deluxe which includes RollerCoaster Tycoon and its two expansions: Corkscrew Follies and Loopy Landscapes for 6 bucks as a digital download off of Steam.\n'
b"will this work on a mac\tSorry to tell you Terresa it won't play\n"
b'I have a new computer with the latest greatest everything...will this game still

We want to create a vocabulary of words that we see. We represent these words numerically using the index of their first appearance in the history of all added words to the set of vocuabulary. See `addWord()` for specifics.

In [20]:
# Default word tokens
PAD_token = 0  # Used for padding short sentences
SOS_token = 1  # Start-of-sentence token
EOS_token = 2  # End-of-sentence token

class Voc:
    def __init__(self, name):
        self.name = name
        self.trimmed = False
        self.word2index = {}
        self.word2count = {}
        self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
        self.num_words = 3  # Count SOS, EOS, PAD

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.num_words
            self.word2count[word] = 1
            self.index2word[self.num_words] = word
            self.num_words += 1
        else:
            self.word2count[word] += 1

    # Remove words below a certain count threshold
    def trim(self, min_count):
        if self.trimmed:
            return
        self.trimmed = True

        keep_words = []

        for k, v in self.word2count.items():
            if v >= min_count:
                keep_words.append(k)

        print('keep_words {} / {} = {:.4f}'.format(
            len(keep_words), len(self.word2index), len(keep_words) / len(self.word2index)
        ))

        # Reinitialize dictionaries
        self.word2index = {}
        self.word2count = {}
        self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
        self.num_words = 3 # Count default tokens

        for word in keep_words:
            self.addWord(word)

Now we finally getting to the part where we begin to make the (sentence, response) pairs. We begin preprocessing by converting the Unicode string texts to ASCII `unicodeToAscii()`. We also make everything lower case, and remove nonletters except for basic punctuation `normalizeString()`. The last thing that is done in preprocessing is ignore sentences beyond a certain length to aid in training `filterPair()` and `MAX_LENGTH`.

In [21]:
MAX_LENGTH = 10  # Maximum sentence length to consider

# Turn a Unicode string to plain ASCII, thanks to
# https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

# Lowercase, trim, and remove non-letter characters
def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    s = re.sub(r"\s+", r" ", s).strip()
    return s

# Read query/response pairs and return a voc object
def readVocs(datafile, corpus_name):
    print("Reading lines...")
    # Read the file and split into lines
    lines = open(datafile, encoding='utf-8').\
        read().strip().split('\n')
    # Split every line into pairs and normalize
    pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]
    voc = Voc(corpus_name)
    return voc, pairs

# Returns True iff both sentences in a pair 'p' are under the MAX_LENGTH threshold
def filterPair(p):
    # Input sequences need to preserve the last word for EOS token
    return len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH

# Filter pairs using filterPair condition
def filterPairs(pairs):
    return [pair for pair in pairs if filterPair(pair)]

# Using the functions defined above, return a populated voc object and pairs list
def loadPrepareData(corpus, corpus_name, datafile, save_dir):
    print("Start preparing training data ...")
    voc, pairs = readVocs(datafile, corpus_name)
    print("Read {!s} sentence pairs".format(len(pairs)))
    pairs = filterPairs(pairs)
    print("Trimmed to {!s} sentence pairs".format(len(pairs)))
    print("Counting words...")
    for pair in pairs:
        voc.addSentence(pair[0])
        voc.addSentence(pair[1])
    print("Counted words:", voc.num_words)
    return voc, pairs


# Load/Assemble voc and pairs
save_dir = os.path.join("data", "save")
voc, pairs = loadPrepareData(corpus, corpus_name, datafile, save_dir)
# Print some pairs to validate
print("\npairs:")
for pair in pairs[:10]:
    print(pair)

Start preparing training data ...
Reading lines...
Read 13307 sentence pairs
Trimmed to 1924 sentence pairs
Counting words...
Counted words: 1638

pairs:
['will it work on windows ?', 'yes']
['will this work on a mac', 'sorry to tell you terresa it won t play']
['will this game work with windows vista ?', 'yes !']
['will it work on windows ?', 'yes']
['will this work on a mac', 'sorry to tell you terresa it won t play']
['will this game work with windows vista ?', 'yes !']
['how many different machines ?', 'there are tables .']
['will it work on windows xp ?', 'yes .']
['does it play on a windows os ?', 'yes']
['how many different machines ?', 'there are tables .']


Another way to speed up training is by removing words that are rarely used. We "trim" these words using `Voc.trim()` and as a result must also remove (sentence, response) pairs that include these words.

In [22]:
MIN_COUNT = 3    # Minimum word count threshold for trimming

def trimRareWords(voc, pairs, MIN_COUNT):
    # Trim words used under the MIN_COUNT from the voc
    voc.trim(MIN_COUNT)
    # Filter out pairs with trimmed words
    keep_pairs = []
    for pair in pairs:
        input_sentence = pair[0]
        output_sentence = pair[1]
        keep_input = True
        keep_output = True
        # Check input sentence
        for word in input_sentence.split(' '):
            if word not in voc.word2index:
                keep_input = False
                break
        # Check output sentence
        for word in output_sentence.split(' '):
            if word not in voc.word2index:
                keep_output = False
                break

        # Only keep pairs that do not contain trimmed word(s) in their input or output sentence
        if keep_input and keep_output:
            keep_pairs.append(pair)

    print("Trimmed from {} pairs to {}, {:.4f} of total".format(len(pairs), len(keep_pairs), len(keep_pairs) / len(pairs)))
    return keep_pairs


# Trim voc and pairs
pairs = trimRareWords(voc, pairs, MIN_COUNT)

keep_words 615 / 1635 = 0.3761
Trimmed from 1924 pairs to 1132, 0.5884 of total


### Using our Data on our Models

The data still needs to be processed more. The model that we use will take numerical values and not actual strings to do computation. We convert our sentences to tensors (vectors) that our model will take as inputs. To do this, we just take every sentence and change it to be a vector of index that corresponds to that word. This is how we will use our data to train the model. 

If we would like to train our model we usually do it with minbatches since it makes things faster. We make a matrix of dimensions (BatchLength, MaxLengthOfSentenceInBatch) represented numerically as mentioned in the previous paragraph. With this we make sure that each row (sentence) of our matrix terminates with `EOS_Token` and is followed by 0 entries until the end of the row.

This implementation almost works, the problem with this is that each row is a sentence and every column is a step in time. However, it is better to think of every row as a step in time and the column to be possible words to choose for that step in time. For this reason, we construct the Matrix as mentioned in the previous paragraph except it is now transposed. 

We define some function that help us achieve this ...

In [23]:
def indexesFromSentence(voc, sentence):
    return [voc.word2index[word] for word in sentence.split(' ')] + [EOS_token]


def zeroPadding(l, fillvalue=PAD_token):
    return list(itertools.zip_longest(*l, fillvalue=fillvalue))

def binaryMatrix(l, value=PAD_token):
    m = []
    for i, seq in enumerate(l):
        m.append([])
        for token in seq:
            if token == PAD_token:
                m[i].append(0)
            else:
                m[i].append(1)
    return m

# Returns padded input sequence tensor and lengths
def inputVar(l, voc):
    indexes_batch = [indexesFromSentence(voc, sentence) for sentence in l]
    lengths = torch.tensor([len(indexes) for indexes in indexes_batch])
    padList = zeroPadding(indexes_batch)
    padVar = torch.LongTensor(padList)
    return padVar, lengths

# Returns padded target sequence tensor, padding mask, and max target length
def outputVar(l, voc):
    indexes_batch = [indexesFromSentence(voc, sentence) for sentence in l]
    max_target_len = max([len(indexes) for indexes in indexes_batch])
    padList = zeroPadding(indexes_batch)
    mask = binaryMatrix(padList)
    mask = torch.BoolTensor(mask)
    padVar = torch.LongTensor(padList)
    return padVar, mask, max_target_len

# Returns all items for a given batch of pairs
def batch2TrainData(voc, pair_batch):
    pair_batch.sort(key=lambda x: len(x[0].split(" ")), reverse=True)
    input_batch, output_batch = [], []
    for pair in pair_batch:
        input_batch.append(pair[0])
        output_batch.append(pair[1])
    inp, lengths = inputVar(input_batch, voc)
    output, mask, max_target_len = outputVar(output_batch, voc)
    return inp, lengths, output, mask, max_target_len


# Example for validation
small_batch_size = 5
batches = batch2TrainData(voc, [random.choice(pairs) for _ in range(small_batch_size)])
input_variable, lengths, target_variable, mask, max_target_len = batches

print("input_variable:", input_variable)
print("lengths:", lengths)
print("target_variable:", target_variable)
print("mask:", mask)
print("max_target_len:", max_target_len)

input_variable: tensor([[ 60,  32,  32,  49,  49],
        [ 39,  10,   4,   4,  10],
        [143,   5,   5,   5,  20],
        [ 65,   6,  64,  64,  57],
        [ 20,  50,  65, 165, 508],
        [598, 589,  63, 223, 470],
        [ 65, 303,   8,   8,   8],
        [344,   8,   2,   2,   2],
        [  8,   2,   0,   0,   0],
        [  2,   0,   0,   0,   0]])
lengths: tensor([10,  9,  8,  8,  8])
target_variable: tensor([[ 44, 125,   9,   9,  44],
        [  2,  30,   4,   2,   2],
        [  0,  28,  32,   0,   0],
        [  0,  49,   2,   0,   0],
        [  0,  11,   0,   0,   0],
        [  0, 590,   0,   0,   0],
        [  0, 272,   0,   0,   0],
        [  0,  30,   0,   0,   0],
        [  0,   2,   0,   0,   0]])
mask: tensor([[ True,  True,  True,  True,  True],
        [ True,  True,  True,  True,  True],
        [False,  True,  True, False, False],
        [False,  True,  True, False, False],
        [False,  True, False, False, False],
        [False,  True, False, F

Seeing the output from preping the data above, it appears that `input_variable` is the matrix described in the previous secsion. `lenghts` is just a tensor (vector) of how long the sentences in the input were. This will eventually be used for the **decoder** later in the program. `target_variable` seems to be the response that our model is supposed to learn. `mask` looks like it is a mask for responses where true is that there is a word and false is that there is no word there. Not sure why this variable is necessary but maybe it will become clearer at some point in the tutorial. 

### Defining our Model

Our base model is a Sequence-to-Sequence model using two RNN's. The first RNN is what is called the **encoder** and it takes a variable length input and converts it into a *fixed length* "context" vector that is intended to hold onto some semantic meaning of the input. The **decoder** takes the context vector provided by the decoder along with an input word to guess the next word in a sequence. I suppose our (sentence, response) pairs are learned as one long continuous sequence where the sentence is the start of hte sequence and the response is the remainder of the sequence, though this might be entirely incorrect.

Some discussion on the Encoder. The encoder uses a bidirectional GRU (Gated Recurrent Unit), meaning that there is basically two RNN's that make up the encoder, one that goes through the sequence of data in the forward direction while another RNN goes through the sequence in the backward direction. At each time step, both RNN's produce an output and a hidden state vector. At each time step, the output of both RNNs are summed and the output is recorded (somewhere) while the hidden state vectors are pushed along and used in the next step of the RNN. The outputs are summed makes it so that at each time step, the RNN is considering present and future context. 

In [24]:
class EncoderRNN(nn.Module):
    def __init__(self, hidden_size, embedding, n_layers=1, dropout=0):
        super(EncoderRNN, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size
        self.embedding = embedding

        # Initialize GRU; the input_size and hidden_size params are both set to 'hidden_size'
        #   because our input size is a word embedding with number of features == hidden_size
        self.gru = nn.GRU(hidden_size, hidden_size, n_layers,
                          dropout=(0 if n_layers == 1 else dropout), bidirectional=True)

    def forward(self, input_seq, input_lengths, hidden=None):
        # Convert word indexes to embeddings
        embedded = self.embedding(input_seq)
        # Pack padded batch of sequences for RNN module
        packed = nn.utils.rnn.pack_padded_sequence(embedded, input_lengths)
        # Forward pass through GRU
        outputs, hidden = self.gru(packed, hidden)
        # Unpack padding
        outputs, _ = nn.utils.rnn.pad_packed_sequence(outputs)
        # Sum bidirectional GRU outputs
        outputs = outputs[:, :, :self.hidden_size] + outputs[:, : ,self.hidden_size:]
        # Return output and final hidden state
        return outputs, hidden

Some discussion on the Decoder. The decoder uses the context vector that is produced by the encoder. Usually, this is all that will be used by a decoder to produce output but this can result in a loss of information, expecially when the input sentences are very long. To counter this, the decoder also uses its current hidden state as a way of determining what it should be "paying attention" to. These are refered to as attention weights and are multiplied by the outputs from the encoder (the output that is apperantly recorded somewhere, this is where they are used) from the current time step to rescale the values, making less important parts of the encoder output smaller and more important parts of the encoder output larger. This can be further improved by using all of the encoder outputs instead of just the one of the current time step to have a more comprehensive set of attention weights. This is the method implemented below, followed by the implementation of the decoder using this attention method.

In [25]:
# Luong attention layer
class Attn(nn.Module):
    def __init__(self, method, hidden_size):
        super(Attn, self).__init__()
        self.method = method
        if self.method not in ['dot', 'general', 'concat']:
            raise ValueError(self.method, "is not an appropriate attention method.")
        self.hidden_size = hidden_size
        if self.method == 'general':
            self.attn = nn.Linear(self.hidden_size, hidden_size)
        elif self.method == 'concat':
            self.attn = nn.Linear(self.hidden_size * 2, hidden_size)
            self.v = nn.Parameter(torch.FloatTensor(hidden_size))

    def dot_score(self, hidden, encoder_output):
        return torch.sum(hidden * encoder_output, dim=2)

    def general_score(self, hidden, encoder_output):
        energy = self.attn(encoder_output)
        return torch.sum(hidden * energy, dim=2)

    def concat_score(self, hidden, encoder_output):
        energy = self.attn(torch.cat((hidden.expand(encoder_output.size(0), -1, -1), encoder_output), 2)).tanh()
        return torch.sum(self.v * energy, dim=2)

    def forward(self, hidden, encoder_outputs):
        # Calculate the attention weights (energies) based on the given method
        if self.method == 'general':
            attn_energies = self.general_score(hidden, encoder_outputs)
        elif self.method == 'concat':
            attn_energies = self.concat_score(hidden, encoder_outputs)
        elif self.method == 'dot':
            attn_energies = self.dot_score(hidden, encoder_outputs)

        # Transpose max_length and batch_size dimensions
        attn_energies = attn_energies.t()

        # Return the softmax normalized probability scores (with added dimension)
        return F.softmax(attn_energies, dim=1).unsqueeze(1)

In [26]:
class LuongAttnDecoderRNN(nn.Module):
    def __init__(self, attn_model, embedding, hidden_size, output_size, n_layers=1, dropout=0.1):
        super(LuongAttnDecoderRNN, self).__init__()

        # Keep for reference
        self.attn_model = attn_model
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        self.dropout = dropout

        # Define layers
        self.embedding = embedding
        self.embedding_dropout = nn.Dropout(dropout)
        self.gru = nn.GRU(hidden_size, hidden_size, n_layers, dropout=(0 if n_layers == 1 else dropout))
        self.concat = nn.Linear(hidden_size * 2, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)

        self.attn = Attn(attn_model, hidden_size)

    def forward(self, input_step, last_hidden, encoder_outputs):
        # Note: we run this one step (word) at a time
        # Get embedding of current input word
        embedded = self.embedding(input_step)
        embedded = self.embedding_dropout(embedded)
        # Forward through unidirectional GRU
        rnn_output, hidden = self.gru(embedded, last_hidden)
        # Calculate attention weights from the current GRU output
        attn_weights = self.attn(rnn_output, encoder_outputs)
        # Multiply attention weights to encoder outputs to get new "weighted sum" context vector
        context = attn_weights.bmm(encoder_outputs.transpose(0, 1))
        # Concatenate weighted context vector and GRU output using Luong eq. 5
        rnn_output = rnn_output.squeeze(0)
        context = context.squeeze(1)
        concat_input = torch.cat((rnn_output, context), 1)
        concat_output = torch.tanh(self.concat(concat_input))
        # Predict next word using Luong eq. 6
        output = self.out(concat_output)
        output = F.softmax(output, dim=1)
        # Return output and final hidden state
        return output, hidden

### Defining the Training Procedure

Recall earlier in the `mask` produced along side the `target_variable` was produced for an unknown purpose, it turns out its purpose is for determining the loss of the model. This implementation below is the negative-log-likelihood loss and the variable `mask` along with `target_variable`.

In [27]:
def maskNLLLoss(inp, target, mask):
    nTotal = mask.sum()
    crossEntropy = -torch.log(torch.gather(inp, 1, target.view(-1, 1)).squeeze(1))
    loss = crossEntropy.masked_select(mask).mean()
    loss = loss.to(device)
    return loss, nTotal.item()

For training the model, two tricks are used to help convergence. The first trick is refered to as "teacher forcing" which just overrides the prediction of the decoder and uses the actual target value instead, this happens with some small probability p. Doing it too much will make the decoder unable to make predictions on its own and not having this feature will make it so that convergence is just slower. The second trick they do is gradient clipping, meaning that at areas in the feature space where there is a steep gradient and therefore a chance to improve the model very fast, the magnitude of the gradient is limited some value. This is to not drastically overshoot local minima on "cliffs". The training function is implemented below. This trains only one iteration.

In [28]:
def train(input_variable, lengths, target_variable, mask, max_target_len, encoder, decoder, embedding,
          encoder_optimizer, decoder_optimizer, batch_size, clip, max_length=MAX_LENGTH):

    # Zero gradients
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    # Set device options
    input_variable = input_variable.to(device)
    target_variable = target_variable.to(device)
    mask = mask.to(device)
    # Lengths for rnn packing should always be on the cpu
    lengths = lengths.to("cpu")

    # Initialize variables
    loss = 0
    print_losses = []
    n_totals = 0

    # Forward pass through encoder
    encoder_outputs, encoder_hidden = encoder(input_variable, lengths)

    # Create initial decoder input (start with SOS tokens for each sentence)
    decoder_input = torch.LongTensor([[SOS_token for _ in range(batch_size)]])
    decoder_input = decoder_input.to(device)

    # Set initial decoder hidden state to the encoder's final hidden state
    decoder_hidden = encoder_hidden[:decoder.n_layers]

    # Determine if we are using teacher forcing this iteration
    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    # Forward batch of sequences through decoder one time step at a time
    if use_teacher_forcing:
        for t in range(max_target_len):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden, encoder_outputs
            )
            # Teacher forcing: next input is current target
            decoder_input = target_variable[t].view(1, -1)
            # Calculate and accumulate loss
            mask_loss, nTotal = maskNLLLoss(decoder_output, target_variable[t], mask[t])
            loss += mask_loss
            print_losses.append(mask_loss.item() * nTotal)
            n_totals += nTotal
    else:
        for t in range(max_target_len):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden, encoder_outputs
            )
            # No teacher forcing: next input is decoder's own current output
            _, topi = decoder_output.topk(1)
            decoder_input = torch.LongTensor([[topi[i][0] for i in range(batch_size)]])
            decoder_input = decoder_input.to(device)
            # Calculate and accumulate loss
            mask_loss, nTotal = maskNLLLoss(decoder_output, target_variable[t], mask[t])
            loss += mask_loss
            print_losses.append(mask_loss.item() * nTotal)
            n_totals += nTotal

    # Perform backpropatation
    loss.backward()

    # Clip gradients: gradients are modified in place
    _ = nn.utils.clip_grad_norm_(encoder.parameters(), clip)
    _ = nn.utils.clip_grad_norm_(decoder.parameters(), clip)

    # Adjust model weights
    encoder_optimizer.step()
    decoder_optimizer.step()

    return sum(print_losses) / n_totals


This version of the training function trains on multiple iterations. It is built on the previous training function. This function also saves the current variables in a tarball file. This is to continue training at a later period or just use the current accumulated training to make predictions.

In [29]:
def trainIters(model_name, voc, pairs, encoder, decoder, encoder_optimizer, decoder_optimizer, embedding, 
               encoder_n_layers, decoder_n_layers, save_dir, n_iteration, batch_size, print_every, save_every, 
               clip, corpus_name, loadFilename):

    # Load batches for each iteration
    training_batches = [batch2TrainData(voc, [random.choice(pairs) for _ in range(batch_size)])
                      for _ in range(n_iteration)]

    # Initializations
    print('Initializing ...')
    start_iteration = 1
    print_loss = 0
    if loadFilename:
        start_iteration = checkpoint['iteration'] + 1

    # Training loop
    print("Training...")
    for iteration in range(start_iteration, n_iteration + 1):
        training_batch = training_batches[iteration - 1]
        # Extract fields from batch
        input_variable, lengths, target_variable, mask, max_target_len = training_batch

        # Run a training iteration with batch
        loss = train(input_variable, lengths, target_variable, mask, max_target_len, encoder,
                     decoder, embedding, encoder_optimizer, decoder_optimizer, batch_size, clip)
        print_loss += loss

        # Print progress
        if iteration % print_every == 0:
            print_loss_avg = print_loss / print_every
            print("Iteration: {}; Percent complete: {:.1f}%; Average loss: {:.4f}".format(
                iteration, iteration / n_iteration * 100, print_loss_avg
            ))
            print_loss = 0

        # Save checkpoint
        if (iteration % save_every == 0):
            directory = os.path.join(save_dir, model_name, corpus_name, '{}-{}_{}'.format(
                encoder_n_layers, decoder_n_layers, hidden_size
            ))
            if not os.path.exists(directory):
                os.makedirs(directory)
            torch.save({
                'iteration': iteration,
                'en': encoder.state_dict(),
                'de': decoder.state_dict(),
                'en_opt': encoder_optimizer.state_dict(),
                'de_opt': decoder_optimizer.state_dict(),
                'loss': loss,
                'voc_dict': voc.__dict__,
                'embedding': embedding.state_dict()
            }, os.path.join(directory, '{}_{}.tar'.format(iteration, 'checkpoint')))


### Defining Evaluation

This is the training method used by our bot when we are not using the "teacher forcing" method. This just basically chooses the output of hte decoder to be the output with the highest softmax score, ie the best response given its prior training. I think this is defining a decoder method for whenever we are actually using our model instead of when it is training.

In [30]:
class GreedySearchDecoder(nn.Module):
    def __init__(self, encoder, decoder):
        super(GreedySearchDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, input_seq, input_length, max_length):
        # Forward input through encoder model
        encoder_outputs, encoder_hidden = self.encoder(input_seq, input_length)
        # Prepare encoder's final hidden layer to be first hidden input to the decoder
        decoder_hidden = encoder_hidden[:decoder.n_layers]
        # Initialize decoder input with SOS_token
        decoder_input = torch.ones(1, 1, device=device, dtype=torch.long) * SOS_token
        # Initialize tensors to append decoded words to
        all_tokens = torch.zeros([0], device=device, dtype=torch.long)
        all_scores = torch.zeros([0], device=device)
        # Iteratively decode one word token at a time
        for _ in range(max_length):
            # Forward pass through decoder
            decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden, encoder_outputs)
            # Obtain most likely word token and its softmax score
            decoder_scores, decoder_input = torch.max(decoder_output, dim=1)
            # Record token and score
            all_tokens = torch.cat((all_tokens, decoder_input), dim=0)
            all_scores = torch.cat((all_scores, decoder_scores), dim=0)
            # Prepare current token to be next decoder input (add a dimension)
            decoder_input = torch.unsqueeze(decoder_input, 0)
        # Return collections of word tokens and scores
        return all_tokens, all_scores

The `evaluateInput` function is meant to take input from the user, prepare it so that it may be fed to our model, and pring out the response from our bot. `evaluate` handles most of hte actual calculation while `evaluateInput` handles most of the user interaction phase. 

In [31]:
def evaluate(encoder, decoder, searcher, voc, sentence, max_length=MAX_LENGTH):
    ### Format input sentence as a batch
    # words -> indexes
    indexes_batch = [indexesFromSentence(voc, sentence)]
    # Create lengths tensor
    lengths = torch.tensor([len(indexes) for indexes in indexes_batch])
    # Transpose dimensions of batch to match models' expectations
    input_batch = torch.LongTensor(indexes_batch).transpose(0, 1)
    # Use appropriate device
    input_batch = input_batch.to(device)
    #lengths = lengths.to(device) # removed bc of gpu cpu tensor difference ###################################
    # Decode sentence with searcher
    tokens, scores = searcher(input_batch, lengths, max_length)
    # indexes -> words
    decoded_words = [voc.index2word[token.item()] for token in tokens]
    return decoded_words


def evaluateInput(encoder, decoder, searcher, voc):
    input_sentence = ''
    while(1):
        try:
            # Get input sentence
            input_sentence = input('> ')
            # Check if it is quit case
            if input_sentence == 'q' or input_sentence == 'quit': break
            # Normalize sentence
            input_sentence = normalizeString(input_sentence)
            # Evaluate sentence
            output_words = evaluate(encoder, decoder, searcher, voc, input_sentence)
            # Format and print response sentence
            output_words[:] = [x for x in output_words if not (x == 'EOS' or x == 'PAD')]
            print('Bot:', ' '.join(output_words))

        except KeyError:
            print("Error: Encountered unknown word.")


### Running the Model

This defines and builds our model based on the following parameters. Notice that parts of the code are commented out as these are other possibilities ofr initialization. attn_model can be given one of three different models. We can also choose to load a model that we previously trained.

In [32]:
# Configure models
model_name = 'cb_model'
attn_model = 'dot'
#attn_model = 'general'
#attn_model = 'concat'
hidden_size = 500
encoder_n_layers = 2
decoder_n_layers = 2
dropout = 0.1
batch_size = 64

# Set checkpoint to load from; set to None if starting from scratch
loadFilename = None
checkpoint_iter = 4000
#loadFilename = os.path.join(save_dir, model_name, corpus_name,
#                            '{}-{}_{}'.format(encoder_n_layers, decoder_n_layers, hidden_size),
#                            '{}_checkpoint.tar'.format(checkpoint_iter))


# Load model if a loadFilename is provided
if loadFilename:
    # If loading on same machine the model was trained on
    checkpoint = torch.load(loadFilename)
    # If loading a model trained on GPU to CPU
    #checkpoint = torch.load(loadFilename, map_location=torch.device('cpu'))
    encoder_sd = checkpoint['en']
    decoder_sd = checkpoint['de']
    encoder_optimizer_sd = checkpoint['en_opt']
    decoder_optimizer_sd = checkpoint['de_opt']
    embedding_sd = checkpoint['embedding']
    voc.__dict__ = checkpoint['voc_dict']


print('Building encoder and decoder ...')
# Initialize word embeddings
embedding = nn.Embedding(voc.num_words, hidden_size)
if loadFilename:
    embedding.load_state_dict(embedding_sd)
# Initialize encoder & decoder models
encoder = EncoderRNN(hidden_size, embedding, encoder_n_layers, dropout)
decoder = LuongAttnDecoderRNN(attn_model, embedding, hidden_size, voc.num_words, decoder_n_layers, dropout)
if loadFilename:
    encoder.load_state_dict(encoder_sd)
    decoder.load_state_dict(decoder_sd)
# Use appropriate device
encoder = encoder.to(device)
decoder = decoder.to(device)
print('Models built and ready to go!')

Building encoder and decoder ...
Models built and ready to go!


Now we train our model using `trainIters`. We first set some training parameters and initialize optimizers and prepare our model for training. Training is likely to take a very long time, thankfully there is the option to load from a previous model and this piece of code here doesn't need to be run every time.

In [33]:
# Safty precaution so that we dont accidentally begin training a model we didnt want to train
TRAIN_THE_MODEL = True

In [35]:
if TRAIN_THE_MODEL:
    
    # Configure training/optimization
    clip = 50.0
    teacher_forcing_ratio = 1.0
    learning_rate = 0.0001
    decoder_learning_ratio = 5.0
    n_iteration = 500
    print_every = 1
    save_every = 500

    # Ensure dropout layers are in train mode
    encoder.train()
    decoder.train()

    # Initialize optimizers
    print('Building optimizers ...')
    encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate * decoder_learning_ratio)
    if loadFilename:
        encoder_optimizer.load_state_dict(encoder_optimizer_sd)
        decoder_optimizer.load_state_dict(decoder_optimizer_sd)

    # If you have cuda, configure cuda to call
    for state in encoder_optimizer.state.values():
        for k, v in state.items():
            if isinstance(v, torch.Tensor):
                state[k] = v.cuda()
    
    for state in decoder_optimizer.state.values():
        for k, v in state.items():
            if isinstance(v, torch.Tensor):
                state[k] = v.cuda()
    
    # Run training iterations
    print("Starting Training!")
    trainIters(model_name, voc, pairs, encoder, decoder, encoder_optimizer, decoder_optimizer,
               embedding, encoder_n_layers, decoder_n_layers, save_dir, n_iteration, batch_size,
               print_every, save_every, clip, corpus_name, loadFilename)
    
    TRAIN_THE_MODEL = False

else:
    
    print(f"Make the variable 'TRAIN_THE_MODEL' in the code block above True and run the code block")
    print(f"Value of 'TRAIN_THE_MODEL' currently: {TRAIN_THE_MODEL}")


Building optimizers ...
Starting Training!
Initializing ...
Training...
Iteration: 1; Percent complete: 0.2%; Average loss: 0.6514
Iteration: 2; Percent complete: 0.4%; Average loss: 0.7685
Iteration: 3; Percent complete: 0.6%; Average loss: 0.6664
Iteration: 4; Percent complete: 0.8%; Average loss: 0.6841
Iteration: 5; Percent complete: 1.0%; Average loss: 0.6707
Iteration: 6; Percent complete: 1.2%; Average loss: 0.8138
Iteration: 7; Percent complete: 1.4%; Average loss: 0.7418
Iteration: 8; Percent complete: 1.6%; Average loss: 0.6432
Iteration: 9; Percent complete: 1.8%; Average loss: 0.8413
Iteration: 10; Percent complete: 2.0%; Average loss: 0.7081
Iteration: 11; Percent complete: 2.2%; Average loss: 0.6191
Iteration: 12; Percent complete: 2.4%; Average loss: 0.7519
Iteration: 13; Percent complete: 2.6%; Average loss: 0.6802
Iteration: 14; Percent complete: 2.8%; Average loss: 0.5932
Iteration: 15; Percent complete: 3.0%; Average loss: 0.7986
Iteration: 16; Percent complete: 3.2%

Run the model :)

In [36]:
# Set dropout layers to eval mode
encoder.eval()
decoder.eval()

# Initialize search module
searcher = GreedySearchDecoder(encoder, decoder)

# Begin chatting (uncomment and run the following line to begin)
evaluateInput(encoder, decoder, searcher, voc)

> Hello
Bot: nope .
> what is going on here
Error: Encountered unknown word.
> computer not working
Bot: unfortunately no . only .
> can you say something
Error: Encountered unknown word.
> are you a human
Error: Encountered unknown word.
> what can you say
Bot: nope . .
> what?
Bot: . .
> Will it work with MacBook
Bot: yes and should work with xp
> what type of computer do you got
Error: Encountered unknown word.
> what do you think about this game
Bot: yes two player mode .
> do you like online games
Bot: nope and
> how's the graphic card
Error: Encountered unknown word.
> how is the graphic card working
Error: Encountered unknown word.
> graphic card
Error: Encountered unknown word.
> how to install the game
Bot: about game .
> where is the power sorces?
Error: Encountered unknown word.
>  where is the power source
Error: Encountered unknown word.
> does it work on 3ds
Bot: yes . player
> do you like 3ds?
Bot: no the ds
> q


### Notes on the Model

Overall, the model is OK. It was not as intelligent as i thought it would be, but I suppose that is a good thing. One of the things that I noticed from playing with the bot was that everything is pretty predictable. In fact, it feels less like it is responding to you and more like it has memories what to day based on an input string from the user. 

I think the first thing we can do to improve the model is to add some randomness to the model, so asking "hello" wont keep generating the same output. 

To improve this one a bit more we can change some of the variable like the learning rate and the depth. 

Eventually, to improve the overall feeling of the chat bot we can implement an entirely new model using BERT or GPT3 or other transformer models. 

Try to give a more meaningful number to words rather than what index they first appear in (ask professor if he thinks this would even matter, or would this just be a change of domain).

# Adding randomness to the Model

In [37]:
class StochasticSearchDecoder(nn.Module):
    def __init__(self, encoder, decoder, k):
        super(StochasticSearchDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.k = k

    def forward(self, input_seq, input_length, max_length):
        # Forward input through encoder model
        encoder_outputs, encoder_hidden = self.encoder(input_seq, input_length)
        # Prepare encoder's final hidden layer to be first hidden input to the decoder
        decoder_hidden = encoder_hidden[:decoder.n_layers]
        # Initialize decoder input with SOS_token
        decoder_input = torch.ones(1, 1, device=device, dtype=torch.long) * SOS_token
        # Initialize tensors to append decoded words to
        all_tokens = torch.zeros([0], device=device, dtype=torch.long)
        all_scores = torch.zeros([0], device=device)
        # Iteratively decode one word token at a time
        for _ in range(max_length):
            # Forward pass through decoder
            decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden, encoder_outputs)
            
            # TODO : randomly select from the top k words
            decoder_output_topk, ind = torch.topk(decoder_output, k=self.k, dim=1)
            i = int(random.random()*self.k)
            decoder_input = torch.tensor([ind[0,i]]).to(device)
            decoder_scores = torch.tensor([decoder_output_topk[0,i]]).to(device)
            
            # Record token and score
            all_tokens = torch.cat((all_tokens, decoder_input), dim=0)
            all_scores = torch.cat((all_scores, decoder_scores), dim=0)
            # Prepare current token to be next decoder input (add a dimension)
            decoder_input = torch.unsqueeze(decoder_input, 0)
        # Return collections of word tokens and scores
        return all_tokens, all_scores

In [39]:
# Set dropout layers to eval mode
encoder.eval()
decoder.eval()

# Initialize search module
searcher = StochasticSearchDecoder(encoder, decoder, 2)

# Begin chatting (uncomment and run the following line to begin)
evaluateInput(encoder, decoder, searcher, voc)

> are you a bot
Error: Encountered unknown word.
> how are you
Bot: two two in black and no problems . . .
> what is going on here
Error: Encountered unknown word.
> wtf
Error: Encountered unknown word.


KeyboardInterrupt: ignored

I like this model a whole lot more than the previous version, since it feels a lot more like a person speaking. I think there is more we can do to explore this stochastic model instead of the greedy model, for example, instead of randomly choosing the next word based off of the top k choices, only randomly choose the first word based of the top k choices then let the bot make up a new sentence based on that random first word. The reason this might be a better idea is because when the bot has a k value too large (even 5 is too large) it sort of rambles and makes weird sentences. I found that a k value of 2 worked best. 