# ChatBot References

Professor Smyth's website discussing projects

https://www.ics.uci.edu/~smyth/courses/cs175/project_reading.html

Chatbot resources on website

https://pytorch.org/tutorials/beginner/chatbot_tutorial.html

https://web.stanford.edu/~jurafsky/slp3/26.pdf

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/12/CortanaLUDialog-FromSLTproceedings.pdf

Dataset for tutorial

https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html

# Start of Tutorial

### Import necessary libraries

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import torch
from torch.jit import script, trace
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
import csv
import random
import re
import os
import unicodedata
import codecs
from io import open
import itertools
import math
import numpy as np


USE_CUDA = torch.cuda.is_available()
device = torch.device("cuda" if USE_CUDA else "cpu")

### Preprocess Data

Looking at some files in the data to see how they are structured ...

In [2]:
corpus_name = "cornell movie-dialogs corpus"
corpus = os.path.join("data", corpus_name)

def printLines(file, n=10):
    with open(file, 'rb') as datafile:
        lines = datafile.readlines()
    for line in lines[:n]:
        print(line)

printLines(os.path.join(corpus, "movie_lines.txt"))

b'L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!\n'
b'L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!\n'
b'L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.\n'
b'L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?\n'
b"L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go.\n"
b'L924 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Wow\n'
b"L872 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Okay -- you're gonna need to learn how to lie.\n"
b'L871 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ No\n'
b'L870 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I\'m kidding.  You know how sometimes you just become this "persona"?  And you don\'t know how to quit?\n'
b'L869 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Like my fear of wearing pastels?\n'


Scan through data files and reformat them to be (sentence, response) pairs separated by a tab eventually. First we create helper functions to help clean up the data a bit and make it more readable ...

In [3]:
# Splits each line of the file into a dictionary of fields
def loadLines(fileName, fields):
    lines = {}
    with open(fileName, 'r', encoding='iso-8859-1') as f:
        for line in f:
            values = line.split(" +++$+++ ")
            # Extract fields
            lineObj = {}
            for i, field in enumerate(fields):
                lineObj[field] = values[i]
            lines[lineObj['lineID']] = lineObj
    return lines


# Groups fields of lines from `loadLines` into conversations based on *movie_conversations.txt*
def loadConversations(fileName, lines, fields):
    conversations = []
    with open(fileName, 'r', encoding='iso-8859-1') as f:
        for line in f:
            values = line.split(" +++$+++ ")
            # Extract fields
            convObj = {}
            for i, field in enumerate(fields):
                convObj[field] = values[i]
            # Convert string to list (convObj["utteranceIDs"] == "['L598485', 'L598486', ...]")
            utterance_id_pattern = re.compile('L[0-9]+')
            lineIds = utterance_id_pattern.findall(convObj["utteranceIDs"])
            # Reassemble lines
            convObj["lines"] = []
            for lineId in lineIds:
                convObj["lines"].append(lines[lineId])
            conversations.append(convObj)
    return conversations


# Extracts pairs of sentences from conversations
def extractSentencePairs(conversations):
    qa_pairs = []
    for conversation in conversations:
        # Iterate over all the lines of the conversation
        for i in range(len(conversation["lines"]) - 1):  # We ignore the last line (no answer for it)
            inputLine = conversation["lines"][i]["text"].strip()
            targetLine = conversation["lines"][i+1]["text"].strip()
            # Filter wrong samples (if one of the lists is empty)
            if inputLine and targetLine:
                qa_pairs.append([inputLine, targetLine])
    return qa_pairs

Next we will be using the functions above we now parse the data into a form that is useful to us and see what a few lines from the data look like ...

In [4]:
# Define path to new file
datafile = os.path.join(corpus, "formatted_movie_lines.txt")

delimiter = '\t'
# Unescape the delimiter
delimiter = str(codecs.decode(delimiter, "unicode_escape"))

# Initialize lines dict, conversations list, and field ids
lines = {}
conversations = []
MOVIE_LINES_FIELDS = ["lineID", "characterID", "movieID", "character", "text"]
MOVIE_CONVERSATIONS_FIELDS = ["character1ID", "character2ID", "movieID", "utteranceIDs"]

# Load lines and process conversations
print("\nProcessing corpus...")
lines = loadLines(os.path.join(corpus, "movie_lines.txt"), MOVIE_LINES_FIELDS)
print("\nLoading conversations...")
conversations = loadConversations(os.path.join(corpus, "movie_conversations.txt"),
                                  lines, MOVIE_CONVERSATIONS_FIELDS)

# Write new csv file
print("\nWriting newly formatted file...")
with open(datafile, 'w', encoding='utf-8') as outputfile:
    writer = csv.writer(outputfile, delimiter=delimiter, lineterminator='\n')
    for pair in extractSentencePairs(conversations):
        writer.writerow(pair)

# Print a sample of lines
print("\nSample lines from file:")
printLines(datafile)


Processing corpus...

Loading conversations...

Writing newly formatted file...

Sample lines from file:
b"Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.\tWell, I thought we'd start with pronunciation, if that's okay with you.\r\n"
b"Well, I thought we'd start with pronunciation, if that's okay with you.\tNot the hacking and gagging and spitting part.  Please.\r\n"
b"Not the hacking and gagging and spitting part.  Please.\tOkay... then how 'bout we try out some French cuisine.  Saturday?  Night?\r\n"
b"You're asking me out.  That's so cute. What's your name again?\tForget it.\r\n"
b"No, no, it's my fault -- we didn't have a proper introduction ---\tCameron.\r\n"
b"Cameron.\tThe thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.\r\n"
b"The thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister. 

We want to create a vocabulary of words that we see. We represent these words numerically using the index of their first appearance in the history of all added words to the set of vocuabulary. See `addWord()` for specifics.

In [5]:
# Default word tokens
PAD_token = 0  # Used for padding short sentences
SOS_token = 1  # Start-of-sentence token
EOS_token = 2  # End-of-sentence token

class Voc:
    def __init__(self, name):
        self.name = name
        self.trimmed = False
        self.word2index = {}
        self.word2count = {}
        self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
        self.num_words = 3  # Count SOS, EOS, PAD

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.num_words
            self.word2count[word] = 1
            self.index2word[self.num_words] = word
            self.num_words += 1
        else:
            self.word2count[word] += 1

    # Remove words below a certain count threshold
    def trim(self, min_count):
        if self.trimmed:
            return
        self.trimmed = True

        keep_words = []

        for k, v in self.word2count.items():
            if v >= min_count:
                keep_words.append(k)

        print('keep_words {} / {} = {:.4f}'.format(
            len(keep_words), len(self.word2index), len(keep_words) / len(self.word2index)
        ))

        # Reinitialize dictionaries
        self.word2index = {}
        self.word2count = {}
        self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
        self.num_words = 3 # Count default tokens

        for word in keep_words:
            self.addWord(word)

Now we finally getting to the part where we begin to make the (sentence, response) pairs. We begin preprocessing by converting the Unicode string texts to ASCII `unicodeToAscii()`. We also make everything lower case, and remove nonletters except for basic punctuation `normalizeString()`. The last thing that is done in preprocessing is ignore sentences beyond a certain length to aid in training `filterPair()` and `MAX_LENGTH`.

In [6]:
MAX_LENGTH = 10  # Maximum sentence length to consider

# Turn a Unicode string to plain ASCII, thanks to
# https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

# Lowercase, trim, and remove non-letter characters
def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    s = re.sub(r"\s+", r" ", s).strip()
    return s

# Read query/response pairs and return a voc object
def readVocs(datafile, corpus_name):
    print("Reading lines...")
    # Read the file and split into lines
    lines = open(datafile, encoding='utf-8').\
        read().strip().split('\n')
    # Split every line into pairs and normalize
    pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]
    voc = Voc(corpus_name)
    return voc, pairs

# Returns True iff both sentences in a pair 'p' are under the MAX_LENGTH threshold
def filterPair(p):
    # Input sequences need to preserve the last word for EOS token
    return len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH

# Filter pairs using filterPair condition
def filterPairs(pairs):
    return [pair for pair in pairs if filterPair(pair)]

# Using the functions defined above, return a populated voc object and pairs list
def loadPrepareData(corpus, corpus_name, datafile, save_dir):
    print("Start preparing training data ...")
    voc, pairs = readVocs(datafile, corpus_name)
    print("Read {!s} sentence pairs".format(len(pairs)))
    pairs = filterPairs(pairs)
    print("Trimmed to {!s} sentence pairs".format(len(pairs)))
    print("Counting words...")
    for pair in pairs:
        voc.addSentence(pair[0])
        voc.addSentence(pair[1])
    print("Counted words:", voc.num_words)
    return voc, pairs


# Load/Assemble voc and pairs
save_dir = os.path.join("data", "save")
voc, pairs = loadPrepareData(corpus, corpus_name, datafile, save_dir)
# Print some pairs to validate
print("\npairs:")
for pair in pairs[:10]:
    print(pair)

Start preparing training data ...
Reading lines...
Read 221282 sentence pairs
Trimmed to 64271 sentence pairs
Counting words...
Counted words: 18008

pairs:
['there .', 'where ?']
['you have my word . as a gentleman', 'you re sweet .']
['hi .', 'looks like things worked out tonight huh ?']
['you know chastity ?', 'i believe we share an art instructor']
['have fun tonight ?', 'tons']
['well no . . .', 'then that s all you had to say .']
['then that s all you had to say .', 'but']
['but', 'you always been this selfish ?']
['do you listen to this crap ?', 'what crap ?']
['what good stuff ?', 'the real you .']


Another way to speed up training is by removing words that are rarely used. We "trim" these words using `Voc.trim()` and as a result must also remove (sentence, response) pairs that include these words.

In [7]:
MIN_COUNT = 3    # Minimum word count threshold for trimming

def trimRareWords(voc, pairs, MIN_COUNT):
    # Trim words used under the MIN_COUNT from the voc
    voc.trim(MIN_COUNT)
    # Filter out pairs with trimmed words
    keep_pairs = []
    for pair in pairs:
        input_sentence = pair[0]
        output_sentence = pair[1]
        keep_input = True
        keep_output = True
        # Check input sentence
        for word in input_sentence.split(' '):
            if word not in voc.word2index:
                keep_input = False
                break
        # Check output sentence
        for word in output_sentence.split(' '):
            if word not in voc.word2index:
                keep_output = False
                break

        # Only keep pairs that do not contain trimmed word(s) in their input or output sentence
        if keep_input and keep_output:
            keep_pairs.append(pair)

    print("Trimmed from {} pairs to {}, {:.4f} of total".format(len(pairs), len(keep_pairs), len(keep_pairs) / len(pairs)))
    return keep_pairs


# Trim voc and pairs
pairs = trimRareWords(voc, pairs, MIN_COUNT)

keep_words 7823 / 18005 = 0.4345
Trimmed from 64271 pairs to 53165, 0.8272 of total


### Using our Data on our Models

The data still needs to be processed more. The model that we use will take numerical values and not actual strings to do computation. We convert our sentences to tensors (vectors) that our model will take as inputs. To do this, we just take every sentence and change it to be a vector of index that corresponds to that word. This is how we will use our data to train the model. 

If we would like to train our model we usually do it with minbatches since it makes things faster. We make a matrix of dimensions (BatchLength, MaxLengthOfSentenceInBatch) represented numerically as mentioned in the previous paragraph. With this we make sure that each row (sentence) of our matrix terminates with `EOS_Token` and is followed by 0 entries until the end of the row.

This implementation almost works, the problem with this is that each row is a sentence and every column is a step in time. However, it is better to think of every row as a step in time and the column to be possible words to choose for that step in time. For this reason, we construct the Matrix as mentioned in the previous paragraph except it is now transposed. 

We define some function that help us achieve this ...

In [8]:
def indexesFromSentence(voc, sentence):
    return [voc.word2index[word] for word in sentence.split(' ')] + [EOS_token]


def zeroPadding(l, fillvalue=PAD_token):
    return list(itertools.zip_longest(*l, fillvalue=fillvalue))

def binaryMatrix(l, value=PAD_token):
    m = []
    for i, seq in enumerate(l):
        m.append([])
        for token in seq:
            if token == PAD_token:
                m[i].append(0)
            else:
                m[i].append(1)
    return m

# Returns padded input sequence tensor and lengths
def inputVar(l, voc):
    indexes_batch = [indexesFromSentence(voc, sentence) for sentence in l]
    lengths = torch.tensor([len(indexes) for indexes in indexes_batch])
    padList = zeroPadding(indexes_batch)
    padVar = torch.LongTensor(padList)
    return padVar, lengths

# Returns padded target sequence tensor, padding mask, and max target length
def outputVar(l, voc):
    indexes_batch = [indexesFromSentence(voc, sentence) for sentence in l]
    max_target_len = max([len(indexes) for indexes in indexes_batch])
    padList = zeroPadding(indexes_batch)
    mask = binaryMatrix(padList)
    mask = torch.BoolTensor(mask)
    padVar = torch.LongTensor(padList)
    return padVar, mask, max_target_len

# Returns all items for a given batch of pairs
def batch2TrainData(voc, pair_batch):
    pair_batch.sort(key=lambda x: len(x[0].split(" ")), reverse=True)
    input_batch, output_batch = [], []
    for pair in pair_batch:
        input_batch.append(pair[0])
        output_batch.append(pair[1])
    inp, lengths = inputVar(input_batch, voc)
    output, mask, max_target_len = outputVar(output_batch, voc)
    return inp, lengths, output, mask, max_target_len


# Example for validation
small_batch_size = 5
batches = batch2TrainData(voc, [random.choice(pairs) for _ in range(small_batch_size)])
input_variable, lengths, target_variable, mask, max_target_len = batches

print("input_variable:", input_variable)
print("lengths:", lengths)
print("target_variable:", target_variable)
print("mask:", mask)
print("max_target_len:", max_target_len)

input_variable: tensor([[   9,    7,   25,  122,  659],
        [ 997,  331,  200,    4,    6],
        [  37,  117,  112,    4,    2],
        [  96,  746, 1975,    4,    0],
        [  53,   12,    9,    6,    0],
        [ 301,  488, 1347,    2,    0],
        [ 989,  469,    4,    0,    0],
        [   4,   66,    2,    0,    0],
        [   2,    2,    0,    0,    0]])
lengths: tensor([9, 9, 8, 6, 3])
target_variable: tensor([[  25,   25,  354,  122,   76],
        [ 132,  197, 1686,   50,   37],
        [ 273,  117,  254,    6,   98],
        [ 385,  222,    4,    2,  427],
        [  12,   40,    2,    0,    4],
        [ 657,    4,    0,    0,    2],
        [   4,    2,    0,    0,    0],
        [   2,    0,    0,    0,    0]])
mask: tensor([[ True,  True,  True,  True,  True],
        [ True,  True,  True,  True,  True],
        [ True,  True,  True,  True,  True],
        [ True,  True,  True,  True,  True],
        [ True,  True,  True, False,  True],
        [ True,  True

Seeing the output from preping the data above, it appears that `input_variable` is the matrix described in the previous secsion. `lenghts` is just a tensor (vector) of how long the sentences in the input were. This will eventually be used for the **decoder** later in the program. `target_variable` seems to be the response that our model is supposed to learn. `mask` looks like it is a mask for responses where true is that there is a word and false is that there is no word there. Not sure why this variable is necessary but maybe it will become clearer at some point in the tutorial. 

### Defining our Model

Our base model is a Sequence-to-Sequence model using two RNN's. The first RNN is what is called the **encoder** and it takes a variable length input and converts it into a *fixed length* "context" vector that is intended to hold onto some semantic meaning of the input. The **decoder** takes the context vector provided by the decoder along with an input word to guess the next word in a sequence. I suppose our (sentence, response) pairs are learned as one long continuous sequence where the sentence is the start of hte sequence and the response is the remainder of the sequence, though this might be entirely incorrect.

Some discussion on the Encoder. The encoder uses a bidirectional GRU (Gated Recurrent Unit), meaning that there is basically two RNN's that make up the encoder, one that goes through the sequence of data in the forward direction while another RNN goes through the sequence in the backward direction. At each time step, both RNN's produce an output and a hidden state vector. At each time step, the output of both RNNs are summed and the output is recorded (somewhere) while the hidden state vectors are pushed along and used in the next step of the RNN. The outputs are summed makes it so that at each time step, the RNN is considering present and future context. 

In [9]:
class EncoderRNN(nn.Module):
    def __init__(self, hidden_size, embedding, n_layers=1, dropout=0):
        super(EncoderRNN, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size
        self.embedding = embedding

        # Initialize GRU; the input_size and hidden_size params are both set to 'hidden_size'
        #   because our input size is a word embedding with number of features == hidden_size
        self.gru = nn.LSTM(hidden_size,hidden_size, n_layers,
                          dropout=(0 if n_layers == 1 else dropout), bidirectional=True)

    def forward(self, input_seq, input_lengths, hidden=None):
        # Convert word indexes to embeddings
        embedded = self.embedding(input_seq)
        # Pack padded batch of sequences for RNN module
        packed = nn.utils.rnn.pack_padded_sequence(embedded, input_lengths)
        # Forward pass through GRU
        outputs, hidden = self.gru(packed, hidden)
        # Unpack padding
        outputs, _ = nn.utils.rnn.pad_packed_sequence(outputs)
        # Sum bidirectional GRU outputs
        outputs = outputs[:, :, :self.hidden_size] + outputs[:, : ,self.hidden_size:]
        # Return output and final hidden state
        return outputs, hidden

Some discussion on the Decoder. The decoder uses the context vector that is produced by the encoder. Usually, this is all that will be used by a decoder to produce output but this can result in a loss of information, expecially when the input sentences are very long. To counter this, the decoder also uses its current hidden state as a way of determining what it should be "paying attention" to. These are refered to as attention weights and are multiplied by the outputs from the encoder (the output that is apperantly recorded somewhere, this is where they are used) from the current time step to rescale the values, making less important parts of the encoder output smaller and more important parts of the encoder output larger. This can be further improved by using all of the encoder outputs instead of just the one of the current time step to have a more comprehensive set of attention weights. This is the method implemented below, followed by the implementation of the decoder using this attention method.

In [10]:
# Luong attention layer
class Attn(nn.Module):
    def __init__(self, method, hidden_size):
        super(Attn, self).__init__()
        self.method = method
        if self.method not in ['dot', 'general', 'concat']:
            raise ValueError(self.method, "is not an appropriate attention method.")
        self.hidden_size = hidden_size
        if self.method == 'general':
            self.attn = nn.Linear(self.hidden_size, hidden_size)
        elif self.method == 'concat':
            self.attn = nn.Linear(self.hidden_size * 2, hidden_size)
            self.v = nn.Parameter(torch.FloatTensor(hidden_size))

    def dot_score(self, hidden, encoder_output):
        return torch.sum(hidden * encoder_output, dim=2)

    def general_score(self, hidden, encoder_output):
        energy = self.attn(encoder_output)
        return torch.sum(hidden * energy, dim=2)

    def concat_score(self, hidden, encoder_output):
        energy = self.attn(torch.cat((hidden.expand(encoder_output.size(0), -1, -1), encoder_output), 2)).tanh()
        return torch.sum(self.v * energy, dim=2)

    def forward(self, hidden, encoder_outputs):
        # Calculate the attention weights (energies) based on the given method
        if self.method == 'general':
            attn_energies = self.general_score(hidden, encoder_outputs)
        elif self.method == 'concat':
            attn_energies = self.concat_score(hidden, encoder_outputs)
        elif self.method == 'dot':
            attn_energies = self.dot_score(hidden, encoder_outputs)

        # Transpose max_length and batch_size dimensions
        attn_energies = attn_energies.t()

        # Return the softmax normalized probability scores (with added dimension)
        return F.softmax(attn_energies, dim=1).unsqueeze(1)

In [11]:
class LuongAttnDecoderRNN(nn.Module):
    def __init__(self, attn_model, embedding, hidden_size, output_size, n_layers=1, dropout=0.1):
        super(LuongAttnDecoderRNN, self).__init__()

        # Keep for reference
        self.attn_model = attn_model
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        self.dropout = dropout

        # Define layers
        self.embedding = embedding
        self.embedding_dropout = nn.Dropout(dropout)
        self.gru = nn.LSTM(hidden_size, hidden_size, n_layers, dropout=(0 if n_layers == 1 else dropout))
        self.concat = nn.Linear(hidden_size * 2, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)

        self.attn = Attn(attn_model, hidden_size)

    def forward(self, input_step, last_hidden, encoder_outputs):
        # Note: we run this one step (word) at a time
        # Get embedding of current input word
        embedded = self.embedding(input_step)
        embedded = self.embedding_dropout(embedded)
        # Forward through unidirectional GRU
        rnn_output, hidden = self.gru(embedded, last_hidden)
        # Calculate attention weights from the current GRU output
        attn_weights = self.attn(rnn_output, encoder_outputs)
        # Multiply attention weights to encoder outputs to get new "weighted sum" context vector
        context = attn_weights.bmm(encoder_outputs.transpose(0, 1))
        # Concatenate weighted context vector and GRU output using Luong eq. 5
        rnn_output = rnn_output.squeeze(0)
        context = context.squeeze(1)
        concat_input = torch.cat((rnn_output, context), 1)
        concat_output = torch.tanh(self.concat(concat_input))
        # Predict next word using Luong eq. 6
        output = self.out(concat_output)
        output = F.softmax(output, dim=1)
        # Return output and final hidden state
        return output, hidden

### Defining the Training Procedure

Recall earlier in the `mask` produced along side the `target_variable` was produced for an unknown purpose, it turns out its purpose is for determining the loss of the model. This implementation below is the negative-log-likelihood loss and the variable `mask` along with `target_variable`.

In [12]:
def maskNLLLoss(inp, target, mask):
    nTotal = mask.sum()
    crossEntropy = -torch.log(torch.gather(inp, 1, target.view(-1, 1)).squeeze(1))
    loss = crossEntropy.masked_select(mask).mean()
    loss = loss.to(device)
    return loss, nTotal.item()

For training the model, two tricks are used to help convergence. The first trick is refered to as "teacher forcing" which just overrides the prediction of the decoder and uses the actual target value instead, this happens with some small probability p. Doing it too much will make the decoder unable to make predictions on its own and not having this feature will make it so that convergence is just slower. The second trick they do is gradient clipping, meaning that at areas in the feature space where there is a steep gradient and therefore a chance to improve the model very fast, the magnitude of the gradient is limited some value. This is to not drastically overshoot local minima on "cliffs". The training function is implemented below. This trains only one iteration.

In [13]:
def train(input_variable, lengths, target_variable, mask, max_target_len, encoder, decoder, embedding,
          encoder_optimizer, decoder_optimizer, batch_size, clip, max_length=MAX_LENGTH):

    # Zero gradients
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    # Set device options
    input_variable = input_variable.to(device)
    target_variable = target_variable.to(device)
    mask = mask.to(device)
    # Lengths for rnn packing should always be on the cpu
    lengths = lengths.to("cpu")

    # Initialize variables
    loss = 0
    print_losses = []
    n_totals = 0

    # Forward pass through encoder
    encoder_outputs, encoder_hidden = encoder(input_variable, lengths)

    # Create initial decoder input (start with SOS tokens for each sentence)
    decoder_input = torch.LongTensor([[SOS_token for _ in range(batch_size)]])
    decoder_input = decoder_input.to(device)

    # Set initial decoder hidden state to the encoder's final hidden state
        #Get the encoder final h_hidden_state
    encoder_h_hidden, encoder_c_hidden = encoder_hidden
    decoder_h_hidden = encoder_h_hidden[:decoder.n_layers]
    decoder_c_hidden = encoder_c_hidden[:decoder.n_layers]
    # Recombine the final hidden states as hidden tuple
    decoder_hidden = (decoder_h_hidden, decoder_c_hidden)

    # Determine if we are using teacher forcing this iteration
    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    # Forward batch of sequences through decoder one time step at a time
    if use_teacher_forcing:
        for t in range(max_target_len):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden, encoder_outputs
            )
            # Teacher forcing: next input is current target
            decoder_input = target_variable[t].view(1, -1)
            # Calculate and accumulate loss
            mask_loss, nTotal = maskNLLLoss(decoder_output, target_variable[t], mask[t])
            loss += mask_loss
            print_losses.append(mask_loss.item() * nTotal)
            n_totals += nTotal
    else:
        for t in range(max_target_len):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden, encoder_outputs
            )
            # No teacher forcing: next input is decoder's own current output
            _, topi = decoder_output.topk(1)
            decoder_input = torch.LongTensor([[topi[i][0] for i in range(batch_size)]])
            decoder_input = decoder_input.to(device)
            # Calculate and accumulate loss
            mask_loss, nTotal = maskNLLLoss(decoder_output, target_variable[t], mask[t])
            loss += mask_loss
            print_losses.append(mask_loss.item() * nTotal)
            n_totals += nTotal

    # Perform backpropatation
    loss.backward()

    # Clip gradients: gradients are modified in place
    _ = nn.utils.clip_grad_norm_(encoder.parameters(), clip)
    _ = nn.utils.clip_grad_norm_(decoder.parameters(), clip)

    # Adjust model weights
    encoder_optimizer.step()
    decoder_optimizer.step()

    return sum(print_losses) / n_totals


This version of the training function trains on multiple iterations. It is built on the previous training function. This function also saves the current variables in a tarball file. This is to continue training at a later period or just use the current accumulated training to make predictions.

In [14]:
def trainIters(model_name, voc, pairs, encoder, decoder, encoder_optimizer, decoder_optimizer, embedding, 
               encoder_n_layers, decoder_n_layers, save_dir, n_iteration, batch_size, print_every, save_every, 
               clip, corpus_name, loadFilename):

    # Load batches for each iteration
    training_batches = [batch2TrainData(voc, [random.choice(pairs) for _ in range(batch_size)])
                      for _ in range(n_iteration)]

    # Initializations
    print('Initializing ...')
    start_iteration = 1
    print_loss = 0
    if loadFilename:
        start_iteration = checkpoint['iteration'] + 1

    # Training loop
    print("Training...")
    for iteration in range(start_iteration, n_iteration + 1):
        training_batch = training_batches[iteration - 1]
        # Extract fields from batch
        input_variable, lengths, target_variable, mask, max_target_len = training_batch

        # Run a training iteration with batch
        loss = train(input_variable, lengths, target_variable, mask, max_target_len, encoder,
                     decoder, embedding, encoder_optimizer, decoder_optimizer, batch_size, clip)
        print_loss += loss

        # Print progress
        if iteration % print_every == 0:
            print_loss_avg = print_loss / print_every
            print("Iteration: {}; Percent complete: {:.1f}%; Average loss: {:.4f}".format(
                iteration, iteration / n_iteration * 100, print_loss_avg
            ))
            print_loss = 0

        # Save checkpoint
        if (iteration % save_every == 0):
            directory = os.path.join(save_dir, model_name, corpus_name, '{}-{}_{}'.format(
                encoder_n_layers, decoder_n_layers, hidden_size
            ))
            if not os.path.exists(directory):
                os.makedirs(directory)
            torch.save({
                'iteration': iteration,
                'en': encoder.state_dict(),
                'de': decoder.state_dict(),
                'en_opt': encoder_optimizer.state_dict(),
                'de_opt': decoder_optimizer.state_dict(),
                'loss': loss,
                'voc_dict': voc.__dict__,
                'embedding': embedding.state_dict()
            }, os.path.join(directory, '{}_{}.tar'.format(iteration, 'checkpoint')))


### Defining Evaluation

This is the training method used by our bot when we are not using the "teacher forcing" method. This just basically chooses the output of hte decoder to be the output with the highest softmax score, ie the best response given its prior training. I think this is defining a decoder method for whenever we are actually using our model instead of when it is training.

In [15]:
class GreedySearchDecoder(nn.Module):
    def __init__(self, encoder, decoder):
        super(GreedySearchDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, input_seq, input_length, max_length):
        # Forward input through encoder model
        encoder_outputs, encoder_hidden = self.encoder(input_seq, input_length)
        # Prepare encoder's final hidden layer to be first hidden input to the decoder
        #Get the encoder final h_hidden_state
        encoder_h_hidden, encoder_c_hidden= encoder_hidden
        decoder_h_hidden = encoder_h_hidden[:decoder.n_layers]
        decoder_c_hidden = encoder_c_hidden[:decoder.n_layers]
        # Recombine the final hidden states as hidden tuple
        decoder_hidden = (decoder_h_hidden, decoder_c_hidden)
        # Initialize decoder input with SOS_token
        decoder_input = torch.ones(1, 1, device=device, dtype=torch.long) * SOS_token
        # Initialize tensors to append decoded words to
        all_tokens = torch.zeros([0], device=device, dtype=torch.long)
        all_scores = torch.zeros([0], device=device)
        # Iteratively decode one word token at a time
        for _ in range(max_length):
            # Forward pass through decoder
            decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden, encoder_outputs)
            # Obtain most likely word token and its softmax score
            decoder_scores, decoder_input = torch.max(decoder_output, dim=1)
            # Record token and score
            all_tokens = torch.cat((all_tokens, decoder_input), dim=0)
            all_scores = torch.cat((all_scores, decoder_scores), dim=0)
            # Prepare current token to be next decoder input (add a dimension)
            decoder_input = torch.unsqueeze(decoder_input, 0)
        # Return collections of word tokens and scores
        return all_tokens, all_scores

#option to greedy search
class StochasticTopKUniformDecoder(nn.Module):
    def __init__(self, encoder, decoder, k):
        super(StochasticTopKUniformDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.k = k

    def forward(self, input_seq, input_length, max_length):
        # Forward input through encoder model
        encoder_outputs, encoder_hidden = self.encoder(input_seq, input_length)
        # Prepare encoder's final hidden layer to be first hidden input to the decoder
        #Get the encoder final h_hidden_state
        encoder_h_hidden, encoder_c_hidden= encoder_hidden
        decoder_h_hidden = encoder_h_hidden[:decoder.n_layers]
        decoder_c_hidden = encoder_c_hidden[:decoder.n_layers]
        # Recombine the final hidden states as hidden tuple
        decoder_hidden = (decoder_h_hidden, decoder_c_hidden)
        # Initialize decoder input with SOS_token
        decoder_input = torch.ones(1, 1, device=device, dtype=torch.long) * SOS_token
        # Initialize tensors to append decoded words to
        all_tokens = torch.zeros([0], device=device, dtype=torch.long)
        all_scores = torch.zeros([0], device=device)
        # Iteratively decode one word token at a time
        for _ in range(max_length):
            # Forward pass through decoder
            decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden, encoder_outputs)
            
            # select from the top k words at each word in the sentence
            # uniformly at random.
            decoder_output_topk, ind = torch.topk(decoder_output, k=self.k, dim=1)
            i = int(random.random()*self.k)
            decoder_input = torch.tensor([ind[0,i]]).to(device)
            decoder_scores = torch.tensor([decoder_output_topk[0,i]]).to(device)
            
            # Record token and score
            all_tokens = torch.cat((all_tokens, decoder_input), dim=0)
            all_scores = torch.cat((all_scores, decoder_scores), dim=0)
            # Prepare current token to be next decoder input (add a dimension)
            decoder_input = torch.unsqueeze(decoder_input, 0)
        # Return collections of word tokens and scores
        return all_tokens, all_scores

class StochasticTopKSampleDecoder(nn.Module):
    def __init__(self, encoder, decoder, k):
        super(StochasticTopKSampleDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.k = k

    def forward(self, input_seq, input_length, max_length):
        # Forward input through encoder model
        encoder_outputs, encoder_hidden = self.encoder(input_seq, input_length)
        # Prepare encoder's final hidden layer to be first hidden input to the decoder
        #Get the encoder final h_hidden_state
        encoder_h_hidden, encoder_c_hidden= encoder_hidden
        decoder_h_hidden = encoder_h_hidden[:decoder.n_layers]
        decoder_c_hidden = encoder_c_hidden[:decoder.n_layers]
        # Recombine the final hidden states as hidden tuple
        decoder_hidden = (decoder_h_hidden, decoder_c_hidden)
        # Initialize decoder input with SOS_token
        decoder_input = torch.ones(1, 1, device=device, dtype=torch.long) * SOS_token
        # Initialize tensors to append decoded words to
        all_tokens = torch.zeros([0], device=device, dtype=torch.long)
        all_scores = torch.zeros([0], device=device)
        # Iteratively decode one word token at a time
        for _ in range(max_length):
            # Forward pass through decoder
            decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden, encoder_outputs)
            
            # sample choice of word from the top k words at each word
            # in the sentence
            decoder_output_topk, ind = torch.topk(decoder_output, k=self.k, dim=1)
            decoder_output_topk = decoder_output_topk.cpu().detach().numpy().flatten()
            ind = ind.cpu().detach().numpy().flatten()
            probabilities = decoder_output_topk/(np.sum(decoder_output_topk))
            value = np.random.choice(ind, 1, p=probabilities)
            i = np.where(ind == value[0])[0][0]
            decoder_input = torch.tensor([ind[i]]).to(device)
            decoder_scores = torch.tensor([decoder_output_topk[i]]).to(device)
            
            # Record token and score
            all_tokens = torch.cat((all_tokens, decoder_input), dim=0)
            all_scores = torch.cat((all_scores, decoder_scores), dim=0)
            # Prepare current token to be next decoder input (add a dimension)
            decoder_input = torch.unsqueeze(decoder_input, 0)
        # Return collections of word tokens and scores
        return all_tokens, all_scores

class StochasticSentenceInitializerSamplingDecoder(nn.Module):
    def __init__(self, encoder, decoder, k, n):
        super(StochasticSentenceInitializerSamplingDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.k = k
        self.n = n

    def forward(self, input_seq, input_length, max_length):
        # Forward input through encoder model
        encoder_outputs, encoder_hidden = self.encoder(input_seq, input_length)
        # Prepare encoder's final hidden layer to be first hidden input to the decoder
        #Get the encoder final h_hidden_state
        encoder_h_hidden, encoder_c_hidden= encoder_hidden
        decoder_h_hidden = encoder_h_hidden[:decoder.n_layers]
        decoder_c_hidden = encoder_c_hidden[:decoder.n_layers]
        # Recombine the final hidden states as hidden tuple
        decoder_hidden = (decoder_h_hidden, decoder_c_hidden)
        # Initialize decoder input with SOS_token
        decoder_input = torch.ones(1, 1, device=device, dtype=torch.long) * SOS_token
        # Initialize tensors to append decoded words to
        all_tokens = torch.zeros([0], device=device, dtype=torch.long)
        all_scores = torch.zeros([0], device=device)
        # Iteratively decode one word token at a time
        for i in range(max_length):
            # Forward pass through decoder
            decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden, encoder_outputs)
            
            if i >= self.n:
                # be greedy >:)
                decoder_scores, decoder_input = torch.max(decoder_output, dim=1)
            else:
                # sample choice of word from the top k words at each word
                # in the sentence
                decoder_output_topk, ind = torch.topk(decoder_output, k=self.k, dim=1)
                decoder_output_topk = decoder_output_topk.cpu().detach().numpy().flatten()
                ind = ind.cpu().detach().numpy().flatten()
                probabilities = decoder_output_topk/(np.sum(decoder_output_topk))
                value = np.random.choice(ind, 1, p=probabilities)
                i = np.where(ind == value[0])[0][0]
                decoder_input = torch.tensor([ind[i]]).to(device)
                decoder_scores = torch.tensor([decoder_output_topk[i]]).to(device)
            
            # Record token and score
            all_tokens = torch.cat((all_tokens, decoder_input), dim=0)
            all_scores = torch.cat((all_scores, decoder_scores), dim=0)
            # Prepare current token to be next decoder input (add a dimension)
            decoder_input = torch.unsqueeze(decoder_input, 0)
        # Return collections of word tokens and scores
        return all_tokens, all_scores

class StochasticSentenceInitializerUniformDecoder(nn.Module):
    def __init__(self, encoder, decoder, k, n):
        super(StochasticSentenceInitializerUniformDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.k = k
        self.n = n

    def forward(self, input_seq, input_length, max_length):
        # Forward input through encoder model
        encoder_outputs, encoder_hidden = self.encoder(input_seq, input_length)
        # Prepare encoder's final hidden layer to be first hidden input to the decoder
        #Get the encoder final h_hidden_state
        encoder_h_hidden, encoder_c_hidden= encoder_hidden
        decoder_h_hidden = encoder_h_hidden[:decoder.n_layers]
        decoder_c_hidden = encoder_c_hidden[:decoder.n_layers]
        # Recombine the final hidden states as hidden tuple
        decoder_hidden = (decoder_h_hidden, decoder_c_hidden)
        # Initialize decoder input with SOS_token
        decoder_input = torch.ones(1, 1, device=device, dtype=torch.long) * SOS_token
        # Initialize tensors to append decoded words to
        all_tokens = torch.zeros([0], device=device, dtype=torch.long)
        all_scores = torch.zeros([0], device=device)
        # Iteratively decode one word token at a time
        for i in range(max_length):
            # Forward pass through decoder
            decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden, encoder_outputs)
            
            if i >= self.n:
                # be greedy >:)
                decoder_scores, decoder_input = torch.max(decoder_output, dim=1)
            else:
                # sample choice of word from the top k words at each word
                # in the sentence (uniformly at random)
                decoder_output_topk, ind = torch.topk(decoder_output, k=self.k, dim=1)
                i = int(random.random()*self.k)
                decoder_input = torch.tensor([ind[0,i]]).to(device)
                decoder_scores = torch.tensor([decoder_output_topk[0,i]]).to(device)
            
            # Record token and score
            all_tokens = torch.cat((all_tokens, decoder_input), dim=0)
            all_scores = torch.cat((all_scores, decoder_scores), dim=0)
            # Prepare current token to be next decoder input (add a dimension)
            decoder_input = torch.unsqueeze(decoder_input, 0)
        # Return collections of word tokens and scores
        return all_tokens, all_scores

The `evaluateInput` function is meant to take input from the user, prepare it so that it may be fed to our model, and pring out the response from our bot. `evaluate` handles most of hte actual calculation while `evaluateInput` handles most of the user interaction phase. 

In [29]:
def evaluate(encoder, decoder, searcher, voc, sentence, max_length=MAX_LENGTH):
    ### Format input sentence as a batch
    # words -> indexes
    indexes_batch = [indexesFromSentence(voc, sentence)]
    # Create lengths tensor
    lengths = torch.tensor([len(indexes) for indexes in indexes_batch])
    # Transpose dimensions of batch to match models' expectations
    input_batch = torch.LongTensor(indexes_batch).transpose(0, 1)
    # Use appropriate device
    input_batch = input_batch.to(device)
    #lengths = lengths.to(device) # removed bc of gpu cpu tensor difference ###################################
    # Decode sentence with searcher
    tokens, scores = searcher(input_batch, lengths, max_length)
    # indexes -> words
    decoded_words = [voc.index2word[token.item()] for token in tokens]
    return decoded_words


def evaluateInput(encoder, decoder, searcher, voc):
    input_sentence = ''
    while(1):
        try:
            # Get input sentence
            input_sentence = input('> ')
            # Check if it is quit case
            if input_sentence == 'q' or input_sentence == 'quit': break
            # Normalize sentence
            input_sentence = normalizeString(input_sentence)
            # Evaluate sentence
            output_words = evaluate(encoder, decoder, searcher, voc, input_sentence)
            # Format and print response sentence
            output_words[:] = [x for x in output_words if not (x == 'EOS' or x == 'PAD')]
            print('Bot:', ' '.join(output_words))

        except KeyError:
            print("Error: Encountered unknown word.")


### Running the Model

This defines and builds our model based on the following parameters. Notice that parts of the code are commented out as these are other possibilities ofr initialization. attn_model can be given one of three different models. We can also choose to load a model that we previously trained.

In [17]:
# Configure models
model_name = 'cb_model'
attn_model = 'dot'
# attn_model = 'general'
# attn_model = 'concat'
hidden_size = 500
encoder_n_layers = 2
decoder_n_layers = 2
dropout = 0.1
batch_size = 1000

# Set checkpoint to load from; set to None if starting from scratch
loadFilename = None
checkpoint_iter = 4000
#loadFilename = os.path.join(save_dir, model_name, corpus_name,
#                            '{}-{}_{}'.format(encoder_n_layers, decoder_n_layers, hidden_size),
#                            '{}_checkpoint.tar'.format(checkpoint_iter))


# Load model if a loadFilename is provided
if loadFilename:
    # If loading on same machine the model was trained on
    checkpoint = torch.load(loadFilename)
    # If loading a model trained on GPU to CPU
    #checkpoint = torch.load(loadFilename, map_location=torch.device('cpu'))
    encoder_sd = checkpoint['en']
    decoder_sd = checkpoint['de']
    encoder_optimizer_sd = checkpoint['en_opt']
    decoder_optimizer_sd = checkpoint['de_opt']
    embedding_sd = checkpoint['embedding']
    voc.__dict__ = checkpoint['voc_dict']


print('Building encoder and decoder ...')
# Initialize word embeddings
embedding = nn.Embedding(voc.num_words, hidden_size)
if loadFilename:
    embedding.load_state_dict(embedding_sd)
# Initialize encoder & decoder models
encoder = EncoderRNN(hidden_size, embedding, encoder_n_layers, dropout)
decoder = LuongAttnDecoderRNN(attn_model, embedding, hidden_size, voc.num_words, decoder_n_layers, dropout)
if loadFilename:
    encoder.load_state_dict(encoder_sd)
    decoder.load_state_dict(decoder_sd)
# Use appropriate device
encoder = encoder.to(device)
decoder = decoder.to(device)
print('Models built and ready to go!')

Building encoder and decoder ...
Models built and ready to go!


Now we train our model using `trainIters`. We first set some training parameters and initialize optimizers and prepare our model for training. Training is likely to take a very long time, thankfully there is the option to load from a previous model and this piece of code here doesn't need to be run every time.

In [18]:
# Safty precaution so that we dont accidentally begin training a model we didnt want to train
TRAIN_THE_MODEL = True

In [19]:
if TRAIN_THE_MODEL:
    
    # Configure training/optimization
    clip = 50
    teacher_forcing_ratio = 1
    learning_rate = 0.0001
    decoder_learning_ratio = 5.0
    n_iteration = 5000
    print_every = 1
    save_every = 2000

    # Ensure dropout layers are in train mode
    encoder.train()
    decoder.train()

    # Initialize optimizers
    print('Building optimizers ...')
    encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate * decoder_learning_ratio)
    if loadFilename:
        encoder_optimizer.load_state_dict(encoder_optimizer_sd)
        decoder_optimizer.load_state_dict(decoder_optimizer_sd)

    # If you have cuda, configure cuda to call
    for state in encoder_optimizer.state.values():
        for k, v in state.items():
            if isinstance(v, torch.Tensor):
                state[k] = v.cuda()
    
    for state in decoder_optimizer.state.values():
        for k, v in state.items():
            if isinstance(v, torch.Tensor):
                state[k] = v.cuda()
    
    # Run training iterations
    print("Starting Training!")
    trainIters(model_name, voc, pairs, encoder, decoder, encoder_optimizer, decoder_optimizer,
               embedding, encoder_n_layers, decoder_n_layers, save_dir, n_iteration, batch_size,
               print_every, save_every, clip, corpus_name, loadFilename)
    
    TRAIN_THE_MODEL = False

else:
    
    print(f"Make the variable 'TRAIN_THE_MODEL' in the code block above True and run the code block")
    print(f"Value of 'TRAIN_THE_MODEL' currently: {TRAIN_THE_MODEL}")


Building optimizers ...
Starting Training!
Initializing ...
Training...
Iteration: 1; Percent complete: 0.0%; Average loss: 8.9605
Iteration: 2; Percent complete: 0.0%; Average loss: 8.9131
Iteration: 3; Percent complete: 0.1%; Average loss: 8.8537
Iteration: 4; Percent complete: 0.1%; Average loss: 8.7638
Iteration: 5; Percent complete: 0.1%; Average loss: 8.6066
Iteration: 6; Percent complete: 0.1%; Average loss: 8.3379
Iteration: 7; Percent complete: 0.1%; Average loss: 7.8598
Iteration: 8; Percent complete: 0.2%; Average loss: 7.1995
Iteration: 9; Percent complete: 0.2%; Average loss: 6.7580
Iteration: 10; Percent complete: 0.2%; Average loss: 6.7284
Iteration: 11; Percent complete: 0.2%; Average loss: 6.7260
Iteration: 12; Percent complete: 0.2%; Average loss: 6.4629
Iteration: 13; Percent complete: 0.3%; Average loss: 6.1065
Iteration: 14; Percent complete: 0.3%; Average loss: 5.7675
Iteration: 15; Percent complete: 0.3%; Average loss: 5.4779
Iteration: 16; Percent complete: 0.3%

Iteration: 136; Percent complete: 2.7%; Average loss: 4.2486
Iteration: 137; Percent complete: 2.7%; Average loss: 4.2275
Iteration: 138; Percent complete: 2.8%; Average loss: 4.2169
Iteration: 139; Percent complete: 2.8%; Average loss: 4.1625
Iteration: 140; Percent complete: 2.8%; Average loss: 4.1690
Iteration: 141; Percent complete: 2.8%; Average loss: 4.1461
Iteration: 142; Percent complete: 2.8%; Average loss: 4.2142
Iteration: 143; Percent complete: 2.9%; Average loss: 4.1662
Iteration: 144; Percent complete: 2.9%; Average loss: 4.1931
Iteration: 145; Percent complete: 2.9%; Average loss: 4.2721
Iteration: 146; Percent complete: 2.9%; Average loss: 4.2080
Iteration: 147; Percent complete: 2.9%; Average loss: 4.2314
Iteration: 148; Percent complete: 3.0%; Average loss: 4.1732
Iteration: 149; Percent complete: 3.0%; Average loss: 4.2287
Iteration: 150; Percent complete: 3.0%; Average loss: 4.1466
Iteration: 151; Percent complete: 3.0%; Average loss: 4.1276
Iteration: 152; Percent 

Iteration: 271; Percent complete: 5.4%; Average loss: 3.8466
Iteration: 272; Percent complete: 5.4%; Average loss: 3.7744
Iteration: 273; Percent complete: 5.5%; Average loss: 3.8543
Iteration: 274; Percent complete: 5.5%; Average loss: 3.8299
Iteration: 275; Percent complete: 5.5%; Average loss: 3.7319
Iteration: 276; Percent complete: 5.5%; Average loss: 3.9144
Iteration: 277; Percent complete: 5.5%; Average loss: 3.7662
Iteration: 278; Percent complete: 5.6%; Average loss: 3.8302
Iteration: 279; Percent complete: 5.6%; Average loss: 3.7969
Iteration: 280; Percent complete: 5.6%; Average loss: 3.8002
Iteration: 281; Percent complete: 5.6%; Average loss: 3.7715
Iteration: 282; Percent complete: 5.6%; Average loss: 3.7912
Iteration: 283; Percent complete: 5.7%; Average loss: 3.7770
Iteration: 284; Percent complete: 5.7%; Average loss: 3.7768
Iteration: 285; Percent complete: 5.7%; Average loss: 3.7901
Iteration: 286; Percent complete: 5.7%; Average loss: 3.7315
Iteration: 287; Percent 

Iteration: 406; Percent complete: 8.1%; Average loss: 3.6182
Iteration: 407; Percent complete: 8.1%; Average loss: 3.5862
Iteration: 408; Percent complete: 8.2%; Average loss: 3.5561
Iteration: 409; Percent complete: 8.2%; Average loss: 3.5632
Iteration: 410; Percent complete: 8.2%; Average loss: 3.6035
Iteration: 411; Percent complete: 8.2%; Average loss: 3.6329
Iteration: 412; Percent complete: 8.2%; Average loss: 3.5735
Iteration: 413; Percent complete: 8.3%; Average loss: 3.6031
Iteration: 414; Percent complete: 8.3%; Average loss: 3.5539
Iteration: 415; Percent complete: 8.3%; Average loss: 3.6140
Iteration: 416; Percent complete: 8.3%; Average loss: 3.5671
Iteration: 417; Percent complete: 8.3%; Average loss: 3.5700
Iteration: 418; Percent complete: 8.4%; Average loss: 3.5913
Iteration: 419; Percent complete: 8.4%; Average loss: 3.5923
Iteration: 420; Percent complete: 8.4%; Average loss: 3.6020
Iteration: 421; Percent complete: 8.4%; Average loss: 3.5941
Iteration: 422; Percent 

Iteration: 540; Percent complete: 10.8%; Average loss: 3.4434
Iteration: 541; Percent complete: 10.8%; Average loss: 3.4354
Iteration: 542; Percent complete: 10.8%; Average loss: 3.3685
Iteration: 543; Percent complete: 10.9%; Average loss: 3.4624
Iteration: 544; Percent complete: 10.9%; Average loss: 3.3960
Iteration: 545; Percent complete: 10.9%; Average loss: 3.3982
Iteration: 546; Percent complete: 10.9%; Average loss: 3.4397
Iteration: 547; Percent complete: 10.9%; Average loss: 3.4062
Iteration: 548; Percent complete: 11.0%; Average loss: 3.3895
Iteration: 549; Percent complete: 11.0%; Average loss: 3.4277
Iteration: 550; Percent complete: 11.0%; Average loss: 3.3974
Iteration: 551; Percent complete: 11.0%; Average loss: 3.4224
Iteration: 552; Percent complete: 11.0%; Average loss: 3.3679
Iteration: 553; Percent complete: 11.1%; Average loss: 3.4039
Iteration: 554; Percent complete: 11.1%; Average loss: 3.3870
Iteration: 555; Percent complete: 11.1%; Average loss: 3.4131
Iteratio

Iteration: 673; Percent complete: 13.5%; Average loss: 3.2728
Iteration: 674; Percent complete: 13.5%; Average loss: 3.2657
Iteration: 675; Percent complete: 13.5%; Average loss: 3.3077
Iteration: 676; Percent complete: 13.5%; Average loss: 3.2944
Iteration: 677; Percent complete: 13.5%; Average loss: 3.3255
Iteration: 678; Percent complete: 13.6%; Average loss: 3.2804
Iteration: 679; Percent complete: 13.6%; Average loss: 3.2630
Iteration: 680; Percent complete: 13.6%; Average loss: 3.2402
Iteration: 681; Percent complete: 13.6%; Average loss: 3.3096
Iteration: 682; Percent complete: 13.6%; Average loss: 3.2401
Iteration: 683; Percent complete: 13.7%; Average loss: 3.2329
Iteration: 684; Percent complete: 13.7%; Average loss: 3.2800
Iteration: 685; Percent complete: 13.7%; Average loss: 3.2787
Iteration: 686; Percent complete: 13.7%; Average loss: 3.2812
Iteration: 687; Percent complete: 13.7%; Average loss: 3.2681
Iteration: 688; Percent complete: 13.8%; Average loss: 3.2665
Iteratio

Iteration: 806; Percent complete: 16.1%; Average loss: 3.1461
Iteration: 807; Percent complete: 16.1%; Average loss: 3.0526
Iteration: 808; Percent complete: 16.2%; Average loss: 3.0792
Iteration: 809; Percent complete: 16.2%; Average loss: 3.1076
Iteration: 810; Percent complete: 16.2%; Average loss: 3.1556
Iteration: 811; Percent complete: 16.2%; Average loss: 3.0515
Iteration: 812; Percent complete: 16.2%; Average loss: 3.1314
Iteration: 813; Percent complete: 16.3%; Average loss: 3.0910
Iteration: 814; Percent complete: 16.3%; Average loss: 3.0819
Iteration: 815; Percent complete: 16.3%; Average loss: 3.1192
Iteration: 816; Percent complete: 16.3%; Average loss: 3.1477
Iteration: 817; Percent complete: 16.3%; Average loss: 3.1483
Iteration: 818; Percent complete: 16.4%; Average loss: 3.1309
Iteration: 819; Percent complete: 16.4%; Average loss: 3.0675
Iteration: 820; Percent complete: 16.4%; Average loss: 3.1234
Iteration: 821; Percent complete: 16.4%; Average loss: 3.1123
Iteratio

Iteration: 939; Percent complete: 18.8%; Average loss: 2.9733
Iteration: 940; Percent complete: 18.8%; Average loss: 2.9372
Iteration: 941; Percent complete: 18.8%; Average loss: 2.9131
Iteration: 942; Percent complete: 18.8%; Average loss: 2.9670
Iteration: 943; Percent complete: 18.9%; Average loss: 2.9925
Iteration: 944; Percent complete: 18.9%; Average loss: 2.9253
Iteration: 945; Percent complete: 18.9%; Average loss: 2.9916
Iteration: 946; Percent complete: 18.9%; Average loss: 2.9543
Iteration: 947; Percent complete: 18.9%; Average loss: 2.9392
Iteration: 948; Percent complete: 19.0%; Average loss: 2.9824
Iteration: 949; Percent complete: 19.0%; Average loss: 2.9334
Iteration: 950; Percent complete: 19.0%; Average loss: 2.9314
Iteration: 951; Percent complete: 19.0%; Average loss: 2.9638
Iteration: 952; Percent complete: 19.0%; Average loss: 2.9794
Iteration: 953; Percent complete: 19.1%; Average loss: 2.9127
Iteration: 954; Percent complete: 19.1%; Average loss: 2.9971
Iteratio

Iteration: 1071; Percent complete: 21.4%; Average loss: 2.8295
Iteration: 1072; Percent complete: 21.4%; Average loss: 2.8539
Iteration: 1073; Percent complete: 21.5%; Average loss: 2.7966
Iteration: 1074; Percent complete: 21.5%; Average loss: 2.8050
Iteration: 1075; Percent complete: 21.5%; Average loss: 2.8188
Iteration: 1076; Percent complete: 21.5%; Average loss: 2.8303
Iteration: 1077; Percent complete: 21.5%; Average loss: 2.8274
Iteration: 1078; Percent complete: 21.6%; Average loss: 2.7563
Iteration: 1079; Percent complete: 21.6%; Average loss: 2.8034
Iteration: 1080; Percent complete: 21.6%; Average loss: 2.8760
Iteration: 1081; Percent complete: 21.6%; Average loss: 2.8154
Iteration: 1082; Percent complete: 21.6%; Average loss: 2.7819
Iteration: 1083; Percent complete: 21.7%; Average loss: 2.8174
Iteration: 1084; Percent complete: 21.7%; Average loss: 2.8183
Iteration: 1085; Percent complete: 21.7%; Average loss: 2.8191
Iteration: 1086; Percent complete: 21.7%; Average loss:

Iteration: 1202; Percent complete: 24.0%; Average loss: 2.7326
Iteration: 1203; Percent complete: 24.1%; Average loss: 2.6308
Iteration: 1204; Percent complete: 24.1%; Average loss: 2.6067
Iteration: 1205; Percent complete: 24.1%; Average loss: 2.6911
Iteration: 1206; Percent complete: 24.1%; Average loss: 2.5941
Iteration: 1207; Percent complete: 24.1%; Average loss: 2.6489
Iteration: 1208; Percent complete: 24.2%; Average loss: 2.6556
Iteration: 1209; Percent complete: 24.2%; Average loss: 2.6970
Iteration: 1210; Percent complete: 24.2%; Average loss: 2.6405
Iteration: 1211; Percent complete: 24.2%; Average loss: 2.6258
Iteration: 1212; Percent complete: 24.2%; Average loss: 2.6929
Iteration: 1213; Percent complete: 24.3%; Average loss: 2.6491
Iteration: 1214; Percent complete: 24.3%; Average loss: 2.6241
Iteration: 1215; Percent complete: 24.3%; Average loss: 2.7158
Iteration: 1216; Percent complete: 24.3%; Average loss: 2.6751
Iteration: 1217; Percent complete: 24.3%; Average loss:

Iteration: 1333; Percent complete: 26.7%; Average loss: 2.5511
Iteration: 1334; Percent complete: 26.7%; Average loss: 2.5127
Iteration: 1335; Percent complete: 26.7%; Average loss: 2.5255
Iteration: 1336; Percent complete: 26.7%; Average loss: 2.5750
Iteration: 1337; Percent complete: 26.7%; Average loss: 2.5227
Iteration: 1338; Percent complete: 26.8%; Average loss: 2.5141
Iteration: 1339; Percent complete: 26.8%; Average loss: 2.4969
Iteration: 1340; Percent complete: 26.8%; Average loss: 2.5010
Iteration: 1341; Percent complete: 26.8%; Average loss: 2.5028
Iteration: 1342; Percent complete: 26.8%; Average loss: 2.4933
Iteration: 1343; Percent complete: 26.9%; Average loss: 2.4518
Iteration: 1344; Percent complete: 26.9%; Average loss: 2.5014
Iteration: 1345; Percent complete: 26.9%; Average loss: 2.4960
Iteration: 1346; Percent complete: 26.9%; Average loss: 2.4709
Iteration: 1347; Percent complete: 26.9%; Average loss: 2.4962
Iteration: 1348; Percent complete: 27.0%; Average loss:

Iteration: 1464; Percent complete: 29.3%; Average loss: 2.3613
Iteration: 1465; Percent complete: 29.3%; Average loss: 2.3915
Iteration: 1466; Percent complete: 29.3%; Average loss: 2.3366
Iteration: 1467; Percent complete: 29.3%; Average loss: 2.3714
Iteration: 1468; Percent complete: 29.4%; Average loss: 2.3575
Iteration: 1469; Percent complete: 29.4%; Average loss: 2.3392
Iteration: 1470; Percent complete: 29.4%; Average loss: 2.3619
Iteration: 1471; Percent complete: 29.4%; Average loss: 2.2795
Iteration: 1472; Percent complete: 29.4%; Average loss: 2.3910
Iteration: 1473; Percent complete: 29.5%; Average loss: 2.3791
Iteration: 1474; Percent complete: 29.5%; Average loss: 2.2731
Iteration: 1475; Percent complete: 29.5%; Average loss: 2.3594
Iteration: 1476; Percent complete: 29.5%; Average loss: 2.3466
Iteration: 1477; Percent complete: 29.5%; Average loss: 2.3571
Iteration: 1478; Percent complete: 29.6%; Average loss: 2.3105
Iteration: 1479; Percent complete: 29.6%; Average loss:

Iteration: 1595; Percent complete: 31.9%; Average loss: 2.2072
Iteration: 1596; Percent complete: 31.9%; Average loss: 2.1531
Iteration: 1597; Percent complete: 31.9%; Average loss: 2.1575
Iteration: 1598; Percent complete: 32.0%; Average loss: 2.1982
Iteration: 1599; Percent complete: 32.0%; Average loss: 2.1563
Iteration: 1600; Percent complete: 32.0%; Average loss: 2.1963
Iteration: 1601; Percent complete: 32.0%; Average loss: 2.1938
Iteration: 1602; Percent complete: 32.0%; Average loss: 2.1627
Iteration: 1603; Percent complete: 32.1%; Average loss: 2.2230
Iteration: 1604; Percent complete: 32.1%; Average loss: 2.1465
Iteration: 1605; Percent complete: 32.1%; Average loss: 2.1454
Iteration: 1606; Percent complete: 32.1%; Average loss: 2.1364
Iteration: 1607; Percent complete: 32.1%; Average loss: 2.1382
Iteration: 1608; Percent complete: 32.2%; Average loss: 2.1918
Iteration: 1609; Percent complete: 32.2%; Average loss: 2.1785
Iteration: 1610; Percent complete: 32.2%; Average loss:

Iteration: 1726; Percent complete: 34.5%; Average loss: 2.0210
Iteration: 1727; Percent complete: 34.5%; Average loss: 1.9734
Iteration: 1728; Percent complete: 34.6%; Average loss: 2.0629
Iteration: 1729; Percent complete: 34.6%; Average loss: 2.0258
Iteration: 1730; Percent complete: 34.6%; Average loss: 2.0435
Iteration: 1731; Percent complete: 34.6%; Average loss: 1.9665
Iteration: 1732; Percent complete: 34.6%; Average loss: 1.9804
Iteration: 1733; Percent complete: 34.7%; Average loss: 2.0211
Iteration: 1734; Percent complete: 34.7%; Average loss: 2.0367
Iteration: 1735; Percent complete: 34.7%; Average loss: 2.0408
Iteration: 1736; Percent complete: 34.7%; Average loss: 1.9667
Iteration: 1737; Percent complete: 34.7%; Average loss: 2.0619
Iteration: 1738; Percent complete: 34.8%; Average loss: 2.0518
Iteration: 1739; Percent complete: 34.8%; Average loss: 2.0617
Iteration: 1740; Percent complete: 34.8%; Average loss: 1.9905
Iteration: 1741; Percent complete: 34.8%; Average loss:

Iteration: 1857; Percent complete: 37.1%; Average loss: 1.8827
Iteration: 1858; Percent complete: 37.2%; Average loss: 1.8326
Iteration: 1859; Percent complete: 37.2%; Average loss: 1.8688
Iteration: 1860; Percent complete: 37.2%; Average loss: 1.8834
Iteration: 1861; Percent complete: 37.2%; Average loss: 1.8780
Iteration: 1862; Percent complete: 37.2%; Average loss: 1.8910
Iteration: 1863; Percent complete: 37.3%; Average loss: 1.8928
Iteration: 1864; Percent complete: 37.3%; Average loss: 1.8491
Iteration: 1865; Percent complete: 37.3%; Average loss: 1.8662
Iteration: 1866; Percent complete: 37.3%; Average loss: 1.8952
Iteration: 1867; Percent complete: 37.3%; Average loss: 1.8363
Iteration: 1868; Percent complete: 37.4%; Average loss: 1.8591
Iteration: 1869; Percent complete: 37.4%; Average loss: 1.8518
Iteration: 1870; Percent complete: 37.4%; Average loss: 1.8682
Iteration: 1871; Percent complete: 37.4%; Average loss: 1.8441
Iteration: 1872; Percent complete: 37.4%; Average loss:

Iteration: 1988; Percent complete: 39.8%; Average loss: 1.6853
Iteration: 1989; Percent complete: 39.8%; Average loss: 1.7767
Iteration: 1990; Percent complete: 39.8%; Average loss: 1.7039
Iteration: 1991; Percent complete: 39.8%; Average loss: 1.7530
Iteration: 1992; Percent complete: 39.8%; Average loss: 1.7054
Iteration: 1993; Percent complete: 39.9%; Average loss: 1.7548
Iteration: 1994; Percent complete: 39.9%; Average loss: 1.6894
Iteration: 1995; Percent complete: 39.9%; Average loss: 1.7284
Iteration: 1996; Percent complete: 39.9%; Average loss: 1.7656
Iteration: 1997; Percent complete: 39.9%; Average loss: 1.7228
Iteration: 1998; Percent complete: 40.0%; Average loss: 1.7285
Iteration: 1999; Percent complete: 40.0%; Average loss: 1.6981
Iteration: 2000; Percent complete: 40.0%; Average loss: 1.7074
Iteration: 2001; Percent complete: 40.0%; Average loss: 1.7094
Iteration: 2002; Percent complete: 40.0%; Average loss: 1.7050
Iteration: 2003; Percent complete: 40.1%; Average loss:

Iteration: 2119; Percent complete: 42.4%; Average loss: 1.5713
Iteration: 2120; Percent complete: 42.4%; Average loss: 1.6359
Iteration: 2121; Percent complete: 42.4%; Average loss: 1.5766
Iteration: 2122; Percent complete: 42.4%; Average loss: 1.5601
Iteration: 2123; Percent complete: 42.5%; Average loss: 1.6061
Iteration: 2124; Percent complete: 42.5%; Average loss: 1.5609
Iteration: 2125; Percent complete: 42.5%; Average loss: 1.5701
Iteration: 2126; Percent complete: 42.5%; Average loss: 1.5582
Iteration: 2127; Percent complete: 42.5%; Average loss: 1.5815
Iteration: 2128; Percent complete: 42.6%; Average loss: 1.5190
Iteration: 2129; Percent complete: 42.6%; Average loss: 1.5956
Iteration: 2130; Percent complete: 42.6%; Average loss: 1.5848
Iteration: 2131; Percent complete: 42.6%; Average loss: 1.5277
Iteration: 2132; Percent complete: 42.6%; Average loss: 1.5306
Iteration: 2133; Percent complete: 42.7%; Average loss: 1.5836
Iteration: 2134; Percent complete: 42.7%; Average loss:

Iteration: 2250; Percent complete: 45.0%; Average loss: 1.4475
Iteration: 2251; Percent complete: 45.0%; Average loss: 1.4412
Iteration: 2252; Percent complete: 45.0%; Average loss: 1.4776
Iteration: 2253; Percent complete: 45.1%; Average loss: 1.4523
Iteration: 2254; Percent complete: 45.1%; Average loss: 1.4757
Iteration: 2255; Percent complete: 45.1%; Average loss: 1.4426
Iteration: 2256; Percent complete: 45.1%; Average loss: 1.4428
Iteration: 2257; Percent complete: 45.1%; Average loss: 1.4678
Iteration: 2258; Percent complete: 45.2%; Average loss: 1.4157
Iteration: 2259; Percent complete: 45.2%; Average loss: 1.4670
Iteration: 2260; Percent complete: 45.2%; Average loss: 1.4827
Iteration: 2261; Percent complete: 45.2%; Average loss: 1.4447
Iteration: 2262; Percent complete: 45.2%; Average loss: 1.4561
Iteration: 2263; Percent complete: 45.3%; Average loss: 1.4385
Iteration: 2264; Percent complete: 45.3%; Average loss: 1.4279
Iteration: 2265; Percent complete: 45.3%; Average loss:

Iteration: 2381; Percent complete: 47.6%; Average loss: 1.3532
Iteration: 2382; Percent complete: 47.6%; Average loss: 1.3090
Iteration: 2383; Percent complete: 47.7%; Average loss: 1.3239
Iteration: 2384; Percent complete: 47.7%; Average loss: 1.2599
Iteration: 2385; Percent complete: 47.7%; Average loss: 1.3437
Iteration: 2386; Percent complete: 47.7%; Average loss: 1.2696
Iteration: 2387; Percent complete: 47.7%; Average loss: 1.3420
Iteration: 2388; Percent complete: 47.8%; Average loss: 1.2815
Iteration: 2389; Percent complete: 47.8%; Average loss: 1.3197
Iteration: 2390; Percent complete: 47.8%; Average loss: 1.2708
Iteration: 2391; Percent complete: 47.8%; Average loss: 1.3170
Iteration: 2392; Percent complete: 47.8%; Average loss: 1.3144
Iteration: 2393; Percent complete: 47.9%; Average loss: 1.2932
Iteration: 2394; Percent complete: 47.9%; Average loss: 1.3471
Iteration: 2395; Percent complete: 47.9%; Average loss: 1.3275
Iteration: 2396; Percent complete: 47.9%; Average loss:

Iteration: 2512; Percent complete: 50.2%; Average loss: 1.1859
Iteration: 2513; Percent complete: 50.3%; Average loss: 1.2116
Iteration: 2514; Percent complete: 50.3%; Average loss: 1.2182
Iteration: 2515; Percent complete: 50.3%; Average loss: 1.1794
Iteration: 2516; Percent complete: 50.3%; Average loss: 1.1630
Iteration: 2517; Percent complete: 50.3%; Average loss: 1.1905
Iteration: 2518; Percent complete: 50.4%; Average loss: 1.2216
Iteration: 2519; Percent complete: 50.4%; Average loss: 1.2519
Iteration: 2520; Percent complete: 50.4%; Average loss: 1.1905
Iteration: 2521; Percent complete: 50.4%; Average loss: 1.2204
Iteration: 2522; Percent complete: 50.4%; Average loss: 1.2178
Iteration: 2523; Percent complete: 50.5%; Average loss: 1.2217
Iteration: 2524; Percent complete: 50.5%; Average loss: 1.2003
Iteration: 2525; Percent complete: 50.5%; Average loss: 1.2012
Iteration: 2526; Percent complete: 50.5%; Average loss: 1.1635
Iteration: 2527; Percent complete: 50.5%; Average loss:

Iteration: 2643; Percent complete: 52.9%; Average loss: 1.1051
Iteration: 2644; Percent complete: 52.9%; Average loss: 1.0771
Iteration: 2645; Percent complete: 52.9%; Average loss: 1.0612
Iteration: 2646; Percent complete: 52.9%; Average loss: 1.1145
Iteration: 2647; Percent complete: 52.9%; Average loss: 1.0843
Iteration: 2648; Percent complete: 53.0%; Average loss: 1.0831
Iteration: 2649; Percent complete: 53.0%; Average loss: 1.0829
Iteration: 2650; Percent complete: 53.0%; Average loss: 1.0960
Iteration: 2651; Percent complete: 53.0%; Average loss: 1.0995
Iteration: 2652; Percent complete: 53.0%; Average loss: 1.0850
Iteration: 2653; Percent complete: 53.1%; Average loss: 1.0670
Iteration: 2654; Percent complete: 53.1%; Average loss: 1.0985
Iteration: 2655; Percent complete: 53.1%; Average loss: 1.0926
Iteration: 2656; Percent complete: 53.1%; Average loss: 1.1094
Iteration: 2657; Percent complete: 53.1%; Average loss: 1.0510
Iteration: 2658; Percent complete: 53.2%; Average loss:

Iteration: 2774; Percent complete: 55.5%; Average loss: 0.9474
Iteration: 2775; Percent complete: 55.5%; Average loss: 0.9605
Iteration: 2776; Percent complete: 55.5%; Average loss: 0.9732
Iteration: 2777; Percent complete: 55.5%; Average loss: 0.9871
Iteration: 2778; Percent complete: 55.6%; Average loss: 1.0167
Iteration: 2779; Percent complete: 55.6%; Average loss: 0.9527
Iteration: 2780; Percent complete: 55.6%; Average loss: 0.9748
Iteration: 2781; Percent complete: 55.6%; Average loss: 0.9272
Iteration: 2782; Percent complete: 55.6%; Average loss: 0.9597
Iteration: 2783; Percent complete: 55.7%; Average loss: 1.0081
Iteration: 2784; Percent complete: 55.7%; Average loss: 0.9773
Iteration: 2785; Percent complete: 55.7%; Average loss: 0.9589
Iteration: 2786; Percent complete: 55.7%; Average loss: 0.9522
Iteration: 2787; Percent complete: 55.7%; Average loss: 0.9887
Iteration: 2788; Percent complete: 55.8%; Average loss: 0.9718
Iteration: 2789; Percent complete: 55.8%; Average loss:

Iteration: 2905; Percent complete: 58.1%; Average loss: 0.8953
Iteration: 2906; Percent complete: 58.1%; Average loss: 0.8776
Iteration: 2907; Percent complete: 58.1%; Average loss: 0.8592
Iteration: 2908; Percent complete: 58.2%; Average loss: 0.9107
Iteration: 2909; Percent complete: 58.2%; Average loss: 0.8984
Iteration: 2910; Percent complete: 58.2%; Average loss: 0.8841
Iteration: 2911; Percent complete: 58.2%; Average loss: 0.8773
Iteration: 2912; Percent complete: 58.2%; Average loss: 0.8566
Iteration: 2913; Percent complete: 58.3%; Average loss: 0.8523
Iteration: 2914; Percent complete: 58.3%; Average loss: 0.8919
Iteration: 2915; Percent complete: 58.3%; Average loss: 0.8443
Iteration: 2916; Percent complete: 58.3%; Average loss: 0.9154
Iteration: 2917; Percent complete: 58.3%; Average loss: 0.8769
Iteration: 2918; Percent complete: 58.4%; Average loss: 0.9036
Iteration: 2919; Percent complete: 58.4%; Average loss: 0.9009
Iteration: 2920; Percent complete: 58.4%; Average loss:

Iteration: 3036; Percent complete: 60.7%; Average loss: 0.8102
Iteration: 3037; Percent complete: 60.7%; Average loss: 0.8079
Iteration: 3038; Percent complete: 60.8%; Average loss: 0.7892
Iteration: 3039; Percent complete: 60.8%; Average loss: 0.8317
Iteration: 3040; Percent complete: 60.8%; Average loss: 0.8083
Iteration: 3041; Percent complete: 60.8%; Average loss: 0.8270
Iteration: 3042; Percent complete: 60.8%; Average loss: 0.7894
Iteration: 3043; Percent complete: 60.9%; Average loss: 0.8249
Iteration: 3044; Percent complete: 60.9%; Average loss: 0.8299
Iteration: 3045; Percent complete: 60.9%; Average loss: 0.7923
Iteration: 3046; Percent complete: 60.9%; Average loss: 0.8490
Iteration: 3047; Percent complete: 60.9%; Average loss: 0.7994
Iteration: 3048; Percent complete: 61.0%; Average loss: 0.8192
Iteration: 3049; Percent complete: 61.0%; Average loss: 0.7946
Iteration: 3050; Percent complete: 61.0%; Average loss: 0.8161
Iteration: 3051; Percent complete: 61.0%; Average loss:

Iteration: 3167; Percent complete: 63.3%; Average loss: 0.7073
Iteration: 3168; Percent complete: 63.4%; Average loss: 0.6893
Iteration: 3169; Percent complete: 63.4%; Average loss: 0.7456
Iteration: 3170; Percent complete: 63.4%; Average loss: 0.7249
Iteration: 3171; Percent complete: 63.4%; Average loss: 0.7001
Iteration: 3172; Percent complete: 63.4%; Average loss: 0.7204
Iteration: 3173; Percent complete: 63.5%; Average loss: 0.7228
Iteration: 3174; Percent complete: 63.5%; Average loss: 0.7494
Iteration: 3175; Percent complete: 63.5%; Average loss: 0.7094
Iteration: 3176; Percent complete: 63.5%; Average loss: 0.6853
Iteration: 3177; Percent complete: 63.5%; Average loss: 0.7126
Iteration: 3178; Percent complete: 63.6%; Average loss: 0.7040
Iteration: 3179; Percent complete: 63.6%; Average loss: 0.6957
Iteration: 3180; Percent complete: 63.6%; Average loss: 0.7039
Iteration: 3181; Percent complete: 63.6%; Average loss: 0.6947
Iteration: 3182; Percent complete: 63.6%; Average loss:

Iteration: 3298; Percent complete: 66.0%; Average loss: 0.6197
Iteration: 3299; Percent complete: 66.0%; Average loss: 0.6765
Iteration: 3300; Percent complete: 66.0%; Average loss: 0.6537
Iteration: 3301; Percent complete: 66.0%; Average loss: 0.6771
Iteration: 3302; Percent complete: 66.0%; Average loss: 0.6807
Iteration: 3303; Percent complete: 66.1%; Average loss: 0.6526
Iteration: 3304; Percent complete: 66.1%; Average loss: 0.6463
Iteration: 3305; Percent complete: 66.1%; Average loss: 0.6541
Iteration: 3306; Percent complete: 66.1%; Average loss: 0.6383
Iteration: 3307; Percent complete: 66.1%; Average loss: 0.5986
Iteration: 3308; Percent complete: 66.2%; Average loss: 0.6358
Iteration: 3309; Percent complete: 66.2%; Average loss: 0.6640
Iteration: 3310; Percent complete: 66.2%; Average loss: 0.6661
Iteration: 3311; Percent complete: 66.2%; Average loss: 0.6290
Iteration: 3312; Percent complete: 66.2%; Average loss: 0.6573
Iteration: 3313; Percent complete: 66.3%; Average loss:

Iteration: 3429; Percent complete: 68.6%; Average loss: 0.5917
Iteration: 3430; Percent complete: 68.6%; Average loss: 0.5872
Iteration: 3431; Percent complete: 68.6%; Average loss: 0.5813
Iteration: 3432; Percent complete: 68.6%; Average loss: 0.5850
Iteration: 3433; Percent complete: 68.7%; Average loss: 0.5459
Iteration: 3434; Percent complete: 68.7%; Average loss: 0.5899
Iteration: 3435; Percent complete: 68.7%; Average loss: 0.5710
Iteration: 3436; Percent complete: 68.7%; Average loss: 0.6221
Iteration: 3437; Percent complete: 68.7%; Average loss: 0.5809
Iteration: 3438; Percent complete: 68.8%; Average loss: 0.5707
Iteration: 3439; Percent complete: 68.8%; Average loss: 0.5694
Iteration: 3440; Percent complete: 68.8%; Average loss: 0.6033
Iteration: 3441; Percent complete: 68.8%; Average loss: 0.5888
Iteration: 3442; Percent complete: 68.8%; Average loss: 0.5599
Iteration: 3443; Percent complete: 68.9%; Average loss: 0.5829
Iteration: 3444; Percent complete: 68.9%; Average loss:

Iteration: 3560; Percent complete: 71.2%; Average loss: 0.5391
Iteration: 3561; Percent complete: 71.2%; Average loss: 0.5324
Iteration: 3562; Percent complete: 71.2%; Average loss: 0.5472
Iteration: 3563; Percent complete: 71.3%; Average loss: 0.5541
Iteration: 3564; Percent complete: 71.3%; Average loss: 0.5251
Iteration: 3565; Percent complete: 71.3%; Average loss: 0.5319
Iteration: 3566; Percent complete: 71.3%; Average loss: 0.5571
Iteration: 3567; Percent complete: 71.3%; Average loss: 0.4866
Iteration: 3568; Percent complete: 71.4%; Average loss: 0.5307
Iteration: 3569; Percent complete: 71.4%; Average loss: 0.5244
Iteration: 3570; Percent complete: 71.4%; Average loss: 0.5471
Iteration: 3571; Percent complete: 71.4%; Average loss: 0.5056
Iteration: 3572; Percent complete: 71.4%; Average loss: 0.4944
Iteration: 3573; Percent complete: 71.5%; Average loss: 0.5273
Iteration: 3574; Percent complete: 71.5%; Average loss: 0.5399
Iteration: 3575; Percent complete: 71.5%; Average loss:

Iteration: 3691; Percent complete: 73.8%; Average loss: 0.5021
Iteration: 3692; Percent complete: 73.8%; Average loss: 0.4889
Iteration: 3693; Percent complete: 73.9%; Average loss: 0.4829
Iteration: 3694; Percent complete: 73.9%; Average loss: 0.4810
Iteration: 3695; Percent complete: 73.9%; Average loss: 0.4772
Iteration: 3696; Percent complete: 73.9%; Average loss: 0.5135
Iteration: 3697; Percent complete: 73.9%; Average loss: 0.4633
Iteration: 3698; Percent complete: 74.0%; Average loss: 0.4517
Iteration: 3699; Percent complete: 74.0%; Average loss: 0.4939
Iteration: 3700; Percent complete: 74.0%; Average loss: 0.4910
Iteration: 3701; Percent complete: 74.0%; Average loss: 0.5050
Iteration: 3702; Percent complete: 74.0%; Average loss: 0.4834
Iteration: 3703; Percent complete: 74.1%; Average loss: 0.4800
Iteration: 3704; Percent complete: 74.1%; Average loss: 0.4892
Iteration: 3705; Percent complete: 74.1%; Average loss: 0.4972
Iteration: 3706; Percent complete: 74.1%; Average loss:

Iteration: 3822; Percent complete: 76.4%; Average loss: 0.4097
Iteration: 3823; Percent complete: 76.5%; Average loss: 0.4325
Iteration: 3824; Percent complete: 76.5%; Average loss: 0.4399
Iteration: 3825; Percent complete: 76.5%; Average loss: 0.4545
Iteration: 3826; Percent complete: 76.5%; Average loss: 0.4806
Iteration: 3827; Percent complete: 76.5%; Average loss: 0.4531
Iteration: 3828; Percent complete: 76.6%; Average loss: 0.4228
Iteration: 3829; Percent complete: 76.6%; Average loss: 0.4346
Iteration: 3830; Percent complete: 76.6%; Average loss: 0.4300
Iteration: 3831; Percent complete: 76.6%; Average loss: 0.4463
Iteration: 3832; Percent complete: 76.6%; Average loss: 0.4765
Iteration: 3833; Percent complete: 76.7%; Average loss: 0.4218
Iteration: 3834; Percent complete: 76.7%; Average loss: 0.4426
Iteration: 3835; Percent complete: 76.7%; Average loss: 0.4464
Iteration: 3836; Percent complete: 76.7%; Average loss: 0.4280
Iteration: 3837; Percent complete: 76.7%; Average loss:

Iteration: 3953; Percent complete: 79.1%; Average loss: 0.4061
Iteration: 3954; Percent complete: 79.1%; Average loss: 0.4224
Iteration: 3955; Percent complete: 79.1%; Average loss: 0.3867
Iteration: 3956; Percent complete: 79.1%; Average loss: 0.4077
Iteration: 3957; Percent complete: 79.1%; Average loss: 0.3896
Iteration: 3958; Percent complete: 79.2%; Average loss: 0.3846
Iteration: 3959; Percent complete: 79.2%; Average loss: 0.4267
Iteration: 3960; Percent complete: 79.2%; Average loss: 0.4062
Iteration: 3961; Percent complete: 79.2%; Average loss: 0.4448
Iteration: 3962; Percent complete: 79.2%; Average loss: 0.4112
Iteration: 3963; Percent complete: 79.3%; Average loss: 0.3972
Iteration: 3964; Percent complete: 79.3%; Average loss: 0.3993
Iteration: 3965; Percent complete: 79.3%; Average loss: 0.3801
Iteration: 3966; Percent complete: 79.3%; Average loss: 0.4267
Iteration: 3967; Percent complete: 79.3%; Average loss: 0.3987
Iteration: 3968; Percent complete: 79.4%; Average loss:

Iteration: 4084; Percent complete: 81.7%; Average loss: 0.3719
Iteration: 4085; Percent complete: 81.7%; Average loss: 0.3447
Iteration: 4086; Percent complete: 81.7%; Average loss: 0.3787
Iteration: 4087; Percent complete: 81.7%; Average loss: 0.3723
Iteration: 4088; Percent complete: 81.8%; Average loss: 0.3907
Iteration: 4089; Percent complete: 81.8%; Average loss: 0.3699
Iteration: 4090; Percent complete: 81.8%; Average loss: 0.3837
Iteration: 4091; Percent complete: 81.8%; Average loss: 0.3932
Iteration: 4092; Percent complete: 81.8%; Average loss: 0.3618
Iteration: 4093; Percent complete: 81.9%; Average loss: 0.3469
Iteration: 4094; Percent complete: 81.9%; Average loss: 0.3706
Iteration: 4095; Percent complete: 81.9%; Average loss: 0.3699
Iteration: 4096; Percent complete: 81.9%; Average loss: 0.3661
Iteration: 4097; Percent complete: 81.9%; Average loss: 0.3846
Iteration: 4098; Percent complete: 82.0%; Average loss: 0.3560
Iteration: 4099; Percent complete: 82.0%; Average loss:

Iteration: 4215; Percent complete: 84.3%; Average loss: 0.3558
Iteration: 4216; Percent complete: 84.3%; Average loss: 0.3477
Iteration: 4217; Percent complete: 84.3%; Average loss: 0.3434
Iteration: 4218; Percent complete: 84.4%; Average loss: 0.3416
Iteration: 4219; Percent complete: 84.4%; Average loss: 0.3534
Iteration: 4220; Percent complete: 84.4%; Average loss: 0.3500
Iteration: 4221; Percent complete: 84.4%; Average loss: 0.3496
Iteration: 4222; Percent complete: 84.4%; Average loss: 0.3490
Iteration: 4223; Percent complete: 84.5%; Average loss: 0.3471
Iteration: 4224; Percent complete: 84.5%; Average loss: 0.3351
Iteration: 4225; Percent complete: 84.5%; Average loss: 0.3634
Iteration: 4226; Percent complete: 84.5%; Average loss: 0.3516
Iteration: 4227; Percent complete: 84.5%; Average loss: 0.3340
Iteration: 4228; Percent complete: 84.6%; Average loss: 0.3404
Iteration: 4229; Percent complete: 84.6%; Average loss: 0.3460
Iteration: 4230; Percent complete: 84.6%; Average loss:

Iteration: 4346; Percent complete: 86.9%; Average loss: 0.3206
Iteration: 4347; Percent complete: 86.9%; Average loss: 0.3284
Iteration: 4348; Percent complete: 87.0%; Average loss: 0.3020
Iteration: 4349; Percent complete: 87.0%; Average loss: 0.3103
Iteration: 4350; Percent complete: 87.0%; Average loss: 0.3345
Iteration: 4351; Percent complete: 87.0%; Average loss: 0.3236
Iteration: 4352; Percent complete: 87.0%; Average loss: 0.3208
Iteration: 4353; Percent complete: 87.1%; Average loss: 0.3223
Iteration: 4354; Percent complete: 87.1%; Average loss: 0.3368
Iteration: 4355; Percent complete: 87.1%; Average loss: 0.2990
Iteration: 4356; Percent complete: 87.1%; Average loss: 0.3079
Iteration: 4357; Percent complete: 87.1%; Average loss: 0.3187
Iteration: 4358; Percent complete: 87.2%; Average loss: 0.3170
Iteration: 4359; Percent complete: 87.2%; Average loss: 0.3141
Iteration: 4360; Percent complete: 87.2%; Average loss: 0.3315
Iteration: 4361; Percent complete: 87.2%; Average loss:

Iteration: 4477; Percent complete: 89.5%; Average loss: 0.2707
Iteration: 4478; Percent complete: 89.6%; Average loss: 0.2911
Iteration: 4479; Percent complete: 89.6%; Average loss: 0.3004
Iteration: 4480; Percent complete: 89.6%; Average loss: 0.2906
Iteration: 4481; Percent complete: 89.6%; Average loss: 0.2937
Iteration: 4482; Percent complete: 89.6%; Average loss: 0.3009
Iteration: 4483; Percent complete: 89.7%; Average loss: 0.2874
Iteration: 4484; Percent complete: 89.7%; Average loss: 0.2993
Iteration: 4485; Percent complete: 89.7%; Average loss: 0.2908
Iteration: 4486; Percent complete: 89.7%; Average loss: 0.3293
Iteration: 4487; Percent complete: 89.7%; Average loss: 0.3002
Iteration: 4488; Percent complete: 89.8%; Average loss: 0.3198
Iteration: 4489; Percent complete: 89.8%; Average loss: 0.3079
Iteration: 4490; Percent complete: 89.8%; Average loss: 0.2827
Iteration: 4491; Percent complete: 89.8%; Average loss: 0.2904
Iteration: 4492; Percent complete: 89.8%; Average loss:

Iteration: 4608; Percent complete: 92.2%; Average loss: 0.2959
Iteration: 4609; Percent complete: 92.2%; Average loss: 0.2907
Iteration: 4610; Percent complete: 92.2%; Average loss: 0.3023
Iteration: 4611; Percent complete: 92.2%; Average loss: 0.2769
Iteration: 4612; Percent complete: 92.2%; Average loss: 0.2845
Iteration: 4613; Percent complete: 92.3%; Average loss: 0.2826
Iteration: 4614; Percent complete: 92.3%; Average loss: 0.2879
Iteration: 4615; Percent complete: 92.3%; Average loss: 0.2762
Iteration: 4616; Percent complete: 92.3%; Average loss: 0.2688
Iteration: 4617; Percent complete: 92.3%; Average loss: 0.2765
Iteration: 4618; Percent complete: 92.4%; Average loss: 0.2852
Iteration: 4619; Percent complete: 92.4%; Average loss: 0.3131
Iteration: 4620; Percent complete: 92.4%; Average loss: 0.2853
Iteration: 4621; Percent complete: 92.4%; Average loss: 0.2939
Iteration: 4622; Percent complete: 92.4%; Average loss: 0.3012
Iteration: 4623; Percent complete: 92.5%; Average loss:

Iteration: 4739; Percent complete: 94.8%; Average loss: 0.2857
Iteration: 4740; Percent complete: 94.8%; Average loss: 0.2733
Iteration: 4741; Percent complete: 94.8%; Average loss: 0.2808
Iteration: 4742; Percent complete: 94.8%; Average loss: 0.2608
Iteration: 4743; Percent complete: 94.9%; Average loss: 0.2728
Iteration: 4744; Percent complete: 94.9%; Average loss: 0.2633
Iteration: 4745; Percent complete: 94.9%; Average loss: 0.2710
Iteration: 4746; Percent complete: 94.9%; Average loss: 0.2916
Iteration: 4747; Percent complete: 94.9%; Average loss: 0.2803
Iteration: 4748; Percent complete: 95.0%; Average loss: 0.2659
Iteration: 4749; Percent complete: 95.0%; Average loss: 0.2596
Iteration: 4750; Percent complete: 95.0%; Average loss: 0.2695
Iteration: 4751; Percent complete: 95.0%; Average loss: 0.2595
Iteration: 4752; Percent complete: 95.0%; Average loss: 0.2377
Iteration: 4753; Percent complete: 95.1%; Average loss: 0.2637
Iteration: 4754; Percent complete: 95.1%; Average loss:

Iteration: 4870; Percent complete: 97.4%; Average loss: 0.2708
Iteration: 4871; Percent complete: 97.4%; Average loss: 0.2588
Iteration: 4872; Percent complete: 97.4%; Average loss: 0.2521
Iteration: 4873; Percent complete: 97.5%; Average loss: 0.2524
Iteration: 4874; Percent complete: 97.5%; Average loss: 0.2452
Iteration: 4875; Percent complete: 97.5%; Average loss: 0.2628
Iteration: 4876; Percent complete: 97.5%; Average loss: 0.2745
Iteration: 4877; Percent complete: 97.5%; Average loss: 0.2715
Iteration: 4878; Percent complete: 97.6%; Average loss: 0.2829
Iteration: 4879; Percent complete: 97.6%; Average loss: 0.2624
Iteration: 4880; Percent complete: 97.6%; Average loss: 0.2684
Iteration: 4881; Percent complete: 97.6%; Average loss: 0.2704
Iteration: 4882; Percent complete: 97.6%; Average loss: 0.2800
Iteration: 4883; Percent complete: 97.7%; Average loss: 0.2446
Iteration: 4884; Percent complete: 97.7%; Average loss: 0.2715
Iteration: 4885; Percent complete: 97.7%; Average loss:

Run the model :)

In [20]:
# Set dropout layers to eval mode
encoder.eval()
decoder.eval()

# Initialize search module
searcher = GreedySearchDecoder(encoder, decoder)

# Begin chatting (uncomment and run the following line to begin)
evaluateInput(encoder, decoder, searcher, voc)

> Hello
Bot: hello . is telly late . 6
> Maybe she is
Bot: what is it ? . 5
> I do not really know
Bot: fine . . ? 4
> How are you today
Bot: fine . . . 4
> That is good to hear
Bot: it s good . let the radio . 8
> What about the radio?
Bot: sorry . has . . 5
> Hmm
Bot: i ll call them . . 6
> Okay sure.
Bot: if she wasn t such done eddie else ? 9
> What
Bot: what is it ? just ? 6
> quit


In [21]:
# Set dropout layers to eval mode
encoder.eval()
decoder.eval()

# Initialize search module
searcher = StochasticTopKUniformDecoder(encoder, decoder, 2)

# Begin chatting (uncomment and run the following line to begin)
evaluateInput(encoder, decoder, searcher, voc)

> Hello
Bot: hello this is telly singer . go . ? 9
> Hi telly, how are you?
Bot: okay thank you guys . . ! 7
> You are welcome
Bot: i know i am ! damn right . now ! 10
> What are you doing today?
Bot: i m going back to your face . 8
> That is funny.
Bot: yeah i think so okay ? ? ! 8
> Haha
Error: Encountered unknown word.
> Let us go out for dinner
Bot: i m not asking you something to worry do 9
> Okay bye
Bot: i ll get you there tomorrow night . go 9
> Cool
Bot: does ? you need one guy ? go ! 9
> No Goodnight
Bot: night night tomorrow . now . go go 8
> quit


In [40]:
# Set dropout layers to eval mode
encoder.eval()
decoder.eval()

# Initialize search module
searcher = StochasticTopKSampleDecoder(encoder, decoder, 5)

# Begin chatting (uncomment and run the following line to begin)
evaluateInput(encoder, decoder, searcher, voc)

> What music do you listen to?
Bot: oh no i need to help . .
> Why do you need help?
Bot: i ve got any cream . .
> Cream?
Bot: i m thinking . . . . ?
> Ok
Bot: i ll see you there . .
> See me where?
Bot: i m not sure . .
> Loser
Bot: what s this ? salvatore ?
> I am not salvatore.
Bot: what are you going to do ? ?
> Beat you up!
Bot: what ? . .
> Exactly.
Bot: exactly what ? you hate him ?
> No, I hate you.
Bot: i don t want to hurt the questions .
> I will hurt you.
Bot: i ll get you through the day tonight .
> quit


In [33]:
# Set dropout layers to eval mode
encoder.eval()
decoder.eval()

# Initialize search module
searcher = StochasticSentenceInitializerSamplingDecoder(encoder, decoder, 5, 3)

# Begin chatting (uncomment and run the following line to begin)
evaluateInput(encoder, decoder, searcher, voc)

> quit


In [24]:
# Set dropout layers to eval mode
encoder.eval()
decoder.eval()

# Initialize search module
searcher = StochasticSentenceInitializerUniformDecoder(encoder, decoder, 10, 3)

# Begin chatting (uncomment and run the following line to begin)
evaluateInput(encoder, decoder, searcher, voc)

> Hello
Bot: hello it . i see you . 7
> What do you see?
Bot: somethin i forget it . . . . 8
> Hmm
Bot: may they mother and wrote we like it ? 9
> I do not know what you are saying
Bot: here and i ! . ! 6
> quit


### Notes on the Model

Overall, the model is OK. It was not as intelligent as i thought it would be, but I suppose that is a good thing. One of the things that I noticed from playing with the bot was that everything is pretty predictable. In fact, it feels less like it is responding to you and more like it has memories what to day based on an input string from the user. I think the first thing we can do to improve the model is to add some randomness to the model, so asking "hello" wont keep generating the same output. To improve this one a bit more we can change some of the variable like the learning rate and the depth. Eventually, to improve the overall feeling of the chat bot we can implement an entirely new model. Try to give a more meaningful number to words rather than what index they first appear in (ask professor if he thinks this would even matter, or would this just be a change of domain).