# ChatBot References

Professor Smyth's website discussing projects

https://www.ics.uci.edu/~smyth/courses/cs175/project_reading.html

Chatbot resources on website

https://pytorch.org/tutorials/beginner/chatbot_tutorial.html

https://web.stanford.edu/~jurafsky/slp3/26.pdf

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/12/CortanaLUDialog-FromSLTproceedings.pdf

Dataset for tutorial

https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html

# Start of Tutorial

### Import necessary libraries

In [31]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import torch
from torch.jit import script, trace
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
import csv
import random
import re
import os
import unicodedata
import codecs
from io import open
import itertools
import math
import numpy as np

USE_CUDA = torch.cuda.is_available()
device = torch.device("cuda" if USE_CUDA else "cpu")

### Preprocess Data

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Looking at some files in the data to see how they are structured ...

In [4]:
data = "/content/drive/MyDrive/CS175_Datasets"
# corpus = os.path.join("data", corpus_name)

def printLines(file, n=10):
    with open(file, 'rb') as datafile:
        lines = datafile.readlines()
    for line in lines[:n]:
        print(line)

# printLines(os.path.join(data, "qa_Video_Games.json"))

Scan through data files and reformat them to be (sentence, response) pairs separated by a tab eventually. First we create helper functions to help clean up the data a bit and make it more readable ...

In [5]:
import json
import gzip

def parse(file):
  lines = [line.rstrip() for line in file]
  return lines

def loadConversations(lines):
  return [[lines[i], lines[i+1]] for i in range(0,len(lines),2)]  


def extractSentencePairs(conversations):
    return conversations
    

Next we will be using the functions above we now parse the data into a form that is useful to us and see what a few lines from the data look like ...

In [6]:
# Define path to new file
corpus=data
datafile_path = os.path.join(corpus, "Dialogue_Datasets")
datasets = os.path.join(datafile_path,"BNCSplitWordsCorpus.txt")
outfile_path = os.path.join(datafile_path, "csv_files")
datafile = os.path.join(outfile_path, "BNC.csv")
corpus_name = "BNC"
delimiter = '\t'
# Unescape the delimiter
delimiter = str(codecs.decode(delimiter, "unicode_escape"))
# load data if data file is already been processed as csv file
load_data = True

conversations = []
if not load_data: 
  # Load lines and process conversations
  print("\nProcessing corpus...")
  # lines = loadLines(os.path.join(corpus, "movie_lines.txt"), MOVIE_LINES_FIELDS)
  print("\nLoading conversations...")
  print(datasets)
  lines = parse(open(datasets))
  conversations = loadConversations(lines)


  # Write new csv file
  print("\nWriting newly formatted file...")
  with open(datafile, 'w', encoding='utf-8') as outputfile:
      writer = csv.writer(outputfile, delimiter=delimiter, lineterminator='\n')
      for pair in extractSentencePairs(conversations):
        writer.writerow(pair)
# Print a sample of lines
print("\nSample lines from file:")
printLines(datafile)


Sample lines from file:
b'You enjoyed yourself in America\tEh\n'
b'did you\tOh I covered a nice trip yes\n'
b'Oh very good\tsaw Mary and Andrew and\n'
b"Yes you did\tin fact the whole family was together for Mary's wedding\n"
b"Oh very nice very nice yes\tIt's horrible\n"
b"It is horrible isn't\tHave you been busy\n"
b'Yes\tYes oh\n'
b"Jim's been for a this afternoon at the H art and Straw Club\toh not very well we er m we stopped going after Christmas because we had bad chest s both of us\n"
b"Oh\tboth cold and it's hard going that three hours in the morning you know\n"
b"Yes\tbut we'll go back again\n"


We want to create a vocabulary of words that we see. We represent these words numerically using the index of their first appearance in the history of all added words to the set of vocuabulary. See `addWord()` for specifics.

In [7]:
# Default word tokens
PAD_token = 0  # Used for padding short sentences
SOS_token = 1  # Start-of-sentence token
EOS_token = 2  # End-of-sentence token

class Voc:
    def __init__(self, name):
        self.name = name
        self.trimmed = False
        self.word2index = {}
        self.word2count = {}
        self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
        self.num_words = 3  # Count SOS, EOS, PAD

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.num_words
            self.word2count[word] = 1
            self.index2word[self.num_words] = word
            self.num_words += 1
        else:
            self.word2count[word] += 1

    # Remove words below a certain count threshold
    def trim(self, min_count):
        if self.trimmed:
            return
        self.trimmed = True

        keep_words = []

        for k, v in self.word2count.items():
            if v >= min_count:
                keep_words.append(k)

        print('keep_words {} / {} = {:.4f}'.format(
            len(keep_words), len(self.word2index), len(keep_words) / len(self.word2index)
        ))

        # Reinitialize dictionaries
        self.word2index = {}
        self.word2count = {}
        self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
        self.num_words = 3 # Count default tokens

        for word in keep_words:
            self.addWord(word)

Now we finally getting to the part where we begin to make the (sentence, response) pairs. We begin preprocessing by converting the Unicode string texts to ASCII `unicodeToAscii()`. We also make everything lower case, and remove nonletters except for basic punctuation `normalizeString()`. The last thing that is done in preprocessing is ignore sentences beyond a certain length to aid in training `filterPair()` and `MAX_LENGTH`.

In [8]:
MAX_LENGTH = 10  # Maximum sentence length to consider

# Turn a Unicode string to plain ASCII, thanks to
# https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

# Lowercase, trim, and remove non-letter characters
def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    s = re.sub(r"\s+", r" ", s).strip()
    return s

# Read query/response pairs and return a voc object
def readVocs(datafile, corpus_name):
    print("Reading lines...")
    # Read the file and split into lines
    lines = open(datafile, encoding='utf-8').\
        read().strip().split('\n')
    # Split every line into pairs and normalize
    pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]
    voc = Voc(corpus_name)
    return voc, pairs

# Returns True iff both sentences in a pair 'p' are under the MAX_LENGTH threshold
def filterPair(p):
    # Input sequences need to preserve the last word for EOS token
    return len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH

# Filter pairs using filterPair condition
def filterPairs(pairs):
    return [pair for pair in pairs if filterPair(pair)]

# Using the functions defined above, return a populated voc object and pairs list
def loadPrepareData(corpus, corpus_name, datafile, save_dir):
    print("Start preparing training data ...")
    voc, pairs = readVocs(datafile, corpus_name)
    print(pairs[:10])
    print("Read {!s} sentence pairs".format(len(pairs)))
    pairs = filterPairs(pairs)
    print("Trimmed to {!s} sentence pairs".format(len(pairs)))
    print("Counting words...")
    for pair in pairs:
        voc.addSentence(pair[0])
        voc.addSentence(pair[1])
    print("Counted words:", voc.num_words)
    return voc, pairs


# Load/Assemble voc and pairs
save_dir = os.path.join(data, "LSRM_save")
voc, pairs = loadPrepareData(corpus, corpus_name, datafile, save_dir)
# Print some pairs to validate
print("\npairs:")
for pair in pairs[:10]:
    print(pair)

Start preparing training data ...
Reading lines...
[['you enjoyed yourself in america', 'eh'], ['did you', 'oh i covered a nice trip yes'], ['oh very good', 'saw mary and andrew and'], ['yes you did', 'in fact the whole family was together for mary s wedding'], ['oh very nice very nice yes', 'it s horrible'], ['it is horrible isn t', 'have you been busy'], ['yes', 'yes oh'], ['jim s been for a this afternoon at the h art and straw club', 'oh not very well we er m we stopped going after christmas because we had bad chest s both of us'], ['oh', 'both cold and it s hard going that three hours in the morning you know'], ['yes', 'but we ll go back again']]
Read 305507 sentence pairs
Trimmed to 185299 sentence pairs
Counting words...
Counted words: 14962

pairs:
['you enjoyed yourself in america', 'eh']
['did you', 'oh i covered a nice trip yes']
['oh very good', 'saw mary and andrew and']
['oh very nice very nice yes', 'it s horrible']
['it is horrible isn t', 'have you been busy']
['yes', 

Another way to speed up training is by removing words that are rarely used. We "trim" these words using `Voc.trim()` and as a result must also remove (sentence, response) pairs that include these words.

In [9]:
MIN_COUNT = 10    # Minimum word count threshold for trimming

def trimRareWords(voc, pairs, MIN_COUNT):
    # Trim words used under the MIN_COUNT from the voc
    voc.trim(MIN_COUNT)
    # Filter out pairs with trimmed words
    keep_pairs = []
    for pair in pairs:
        input_sentence = pair[0]
        output_sentence = pair[1]
        keep_input = True
        keep_output = True
        # Check input sentence
        for word in input_sentence.split(' '):
            if word not in voc.word2index:
                keep_input = False
                break
        # Check output sentence
        for word in output_sentence.split(' '):
            if word not in voc.word2index:
                keep_output = False
                break

        # Only keep pairs that do not contain trimmed word(s) in their input or output sentence
        if keep_input and keep_output:
            keep_pairs.append(pair)

    print("Trimmed from {} pairs to {}, {:.4f} of total".format(len(pairs), len(keep_pairs), len(keep_pairs) / len(pairs)))
    return keep_pairs


# Trim voc and pairs
pairs = trimRareWords(voc, pairs, MIN_COUNT)

keep_words 4676 / 14959 = 0.3126
Trimmed from 185299 pairs to 159552, 0.8611 of total


### Using our Data on our Models

The data still needs to be processed more. The model that we use will take numerical values and not actual strings to do computation. We convert our sentences to tensors (vectors) that our model will take as inputs. To do this, we just take every sentence and change it to be a vector of index that corresponds to that word. This is how we will use our data to train the model. 

If we would like to train our model we usually do it with minbatches since it makes things faster. We make a matrix of dimensions (BatchLength, MaxLengthOfSentenceInBatch) represented numerically as mentioned in the previous paragraph. With this we make sure that each row (sentence) of our matrix terminates with `EOS_Token` and is followed by 0 entries until the end of the row.

This implementation almost works, the problem with this is that each row is a sentence and every column is a step in time. However, it is better to think of every row as a step in time and the column to be possible words to choose for that step in time. For this reason, we construct the Matrix as mentioned in the previous paragraph except it is now transposed. 

We define some function that help us achieve this ...

In [10]:
def indexesFromSentence(voc, sentence):
    return [voc.word2index[word] for word in sentence.split(' ')] + [EOS_token]


def zeroPadding(l, fillvalue=PAD_token):
    return list(itertools.zip_longest(*l, fillvalue=fillvalue))

def binaryMatrix(l, value=PAD_token):
    m = []
    for i, seq in enumerate(l):
        m.append([])
        for token in seq:
            if token == PAD_token:
                m[i].append(0)
            else:
                m[i].append(1)
    return m

# Returns padded input sequence tensor and lengths
def inputVar(l, voc):
    indexes_batch = [indexesFromSentence(voc, sentence) for sentence in l]
    lengths = torch.tensor([len(indexes) for indexes in indexes_batch])
    padList = zeroPadding(indexes_batch)
    padVar = torch.LongTensor(padList)
    return padVar, lengths

# Returns padded target sequence tensor, padding mask, and max target length
def outputVar(l, voc):
    indexes_batch = [indexesFromSentence(voc, sentence) for sentence in l]
    max_target_len = max([len(indexes) for indexes in indexes_batch])
    padList = zeroPadding(indexes_batch)
    mask = binaryMatrix(padList)
    mask = torch.BoolTensor(mask)
    padVar = torch.LongTensor(padList)
    return padVar, mask, max_target_len

# Returns all items for a given batch of pairs
def batch2TrainData(voc, pair_batch):
    pair_batch.sort(key=lambda x: len(x[0].split(" ")), reverse=True)
    input_batch, output_batch = [], []
    for pair in pair_batch:
        input_batch.append(pair[0])
        output_batch.append(pair[1])
    inp, lengths = inputVar(input_batch, voc)
    output, mask, max_target_len = outputVar(output_batch, voc)
    return inp, lengths, output, mask, max_target_len


# Example for validation
small_batch_size = 5
batches = batch2TrainData(voc, [random.choice(pairs) for _ in range(small_batch_size)])
input_variable, lengths, target_variable, mask, max_target_len = batches

print("input_variable:", input_variable)
print("lengths:", lengths)
print("target_variable:", target_variable)
print("mask:", mask)
print("max_target_len:", max_target_len)

input_variable: tensor([[775,  16,  21, 458,  45],
        [ 21,  11, 486, 434,   2],
        [ 87,   9, 486,   2,   0],
        [  2,   2,   2,   0,   0]])
lengths: tensor([4, 4, 4, 3, 2])
target_variable: tensor([[ 10,   9, 444, 114,  59],
        [360,   3,  83,  33,  24],
        [  2, 424,   2,  29,  18],
        [  0,  91,   0,  91,   2],
        [  0, 891,   0, 260,   0],
        [  0,   2,   0, 498,   0],
        [  0,   0,   0,  37,   0],
        [  0,   0,   0, 258,   0],
        [  0,   0,   0,   2,   0]])
mask: tensor([[ True,  True,  True,  True,  True],
        [ True,  True,  True,  True,  True],
        [ True,  True,  True,  True,  True],
        [False,  True, False,  True,  True],
        [False,  True, False,  True, False],
        [False,  True, False,  True, False],
        [False, False, False,  True, False],
        [False, False, False,  True, False],
        [False, False, False,  True, False]])
max_target_len: 9


Seeing the output from preping the data above, it appears that `input_variable` is the matrix described in the previous secsion. `lenghts` is just a tensor (vector) of how long the sentences in the input were. This will eventually be used for the **decoder** later in the program. `target_variable` seems to be the response that our model is supposed to learn. `mask` looks like it is a mask for responses where true is that there is a word and false is that there is no word there. Not sure why this variable is necessary but maybe it will become clearer at some point in the tutorial. 

### Defining our Model

Our base model is a Sequence-to-Sequence model using two RNN's. The first RNN is what is called the **encoder** and it takes a variable length input and converts it into a *fixed length* "context" vector that is intended to hold onto some semantic meaning of the input. The **decoder** takes the context vector provided by the decoder along with an input word to guess the next word in a sequence. I suppose our (sentence, response) pairs are learned as one long continuous sequence where the sentence is the start of hte sequence and the response is the remainder of the sequence, though this might be entirely incorrect.

Some discussion on the Encoder. The encoder uses a bidirectional GRU (Gated Recurrent Unit), meaning that there is basically two RNN's that make up the encoder, one that goes through the sequence of data in the forward direction while another RNN goes through the sequence in the backward direction. At each time step, both RNN's produce an output and a hidden state vector. At each time step, the output of both RNNs are summed and the output is recorded (somewhere) while the hidden state vectors are pushed along and used in the next step of the RNN. The outputs are summed makes it so that at each time step, the RNN is considering present and future context. 

In [11]:
class EncoderRNN(nn.Module):
    def __init__(self, hidden_size, embedding, n_layers=1, dropout=0):
        super(EncoderRNN, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size
        self.embedding = embedding

        # Initialize GRU; the input_size and hidden_size params are both set to 'hidden_size'
        #   because our input size is a word embedding with number of features == hidden_size
        self.gru = nn.LSTM(hidden_size,hidden_size, n_layers,
                          dropout=(0 if n_layers == 1 else dropout), bidirectional=True)

    def forward(self, input_seq, input_lengths, hidden=None):
        # Convert word indexes to embeddings
        embedded = self.embedding(input_seq)
        # Pack padded batch of sequences for RNN module
        packed = nn.utils.rnn.pack_padded_sequence(embedded, input_lengths)
        # Forward pass through GRU
        outputs, hidden = self.gru(packed, hidden)
        # Unpack padding
        outputs, _ = nn.utils.rnn.pad_packed_sequence(outputs)
        # Sum bidirectional GRU outputs
        outputs = outputs[:, :, :self.hidden_size] + outputs[:, : ,self.hidden_size:]
        # Return output and final hidden state
        return outputs, hidden

Some discussion on the Decoder. The decoder uses the context vector that is produced by the encoder. Usually, this is all that will be used by a decoder to produce output but this can result in a loss of information, expecially when the input sentences are very long. To counter this, the decoder also uses its current hidden state as a way of determining what it should be "paying attention" to. These are refered to as attention weights and are multiplied by the outputs from the encoder (the output that is apperantly recorded somewhere, this is where they are used) from the current time step to rescale the values, making less important parts of the encoder output smaller and more important parts of the encoder output larger. This can be further improved by using all of the encoder outputs instead of just the one of the current time step to have a more comprehensive set of attention weights. This is the method implemented below, followed by the implementation of the decoder using this attention method.

In [12]:
# Luong attention layer
class Attn(nn.Module):
    def __init__(self, method, hidden_size):
        super(Attn, self).__init__()
        self.method = method
        if self.method not in ['dot', 'general', 'concat']:
            raise ValueError(self.method, "is not an appropriate attention method.")
        self.hidden_size = hidden_size
        if self.method == 'general':
            self.attn = nn.Linear(self.hidden_size, hidden_size)
        elif self.method == 'concat':
            self.attn = nn.Linear(self.hidden_size * 2, hidden_size)
            self.v = nn.Parameter(torch.FloatTensor(hidden_size))

    def dot_score(self, hidden, encoder_output):
        return torch.sum(hidden * encoder_output, dim=2)

    def general_score(self, hidden, encoder_output):
        energy = self.attn(encoder_output)
        return torch.sum(hidden * energy, dim=2)

    def concat_score(self, hidden, encoder_output):
        energy = self.attn(torch.cat((hidden.expand(encoder_output.size(0), -1, -1), encoder_output), 2)).tanh()
        return torch.sum(self.v * energy, dim=2)

    def forward(self, hidden, encoder_outputs):
        # Calculate the attention weights (energies) based on the given method
        if self.method == 'general':
            attn_energies = self.general_score(hidden, encoder_outputs)
        elif self.method == 'concat':
            attn_energies = self.concat_score(hidden, encoder_outputs)
        elif self.method == 'dot':
            attn_energies = self.dot_score(hidden, encoder_outputs)

        # Transpose max_length and batch_size dimensions
        attn_energies = attn_energies.t()

        # Return the softmax normalized probability scores (with added dimension)
        return F.softmax(attn_energies, dim=1).unsqueeze(1)

In [13]:
class LuongAttnDecoderRNN(nn.Module):
    def __init__(self, attn_model, embedding, hidden_size, output_size, n_layers=1, dropout=0.1):
        super(LuongAttnDecoderRNN, self).__init__()

        # Keep for reference
        self.attn_model = attn_model
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        self.dropout = dropout

        # Define layers
        self.embedding = embedding
        self.embedding_dropout = nn.Dropout(dropout)
        self.gru = nn.LSTM(hidden_size, hidden_size, n_layers, dropout=(0 if n_layers == 1 else dropout))
        self.concat = nn.Linear(hidden_size * 2, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)

        self.attn = Attn(attn_model, hidden_size)

    def forward(self, input_step, last_hidden, encoder_outputs):
        # Note: we run this one step (word) at a time
        # Get embedding of current input word
        embedded = self.embedding(input_step)
        embedded = self.embedding_dropout(embedded)
        # Forward through unidirectional GRU
        rnn_output, hidden = self.gru(embedded, last_hidden)
        # Calculate attention weights from the current GRU output
        attn_weights = self.attn(rnn_output, encoder_outputs)
        # Multiply attention weights to encoder outputs to get new "weighted sum" context vector
        context = attn_weights.bmm(encoder_outputs.transpose(0, 1))
        # Concatenate weighted context vector and GRU output using Luong eq. 5
        rnn_output = rnn_output.squeeze(0)
        context = context.squeeze(1)
        concat_input = torch.cat((rnn_output, context), 1)
        concat_output = torch.tanh(self.concat(concat_input))
        # Predict next word using Luong eq. 6
        output = self.out(concat_output)
        output = F.softmax(output, dim=1)
        # Return output and final hidden state
        return output, hidden

### Defining the Training Procedure

Recall earlier in the `mask` produced along side the `target_variable` was produced for an unknown purpose, it turns out its purpose is for determining the loss of the model. This implementation below is the negative-log-likelihood loss and the variable `mask` along with `target_variable`.

In [14]:
def maskNLLLoss(inp, target, mask):
    nTotal = mask.sum()
    crossEntropy = -torch.log(torch.gather(inp, 1, target.view(-1, 1)).squeeze(1))
    loss = crossEntropy.masked_select(mask).mean()
    loss = loss.to(device)
    return loss, nTotal.item()

For training the model, two tricks are used to help convergence. The first trick is refered to as "teacher forcing" which just overrides the prediction of the decoder and uses the actual target value instead, this happens with some small probability p. Doing it too much will make the decoder unable to make predictions on its own and not having this feature will make it so that convergence is just slower. The second trick they do is gradient clipping, meaning that at areas in the feature space where there is a steep gradient and therefore a chance to improve the model very fast, the magnitude of the gradient is limited some value. This is to not drastically overshoot local minima on "cliffs". The training function is implemented below. This trains only one iteration.

In [15]:
def train(input_variable, lengths, target_variable, mask, max_target_len, encoder, decoder, embedding,
          encoder_optimizer, decoder_optimizer, batch_size, clip, max_length=MAX_LENGTH):

    # Zero gradients
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    # Set device options
    input_variable = input_variable.to(device)
    target_variable = target_variable.to(device)
    mask = mask.to(device)
    # Lengths for rnn packing should always be on the cpu
    lengths = lengths.to("cpu")

    # Initialize variables
    loss = 0
    print_losses = []
    n_totals = 0

    # Forward pass through encoder
    encoder_outputs, encoder_hidden = encoder(input_variable, lengths)

    # Create initial decoder input (start with SOS tokens for each sentence)
    decoder_input = torch.LongTensor([[SOS_token for _ in range(batch_size)]])
    decoder_input = decoder_input.to(device)

    # Set initial decoder hidden state to the encoder's final hidden state
        #Get the encoder final h_hidden_state
    encoder_h_hidden, encoder_c_hidden = encoder_hidden
    decoder_h_hidden = encoder_h_hidden[:decoder.n_layers]
    decoder_c_hidden = encoder_c_hidden[:decoder.n_layers]
    # Recombine the final hidden states as hidden tuple
    decoder_hidden = (decoder_h_hidden, decoder_c_hidden)

    # Determine if we are using teacher forcing this iteration
    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    # Forward batch of sequences through decoder one time step at a time
    if use_teacher_forcing:
        for t in range(max_target_len):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden, encoder_outputs
            )
            # Teacher forcing: next input is current target
            decoder_input = target_variable[t].view(1, -1)
            # Calculate and accumulate loss
            mask_loss, nTotal = maskNLLLoss(decoder_output, target_variable[t], mask[t])
            loss += mask_loss
            print_losses.append(mask_loss.item() * nTotal)
            n_totals += nTotal
    else:
        for t in range(max_target_len):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden, encoder_outputs
            )
            # No teacher forcing: next input is decoder's own current output
            _, topi = decoder_output.topk(1)
            decoder_input = torch.LongTensor([[topi[i][0] for i in range(batch_size)]])
            decoder_input = decoder_input.to(device)
            # Calculate and accumulate loss
            mask_loss, nTotal = maskNLLLoss(decoder_output, target_variable[t], mask[t])
            loss += mask_loss
            print_losses.append(mask_loss.item() * nTotal)
            n_totals += nTotal

    # Perform backpropatation
    loss.backward()

    # Clip gradients: gradients are modified in place
    _ = nn.utils.clip_grad_norm_(encoder.parameters(), clip)
    _ = nn.utils.clip_grad_norm_(decoder.parameters(), clip)

    # Adjust model weights
    encoder_optimizer.step()
    decoder_optimizer.step()

    return sum(print_losses) / n_totals


This version of the training function trains on multiple iterations. It is built on the previous training function. This function also saves the current variables in a tarball file. This is to continue training at a later period or just use the current accumulated training to make predictions.

In [16]:
def trainIters(model_name, voc, pairs, encoder, decoder, encoder_optimizer, decoder_optimizer, embedding, 
               encoder_n_layers, decoder_n_layers, save_dir, n_iteration, batch_size, print_every, save_every, 
               clip, corpus_name, loadFilename):

    # Load batches for each iteration
    training_batches = [batch2TrainData(voc, [random.choice(pairs) for _ in range(batch_size)])
                      for _ in range(n_iteration)]

    # Initializations
    print('Initializing ...')
    start_iteration = 1
    print_loss = 0
    if loadFilename:
        start_iteration = checkpoint['iteration'] + 1

    # Training loop
    print("Training...")
    for iteration in range(start_iteration, n_iteration + 1):
        training_batch = training_batches[iteration - 1]
        # Extract fields from batch
        input_variable, lengths, target_variable, mask, max_target_len = training_batch

        # Run a training iteration with batch
        loss = train(input_variable, lengths, target_variable, mask, max_target_len, encoder,
                     decoder, embedding, encoder_optimizer, decoder_optimizer, batch_size, clip)
        print_loss += loss

        # Print progress
        if iteration % print_every == 0:
            print_loss_avg = print_loss / print_every
            print("Iteration: {}; Percent complete: {:.1f}%; Average loss: {:.4f}".format(
                iteration, iteration / n_iteration * 100, print_loss_avg
            ))
            print_loss = 0

        # Save checkpoint
        if (iteration % save_every == 0):
            directory = os.path.join(save_dir, model_name, corpus_name, '{}-{}_{}'.format(
                encoder_n_layers, decoder_n_layers, hidden_size
            ))
            if not os.path.exists(directory):
                os.makedirs(directory)
            torch.save({
                'iteration': iteration,
                'en': encoder.state_dict(),
                'de': decoder.state_dict(),
                'en_opt': encoder_optimizer.state_dict(),
                'de_opt': decoder_optimizer.state_dict(),
                'loss': loss,
                'voc_dict': voc.__dict__,
                'embedding': embedding.state_dict()
            }, os.path.join(directory, '{}_{}.tar'.format(iteration, 'checkpoint')))


### Defining Evaluation

This is the training method used by our bot when we are not using the "teacher forcing" method. This just basically chooses the output of hte decoder to be the output with the highest softmax score, ie the best response given its prior training. I think this is defining a decoder method for whenever we are actually using our model instead of when it is training.

In [28]:
class GreedySearchDecoder(nn.Module):
    def __init__(self, encoder, decoder):
        super(GreedySearchDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, input_seq, input_length, max_length):
        # Forward input through encoder model
        encoder_outputs, encoder_hidden = self.encoder(input_seq, input_length)
        # Prepare encoder's final hidden layer to be first hidden input to the decoder
        #Get the encoder final h_hidden_state
        encoder_h_hidden, encoder_c_hidden= encoder_hidden
        decoder_h_hidden = encoder_h_hidden[:decoder.n_layers]
        decoder_c_hidden = encoder_c_hidden[:decoder.n_layers]
        # Recombine the final hidden states as hidden tuple
        decoder_hidden = (decoder_h_hidden, decoder_c_hidden)
        # Initialize decoder input with SOS_token
        decoder_input = torch.ones(1, 1, device=device, dtype=torch.long) * SOS_token
        # Initialize tensors to append decoded words to
        all_tokens = torch.zeros([0], device=device, dtype=torch.long)
        all_scores = torch.zeros([0], device=device)
        # Iteratively decode one word token at a time
        for _ in range(max_length):
            # Forward pass through decoder
            decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden, encoder_outputs)
            # Obtain most likely word token and its softmax score
            decoder_scores, decoder_input = torch.max(decoder_output, dim=1)
            # Record token and score
            all_tokens = torch.cat((all_tokens, decoder_input), dim=0)
            all_scores = torch.cat((all_scores, decoder_scores), dim=0)
            # Prepare current token to be next decoder input (add a dimension)
            decoder_input = torch.unsqueeze(decoder_input, 0)
        # Return collections of word tokens and scores
        return all_tokens, all_scores

#option to greedy search
class StochasticTopKUniformDecoder(nn.Module):
    def __init__(self, encoder, decoder, k):
        super(StochasticTopKUniformDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.k = k

    def forward(self, input_seq, input_length, max_length):
        # Forward input through encoder model
        encoder_outputs, encoder_hidden = self.encoder(input_seq, input_length)
        # Prepare encoder's final hidden layer to be first hidden input to the decoder
        #Get the encoder final h_hidden_state
        encoder_h_hidden, encoder_c_hidden= encoder_hidden
        decoder_h_hidden = encoder_h_hidden[:decoder.n_layers]
        decoder_c_hidden = encoder_c_hidden[:decoder.n_layers]
        # Recombine the final hidden states as hidden tuple
        decoder_hidden = (decoder_h_hidden, decoder_c_hidden)
        # Initialize decoder input with SOS_token
        decoder_input = torch.ones(1, 1, device=device, dtype=torch.long) * SOS_token
        # Initialize tensors to append decoded words to
        all_tokens = torch.zeros([0], device=device, dtype=torch.long)
        all_scores = torch.zeros([0], device=device)
        # Iteratively decode one word token at a time
        for _ in range(max_length):
            # Forward pass through decoder
            decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden, encoder_outputs)
            
            # select from the top k words at each word in the sentence
            # uniformly at random.
            decoder_output_topk, ind = torch.topk(decoder_output, k=self.k, dim=1)
            i = int(random.random()*self.k)
            decoder_input = torch.tensor([ind[0,i]]).to(device)
            decoder_scores = torch.tensor([decoder_output_topk[0,i]]).to(device)
            
            # Record token and score
            all_tokens = torch.cat((all_tokens, decoder_input), dim=0)
            all_scores = torch.cat((all_scores, decoder_scores), dim=0)
            # Prepare current token to be next decoder input (add a dimension)
            decoder_input = torch.unsqueeze(decoder_input, 0)
        # Return collections of word tokens and scores
        return all_tokens, all_scores

class StochasticTopKSampleDecoder(nn.Module):
    def __init__(self, encoder, decoder, k):
        super(StochasticTopKSampleDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.k = k

    def forward(self, input_seq, input_length, max_length):
        # Forward input through encoder model
        encoder_outputs, encoder_hidden = self.encoder(input_seq, input_length)
        # Prepare encoder's final hidden layer to be first hidden input to the decoder
        #Get the encoder final h_hidden_state
        encoder_h_hidden, encoder_c_hidden= encoder_hidden
        decoder_h_hidden = encoder_h_hidden[:decoder.n_layers]
        decoder_c_hidden = encoder_c_hidden[:decoder.n_layers]
        # Recombine the final hidden states as hidden tuple
        decoder_hidden = (decoder_h_hidden, decoder_c_hidden)
        # Initialize decoder input with SOS_token
        decoder_input = torch.ones(1, 1, device=device, dtype=torch.long) * SOS_token
        # Initialize tensors to append decoded words to
        all_tokens = torch.zeros([0], device=device, dtype=torch.long)
        all_scores = torch.zeros([0], device=device)
        # Iteratively decode one word token at a time
        for _ in range(max_length):
            # Forward pass through decoder
            decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden, encoder_outputs)
            
            # sample choice of word from the top k words at each word
            # in the sentence
            decoder_output_topk, ind = torch.topk(decoder_output, k=self.k, dim=1)
            decoder_output_topk = decoder_output_topk.cpu().detach().numpy().flatten()
            ind = ind.cpu().detach().numpy().flatten()
            probabilities = decoder_output_topk/(np.sum(decoder_output_topk))
            value = np.random.choice(ind, 1, p=probabilities)
            i = np.where(ind == value[0])[0][0]
            decoder_input = torch.tensor([ind[i]]).to(device)
            decoder_scores = torch.tensor([decoder_output_topk[i]]).to(device)
            
            # Record token and score
            all_tokens = torch.cat((all_tokens, decoder_input), dim=0)
            all_scores = torch.cat((all_scores, decoder_scores), dim=0)
            # Prepare current token to be next decoder input (add a dimension)
            decoder_input = torch.unsqueeze(decoder_input, 0)
        # Return collections of word tokens and scores
        return all_tokens, all_scores

class StochasticSentenceInitializerSamplingDecoder(nn.Module):
    def __init__(self, encoder, decoder, k, n):
        super(StochasticSentenceInitializerSamplingDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.k = k
        self.n = n

    def forward(self, input_seq, input_length, max_length):
        # Forward input through encoder model
        encoder_outputs, encoder_hidden = self.encoder(input_seq, input_length)
        # Prepare encoder's final hidden layer to be first hidden input to the decoder
        #Get the encoder final h_hidden_state
        encoder_h_hidden, encoder_c_hidden= encoder_hidden
        decoder_h_hidden = encoder_h_hidden[:decoder.n_layers]
        decoder_c_hidden = encoder_c_hidden[:decoder.n_layers]
        # Recombine the final hidden states as hidden tuple
        decoder_hidden = (decoder_h_hidden, decoder_c_hidden)
        # Initialize decoder input with SOS_token
        decoder_input = torch.ones(1, 1, device=device, dtype=torch.long) * SOS_token
        # Initialize tensors to append decoded words to
        all_tokens = torch.zeros([0], device=device, dtype=torch.long)
        all_scores = torch.zeros([0], device=device)
        # Iteratively decode one word token at a time
        for i in range(max_length):
            # Forward pass through decoder
            decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden, encoder_outputs)
            
            if i >= self.n:
                # be greedy >:)
                decoder_scores, decoder_input = torch.max(decoder_output, dim=1)
            else:
                # sample choice of word from the top k words at each word
                # in the sentence
                decoder_output_topk, ind = torch.topk(decoder_output, k=self.k, dim=1)
                decoder_output_topk = decoder_output_topk.cpu().detach().numpy().flatten()
                ind = ind.cpu().detach().numpy().flatten()
                probabilities = decoder_output_topk/(np.sum(decoder_output_topk))
                value = np.random.choice(ind, 1, p=probabilities)
                i = np.where(ind == value[0])[0][0]
                decoder_input = torch.tensor([ind[i]]).to(device)
                decoder_scores = torch.tensor([decoder_output_topk[i]]).to(device)
            
            # Record token and score
            all_tokens = torch.cat((all_tokens, decoder_input), dim=0)
            all_scores = torch.cat((all_scores, decoder_scores), dim=0)
            # Prepare current token to be next decoder input (add a dimension)
            decoder_input = torch.unsqueeze(decoder_input, 0)
        # Return collections of word tokens and scores
        return all_tokens, all_scores

class StochasticSentenceInitializerUniformDecoder(nn.Module):
    def __init__(self, encoder, decoder, k, n):
        super(StochasticSentenceInitializerUniformDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.k = k
        self.n = n

    def forward(self, input_seq, input_length, max_length):
        # Forward input through encoder model
        encoder_outputs, encoder_hidden = self.encoder(input_seq, input_length)
        # Prepare encoder's final hidden layer to be first hidden input to the decoder
        #Get the encoder final h_hidden_state
        encoder_h_hidden, encoder_c_hidden= encoder_hidden
        decoder_h_hidden = encoder_h_hidden[:decoder.n_layers]
        decoder_c_hidden = encoder_c_hidden[:decoder.n_layers]
        # Recombine the final hidden states as hidden tuple
        decoder_hidden = (decoder_h_hidden, decoder_c_hidden)
        # Initialize decoder input with SOS_token
        decoder_input = torch.ones(1, 1, device=device, dtype=torch.long) * SOS_token
        # Initialize tensors to append decoded words to
        all_tokens = torch.zeros([0], device=device, dtype=torch.long)
        all_scores = torch.zeros([0], device=device)
        # Iteratively decode one word token at a time
        for i in range(max_length):
            # Forward pass through decoder
            decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden, encoder_outputs)
            
            if i >= self.n:
                # be greedy >:)
                decoder_scores, decoder_input = torch.max(decoder_output, dim=1)
            else:
                # sample choice of word from the top k words at each word
                # in the sentence (uniformly at random)
                decoder_output_topk, ind = torch.topk(decoder_output, k=self.k, dim=1)
                i = int(random.random()*self.k)
                decoder_input = torch.tensor([ind[0,i]]).to(device)
                decoder_scores = torch.tensor([decoder_output_topk[0,i]]).to(device)
            
            # Record token and score
            all_tokens = torch.cat((all_tokens, decoder_input), dim=0)
            all_scores = torch.cat((all_scores, decoder_scores), dim=0)
            # Prepare current token to be next decoder input (add a dimension)
            decoder_input = torch.unsqueeze(decoder_input, 0)
        # Return collections of word tokens and scores
        return all_tokens, all_scores

The `evaluateInput` function is meant to take input from the user, prepare it so that it may be fed to our model, and pring out the response from our bot. `evaluate` handles most of hte actual calculation while `evaluateInput` handles most of the user interaction phase. 

In [18]:
def evaluate(encoder, decoder, searcher, voc, sentence, max_length=MAX_LENGTH):
    ### Format input sentence as a batch
    # words -> indexes
    indexes_batch = [indexesFromSentence(voc, sentence)]
    # Create lengths tensor
    lengths = torch.tensor([len(indexes) for indexes in indexes_batch])
    # Transpose dimensions of batch to match models' expectations
    input_batch = torch.LongTensor(indexes_batch).transpose(0, 1)
    # Use appropriate device
    input_batch = input_batch.to(device)
    #lengths = lengths.to(device) # removed bc of gpu cpu tensor difference ###################################
    # Decode sentence with searcher
    tokens, scores = searcher(input_batch, lengths, max_length)
    # indexes -> words
    decoded_words = [voc.index2word[token.item()] for token in tokens]
    return decoded_words


def evaluateInput(encoder, decoder, searcher, voc):
    input_sentence = ''
    while(1):
        try:
            # Get input sentence
            input_sentence = input('> ')
            # Check if it is quit case
            if input_sentence == 'q' or input_sentence == 'quit': break
            # Normalize sentence
            input_sentence = normalizeString(input_sentence)
            # Evaluate sentence
            output_words = evaluate(encoder, decoder, searcher, voc, input_sentence)
            # Format and print response sentence
            output_words[:] = [x for x in output_words if not (x == 'EOS' or x == 'PAD')]
            print('Bot:', ' '.join(output_words), len(output_words))

        except KeyError:
            print("Error: Encountered unknown word.")


### Running the Model

This defines and builds our model based on the following parameters. Notice that parts of the code are commented out as these are other possibilities ofr initialization. attn_model can be given one of three different models. We can also choose to load a model that we previously trained.

In [21]:
# Configure models
model_name = 'LSTM_dialogue_2048_batch_15000_iter'
attn_model = 'dot'
# attn_model = 'general'
# attn_model = 'concat'
hidden_size = 500
encoder_n_layers = 2
decoder_n_layers = 2
dropout = 0.1
batch_size = 2048

# Set checkpoint to load from; set to None if starting from scratch
loadFilename = None
checkpoint_iter = 15000
# loadFilename = os.path.join(save_dir, model_name, corpus_name,
#                            '{}-{}_{}'.format(encoder_n_layers, decoder_n_layers, hidden_size),
#                            '{}_checkpoint.tar'.format(checkpoint_iter))


# Load model if a loadFilename is provided
if loadFilename:
    # If loading on same machine the model was trained on
    checkpoint = torch.load(loadFilename)
    # If loading a model trained on GPU to CPU
    #checkpoint = torch.load(loadFilename, map_location=torch.device('cpu'))
    encoder_sd = checkpoint['en']
    decoder_sd = checkpoint['de']
    encoder_optimizer_sd = checkpoint['en_opt']
    decoder_optimizer_sd = checkpoint['de_opt']
    embedding_sd = checkpoint['embedding']
    voc.__dict__ = checkpoint['voc_dict']


print('Building encoder and decoder ...')
# Initialize word embeddings
embedding = nn.Embedding(voc.num_words, hidden_size)
if loadFilename:
    embedding.load_state_dict(embedding_sd)
# Initialize encoder & decoder models
encoder = EncoderRNN(hidden_size, embedding, encoder_n_layers, dropout)
decoder = LuongAttnDecoderRNN(attn_model, embedding, hidden_size, voc.num_words, decoder_n_layers, dropout)
if loadFilename:
    encoder.load_state_dict(encoder_sd)
    decoder.load_state_dict(decoder_sd)
# Use appropriate device
encoder = encoder.to(device)
decoder = decoder.to(device)
print('Models built and ready to go!')

Building encoder and decoder ...
Models built and ready to go!


Now we train our model using `trainIters`. We first set some training parameters and initialize optimizers and prepare our model for training. Training is likely to take a very long time, thankfully there is the option to load from a previous model and this piece of code here doesn't need to be run every time.

In [22]:
# Safty precaution so that we dont accidentally begin training a model we didnt want to train
TRAIN_THE_MODEL = True

In [None]:
if TRAIN_THE_MODEL:
    
    # Configure training/optimization
    clip = 50.0
    teacher_forcing_ratio = 1.0
    learning_rate = 0.0001
    decoder_learning_ratio = 5.0
    n_iteration = 15000
    print_every = 1
    save_every = 1000

    # Ensure dropout layers are in train mode
    encoder.train()
    decoder.train()

    # Initialize optimizers
    print('Building optimizers ...')
    encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate * decoder_learning_ratio)
    if loadFilename:
        encoder_optimizer.load_state_dict(encoder_optimizer_sd)
        decoder_optimizer.load_state_dict(decoder_optimizer_sd)

    # If you have cuda, configure cuda to call
    for state in encoder_optimizer.state.values():
        for k, v in state.items():
            if isinstance(v, torch.Tensor):
                state[k] = v.cuda()
    
    for state in decoder_optimizer.state.values():
        for k, v in state.items():
            if isinstance(v, torch.Tensor):
                state[k] = v.cuda()
    
    # Run training iterations
    print("Starting Training!")
    trainIters(model_name, voc, pairs, encoder, decoder, encoder_optimizer, decoder_optimizer,
               embedding, encoder_n_layers, decoder_n_layers, save_dir, n_iteration, batch_size,
               print_every, save_every, clip, corpus_name, loadFilename)
    
    TRAIN_THE_MODEL = False

else:
    
    print(f"Make the variable 'TRAIN_THE_MODEL' in the code block above True and run the code block")
    print(f"Value of 'TRAIN_THE_MODEL' currently: {TRAIN_THE_MODEL}")


Run the model :)

In [25]:
class StochasticTopKSampleDecoder(nn.Module):
    def __init__(self, encoder, decoder, k):
        super(StochasticTopKSampleDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.k = k

    def forward(self, input_seq, input_length, max_length):
        # Forward input through encoder model
        encoder_outputs, encoder_hidden = self.encoder(input_seq, input_length)
        # Prepare encoder's final hidden layer to be first hidden input to the decoder
        decoder_hidden = encoder_hidden[:decoder.n_layers]
        # Initialize decoder input with SOS_token
        decoder_input = torch.ones(1, 1, device=device, dtype=torch.long) * SOS_token
        # Initialize tensors to append decoded words to
        all_tokens = torch.zeros([0], device=device, dtype=torch.long)
        all_scores = torch.zeros([0], device=device)
        # Iteratively decode one word token at a time
        for _ in range(max_length):
            # Forward pass through decoder
            decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden, encoder_outputs)
            
            # sample choice of word from the top k words at each word
            # in the sentence
            decoder_output_topk, ind = torch.topk(decoder_output, k=self.k, dim=1)
            decoder_output_topk = decoder_output_topk.cpu().detach().numpy().flatten()
            ind = ind.cpu().detach().numpy().flatten()
            probabilities = decoder_output_topk/np.sum(decoder_output_topk)
            value = np.random.choice(ind, 1, p=probabilities)
            i = np.where(ind == value[0])[0][0]
            decoder_input = torch.tensor([ind[i]]).to(device)
            decoder_scores = torch.tensor([decoder_output_topk[i]]).to(device)
            
            # Record token and score
            all_tokens = torch.cat((all_tokens, decoder_input), dim=0)
            all_scores = torch.cat((all_scores, decoder_scores), dim=0)
            # Prepare current token to be next decoder input (add a dimension)
            decoder_input = torch.unsqueeze(decoder_input, 0)
        # Return collections of word tokens and scores
        return all_tokens, all_scores

In [39]:
# Set dropout layers to eval mode
encoder.eval()
decoder.eval()

# Initialize search module
searcher = GreedySearchDecoder(encoder, decoder)

# Begin chatting (uncomment and run the following line to begin)
evaluateInput(encoder, decoder, searcher, voc)

> Hello
Bot: hello 1
> who are you
Bot: linda 1
> hi linda
Bot: but she s got a real 6
> who is she
Bot: that that girl float float er m 7
> interesting
Bot: you re not allowed to make your own decisions 9
> What music do you listen to
Bot: ma 1
> is there any music called ma
Bot: yeah is it 3
> i dont know
Error: Encountered unknown word.
> i do not know actually
Bot: nobody did even even even done that 7
> anyway, i do not want to continue this conversation
Error: Encountered unknown word.
> can you talk like a human
Bot: yeah 1
> well, this is a good response
Error: Encountered unknown word.
> so are you a robot
Error: Encountered unknown word.
> so are you a real human
Bot: yeah 1
> cool, see you later
Bot: see you all over the place out 7
> q


In [40]:
# Set dropout layers to eval mode
encoder.eval()
decoder.eval()

# Initialize search module
searcher = StochasticTopKSampleDecoder(encoder, decoder,5)

# Begin chatting (uncomment and run the following line to begin)
evaluateInput(encoder, decoder, searcher, voc)

> hello
Bot: you realize i don t trust him any longer 9
> how are you
Bot: not too bad 3
> that is great
Bot: use bad of those things 5
> what is your name
Bot: sarah 1
> hey sarah
Bot: no 1
> what did you mean no
Bot: i m at home night 5
> so what
Bot: so what 2
> anyway, What music do you listen t
Bot: mm mm 2
> is mm a music title or an artist
Error: Encountered unknown word.
> what is that
Bot: f eight 2
> are you a real human
Bot: yeah b y s 4
> interesting
Bot: these are the ones that i saw on the 9
> anyway, see you later
Bot: see you later 3
> bye
Bot: bye 1
> q


In [33]:
# Set dropout layers to eval mode
encoder.eval()
decoder.eval()

# Initialize search module
searcher = StochasticTopKUniformDecoder(encoder, decoder,2)

# Begin chatting (uncomment and run the following line to begin)
evaluateInput(encoder, decoder, searcher, voc)

> Hello
Bot: hello doreen and see where we goes oh 8
> who are you
Bot: er oh 2
> what is your name
Bot: sarah name is and night 5
> how are you doing today
Bot: it s three early till day tomorrow night afternoon 9
> interesting
Bot: you ve got bar do you sound now night 9
> nah
Bot: don me lie up now time 6
> that does not make sense
Bot: i haven a one anyway i mean me 8
> q


### Notes on the Model

Overall, the model is OK. It was not as intelligent as i thought it would be, but I suppose that is a good thing. One of the things that I noticed from playing with the bot was that everything is pretty predictable. In fact, it feels less like it is responding to you and more like it has memories what to day based on an input string from the user. I think the first thing we can do to improve the model is to add some randomness to the model, so asking "hello" wont keep generating the same output. To improve this one a bit more we can change some of the variable like the learning rate and the depth. Eventually, to improve the overall feeling of the chat bot we can implement an entirely new model. Try to give a more meaningful number to words rather than what index they first appear in (ask professor if he thinks this would even matter, or would this just be a change of domain).