## Cornel Chatbot dialogue
The code below insipired by 
1. https://pytorch.org/tutorials/beginner/chatbot_tutorial.html
2. https://github.com/Currie32/Chatbot-from-Movie-Dialogue

In [1]:
# sys module
import os
import sys
import re
import unicodedata
import random
import itertools

# third parties module
import pandas as pd
import torch
from torch import nn
import torch.nn.functional as F

path_this = os.path.abspath (os.path.dirname ('.'))
path_lines = os.path.join (path_this, '..', 'data', 'cornell movie-dialogs corpus', 'movie_lines.txt')
path_conversation = os.path.join (path_this, '..', 'data', 'cornell movie-dialogs corpus', 'movie_conversations.txt')

In [2]:
# load lines
id2lines = {}
with open (path_lines, 'rb') as f_:
    lines = [l.decode ('utf8', 'ignore') for l in f_.readlines()]
    for l in lines:
        entry = [_.strip () for _ in l.split ('+++$+++')]
        id2lines[entry[0]] = entry[-1]
print ("Loaded {} keys".format (len (id2lines)))

Loaded 304713 keys


In [3]:
# load conversation
conversation = []
with open (path_conversation, 'rb') as f_:
    lines = [l.decode ('utf8', 'ignore') for l in f_.readlines ()]
    for l in lines:
        entry = [_.strip () for _ in l.split ("+++$+++")]
        conv = entry[-1][1:-1].replace (' ', '').replace ("'", "").split (',')
        for i in range (len (conv) - 1):
            conversation.append ((conv[i], conv[i+1]))
print ("We have {} conversations".format (len (conversation)))
print ("Sample conversation: ")

def sample_conversation (id=None):
    if id is None: id = random.randrange (len (conversation))
    print ("id : {}".format (id))
    sample = conversation[id]
    print ("Q: {}".format (id2lines[sample[0]]))
    print ("A: {}".format (id2lines[sample[1]]))
sample_conversation ()

We have 221616 conversations
Sample conversation: 
id : 193392
Q: I <u>can</u>?
A: You bet your life.  "The mill wheel goes around...some times it's even under water -- then it rises up, as high as it can go..."


In [4]:
# Create vocabulary class
PAD_token = 0
SOS_token = 1
EOS_token = 2

class Vocabulary:
    
    def __init__ (self, name):
        self.name = name
        self.trimmed = False
        self.word2index = {"PAD" : 0, "SOS" : 1, "EOS" : 2}
        self.word2count = {}
        self.index2word = {0:"PAD", 1:"SOS", 2:"EOS"}
        self.num_words = 3
            
    def add_word (self, word):
        if word not in self.word2index:
            self.word2index[word] = self.num_words
            self.word2count[word] = 1
            self.index2word[self.num_words] = word
            self.num_words += 1
        else:
            self.word2count[word] += 1
    
    def add_sentence (self, sentence):
        for word in sentence.split (" "):
            self.add_word (word)
    
    # remove word below certain frequencies
    def trim (self, min_count):
        if self.trimmed:
            return
                
        keep_words = []
        
        for k,v in self.word2count.items ():
            if v >= min_count:
                keep_words.append (k)
        print ("From {}, {} are kept".format (self.num_words, len (keep_words)))
        
        # redefine attribute
        self.__init__ (self.name)
        for w in keep_words:
            self.add_word (w)
        self.trimmed = True

In [5]:
# convert line in id2lines into ASCII and then remove any line and any conversation
# that has total word greater than thres
def unicode2ascii (text):
    return ''.join(
        c for c in unicodedata.normalize('NFD', text) if unicodedata.category(c) != 'Mn'
    )

# Lowercase, trim, and remove non-letter characters
def normalize_string(text):
    s = unicode2ascii(text.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    s = re.sub(r"\s+", r" ", s).strip()
    return s

def prepare_data (id2lines, conversation, max_length=10):
    for k,v in id2lines.items ():
        v = normalize_string (v)
        if len (v.split (" ")) > max_length:
            id2lines[k] = ""
        else:
            id2lines[k] = v
    
    clean_conversation = []
    for conv_idx, conv in enumerate (conversation):
        if not any ([id2lines[i] == "" for i in conv]):
            clean_conversation.append(conv)
    print ("From {}, trimmed to {}".format (len (conversation), len (clean_conversation)))
    conversation = clean_conversation
    clean_conversation = None # free the memory
    return id2lines, conversation

In [6]:
id2lines, conversation = prepare_data (id2lines, conversation)

From 221616, trimmed to 75000


In [8]:
sample_conversation ()

id : 48306
Q: hold on a second !
A: look at this it s so lean and clean .


In [9]:
# create vocabulary
vocab = Vocabulary ("cornell")
for c in conversation:
    vocab.add_sentence (id2lines[c[0]])
    vocab.add_sentence (id2lines[c[1]])
print ("Total words {}".format (vocab.num_words))
vocab.trim (3)

Total words 20164
From 20164, 9041 are kept


In [10]:
# since we remove some vocabulary, we also need to check the conversation, 
# remove any conversation that contain removed word
def trim_conversation (id2lines, conversation, vocab):
    for _id,line in id2lines.items ():
        word = line.split (" ")
        is_removed = any ([w not in vocab.word2index for w in word])
        if is_removed:
            id2lines[_id] = None

    # actually removing the conversation
    clean_conversation = []
    for c in conversation:
        if all ([id2lines[_] is not None for _ in c]):
            clean_conversation.append (c)
    conversation = clean_conversation
    clean_conversation = None

    return id2lines, conversation

In [11]:
print ("Before trimed : {} conversations".format (len (conversation)))
id2lines, conversation = trim_conversation (id2lines, conversation, vocab)
print ("After trimed : {} conversations".format (len (conversation)))

Before trimed : 75000 conversations
After trimed : 62721 conversations


In [12]:
sample_conversation ()

id : 18425
Q: how d you know i d do it .
A: do what ?


In [30]:
"""
To be able to do minibatches of seq2seq, we need to convert a bunch
of dataset into (max_length, batch) matrix

Notes on why we convert it into (max_length, batch), not (batch, max_length)
as all we know it. The reason for this is the inner working of RNN
is the same as sequence model, it iteratively pick the first item, run it
and then continue to the next one. If we are using (batch, max_length), then
our matrix will be 
    [
        <tensor doc 1>
        <tensor doc 2>
        ...
        <tensor doc n>
    ]
if we index it by it's first dimension, (i.e matrix[0]), then we will get the whole
tensor document. And we don't want it (because we want to step by step).
So our matrix should be
    [
        <all index 0 of doc>
        <all index 1 of doc>
        ...
    ]
so, in the first iteration, matrix[0] will return all the first word of all documents,
and then it can continue to the second word, third word, etc.

The conclusion in here is: inner pytorch iteratively access fist dimension, 
we don't want operation per doc, but per time stamp.
"""

def sentence2index (sentence, vocab):
    return [vocab.word2index[w] for w in sentence.split (" ")] + [vocab.word2index["EOS"]]

def zero_padding (m, vocab):
    # in here we transpose from (batch_size, max_length) into (max_length, batch_size)
    return list (itertools.zip_longest (*m, fillvalue=vocab.word2index['PAD']))

def binary_mask (m, vocab):
    mask = []
    for idx, l in enumerate (m):
        mask.append ([])
        for w in l:
            if w == vocab.word2index['PAD']:
                mask[-1].append (0)
            else:
                mask[-1].append (1)
    return mask

def transform_input (sentences, vocab):
    # input is a string word
    s_index= [sentence2index (s, vocab) for s in sentences] # convert to index
    lengths = torch.tensor ([len (s) for s in s_index]) # get length
    
    s_index_padded = torch.LongTensor (zero_padding (s_index, vocab)) # padding, convert to tensor
    return s_index_padded, lengths

def transform_output (sentences, vocab):
    # input is a string word
    s_index = [sentence2index (s, vocab) for s in sentences] # convert to index
    max_length = max ([len (s) for s in s_index]) # get maximum length of this batches
    s_index_padded = zero_padding (s_index, vocab)
    
    mask = torch.ByteTensor (binary_mask (s_index_padded, vocab)) # mask, convert to tensor
    s_index_padded = torch.LongTensor (s_index_padded) # convert to tensor
    
    return s_index_padded, max_length, mask    

def batch2tensor (pair_batch, vocab):
    pair_batch.sort (key=lambda x : len (x[0].split (" ")), reverse=True)
    input_batch, output_batch = [], []
    for pair in pair_batch:
        input_batch.append (pair[0]) # question
        output_batch.append (pair[1]) # its answer
    
    input_tensor, lengths = transform_input (input_batch, vocab)
    output_tensor, max_length, mask = transform_output (output_batch, vocab)
    
    return input_tensor, output_tensor, lengths, max_length, mask

In [14]:
# example validation
batch_size = 5
to_line = lambda conv: [id2lines[conv[0]], id2lines[conv[1]]]
batches_pair = [to_line (conversation[random.randrange (len (conversation))]) for i in range (batch_size)]

In [15]:
batches_pair

[['huh ?', 'got another song for us ?'],
 ['a dead body ?', 'it s amy kramer .'],
 ['access granted . male or female ?', 'male .'],
 ['bud . . .', 'yeah . . .'],
 ['do you have children ?', 'no .']]

In [16]:
input_tensor, output_tensor, lengths, max_length, mask = batch2tensor (batches_pair, vocab)

In [27]:
print (input_tensor.shape) # (max_length, batch_size)
input_tensor

torch.Size([8, 5])


tensor([[5803,   54,    8,  954,   32],
        [3129,   17,  215,   11,   16],
        [  11,   18, 1696,   11,    2],
        [2971, 3278,   16,   11,    0],
        [ 354,   16,    2,    2,    0],
        [2192,    2,    0,    0,    0],
        [  16,    0,    0,    0,    0],
        [   2,    0,    0,    0,    0]])

In [18]:
output_tensor

tensor([[2971,   43,    6,  196,  594],
        [  11,   11,    4,   11, 1032],
        [   2,    2, 1035,   11,  991],
        [   0,    0, 4426,   11,  119],
        [   0,    0,   11,    2,  424],
        [   0,    0,    2,    0,   16],
        [   0,    0,    0,    0,    2]])

In [20]:
mask, max_length

(tensor([[1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1],
         [0, 0, 1, 1, 1],
         [0, 0, 1, 1, 1],
         [0, 0, 1, 0, 1],
         [0, 0, 0, 0, 1]], dtype=torch.uint8), 7)

# Encoder Bidirectional RNN
Is a 2 normal RNN but the other RNN is operated on reversed. Each input goes to the network independently, then the output of each RNN is summed.

Computational graph:
1. Convert input into word embedding
2. Pack the sequence
3. Pass to RNN
4. Unpack the result
5. Sum the result of the 2 RNN
6. Return output and the final hidden state

In [21]:
class EncoderRNN (nn.Module):
    
    def __init__ (self, hidden_size, embedding, n_layers=1, dropout=0):
        super (EncoderRNN, self).__init__ ()
        self.hidden_size = hidden_size
        self.embedding = embedding
        self.n_layers = n_layers
        
        # the embedding output must be hidden_size too
        self.gru = nn.GRU (hidden_size, hidden_size, n_layers, 
                           dropout= (0 if n_layers == 1 else dropout), bidirectional=True)
    
    def forward (self, input_seq, input_lengths, hidden=None):
        # what is the input sequence length? is it padded already?
        embed = self.embedding (input_seq) 
        # padding and packing
        pack = nn.utils.rnn.pack_padded_sequence (embed, input_lengths) 
        outputs, hidden = self.gru (pack, hidden)
        # reverse the pack_padded
        outputs, _ = nn.utils.rnn.pad_packed_sequence (outputs)
        # since bidirectional is true, we sum it
        outputs = outputs[:, :, :self.hidden_size] + outputs[:, :, self.hidden_size:]
        return outputs, hidden

In [22]:
# Luong attention layer
class Attn (nn.Module):
    
    def __init__ (self, method, hidden_size):
        super (Attn, self).__init__ ()
        self.method = method
        if self.method not in ['dot', 'general', 'concat']:
            raise ValueError ("{} is not accepted method".format (self.method))
        
        self.hidden_size = hidden_size
        if self.method == 'general':
            self.attn = nn.Linear (self.hidden_size, self.hidden_size)
        elif self.method == 'concat':
            self.attn = nn.Linear (2*self.hidden_size, self.hidden_size)
            self.v = nn.Parameter (torch.FloatTensor (self.hidden_size))
    
    def dot_score (self, hidden, encoder_output):
        # become max_seq, batch_size
        return torch.sum (hidden * encoder_output, dim=2)
    
    def general_score (self, hidden, encoder_output):
        energy = self.attn (encoder_output)
        return torch.sum (hidden * energy, dim=2)
    
    def concat_score (self, hidden, encoder_output):
        energy = self.attn (
            torch.cat (
                (hidden.expand (encoder_output.size (0), -1, -1), encoder_output),
                2
            )).tanh ()
        return torch.sum (self.v * energy, dim=2)
    
    def forward (self, hidden, encoder_outputs):
        # hidden -> (1, batch, hidden)
        # encoder_output -> (seq_length, batch, hidden)
        
        if self.method == 'general':
            attn_energies = self.general_score (hidden, encoder_output)
        elif self.method == 'concat':
            attn_energies = self.concat_score (hidden, encoder_output)
        elif self.method == 'dot':
            attn_energies = self.dot_score (hidden, encoder_outputs)
        
        # attn_energies -> (batch_size, max_seq)
        attn_energies = attn_energies.t () # transpose
        
        # return (batch_size, 1, max_seq)?
        return F.softmax (attn_energies, dim=1).unsqueeze (1)

# Decoder With Undirectional GRU with Attention

Computation Graph:
1. Get embedding of current input word
2. Forward through undirectional GRU
3. Calculate attention weights based on current GRU output (2) and encoder output
4. Multiply attention weights to encoder output to get new "weighted sum" context vector.
5. Concatenate weighted context and GRU output
6. Predict next word
7. Return output and final hidden state

based on that we have

**inputs** :
- `input_step`, one time step (one word) of input sequence batch, *shape=(1, batch_size)*
- `last_hidden`, the last hidden of GRU, *shape=(n_layers x num_direction, batch_size, hidden_size)*
- `encoder_outputs`, the output of encoding process, *shape=(max_length, batch_size, hidden_size)*

**outputs**:
- `output`, softmax normalized tensor given the probability of each word output, *shape=(batch_size, voc.num_words)*
- `hidden`: hidden state of GRU, *shape=(n_layers x directions, batch_size, hidden_size)*

In [23]:
class AttnDecoderRNN (nn.Module):
    
    def __init__ (self, attn_model, embedding,
                 hidden_size, output_size, 
                  n_layers=1, dropout=0.1):
        
        super (AttnDecoderRNN, self).__init__ ()
        
        self.attn_model = attn_model
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        self.dropout = dropout
        
        # define layers needed
        self.embedding = embedding
        self.embedding_dropout = nn.Dropout (self.dropout)
        self.gru = nn.GRU (hidden_size, hidden_size, n_layers,
                          dropout=(0 if n_layers==1 else dropout))
        
        # for joining context vector and decoder hidden state
        self.concat = nn.Linear (hidden_size*2, hidden_size)
        # from the concat into the word prob
        self.out = nn.Linear (hidden_size, output_size)
        
        self.attn = Attn (attn_model, hidden_size)
        
    def forward (self, input_step, last_hidden, encoder_outputs):
        # note: input_step is the index of word with just one word
        # so the shape is (1, batch_size).
        embed = self.embedding (input_step)
        embed = self.embedding_dropout (embed)
        
        # forward through the RNN
        # RNN_output --> (seq_length, batch, num_direction*hidden_size) = (1, batch, hidden_size)
        rnn_output, hidden = self.gru (embed, last_hidden)
        
        # calculate attention
        # encoder_output --> (seq_length, batch, num_direction*hidden_size) = (seq_length, batch, hidden_size)
        # we summed the bidirectional encoder output right?
        attn_weights = self.attn (rnn_output, encoder_outputs)
        # attn_weights -> (batch, 1, seq_length)
        
        # multiply the weight with encoder output to get the context
        # encoder_outputs must be (batch, seq_length, hidden), so we transpose it from (seq_length, batch, hidden)
        # in other words, from index 0 to index 1
        context = attn_weights.bmm (encoder_outputs.transpose (0,1))
        # context will be (batch, 1, hidden)
        
        # concat the context vector with rnn output
        rnn_output = rnn_output.squeeze (0)
        context = context.squeeze (1)
        concat_input = torch.cat ((rnn_output, context), 1)
        
        # put into tanh of concat layer network
        concat_output = torch.tanh (self.concat (concat_input))
        
        # predict the output
        output = self.out (concat_output)
        output = F.softmax (output, dim=1)
        
        return output, hidden

So based on my understanding, the attention eventually will produce tensor with shape:

**(batch_size, seq_length)**

And then, this attention will be multiplied with `encoder_output` to produce context with size **(batch_size, hiden)**

After that, this context vector will be joined with RNN output, and then feed into network with layer (2 x hidden_size, hidden_size).

# Training Procedure

### Masked Loss
Since we are dealing with padded sequence, we can't simply consider all elements while calculating loss. Only element that has a word in it we should consider. Therefore we need to create a mask loss.

In [24]:
def maskNLLLoss(inp, target, mask):
    nTotal = mask.sum()
    crossEntropy = -torch.log(torch.gather(inp, 1, target.view(-1, 1)).squeeze(1))
    loss = crossEntropy.masked_select(mask).mean()
    loss = loss.to(device)
    return loss, nTotal.item()

### Single Training Iteration

For a single batch of input, we define a `train` function that will take the input, process it and count it's gradient.
In here we will use a couple techniques, such as **teacher forcing** and **gradient clipping**

The algorith of this function is:
1. **Forward** `input` batch through encoder.
2. **Initialize** decoder input as SOS_token and hidden_state as the encoder's final hidden state.
3. Forward input batch sequence **through decoder** one time step at a time
4. If **teacher forcing**: set next decoder input as the current target, else set as current decoder output
5. Calculate and accumulate **loss**
6. Perform **backpropagation**
7. **Clip** the gradients
8. **Update** encoder and decoder parameter