# Encoder/Decoder Dialogue Management

Here we use a simple Encoder/Decoder GRU network to predict answers from the Cornell Movie-Dialog Corpus. We use **PyTorch** as a deep learning framework.

Most of the code in this notebook comes from the following **tutorial** on English-French translation.

https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

We apply the Machine Translation framework to Dialogue Management by processing sentences in the corpus by pairs: we encode the sentence, and decode the answer.

In [1]:
from __future__ import unicode_literals, print_function, division
from io import open
import unicodedata
import string
import re
import random

import numpy as np

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [2]:
from tqdm import tqdm_notebook as tqdm

In [3]:
DEVICE

device(type='cuda')

## Load dataset 

We start loading the corpus' dialogs as **Episodes** (class due.episode.Episode). We limit the number of episodes to load so we can test the code more easily.

In [4]:
from due.corpora import cornell
import itertools

N_DIALOGS = 100

episodes = list(itertools.islice(cornell.episode_generator(), N_DIALOGS))

# episodes = cornell.load()

  0%|          | 0/83097 [00:00<?, ?it/s]


In [5]:
episodes[95].events

[Event(type=<Type.Utterance: 'utterance'>, timestamp=datetime.datetime(2011, 6, 15, 12, 4, 49), agent='u9', payload="What's the worst?"),
 Event(type=<Type.Utterance: 'utterance'>, timestamp=datetime.datetime(2011, 6, 15, 12, 4, 50), agent='u2', payload='You get the girl.')]

## Text cleaning

Here we define functions for a simple text processing pipeline, where we just convert sentences to lowercase and tokenize them using SpaCy.

In [6]:
s = "Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again."

In [7]:
import re
import spacy
# Needed: pipenv run python -m spacy download en
spacy_nlp_en = spacy.load('en')

def tokenize_sentence(sentence):
    s_spacy = spacy_nlp_en(sentence)
    return [str(token) for token in s_spacy]

In [8]:
def normalize_sentence(sentence, return_tokens=False):
    result = sentence.lower()
    result = re.sub(r'\s+', ' ', result)
    result = tokenize_sentence(result)
    if not return_tokens:
        result = ' '.join(result)
    return result

s_normalized = normalize_sentence(s, False)
print(s_normalized)

can we make this quick ? roxanne korrine and andrew barrett are having an incredibly horrendous public break- up on the quad . again .


## Dataset generation

Here we generate a dataset of utterances and their responses. The **output** of this section is:

* A list of utterances (`str`) `X`    
* A list of responses (`str`) `y`, one per utterance in `X`.

Example:

* X: `["hi", "hello how are you?", "i'm fine thanks", ...]`
* y: `["hello how are you?", "i'm fine thanks", "good to hear", ...]`

Note that within an Episode `i`, `y_i` is just `X_i[1:]`. This is not true when `X` and `y` are obtained concatenating data from multiple episodes.

In [9]:
from due.event import Event

In [10]:
def _is_utterance(event):
    return event.type == Event.Type.Utterance

def extract_pairs(episode):
    """
    Process Events in an Episode, extracting all the Event pairs that can be interpreted as one
    dialogue turn (ie. an Agent's utterance, and another Agent's response)
    
    In particular, Event pairs are extracted so that:
    
    * Both Events are Utterances (currently, non-utterances will raise an exception)
    * The second Event immediately follows the first
    * The two Events are acted by two different Agents
    
    Two lists of the same length are returned, so that each utterance (`str`) in the first list
    has its response in the second.
    """
    alternate_sentences = [episode.events[0].payload]
    for e1, e2 in zip(episode.events, episode.events[1:]):
        if not _is_utterance(e1) or not _is_utterance(e2):
            raise NotImplementedError("Non-utterance Events are not supported yet")
            
        if e1.agent != e2.agent:
            alternate_sentences.append(e2.payload)

    normalized_alternate_sentences = [normalize_sentence(s) for s in alternate_sentences]

    result_X = normalized_alternate_sentences[0:-1]
    result_y = normalized_alternate_sentences[1:]
    
    return result_X, result_y
    
extract_pairs(episodes[0])

(['can we make this quick ? roxanne korrine and andrew barrett are having an incredibly horrendous public break- up on the quad . again .',
  "well , i thought we 'd start with pronunciation , if that 's okay with you .",
  'not the hacking and gagging and spitting part . please .'],
 ["well , i thought we 'd start with pronunciation , if that 's okay with you .",
  'not the hacking and gagging and spitting part . please .',
  "okay ... then how 'bout we try out some french cuisine . saturday ? night ?"])

In [11]:
from tqdm import tqdm_notebook as tqdm

X = []
y = []

for e in tqdm(episodes):
    try:
        episode_X, episode_y = extract_pairs(e)
    except AttributeError:
        print("Skipping episode with events: %s" % e.events)
    X.extend(episode_X)
    y.extend(episode_y)

HBox(children=(IntProgress(value=0), HTML(value='')))




# Vocabulary

Here we index all the words in the corpus so that we can associate each word with a numeric ID, and vice versa.

**TODO**: consider using torchtext instead

**TODO**: implement pruning of rare words

In [12]:
from collections import defaultdict

class Vocabulary():
    def __init__(self):
        self.word_to_index = {}
        self.index_to_word = {}
        self.index_to_count = defaultdict(int)
        self.current_index = 0
        
        self.add_word('<UNK>') # Unknown token
        self.add_word('<SOS>') # Start of String
        self.add_word('<EOS>') # End of String
        
    def add_word(self, word):
        """
        Add a new word to the dictionary.
        """
        if word in self.word_to_index:
            index = self.word_to_index[word]
        else:
            index = self.current_index
            self.current_index += 1
            self.word_to_index[word] = index
            self.index_to_word[index] = word
            
        self.index_to_count[index] += 1
        
    def index(self, word):
        """
        Retrieve a word's index in the Vocabulary. Return the index of the <UNK>
        token if not present.
        """
        if word in self.word_to_index:
            return self.word_to_index[word]
        return self.word_to_index['<UNK>']
    
    def word(self, index):
        """
        Return the word corresponding to the given index/
        """
        return self.index_to_word[index]
    
    def size(self):
        return len(self.word_to_index)

In [13]:
vocabulary_full = Vocabulary()
for sentence in set(X + y):
    for word in sentence.split():
        vocabulary_full.add_word(word)

In [14]:
vocabulary_full.size()

803

In [15]:
from six import iteritems

MIN_OCCURRENCES = 2

vocabulary = Vocabulary()
for index, count in iteritems(vocabulary_full.index_to_count):
    if count >= MIN_OCCURRENCES:
        vocabulary.add_word(vocabulary_full.word(index))

In [16]:
vocabulary.size()

287

# Embeddings

We could initialize the model's embedding layer with random weights, but we expect better results using pre-trained word embeddings instead. We chose **GloVe** 6B, 300d word vectors for this purpose.

To set these vectors as default embeddings for our network we need to prepare a matrix of `(vocabulary_size, embedding_dim)` elements where the *i*-th row is the embedding vector of the word of index *i* in our vocabulary.

In [17]:
import numpy as np

In [18]:
from due import resource_manager
rm = resource_manager

In [19]:
EMBEDDING_DIM = 300

In [20]:
def get_embedding_matrix(vocabulary, stub=False):
    if stub:
        return torch.tensor(np.random.rand(vocabulary.size(), EMBEDDING_DIM), device=DEVICE)

    with rm.open_resource_file('embeddings.glove6B', 'glove.6B.300d.txt') as f:
        unk_index = vocabulary.index('<UNK>')
        result = np.zeros((vocabulary.size(), 300))
        for line in tqdm(f):
            line_split = line.split()
            word = line_split[0]
            index = vocabulary.index(word)
            if index != unk_index:
                vector = [float(x) for x in line_split[1:]]
                result[index,:] = vector
        sos_index = vocabulary.index('<SOS>')
        result[sos_index,:] = np.ones(300)
        return torch.FloatTensor(result, device=DEVICE)

In [21]:
embedding_matrix = get_embedding_matrix(vocabulary)
# embedding_matrix = get_embedding_matrix(vocabulary, stub=True)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




In [22]:
embedding_matrix.size()

torch.Size([287, 300])

# 1-by-1 training

Here we define a simple model that can be trained one sentence pair at the time. To drastically improve training time, usually deep learning systems are trained in **batches**. Batch training is implemented later on in this Notebook.

## Encoding

Here we define a function to encode a sentence into a Torch tensor of indices

In [23]:
def sentence_to_tensor(sentence):
    sentence_indexes = [vocabulary.index(w) for w in sentence.split()]
    sentence_indexes.append(vocabulary.index('<EOS>'))
    return torch.tensor(sentence_indexes, dtype=torch.long, device=DEVICE).view(-1, 1)

In [24]:
sentence_to_tensor(X[0])

tensor([[ 195],
        [  17],
        [  50],
        [ 193],
        [   0],
        [  12],
        [   0],
        [   0],
        [  67],
        [   0],
        [   0],
        [  96],
        [ 233],
        [ 180],
        [   0],
        [   0],
        [   0],
        [   0],
        [   9],
        [  93],
        [  58],
        [   0],
        [   5],
        [ 252],
        [   5],
        [   2]], device='cuda:0')

## Model

The model we used is copied straight from the one presented in the reference tutorial (https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html).

Note that attention is not implemented yet.

In [25]:
class EncoderRNN(nn.Module):
    def __init__(self, hidden_size, embedding_matrix):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size
        
#         self.embedding = nn.Embedding(vocabulary_size, embedding_size) # random init
        self.embedding = nn.Embedding.from_pretrained(embedding_matrix, freeze=False)
        embedding_dim = self.embedding.embedding_dim
    
        self.gru = nn.GRU(embedding_dim, hidden_size)
        
    def forward(self, input_data, hidden):
        embedded = self.embedding(input_data).view(1, 1, -1)
        output = embedded
        output, hidden = self.gru(output, hidden)
        return output, hidden
    
    def init_hidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=DEVICE)

In [26]:
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, embedding_matrix):
        super(DecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        
#         self.embedding = nn.Embedding(vocabulary_size, embedding_size)
        self.embedding = nn.Embedding.from_pretrained(embedding_matrix, freeze=False)
        embedding_dim = self.embedding.embedding_dim
        vocabulary_size = self.embedding.num_embeddings
        
        self.gru = nn.GRU(embedding_dim, hidden_size)
        self.out = nn.Linear(hidden_size, vocabulary_size)
        self.softmax = nn.LogSoftmax(dim=1)
        
    def forward(self, input_data, hidden):
        output = self.embedding(input_data).view(1, 1, -1)
        output = F.relu(output)
        output, hidden = self.gru(output, hidden)
        output = self.out(output[0])
        output = self.softmax(output)
        return output, hidden
    
    def init_hidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=DEVICE)

## Training

Here we define a function to process training for a single pair of sentences.

**TODO**: implement training with no teacher forcing

In [27]:
import random

In [28]:
TEACHER_FORCING_RATIO = 0.5
MAX_LENGTH = 500 # Will raise an error if a longer sentence is encountered

def train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion, max_length=MAX_LENGTH):
    encoder_hidden = encoder.init_hidden()
    
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()
    
    input_length = input_tensor.size(0)
    target_length = target_tensor.size(0)
    
    encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=DEVICE)
    
    loss = 0
    
    for ei in range(input_length):
        encoder_output, encoder_hidden = encoder(input_tensor[ei], encoder_hidden)
        encoder_outputs[ei] = encoder_output[0, 0]
        
    decoder_input = torch.tensor([[vocabulary.index('<SOS>')]], device=DEVICE)
    decoder_hidden = encoder_hidden
    
#     use_teacher_forcing = True if random.random() < TEACHER_FORCING_RATIO else False
    use_teacher_forcing = True
    
    if use_teacher_forcing:
        for di in range(target_length):
            decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)
            loss += criterion(decoder_output, target_tensor[di])
            decoder_input = target_tensor[di]
        
    loss.backward()
    
    encoder_optimizer.step()
    decoder_optimizer.step()
    
    return loss.item() / target_length

### Model initialization

This instantiate a fresh model. You should run this cell **once** before running your training epochs.

In [29]:
from datetime import datetime

LEARNING_RATE = 0.01
VOCABULARY_SIZE = vocabulary.size()
EMBEDDING_SIZE = 300
HIDDEN_SIZE = 512

encoder = EncoderRNN(VOCABULARY_SIZE, embedding_matrix).to(DEVICE)
decoder = DecoderRNN(VOCABULARY_SIZE, embedding_matrix).to(DEVICE)

encoder_optimizer = optim.SGD(encoder.parameters(), lr=LEARNING_RATE)
decoder_optimizer = optim.SGD(decoder.parameters(), lr=LEARNING_RATE)
criterion = nn.NLLLoss()

epoch = 0

### Epoch
Here we run a training Epoch, that is, we run the whole dataset through the training procedure. This cell can be executed many times to run multiple Epochs (be careful not to re-initialize the model across Epochs: that would reset training to Epoch 1).

In [30]:
PRINT_EVERY = 50

i = 1
tick = datetime.now()
loss_sum = 0.0
for input_sentence, target_sentence in tqdm(zip(X, y)):
    input_tensor = sentence_to_tensor(input_sentence)
    target_tensor = sentence_to_tensor(target_sentence)

    loss = train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion)
    loss_sum += loss
    if i%PRINT_EVERY == 0:
        print(i, loss_sum/PRINT_EVERY)
        loss_sum = 0.0
    i += 1
tock = datetime.now()

epoch += 1

print(tock-tick)
print(i, loss_sum/PRINT_EVERY)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

50 4.740919136900429
100 4.4504865207016175
150 4.37611336638037
200 4.3735499388217685

0:00:03.594551
202 0.09222966194152832


## Model serialization 

In [31]:
MODEL_NAME = "encdec-cornell-100-batch-glove6B_300-vmin2-e1"
model_filename = "%s_MODEL.pt" % MODEL_NAME
dataset_filename = "%s_DATASET.pt" % MODEL_NAME

### Save 

In [32]:
model = {
    'encoder': encoder.state_dict(),
    'decoder': decoder.state_dict(),
    'epoch': epoch,
    'encoder_optimizer': encoder_optimizer,
    'decoder_optimizer': decoder_optimizer,
    'embedding_matrix': embedding_matrix
}

torch.save(model, model_filename)

In [33]:
dataset = {
    "X": X,
    "y": y,
    "vocabulary": {
        "word_to_index": vocabulary.word_to_index,
        "index_to_word": vocabulary.index_to_word,
        "index_to_count": vocabulary.index_to_count,
        "current_index": vocabulary.current_index
    }
}

torch.save(dataset, dataset_filename)

### Load

In [34]:
dataset_deserialized = torch.load(dataset_filename)

X_deserialized = dataset_deserialized["X"]
y_deserialized = dataset_deserialized["y"]
vocabulary_deserialized = Vocabulary()
vocabulary_deserialized.word_to_index = dataset_deserialized['vocabulary']['word_to_index']
vocabulary_deserialized.index_to_word = dataset_deserialized['vocabulary']['index_to_word']
vocabulary_deserialized.index_to_count = dataset_deserialized['vocabulary']['index_to_count']
vocabulary_deserialized.current_index = dataset_deserialized['vocabulary']['current_index']

In [35]:
model_deserialized = torch.load(model_filename)

embedding_matrix_deserialized = model_deserialized['embedding_matrix']

encoder_deserialized = EncoderRNN(vocabulary_deserialized.size(), embedding_matrix_deserialized).to(DEVICE)
encoder_deserialized.load_state_dict(model_deserialized['encoder'])

decoder_deserialized = DecoderRNN(vocabulary_deserialized.size(), embedding_matrix_deserialized).to(DEVICE)
decoder_deserialized.load_state_dict(model_deserialized['decoder'])

## Evaluation

In [36]:
# TODO

## Testing 

In [37]:
def predict_answer(input_sentence, vocabulary, encoder, decoder):
    result = []
    
    input_tensor = sentence_to_tensor(input_sentence)
    input_length = input_tensor.size(0)
    
    encoder_hidden = encoder.init_hidden()
    for ei in range(input_length):
        _, encoder_hidden = encoder(input_tensor[ei], encoder_hidden)

    decoder_input = torch.tensor([[vocabulary.index('<SOS>')]], device=DEVICE)
    decoder_hidden = encoder_hidden
    
    for di in range(MAX_LENGTH):
        decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)
        topv, topi = decoder_output.topk(1)
        decoder_input = topi.squeeze().detach()
        
        predicted_index = decoder_input.item()
        
        if predicted_index == vocabulary.index('<EOS>'):
            break
        result.append(vocabulary.word(predicted_index))
    
    return " ".join(result)

In [38]:
predict_answer("what's the meaning of life?'", vocabulary, encoder, decoder)

'i <UNK> <UNK>'

In [39]:
predict_answer("what's the meaning of life?'", vocabulary_deserialized, encoder_deserialized, decoder_deserialized)

'i <UNK> <UNK>'

# Batch training

Instead of feeding sentence pairs one by one, we want the training procedure to predict a number of samples before computing the loss and completing the optimization step. This is called batch training.

The code below is inspired to https://github.com/pengyuchen/PyTorch-Batch-Seq2seq/blob/master/seq2seq_translation_tutorial.py

## Exploration
Here we compare our model's output in the single-sentence case vs. batch.

In [41]:
# Fake embedding layer
embedding = nn.Embedding(5, 10).to(DEVICE)

In [42]:
# Single sentence tensor
sentence_indexes = [1, 2, 3]
sentence_tensor = torch.tensor(sentence_indexes, dtype=torch.long, device=DEVICE).view(-1, 1)
input_data = sentence_tensor[0]
input_data

tensor([ 1], device='cuda:0')

In [43]:
BATCH_SIZE = 2

# Batch tensor
input_batch = torch.tensor([1, 4], device=DEVICE).view(-1, 1)
input_batch

tensor([[ 1],
        [ 4]], device='cuda:0')

In [44]:
embedding(input_data)

tensor([[-0.5899, -0.4195,  0.4906,  0.1626,  0.0029,  0.6255, -1.0804,
         -0.0071,  1.5118, -0.9721]], device='cuda:0')

In [45]:
embedding(input_batch)

tensor([[[-0.5899, -0.4195,  0.4906,  0.1626,  0.0029,  0.6255, -1.0804,
          -0.0071,  1.5118, -0.9721]],

        [[ 0.8736,  0.3196, -0.1022,  0.2273, -0.6255, -0.9996,  0.0735,
           0.4913,  0.6050,  0.4745]]], device='cuda:0')

In [46]:
embedding(input_batch).view(1, BATCH_SIZE, -1)

tensor([[[-0.5899, -0.4195,  0.4906,  0.1626,  0.0029,  0.6255, -1.0804,
          -0.0071,  1.5118, -0.9721],
         [ 0.8736,  0.3196, -0.1022,  0.2273, -0.6255, -0.9996,  0.0735,
           0.4913,  0.6050,  0.4745]]], device='cuda:0')

## Model 
We still compare the mode's output with the previous one

In [47]:
class EncoderRNNBatch(nn.Module):
    def __init__(self, hidden_size, embedding_matrix):
        super(EncoderRNNBatch, self).__init__()
        self.hidden_size = hidden_size
        
        self.embedding = nn.Embedding.from_pretrained(embedding_matrix, freeze=False)
        embedding_dim = self.embedding.embedding_dim
    
        self.gru = nn.GRU(embedding_dim, hidden_size)
        
    def forward(self, input_data, batch_size, hidden):
        embedded = self.embedding(input_data).view(1, batch_size, -1)
        output = embedded
        output, hidden = self.gru(output, hidden)
        return output, hidden
    
    def init_hidden(self, batch_size):
        return torch.zeros(1, batch_size, self.hidden_size, device=DEVICE)

In [48]:
encoder = EncoderRNN(32, embedding_matrix).to(DEVICE)
encoder_batch = EncoderRNNBatch(32, embedding_matrix).to(DEVICE)

In [49]:
# 1-by-1 model
encoder_hidden = encoder.init_hidden()
encoder(input_data, encoder_hidden)

(tensor([[[ 0.0228,  0.2981,  0.9194, -0.1953, -0.2722, -0.4606, -0.0468,
           -0.1577,  0.0971, -0.0423, -0.4126,  0.8335,  0.5273, -0.1992,
            0.0203,  0.7854, -0.1932,  0.0382,  0.0955, -0.0258, -0.2477,
            0.5920, -0.0355,  0.0702, -0.0024, -0.0518,  0.2959, -0.0327,
           -0.8324,  0.7245,  0.0966, -0.8010]]], device='cuda:0'),
 tensor([[[ 0.0228,  0.2981,  0.9194, -0.1953, -0.2722, -0.4606, -0.0468,
           -0.1577,  0.0971, -0.0423, -0.4126,  0.8335,  0.5273, -0.1992,
            0.0203,  0.7854, -0.1932,  0.0382,  0.0955, -0.0258, -0.2477,
            0.5920, -0.0355,  0.0702, -0.0024, -0.0518,  0.2959, -0.0327,
           -0.8324,  0.7245,  0.0966, -0.8010]]], device='cuda:0'))

In [50]:
# Batch model
encoder_hidden_batch = encoder_batch.init_hidden(BATCH_SIZE)
encoder_batch(input_batch, BATCH_SIZE, encoder_hidden_batch)

(tensor([[[-0.2550, -0.1818,  0.0316,  0.1641, -0.1318, -0.8157, -0.1721,
           -0.0246,  0.0438,  0.2603, -0.6466,  0.7489, -0.0013, -0.7385,
           -0.1891, -0.0044,  0.0814, -0.1645, -0.2884,  0.0751, -0.1295,
           -0.4015, -0.2971,  0.8273,  0.4685,  0.5808, -0.0511, -0.0893,
           -0.1929, -0.2152,  0.1525, -0.2222],
          [-0.3364,  0.0427, -0.0655, -0.0571,  0.0610,  0.2467, -0.3439,
           -0.3327, -0.2260,  0.3126,  0.0580, -0.2173,  0.3389,  0.1444,
            0.3097,  0.1450,  0.2398,  0.3871, -0.2337,  0.2830, -0.3261,
           -0.0741, -0.0842,  0.2484,  0.0201,  0.2271,  0.1544, -0.1194,
           -0.2921,  0.1589,  0.2274, -0.0239]]], device='cuda:0'),
 tensor([[[-0.2550, -0.1818,  0.0316,  0.1641, -0.1318, -0.8157, -0.1721,
           -0.0246,  0.0438,  0.2603, -0.6466,  0.7489, -0.0013, -0.7385,
           -0.1891, -0.0044,  0.0814, -0.1645, -0.2884,  0.0751, -0.1295,
           -0.4015, -0.2971,  0.8273,  0.4685,  0.5808, -0.0511, -0.08

In [51]:
class DecoderRNNBatch(nn.Module):
    def __init__(self, hidden_size, embedding_matrix):
        super(DecoderRNNBatch, self).__init__()
        self.hidden_size = hidden_size
        
        self.embedding = nn.Embedding.from_pretrained(embedding_matrix, freeze=False)
        embedding_dim = self.embedding.embedding_dim
        vocabulary_size = self.embedding.num_embeddings
        
        self.gru = nn.GRU(embedding_dim, hidden_size)
        self.out = nn.Linear(hidden_size, vocabulary_size)
        self.softmax = nn.LogSoftmax(dim=1)
        
    def forward(self, input_data, batch_size, hidden):
        output = self.embedding(input_data).view(1, batch_size, -1)
        output = F.relu(output)
        output, hidden = self.gru(output, hidden)
        output = self.out(output[0])
        output = self.softmax(output)
        return output, hidden
    
    def init_hidden(self, batch_size):
        return torch.zeros(1, BATCH_SIZE, self.hidden_size, device=DEVICE)

In [59]:
# vocabulary_size=10, embedding_dim=5
toy_embedding_matrix = torch.FloatTensor(np.random.rand(10, 5), device=DEVICE)

In [62]:
decoder = DecoderRNN(32, toy_embedding_matrix).to(DEVICE)
decoder_batch = DecoderRNNBatch(32, toy_embedding_matrix).to(DEVICE)

In [63]:
# 1-by-1 model
decoder_input = torch.tensor([[vocabulary.index('<SOS>')]], device=DEVICE)
decoder_hidden = encoder_hidden
decoder(decoder_input, decoder_hidden)

(tensor([[-2.4036, -2.1507, -2.2598, -2.2110, -2.4558, -2.2650, -2.1703,
          -2.4602, -2.2721, -2.4422]], device='cuda:0'),
 tensor([[[ 0.1051, -0.0295,  0.0116,  0.0640,  0.1038, -0.0042, -0.0312,
           -0.0931, -0.0646, -0.1191,  0.0821, -0.1534, -0.0853, -0.0444,
           -0.0749, -0.0700, -0.1183, -0.0063,  0.0183, -0.0702,  0.0869,
           -0.0172, -0.0663,  0.0524, -0.1017,  0.0631, -0.0713, -0.0294,
           -0.0407, -0.0362,  0.0062, -0.0139]]], device='cuda:0'))

In [64]:
# Batch model
decoder_input_batch = torch.tensor([[vocabulary.index('<SOS>')]*BATCH_SIZE], device=DEVICE)
decoder_hidden_batch = encoder_hidden_batch
decoder_batch(decoder_input_batch, BATCH_SIZE, decoder_hidden_batch)

(tensor([[-2.3927, -2.2153, -2.5005, -2.2281, -2.2738, -2.3554, -2.2440,
          -2.3097, -2.1736, -2.3767],
         [-2.3927, -2.2153, -2.5005, -2.2281, -2.2738, -2.3554, -2.2440,
          -2.3097, -2.1736, -2.3767]], device='cuda:0'),
 tensor([[[-0.0384, -0.0389,  0.1550,  0.1176, -0.1043,  0.0947,  0.0626,
           -0.0606, -0.0417, -0.0600,  0.0924,  0.1496, -0.0804,  0.0299,
            0.0715, -0.0259, -0.0535, -0.1762,  0.1166, -0.1030,  0.0852,
            0.0671,  0.0429,  0.0611,  0.1685,  0.1322,  0.1156,  0.0138,
            0.0451,  0.0209,  0.0868, -0.0031],
          [-0.0384, -0.0389,  0.1550,  0.1176, -0.1043,  0.0947,  0.0626,
           -0.0606, -0.0417, -0.0600,  0.0924,  0.1496, -0.0804,  0.0299,
            0.0715, -0.0259, -0.0535, -0.1762,  0.1166, -0.1030,  0.0852,
            0.0671,  0.0429,  0.0611,  0.1685,  0.1322,  0.1156,  0.0138,
            0.0451,  0.0209,  0.0868, -0.0031]]], device='cuda:0'))

In [66]:
try:
    del encoder
    del decoder
    del decoder_batch
    del encoder_hidden
    del encoder_hidden_batch
    del decoder_input
    del decoder_hidden
    del decoder_input_batch
    del decoder_hidden_batch
except NameError:
    pass

## Batch iterator

In [67]:
def batches(X, y, batch_size):
    for i in range(int(np.ceil(len(X)/batch_size))):
        start_index = i*batch_size
        end_index = start_index + batch_size
        yield X[start_index:end_index], y[start_index:end_index]

In [68]:
list(batches([0, 1, 2, 3, 4], ['a', 'b', 'c', 'd', 'e'], 2))

[([0, 1], ['a', 'b']), ([2, 3], ['c', 'd']), ([4], ['e'])]

## Batch to tensor

In [69]:
sentence_to_tensor(X[0])[0] # Input of normal encoder

tensor([ 195], device='cuda:0')

In [70]:
input_batch

tensor([[ 1],
        [ 4]], device='cuda:0')

In [71]:
def pad_sequence(sequence, pad_value, final_length):
    """
    Trim the sequence if longer than final_length, pad it with pad_value if shorter.
    
    In any case at lest one pad element will be left at the end of the sequence (this is
    because we pad with the <EOS> token)
    """
    if len(sequence) >= final_length:
        result = sequence[:final_length]
        result[-1] = pad_value
        return result

    return sequence + [pad_value] * (final_length - len(sequence))

In [72]:
pad_sequence([1, 2, 3], 0, 5)

[1, 2, 3, 0, 0]

In [73]:
pad_sequence([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 0, 5)

[1, 2, 3, 4, 0]

In [74]:
a = np.array([[1, 2, 3], [4, 5, 6]])
a = a.transpose()
np.expand_dims(a, 2)[0]

array([[1],
       [4]])

In [75]:
def batch_to_tensor(batch, max_words=None):
    """
    Receive a list of sentences (strings), return a batch
    """
    sentence_indexes = [[vocabulary.index(w) for w in sentence.split()] for sentence in batch]
    max_length = max([len(x) for x in sentence_indexes])
    if max_words:
        max_length = min(max_length, max_words)
    sentence_indexes = [pad_sequence(s, vocabulary.index('<EOS>'), max_length+1) for s in sentence_indexes]
    
    result = np.transpose(sentence_indexes)
    result = np.expand_dims(result, axis=2)
    return torch.tensor(result, dtype=torch.long, device=DEVICE)

In [76]:
batch = ['this is a sentence', 'this is another much longer sentence', 'short sentence']
batch_tensor = batch_to_tensor(batch)
n_words = batch_tensor.size(0)
batch_size = batch_tensor.size(1)
first_word = batch_tensor[0]

print(n_words)
print(batch_size)
print(first_word)

7
3
tensor([[ 193],
        [ 193],
        [   0]], device='cuda:0')


## Training

In [77]:
torch.cuda.empty_cache()

In [78]:
TEACHER_FORCING_RATIO = 0.5
MAX_LENGTH = 30

def train_batch(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion, max_length=MAX_LENGTH):
    batch_size = input_tensor.size(1)
    
    encoder_hidden = encoder.init_hidden(batch_size)
    
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()
    
    input_length = input_tensor.size(0)
    target_length = target_tensor.size(0)
    
#     encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=DEVICE)
    
    loss = 0
    
    for ei in range(input_length):
        encoder_output, encoder_hidden = encoder(input_tensor[ei], batch_size, encoder_hidden)
#         encoder_outputs[ei] = encoder_output[0, 0]
        
    decoder_input = torch.tensor([[vocabulary.index('<SOS>')]*batch_size], device=DEVICE)
    decoder_hidden = encoder_hidden
    
#     use_teacher_forcing = True if random.random() < TEACHER_FORCING_RATIO else False
    use_teacher_forcing = True
    
    if use_teacher_forcing:
        for di in range(target_length):
            decoder_output, decoder_hidden = decoder(decoder_input, batch_size, decoder_hidden)
#             print("decoder_output:", decoder_output, decoder_output.size())
#             print("target_tensor[di]:", target_tensor[di], target_tensor[di].size())
            loss += criterion(decoder_output, target_tensor[di].view(batch_size))
#             loss += criterion(decoder_output, target_tensor[di])
            decoder_input = target_tensor[di]
        
    loss.backward()
    
    encoder_optimizer.step()
    decoder_optimizer.step()
    
    return loss.item() / target_length

In [85]:
from datetime import datetime

LEARNING_RATE = 0.01
VOCABULARY_SIZE = vocabulary.size()
EMBEDDING_SIZE = 300
HIDDEN_SIZE = 512
BATCH_SIZE = 64

encoder = EncoderRNNBatch(VOCABULARY_SIZE, embedding_matrix).to(DEVICE)
decoder = DecoderRNNBatch(VOCABULARY_SIZE, embedding_matrix).to(DEVICE)

encoder_optimizer = optim.SGD(encoder.parameters(), lr=LEARNING_RATE)
decoder_optimizer = optim.SGD(decoder.parameters(), lr=LEARNING_RATE)
criterion = nn.NLLLoss()

epoch = 0

In [86]:
PRINT_EVERY = 100
EPOCHS = 2

for _ in range(EPOCHS):
    i = 1
    tick = datetime.now()
    loss_sum = 0.0
    loss_sum_partial = 0.0
    for input_batch, target_batch in tqdm(batches(X, y, BATCH_SIZE)):
        input_tensor = batch_to_tensor(input_batch, MAX_LENGTH)
        target_tensor = batch_to_tensor(target_batch, MAX_LENGTH)

        loss = train_batch(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion)
        loss_sum += loss
        loss_sum_partial += loss
        if i%PRINT_EVERY == 0:
            print(i, loss_sum_partial/PRINT_EVERY)
            loss_sum_partial = 0.0
        i += 1
    tock = datetime.now()

    epoch += 1

    print(tock-tick)
    print(i, loss_sum/i)
    print()

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


0:00:00.279829
5 4.192670440673828



HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


0:00:00.264212
5 2.5834801909744103



In [87]:
epoch

2

## Testing

In [88]:
def predict_answer_batch(input_sentence, vocabulary, encoder, decoder):
    result = []
    
    input_tensor = batch_to_tensor([input_sentence])
    input_length = input_tensor.size(0)
    batch_size = input_tensor.size(1)
    
    encoder_hidden = encoder.init_hidden(batch_size)
    for ei in range(input_length):
        _, encoder_hidden = encoder(input_tensor[ei], batch_size, encoder_hidden)

    decoder_input = torch.tensor([[vocabulary.index('<SOS>')] * batch_size], device=DEVICE)
    decoder_hidden = encoder_hidden
    
    for di in range(MAX_LENGTH):
        decoder_output, decoder_hidden = decoder(decoder_input, batch_size, decoder_hidden)
        topv, topi = decoder_output.topk(1)
        decoder_input = topi.squeeze().detach()
        
#         print(decoder_output)
        
        predicted_index = decoder_input.item()
        
        if predicted_index == vocabulary.index('<EOS>'):
            break
        result.append(vocabulary.word(predicted_index))
    
    return " ".join(result)

In [89]:
print(predict_answer_batch("hi", vocabulary, encoder, decoder))
print(predict_answer_batch("hello", vocabulary, encoder, decoder))
print(predict_answer_batch("what's your name?", vocabulary, encoder, decoder))
print(predict_answer_batch("my name is Anna", vocabulary, encoder, decoder))
print(predict_answer_batch("good to see you", vocabulary, encoder, decoder))
print(predict_answer_batch("i like wine, and you?", vocabulary, encoder, decoder))







