In [1]:
# IPython candies...
from IPython.display import Image
from IPython.core.display import HTML 

In [2]:
# Imports we need.
from __future__ import unicode_literals, print_function, division
import random

import torch
import torch.nn as nn
from torch.autograd import Variable
from torch import optim
import torch.nn.functional as F

use_cuda = torch.cuda.is_available()

Seq2Seq with PyTorch
====

Sequence-to-Sequence (Seq2Seq) learning is a useful class of neural network model to map sequential input into an output sequence. It has been shown to work well on various task, from machine translation to interpreting Python without an interpreter. {{citations-needed}}

This notebook is a hands-on session to write an encoder-decoder Seq2Seq network using PyTorch for [DataScience SG meetup](https://www.meetup.com/DataScience-SG-Singapore/events/246541733/). 


It would be great if you have at least worked through the ["Deep Learning in 60 minutes" PyTorch tutorial](http://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html) before continuing the rest of the notebook.


Acknowledgements
----
The materials are largely based on the 

 - [intermediate PyTorch tutorials by Sean Robertson](http://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html) and 
 - [Luong et al. tutorial on neural machine translation in ACL16](https://sites.google.com/site/acl16nmt/home).

The dataset used in this exercise is hosted on https://www.kaggle.com/alvations/sg-kopi

Kopi Problems
====

In this hands-on session, we want to **train a neural network to translate from Singlish Kopi orders to English?**


**"Singlish" -> English**

```
"Kopi" -> Coffee with condensed milk
"Kopi O" -> Coffee without milk or sugar
"Kopi dinosaur gau siew dai peng" -> ???
```

(Image Source: http://www.straitstimes.com/lifestyle/food/get-your-kopi-kick)

In [3]:
Image(url="https://static.straitstimes.com.sg/sites/default/files/160522_kopi.jpg", width=700)

Seriously?
----

Yes, we'll be translating Singlish Kopi orders to English using the [sequence-to-sequence network](https://arxiv.org/abs/1409.3215) {{citations-needed}}. 

But first...
---

1. Data Munging
====

Before any machine/deep learning, we have to get some data and "hammer" it until we get it into the shape we want.

> *Data scientists spend 60% of their time on cleaning and organizing data. Collecting data sets comes second at 19% of their time, meaning data scientists spend around 80% of their time on preparing and managing data for analysis.*

> (Source: [Gil Press](https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#3e4dc0416f63) Forbes article)

**Step 1:** Take the data from somewhere, in this case: http://kaggle.com/alvations/sg-kopi.

**Step 2:** Import your favorite dataframe and text processing library.

**Step 3:** Munge the data till desired.

In [4]:
import pandas as pd
from gensim.corpora.dictionary import Dictionary
from nltk import word_tokenize

# Reads the tab-delimited data using Pandas.
kopitiam = pd.read_csv('coffee-culture-sg.tsv', sep='\t')
kopitiam.head()

Unnamed: 0,Local Terms,Meaning,Source
0,Kopi O,Black Coffee with Sugar,https://daneshd.com/2010/02/28/a-rough-guide-t...
1,Kopi,Black Coffee with Condensed Milk,https://daneshd.com/2010/02/28/a-rough-guide-t...
2,Kopi C,Black Coffee with Evaporated Milk,https://daneshd.com/2010/02/28/a-rough-guide-t...
3,Kopi Kosong,Black Coffee without sugar or milk,https://daneshd.com/2010/02/28/a-rough-guide-t...
4,Kopi Gah Dai,Black Coffee with extra condensed milk,https://daneshd.com/2010/02/28/a-rough-guide-t...


In [5]:
START, START_IDX = '<s>',  0
END, END_IDX = '</s>', 1

# We use this idiom to tokenize our sentences in the dataframe column:
# >>> DataFrame['column'].apply(str.lower).apply(word_tokenize)

# Also we added the START and the END symbol to the sentences. 
singlish_sents = [START] + kopitiam['Local Terms'].apply(str.lower).apply(word_tokenize) + [END]
english_sents = [START] + kopitiam['Meaning'].apply(str.lower).apply(word_tokenize) + [END]

# We're sort of getting into the data into the shape we want. 
# But now it's still too humanly readable and redundant.
## Cut-away: Computers like it to be simpler, more concise. -_-|||
print('First Singlish sentence:', singlish_sents[0])
print('First English sentence:', english_sents[0])

First Singlish sentence: ['<s>', 'kopi', 'o', '</s>']
First English sentence: ['<s>', 'black', 'coffee', 'with', 'sugar', '</s>']


In [6]:
# Let's convert the individual words into some sort of unique index 
# and use the unique to represent the words. 
## Cut-away: Integers = 1-2 bytes vs UTF-8 Strings = no. of chars * 1-2 bytes. @_@

english_vocab = Dictionary([['<s>'], ['</s>'], ['UNK']])
english_vocab.add_documents(english_sents)

singlish_vocab = Dictionary([['<s>'], ['</s>'], ['UNK']])
singlish_vocab.add_documents(singlish_sents)

# First ten words in the vocabulary.
print('First 10 Singlish words:\n', sorted(singlish_vocab.items())[:10])
print()
print('First 10 English words:\n', sorted(english_vocab.items())[:10])

First 10 Singlish words:
 [(0, '<s>'), (1, '</s>'), (2, 'UNK'), (3, 'kopi'), (4, 'o'), (5, 'c'), (6, 'kosong'), (7, 'dai'), (8, 'gah'), (9, 'siew')]

First 10 English words:
 [(0, '<s>'), (1, '</s>'), (2, 'UNK'), (3, 'black'), (4, 'coffee'), (5, 'sugar'), (6, 'with'), (7, 'condensed'), (8, 'milk'), (9, 'evaporated')]


In [7]:
english_sents[0]

['<s>', 'black', 'coffee', 'with', 'sugar', '</s>']

In [8]:
# Now, convert all the sentences into list of the indices 
print('First Singlish sentence:', singlish_vocab.doc2idx(singlish_sents[0]) )
print('First English sentence:', english_vocab.doc2idx(english_sents[0]) )

First Singlish sentence: [0, 3, 4, 1]
First English sentence: [0, 3, 4, 6, 5, 1]


In [9]:
# Lets create a function to convert new sentences into the indexed forms.
def vectorize_sent(sent, vocab):
    return vocab.doc2idx([START] + word_tokenize(sent.lower()) + [END])

new_kopi = "Kopi dinosaur gau siew dai peng"
vectorize_sent(new_kopi, singlish_vocab)

[0, 3, 22, 11, 9, 7, 12, 1]

In [10]:
# For the last step of data hammering, we need to clobber 
# the vectorized sentence into PyTorch Variable. 
def variable_from_sent(sent, vocab):
    vsent = vectorize_sent(sent, vocab)
    result = Variable(torch.LongTensor(vsent).view(-1, 1))
    return result.cuda() if use_cuda else result

new_kopi = "Kopi dinosaur gau siew dai peng"
variable_from_sent(new_kopi, singlish_vocab)

Variable containing:
    0
    3
   22
   11
    9
    7
   12
    1
[torch.cuda.LongTensor of size 8x1 (GPU 0)]

In [11]:
# To get the sentence length.
variable_from_sent(new_kopi, singlish_vocab).size()[0] # Includes START and END symbol.

8

In [12]:
# Prepare the whole training corpus.
singlish_tensors = kopitiam['Local Terms'].apply(lambda s: variable_from_sent(s, singlish_vocab))
english_tensors = kopitiam['Meaning'].apply(lambda s: variable_from_sent(s, english_vocab))

sent_pairs = list(zip(singlish_tensors, english_tensors))

2. The Seq2Seq Model
====

A Recurrent Neural Network (RNN), is a network that operates on a sequence and uses its own output as input for subsequent steps.

> *The general idea is to make **two recurrent neural network transform from one sequence to another**. An encoder network condenses an input sequence into a vector and a decoder netwrok unfolds the vector into a new sequence.*



2.1. The Encoder
====

The encoder of a seq2seq network is a RNN that outputs some value for every word from the input sentence. For every input word the encoder outputs a vector and a hidden state, and uses the hidden state for the next input word.


<img src="http://pytorch.org/tutorials/_images/encoder-network.png" align='left'>

In [13]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(EncoderRNN, self).__init__()
        # Set the no. of nodes for the hidden layer.
        self.hidden_size = hidden_size
        # Initialize the embedding layer with the 
        # - size of input (i.e. no. of words in input vocab)
        # - no. of hidden nodes in the embedding layer
        self.embedding = nn.Embedding(input_size, hidden_size)
        # Initialize the GRU with the 
        # - size of the hidden layer from the previous state
        # - size of the hidden layer from the current state
        self.gru = nn.GRU(hidden_size, hidden_size)

    def forward(self, input, hidden):
        # Feed the input into the embedding layer.
        embedded = self.embedding(input).view(1, 1, -1)
        # Feed the embedded layer with the hidden layer to the GRU.
        # Update the output and hidden layer.
        output, hidden = self.gru(embedded, hidden)
        return output, hidden

    def initialize_hidden_states(self):
        result = Variable(torch.zeros(1, 1, self.hidden_size))
        return result.cuda() if use_cuda else result


2.2. Simple Decoder
====

In the simplest seq2seq decoder we use only last output of the encoder. This last output is sometimes called the context vector as it encodes context from the entire sequence. This context vector is used as the initial hidden state of the decoder.

At every step of decoding, the decoder is given an input token and hidden state. The initial input token is the start-of-string `<s>` token, and the first hidden state is the context vector (the encoder’s last hidden state).


<img src="http://pytorch.org/tutorials/_images/decoder-network.png" align='left'>


In [14]:
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(DecoderRNN, self).__init__()
        # Set the no. of nodes for the hidden layer.
        self.hidden_size = hidden_size
        # Initialize the embedding layer with the 
        # - size of output (i.e. no. of words in output vocab)
        # - no. of hidden nodes in the embedding layer
        self.embedding = nn.Embedding(output_size, hidden_size)
        # Initialize the GRU with the 
        # - size of the hidden layer from the previous state
        # - size of the hidden layer from the current state
        self.gru = nn.GRU(hidden_size, hidden_size)
        # Set the output layer to output a specific symbol 
        # from the output vocabulary
        self.softmax = nn.LogSoftmax(dim=1)
        self.out = nn.Linear(hidden_size, output_size)

    def forward(self, input, hidden):
        # Feed the input into the embedding layer.
        output = self.embedding(input).view(1, 1, -1)
        # Transform the embedded output with a relu function. 
        output = F.relu(output)
        # Feed the embedded layer with the hidden layer to the GRU.
        # Update the output and hidden layer.
        output, hidden = self.gru(output, hidden)
        # Take the updated output and find the most appropriate
        # output symbol. 
        output = self.softmax(self.out(output[0]))
        return output, hidden

    def initialize_hidden_states(self):
        result = Variable(torch.zeros(1, 1, self.hidden_size))
        return result.cuda() if use_cuda else result
        

2.3. Training the Model
====

To train we run the input sentence through the encoder, and keep track of every output and the latest hidden state. Then the decoder is given the `<s>` token as its first input, and the last hidden state of the encoder as its first hidden state.

2.3.1 Set the Hyperparamters and Prepare Data (again...)
----

In [15]:
hidden_size = 10
learning_rate=0.01
batch_size = 2
epochs = 30 # Since we are taking batch_size=2 and epochs=30, we only look at 600 data points.
criterion = nn.NLLLoss()
MAX_LENGTH=20

# Initialize the network for encoder and decoder.
input_vocab, output_vocab = singlish_vocab, english_vocab
encoder = EncoderRNN(len(input_vocab), hidden_size)
decoder = DecoderRNN(hidden_size, len(output_vocab))
if use_cuda:
    encoder = encoder.cuda()
    decoder = decoder.cuda()

# Initialize the optimizer for encoder and decoder.
encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate)
decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate)

# If batchsize == 1, choose 1 data points per batch:
##training_data = [[random.choice(sent_pairs)] for i in range(epochs)]
# If batch_size > 1, use random.sample() instead of random.choice:
training_data = [random.sample(sent_pairs, batch_size) for i in range(epochs)]

2.3.2. Loop through the batches
---

In [16]:
#############################################
# 2.3.2. Loop through the batches.
#############################################
# Start the training.
for data_batch in training_data:
    # (Re-)Initialize the optimizers, clear all gradients after every batch.
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()
    # Reset the loss for every batch.
    loss = 0
    for input_variable, target_variable in data_batch:
        # Initialize the hidden_states for the encoder.
        encoder_hidden = encoder.initialize_hidden_states()
        # Initialize the length of the PyTorch variables.
        input_length = input_variable.size()[0]
        target_length = target_variable.size()[0]
        encoder_outputs = Variable(torch.zeros(MAX_LENGTH, encoder.hidden_size))
        encoder_outputs = encoder_outputs.cuda() if use_cuda else encoder_outputs

2.3.3. Iterating through each word in the encoder.
----

In [17]:
#############################################
# 2.3.2. Loop through the batches.
#############################################
# Start the training.
for data_batch in training_data:
    # (Re-)Initialize the optimizers, clear all gradients after every batch.
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()
    # Reset the loss for every batch.
    loss = 0
    for input_variable, target_variable in data_batch:
        # Initialize the hidden_states for the encoder.
        encoder_hidden = encoder.initialize_hidden_states()
        # Initialize the length of the PyTorch variables.
        input_length = input_variable.size()[0]
        target_length = target_variable.size()[0]
        encoder_outputs = Variable(torch.zeros(MAX_LENGTH, encoder.hidden_size))
        encoder_outputs = encoder_outputs.cuda() if use_cuda else encoder_outputs
        #############################################
        # 2.3.3.  Iterating through each word in the encoder.
        #############################################
        # Iterating through each word in the input.
        for ei in range(input_length):
            # We move forward through each state.
            encoder_output, encoder_hidden = encoder(input_variable[ei], encoder_hidden)
            # And we save the encoder outputs. 
            # Note: We're retrieving [0][0] cos remember the weird .view(1,1,-1) for the GRU.
            encoder_outputs[ei] = encoder_output[0][0] 


2.3.3.1. Outputs of the Encoder
----

In [18]:
# Cut-away: The encoded output for the last sentence in out training_data"

# The encoder has 68 unique words
print(encoder, '\n')
print(singlish_vocab)
print('\n########\n')

# The last input sentence, in PyTorch Tensor data structure.
print(data_batch[-1][0]) 
print('########\n')

# The last input sentence as list(int)
print(list(map(int, data_batch[-1][0])), '\n')
print('########\n')

# The last input sentence as list(int)
print(' '.join([singlish_vocab[i] for i in map(int, data_batch[-1][0])]))
print('\n########\n')

# The encoded outputs of the last sentence 
# Note: We have a matrix of 20 (MAX_LENGTH) x 10 (hidden_size) and 
#       for this particular sentence, we only have 4 encoded outputs
print(encoder_outputs)

print('\n########\n')

# The last hidden state of the last input sentence. 
# Note: For vanilla RNN (Elman Net), the last hidden state of the encoder
#       is the start state of the decoder's hidden state.
print(encoder_hidden)

EncoderRNN(
  (embedding): Embedding(68, 10)
  (gru): GRU(10, 10)
) 

Dictionary(68 unique tokens: ['tut', 'jia', 'peng', 'michael', 'hao']...)

########

Variable containing:
  0
  3
 36
  1
[torch.cuda.LongTensor of size 4x1 (GPU 0)]

########

[0, 3, 36, 1] 

########

<s> kopi poh </s>

########

Variable containing:
-0.0972  0.2290  0.1250  0.4694  0.1017 -0.2808 -0.2127  0.2121 -0.3759  0.0586
 0.0574 -0.0617 -0.0349  0.0509 -0.0723 -0.3486 -0.1432  0.2439 -0.6072 -0.4710
 0.5366 -0.1668 -0.1907 -0.3316 -0.3769  0.1730 -0.3404  0.4280 -0.6774 -0.4104
 0.2478  0.0491  0.2956  0.0742 -0.0361  0.0736 -0.4051  0.2031 -0.4853 -0.3444
 0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000  0.00

2.3.4. Iterating through each word in the decoder.
----

In [19]:
#############################################
# 2.3.2. Loop through the batches.
#############################################
# Start the training.
for data_batch in training_data:
    # (Re-)Initialize the optimizers, clear all gradients after every batch.
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()
    # Reset the loss for every batch.
    loss = 0
    for input_variable, target_variable in data_batch:
        # Initialize the hidden_states for the encoder.
        encoder_hidden = encoder.initialize_hidden_states()
        # Initialize the length of the PyTorch variables.
        input_length = input_variable.size()[0]
        target_length = target_variable.size()[0]
        encoder_outputs = Variable(torch.zeros(MAX_LENGTH, encoder.hidden_size))
        encoder_outputs = encoder_outputs.cuda() if use_cuda else encoder_outputs
        
        #############################################
        # 2.3.3.  Iterating through each word in the encoder.
        #############################################
        # Iterating through each word in the input.
        for ei in range(input_length):
            # We move forward through each state.
            encoder_output, encoder_hidden = encoder(input_variable[ei], encoder_hidden)
            # And we save the encoder outputs. 
            # Note: We're retrieving [0][0] cos remember the weird .view(1,1,-1) -_-|||
            encoder_outputs[ei] = encoder_output[0][0] 
            
            #############################################
            # 2.3.4.  Iterating through each word in the decoder.
            #############################################
            # Initialize the variable input with the index of the START.
            decoder_input = Variable(torch.LongTensor([[START_IDX]]))
            decoder_input = decoder_input.cuda() if use_cuda else decoder_input
            # As the first state of the decoder, we take the last step of the encoder.
            decoder_hidden = encoder_hidden
            # Iterate through each state in the decoder.
            # Note: when we are training we know the length of the decoder.
            #       so we can use the trick to restrict the loop when decoding.
            for di in range(target_length):
                # We move forward through each state.
                decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)
                # What are alll these weird syntax, refer to 2.3.4.1
                topv, topi = decoder_output.data.topk(1)
                ni = topi[0][0]

                # Replace our decoder input for the next state with the
                # embedding of the decoded topi guess. 
                decoder_input = Variable(torch.LongTensor([[ni]]))
                decoder_input = decoder_input.cuda() if use_cuda else decoder_input
                
                # Update our loss for this batch.
                loss += criterion(decoder_output, target_variable[di])
                
                # If we see the </s> symbol, break the training.
                if ni == END_IDX:
                    break


2.3.4.1 Outputs of the Decoder
----

In [20]:
# Cut-away: The decoded output for the last sentence in out training_data"

# The encoder has 117 unique words
print(decoder, '\n')
print(english_vocab)
print('\n########\n')

# The last input sentence.
print(' '.join([singlish_vocab[i] for i in map(int, data_batch[-1][0])]))
# The last target sentence.
print(' '.join([english_vocab[i] for i in map(int, data_batch[-1][1])]))

print('\n########\n')

DecoderRNN(
  (embedding): Embedding(117, 10)
  (gru): GRU(10, 10)
  (softmax): LogSoftmax()
  (out): Linear(in_features=10, out_features=117)
) 

Dictionary(117 unique tokens: ['stronger', 'syrup', 'horlicks', 'brew', 'ginger']...)

########

<s> kopi poh </s>
<s> thinner coffee more water is added to dilute the beverage . </s>

########



In [21]:
# The last word in the last sentence. 
print([english_vocab[i] for i in map(int, data_batch[-1][1])][-1])

print('\n########')

# The -log probability of the word that's most probably the 
# correct target word as we moved from the encoder to the decoder.
print(decoder_output.data)

print('\n########')

# The word with the highest probability
print(decoder_output.data.topk(1))
print('\n########')

# Take a look at what's the decoder's guess for the final word in the last sentence.
topv, topi = decoder_output.data.topk(1)
print(topv) # The -log probability of the decoder's guess.
print(topi) # The index of the word in the english_vocab.
print(english_vocab[int(topi)]) # Decoder's guess of the final word.


</s>

########


Columns 0 to 9 
-4.9207 -5.1512 -5.2086 -4.9887 -4.7247 -4.9769 -5.0857 -4.3333 -4.1256 -4.3008

Columns 10 to 19 
-4.4075 -4.9221 -5.0280 -4.8629 -5.2383 -5.0389 -4.5376 -4.4449 -5.0103 -4.7008

Columns 20 to 29 
-4.9679 -5.2594 -4.8193 -4.6872 -4.4805 -4.4097 -4.7942 -4.9296 -4.7486 -4.8726

Columns 30 to 39 
-4.4286 -4.8542 -5.1222 -4.6833 -5.0283 -4.7359 -4.5719 -4.8810 -4.9007 -4.5465

Columns 40 to 49 
-5.0500 -5.1666 -4.8831 -4.6379 -4.4574 -4.7411 -4.8628 -5.0775 -4.7726 -4.8609

Columns 50 to 59 
-4.9800 -4.9194 -5.0201 -4.7333 -4.4561 -4.6480 -4.8651 -4.6748 -4.8653 -4.7025

Columns 60 to 69 
-4.5608 -4.5032 -5.1159 -5.1541 -4.7839 -4.8625 -4.5954 -4.8888 -4.5934 -4.8091

Columns 70 to 79 
-4.7013 -4.6661 -4.7685 -4.6622 -4.7927 -4.9732 -5.2363 -4.3163 -5.2381 -4.8469

Columns 80 to 89 
-4.5284 -4.3267 -5.2866 -4.8497 -4.7198 -5.0783 -4.4740 -4.3546 -4.6759 -4.8425

Columns 90 to 99 
-4.2371 -5.0061 -4.6096 -5.2386 -4.5358 -4.4367 -4.7863 -4.9175 -4.4395 -4.8

2.3.5 Backpropagate the Loss and Optimizers Takes a Step.
----

In [22]:
#############################################
# 2.3.2. Loop through the batches.
#############################################
# Start the training.
for data_batch in training_data:
    # (Re-)Initialize the optimizers, clear all gradients after every batch.
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()
    # Reset the loss for every batch.
    loss = 0
    for input_variable, target_variable in data_batch:
        # Initialize the hidden_states for the encoder.
        encoder_hidden = encoder.initialize_hidden_states()
        # Initialize the length of the PyTorch variables.
        input_length = input_variable.size()[0]
        target_length = target_variable.size()[0]
        encoder_outputs = Variable(torch.zeros(MAX_LENGTH, encoder.hidden_size))
        encoder_outputs = encoder_outputs.cuda() if use_cuda else encoder_outputs
        
        #############################################
        # 2.3.3.  Iterating through each word in the encoder.
        #############################################
        # Iterating through each word in the input.
        for ei in range(input_length):
            # We move forward through each state.
            encoder_output, encoder_hidden = encoder(input_variable[ei], encoder_hidden)
            # And we save the encoder outputs. 
            # Note: We're retrieving [0][0] cos remember the weird .view(1,1,-1) -_-|||
            encoder_outputs[ei] = encoder_output[0][0] 
            
            #############################################
            # 2.3.4.  Iterating through each word in the decoder.
            #############################################
            # Initialize the variable input with the index of the START.
            decoder_input = Variable(torch.LongTensor([[START_IDX]]))
            decoder_input = decoder_input.cuda() if use_cuda else decoder_input
            # As the first state of the decoder, we take the last step of the encoder.
            decoder_hidden = encoder_hidden
            # Iterate through each state in the decoder.
            # Note: when we are training we know the length of the decoder.
            #       so we can use the trick to restrict the loop when decoding.
            for di in range(target_length):
                # We move forward through each state.
                decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)
                # What are alll these weird syntax, refer to 2.3.4.1
                topv, topi = decoder_output.data.topk(1)
                ni = topi[0][0]

                # Replace our decoder input for the next state with the
                # embedding of the decoded topi guess. 
                decoder_input = Variable(torch.LongTensor([[ni]]))
                decoder_input = decoder_input.cuda() if use_cuda else decoder_input
                
                # Update our loss for this batch.
                loss += criterion(decoder_output, target_variable[di])
                
                # If we see the </s> symbol, break the training.
                if ni == END_IDX:
                    break
    #####################################################
    # 2.3.5 Backpropagate the Loss and Optimizers Takes a Step.
    #####################################################
    loss.backward() # Backpropagate.
    encoder_optimizer.step()
    decoder_optimizer.step()




2.4. Formalize the training per epoch as a function
====

Largely, the following code is from http://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html



In [53]:
def train_one_epoch(input_variable, target_variable, encoder, decoder, 
                    encoder_optimizer, decoder_optimizer, criterion):
    """
    Function to put the variables, decoder and optimizers to train per epoch.
    """
    encoder_hidden = encoder.initialize_hidden_states()

    # (Re-)Initialize the optimizers, clear all gradients. 
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()
    
    # Initialize the length of the PyTorch variables.
    input_length = input_variable.size()[0]
    target_length = target_variable.size()[0]
    encoder_outputs = Variable(torch.zeros(MAX_LENGTH, encoder.hidden_size))
    encoder_outputs = encoder_outputs.cuda() if use_cuda else encoder_outputs

    loss = 0
    
    # Iterating through each word in the input.
    for ei in range(input_length):
        # We move forward through each state.
        encoder_output, encoder_hidden = encoder(input_variable[ei], encoder_hidden)
        # And we save the encoder outputs. 
        encoder_outputs[ei] = encoder_output[0][0]

    # Initialize the variable input with the index of the START.
    decoder_input = Variable(torch.LongTensor([[START_IDX]]))
    decoder_input = decoder_input.cuda() if use_cuda else decoder_input
    
    # As the first state of the decoder, we take the last step of the encoder.
    decoder_hidden = encoder_hidden
    
    # Without teacher forcing: use its own predictions as the next input
    for di in range(target_length):
        decoder_output, decoder_hidden = decoder(
            decoder_input, decoder_hidden)
        topv, topi = decoder_output.data.topk(1)
        ni = topi[0][0]

        decoder_input = Variable(torch.LongTensor([[ni]]))
        decoder_input = decoder_input.cuda() if use_cuda else decoder_input

        loss += criterion(decoder_output, target_variable[di])
        if ni == END_IDX:
            break

    loss.backward()

    encoder_optimizer.step()
    decoder_optimizer.step()

    return loss.data[0] / target_length

2.5. Some Logging and Plotting Candies to Monitor Training
====

In [52]:
import time
import math

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import numpy as np


def asMinutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (- %s)' % (asMinutes(s), asMinutes(rs))


def showPlot(points):
    plt.figure()
    fig, ax = plt.subplots()
    # this locator puts ticks at regular intervals
    loc = ticker.MultipleLocator(base=0.2)
    ax.yaxis.set_major_locator(loc)
    plt.plot(points)

2.6. Formalize the full training process as a function
====

In [54]:
def train(encoder, decoder, n_iters, print_every=1000, plot_every=100, learning_rate=0.01):
    start = time.time()
    plot_losses = []
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every

    encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate)
    
    training_pairs = [random.choice(sent_pairs) for i in range(n_iters)]
    
    criterion = nn.NLLLoss()

    for iter in range(1, n_iters + 1):
        training_pair = training_pairs[iter - 1]
        input_variable = training_pair[0]
        target_variable = training_pair[1]

        loss = train_one_epoch(input_variable, target_variable, encoder,
                     decoder, encoder_optimizer, decoder_optimizer, criterion)
        print_loss_total += loss
        plot_loss_total += loss

        if iter % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print('%s (%d %d%%) %.4f' % (timeSince(start, iter / n_iters),
                                         iter, iter / n_iters * 100, print_loss_avg))

        if iter % plot_every == 0:
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0

Lets Train
====

In [55]:
MAX_LENGTH = 20
batches = 100000 # In this case, the PyTorch code is using batch_size=1
hidden_size = 100

singlish_vocab
my_encoder = EncoderRNN(len(singlish_vocab), hidden_size)
my_decoder = DecoderRNN(hidden_size, len(english_vocab))

if use_cuda:
    my_encoder = my_encoder.cuda()
    my_decoder = my_decoder.cuda()

train(my_encoder, my_decoder, batches, print_every=100)

0m 2s (- 41m 15s) (100 0%) 2.5812
0m 4s (- 33m 58s) (200 0%) 2.0632
0m 5s (- 31m 7s) (300 0%) 2.0789
0m 7s (- 29m 37s) (400 0%) 2.0274
0m 8s (- 28m 53s) (500 0%) 2.0474
0m 10s (- 28m 14s) (600 0%) 1.9938
0m 11s (- 28m 3s) (700 0%) 2.0145
0m 13s (- 27m 34s) (800 0%) 1.9128
0m 15s (- 27m 40s) (900 0%) 1.9695
0m 17s (- 29m 21s) (1000 1%) 1.9863
0m 19s (- 29m 21s) (1100 1%) 1.7917
0m 21s (- 28m 56s) (1200 1%) 1.7981
0m 22s (- 28m 37s) (1300 1%) 1.7196
0m 24s (- 28m 18s) (1400 1%) 1.5752
0m 25s (- 28m 5s) (1500 1%) 1.6152
0m 27s (- 27m 58s) (1600 1%) 1.5605
0m 29s (- 28m 18s) (1700 1%) 1.5186
0m 31s (- 28m 39s) (1800 1%) 1.5110
0m 34s (- 29m 27s) (1900 1%) 1.5140
0m 36s (- 29m 39s) (2000 2%) 1.4677
0m 38s (- 29m 38s) (2100 2%) 1.4021
0m 39s (- 29m 35s) (2200 2%) 1.4298
0m 42s (- 29m 57s) (2300 2%) 1.5225
0m 46s (- 31m 16s) (2400 2%) 1.3753
0m 49s (- 32m 17s) (2500 2%) 1.3200
0m 53s (- 33m 18s) (2600 2%) 1.4220
0m 56s (- 34m 9s) (2700 2%) 1.2380
1m 0s (- 35m 7s) (2800 2%) 1.3245
1m 4s (- 35m

10m 58s (- 37m 59s) (22400 22%) 0.4403
11m 0s (- 37m 55s) (22500 22%) 0.4507
11m 3s (- 37m 52s) (22600 22%) 0.4173
11m 6s (- 37m 50s) (22700 22%) 0.3902
11m 9s (- 37m 47s) (22800 22%) 0.4085
11m 12s (- 37m 44s) (22900 22%) 0.3962
11m 15s (- 37m 41s) (23000 23%) 0.4142
11m 18s (- 37m 38s) (23100 23%) 0.2663
11m 21s (- 37m 36s) (23200 23%) 0.3845
11m 24s (- 37m 31s) (23300 23%) 0.3722
11m 26s (- 37m 25s) (23400 23%) 0.3629
11m 27s (- 37m 19s) (23500 23%) 0.3858
11m 29s (- 37m 12s) (23600 23%) 0.5159
11m 31s (- 37m 6s) (23700 23%) 0.2931
11m 35s (- 37m 6s) (23800 23%) 0.4519
11m 39s (- 37m 6s) (23900 23%) 0.3851
11m 43s (- 37m 7s) (24000 24%) 0.4735
11m 47s (- 37m 9s) (24100 24%) 0.4676
11m 51s (- 37m 9s) (24200 24%) 0.3817
11m 55s (- 37m 9s) (24300 24%) 0.4600
12m 0s (- 37m 11s) (24400 24%) 0.4147
12m 4s (- 37m 11s) (24500 24%) 0.4409
12m 8s (- 37m 11s) (24600 24%) 0.3900
12m 12s (- 37m 11s) (24700 24%) 0.4254
12m 16s (- 37m 13s) (24800 24%) 0.4305
12m 20s (- 37m 13s) (24900 24%) 0.3512


KeyboardInterrupt: 

In [57]:
import pickle 
# Before moving on, SAVE THE MODELS!!!
with open('encoder_vanilla_100_100K.pkl', 'wb') as fout:
    pickle.dump(my_encoder, fout)
    
with open('decoder_vanilla_100_100K.pkl', 'wb') as fout:
    pickle.dump(my_decoder, fout)

2.6. Getting the Model to Translate
====

In [None]:
def decode(encoder, decoder, input_variable, max_length=MAX_LENGTH):
    # The length of the input.
    input_length = input_variable.size()[0]
    # For each sentence, initilize the hidden states with zeros.
    encoder_hidden = encoder.initialize_hidden_states()
    # Initialize the encoder outputs. 
    encoder_outputs = Variable(torch.zeros(max_length, encoder.hidden_size))
    encoder_outputs = encoder_outputs.cuda() if use_cuda else encoder_outputs
    # Iterate through the input words.
    for ei in range(input_length):
        encoder_output, encoder_hidden = encoder(input_variable[ei], encoder_hidden)
        encoder_outputs[ei] = encoder_outputs[ei] + encoder_output[0][0]
    # Initialize the decoder with the start symbol <s>.
    decoder_input = Variable(torch.LongTensor([[START_IDX]])) 
    decoder_input = decoder_input.cuda() if use_cuda else decoder_input
    # Use the last encoder hidden state as the first decoder's hidden state.
    decoder_hidden = encoder_hidden
    # Keep a list of the decoded words.
    decoded_words = []
    
    # Iterate through the decoder states.
    for di in range(max_length):
        # Very similar to how the training works.
        decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)
        topv, topi = decoder_output.data.topk(1)
        ni = topi[0][0]
        if ni == END:
            decoded_words.append('</s>')
            break
        else:
            decoded_words.append(ni)
        # Replace the new decoder input for the next state 
        # with the top guess of this state.
        decoder_input = Variable(torch.LongTensor([[ni]]))
        decoder_input = decoder_input.cuda() if use_cuda else decoder_input

    return decoded_words

In [None]:
sent = 'teh o'
variable_from_sent(sent, singlish_vocab)

In [None]:
output_words = decode(my_encoder, my_decoder, variable_from_sent(sent, singlish_vocab))
output_words

In [None]:
[english_vocab[i] for i in output_words[1:output_words.index(1)]]

In [None]:
def translate(kopi_order):
    output_words = evaluate(my_encoder, my_decoder, variable_from_sent(kopi_order, singlish_vocab))
    output_sentence = [english_vocab[i] for i in output_words[1:output_words.index(1)]]
    return ' '.join(output_sentence)

In [39]:
translate('teh o')

Training the Model (with teacher forcing)
====

To train we run the input sentence through the encoder, and keep track of every output and the latest hidden state. Then the decoder is given the <SOS> token as its first input, and the last hidden state of the encoder as its first hidden state.

“Teacher forcing” is the concept of using the real target outputs as each next input, instead of using the decoder’s guess as the next input. Using teacher forcing causes it to converge faster but when the trained network is exploited, it may exhibit instability.

You can observe outputs of teacher-forced networks that read with coherent grammar but wander far from the correct translation - intuitively it has learned to represent the output grammar and can “pick up” the meaning once the teacher tells it the first few words, but it has not properly learned how to create the sentence from the translation in the first place.

Because of the freedom PyTorch’s autograd gives us, we can randomly choose to use teacher forcing or not with a simple if statement. Turn teacher_forcing_ratio up to use more of it.



In [58]:
teacher_forcing_ratio = 0.5
MAX_LENGTH = 20

def train_one_epoch(input_variable, target_variable, encoder, decoder, 
                    encoder_optimizer, decoder_optimizer, criterion):
    """
    Function to put the variables, decoder and optimizers to train per epoch.
    """
    encoder_hidden = encoder.initialize_hidden_states()

    # (Re-)Initialize the optimizers, clear all gradients. 
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()
    
    # Initialize the length of the PyTorch variables.
    input_length = input_variable.size()[0]
    target_length = target_variable.size()[0]
    encoder_outputs = Variable(torch.zeros(MAX_LENGTH, encoder.hidden_size))
    encoder_outputs = encoder_outputs.cuda() if use_cuda else encoder_outputs

    loss = 0
    
    # Iterating through each word in the input.
    for ei in range(input_length):
        # We move forward through each state.
        encoder_output, encoder_hidden = encoder(input_variable[ei], encoder_hidden)
        # And we save the encoder outputs. 
        encoder_outputs[ei] = encoder_output[0][0]


    decoder_input = Variable(torch.LongTensor([[START_IDX]]))
    decoder_input = decoder_input.cuda() if use_cuda else decoder_input

    decoder_hidden = encoder_hidden

    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    if use_teacher_forcing:
        # Teacher forcing: Feed the target as the next input.
        for di in range(target_length):
            decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)
            loss += criterion(decoder_output, target_variable[di])
            decoder_input = target_variable[di]  # Teacher forcing

    else:
        # Without teacher forcing: use its own predictions as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden)
            topv, topi = decoder_output.data.topk(1)
            ni = topi[0][0]

            decoder_input = Variable(torch.LongTensor([[ni]]))
            decoder_input = decoder_input.cuda() if use_cuda else decoder_input

            loss += criterion(decoder_output, target_variable[di])
            if ni == END_IDX:
                break

    loss.backward()

    encoder_optimizer.step()
    decoder_optimizer.step()

    return loss.data[0] / target_length

In [None]:
MAX_LENGTH = 20
batches = 100000 # In this case, the PyTorch code is using batch_size=1
hidden_size = 100

singlish_vocab
my_encoder = EncoderRNN(len(singlish_vocab), hidden_size)
my_decoder = DecoderRNN(hidden_size, len(english_vocab))

if use_cuda:
    my_encoder = my_encoder.cuda()
    my_decoder = my_decoder.cuda()

train(my_encoder, my_decoder, batches, print_every=100)

0m 4s (- 71m 9s) (100 0%) 3.0145
0m 8s (- 69m 53s) (200 0%) 2.5526
0m 12s (- 69m 54s) (300 0%) 2.3090
0m 16s (- 69m 37s) (400 0%) 2.3130
0m 21s (- 69m 57s) (500 0%) 2.4043
0m 25s (- 69m 45s) (600 0%) 2.0384
0m 29s (- 70m 6s) (700 0%) 2.0895
0m 34s (- 70m 38s) (800 0%) 2.0777
0m 38s (- 70m 28s) (900 0%) 2.0000
0m 42s (- 70m 31s) (1000 1%) 2.0120
0m 47s (- 70m 26s) (1100 1%) 1.9481
0m 51s (- 70m 28s) (1200 1%) 1.7242
0m 55s (- 69m 57s) (1300 1%) 1.6484
0m 59s (- 69m 52s) (1400 1%) 1.7458
1m 3s (- 69m 30s) (1500 1%) 1.8199
1m 7s (- 69m 33s) (1600 1%) 1.5612
1m 12s (- 69m 50s) (1700 1%) 1.6924
1m 16s (- 69m 50s) (1800 1%) 1.5385
1m 20s (- 69m 41s) (1900 1%) 1.4304
1m 25s (- 69m 53s) (2000 2%) 1.4741
1m 30s (- 70m 6s) (2100 2%) 1.4806
1m 34s (- 70m 9s) (2200 2%) 1.3068
1m 39s (- 70m 9s) (2300 2%) 1.3519
1m 43s (- 70m 2s) (2400 2%) 1.2785
1m 47s (- 69m 55s) (2500 2%) 1.5085
1m 51s (- 69m 43s) (2600 2%) 1.1869
1m 55s (- 69m 34s) (2700 2%) 1.3052
2m 0s (- 69m 30s) (2800 2%) 1.1777
2m 4s (- 69m

In [None]:
# Before moving on, SAVE THE MODELS!!!
with open('encoder_vanilla_100_100K_teachforce.pkl', 'wb') as fout:
    pickle.dump(my_encoder, fout)
    
with open('decoder_vanilla_100_100K_teachforce.pkl', 'wb') as fout:
    pickle.dump(my_decoder, fout)

In [43]:
translate('teh c')

'hot tea with evaporated milk and sugar'

In [44]:
translate('teh ga dai')

'hot tea with condensed milk and more sugar'

In [45]:
translate('teh c ga dai')

'hot tea with evaporated milk and more sugar'

In [46]:
translate('teh c ga dai peng')

'hot tea with evaporated milk and sugar'

In [47]:
translate('teh o siew dai')

'hot tea with lesser sugar'

In [48]:
translate('teh tiloh')

'heaviest , purest version of tea with no water added at all to the initial brew'

In [49]:
translate('teh tiloh peng')

'iced version of tea with sugar'

In [50]:
translate('tak kiu peng')

'iced milo'

In [51]:
translate('michael jackson')

'soya bean milk mixed with grass jelly'

In [52]:
translate('michael jackson peng')

'ice milo'

In [53]:
translate('teh siew dai peng')

'iced tea with sugar'