# Part 1: seq2seq model with attention for language translation or chatbot?

## some resources
- [online tutorial](https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation-batched.ipynb) and [code](https://github.com/spro/practical-pytorch/tree/master/seq2seq-translation) from practical pytorch
- MaximumEntropy [seq2seq-pytorch](https://github.com/MaximumEntropy/Seq2Seq-PyTorch)
- IBM [pytorch seq2seq](https://github.com/IBM/pytorch-seq2seq)
- [seq2seq.pytorch](https://github.com/eladhoffer/seq2seq.pytorch)
- [seq2seq with tensorflow tutorials](https://github.com/ematvey/tensorflow-seq2seq-tutorials)
- [seq2seq neural machine translation tutorial](https://github.com/tensorflow/nmt)
- [chatbot based on seq2seq antilm](https://github.com/Marsan-Ma/tf_chatbot_seq2seq_antilm)
- [practical seq2seq for chatbot](http://suriyadeepan.github.io/2016-12-31-practical-seq2seq/)

## datasets
- [Tab-delimited Bilingual Sentence Pairs](http://www.manythings.org/anki/)
- [chat corpus](https://github.com/Marsan-Ma/chat_corpus)

It might be too long to fit into one notebook, so split it into several.

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

In [2]:
import numpy as np
import random

import torch
from torch import nn, optim
from torch.nn import functional as F
from torch.utils import data
from torch.autograd import Variable
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

## Vanila basic seq2seq model
### encoder: 
- a simple rnn (GRU/LSTM), some times with embedding layer before it
- don't care about the output, instead, just take the last hidden state (called thought vector or context?)
- input: batch of padded sequences (if of varying lengths), size=(batch_size, seq_len, input_dim) or (seq_len, batch_size, input_dim) depending on whether it is time_major or batch_major
- output: the hidden state at the last step, size=(batch_size, hidden_dim)

### decoder
- a simple rnn, with a projection layer (softmax) after it, to map rnn seq output to vocab classification
- the initial hidden state will be the thought vector, aka the last hidden state from encoder
- for each step, the input should be the output from last step. And the input of first step for decoder will be a special mark, e.g., SOS (start of sentence) or just EOS
- the sequence output will be projected by another one/sevearl layers to map them to class probablities

### example
- I will follow this [tutorial](https://github.com/ematvey/tensorflow-seq2seq-tutorials/blob/master/1-seq2seq.ipynb), trying to reverse the sequence by using a seq2seq model, in pytorch

In [3]:
## generate some data: 
## input - a sequence of integers(index), target: the reverse of it
## for vocabulary setup, reserving index 0 for padding and index 1 for EOS

## this corresponds to skipping the vocab building (word2inex, index2word) and
## use index directly
class ReverseSeqData(data.Dataset):
    def __init__(self, vocab_size=10, max_seq=10, n_data=1000):
        self.vocab_size = vocab_size
        self.max_seq = max_seq
        self.n_data = n_data
        self.seqs = []
        self.seq_lens = []
        for _ in range(n_data):
            seq_len = np.random.randint(2, max_seq)
            seq = np.zeros(max_seq).astype(np.int64)
            seq[:seq_len] = np.random.randint(2, 10, seq_len) # 0, 1 reserved for padding and EOS
            self.seqs.append(seq)
            self.seq_lens.append(seq_len)
    def __len__(self):
        return len(self.seqs)
    def __getitem__(self, i):
        seq = self.seqs[i]
        seq_len = self.seq_lens[i]
        target = np.zeros(self.max_seq + 1).astype(np.int64)
        target[:seq_len+1] = np.array([x for x in seq[:seq_len][::-1]] + [1])
        return (seq, target, seq_len)
    
toy_ds = ReverseSeqData(n_data=50000, max_seq=5)

print(len(toy_ds))
s, t, l = toy_ds[0]
print(s, t, l)

50000
[4 7 6 0 0] [6 7 4 1 0 0] 3


In [4]:
def sort_seqs_by_len(*seqs, lens):
    order = np.argsort(lens)[::-1]
    sorted_seqs = []
    for seq in seqs:
        sorted_seqs.append(np.asarray(seq)[order])
    return sorted_seqs + [np.asarray(lens)[order]]

In [4]:
## model

vector_dim = 8
vocab_size = toy_ds.vocab_size

class BasicSeq2Seq(nn.Module):
    
    def __init__(self):
        super(BasicSeq2Seq, self).__init__()
        self.embed = nn.Embedding(vocab_size, vector_dim, padding_idx=0)
        self.encode = nn.GRU(input_size=8, hidden_size=vector_dim, num_layers=1, batch_first=True)
        self.decode = nn.GRU(input_size=8, hidden_size=vector_dim, num_layers=1, batch_first=True)
        self.project = nn.Linear(vector_dim, vocab_size)
        
    def forward(self, seqs, seq_lens):
        batch_size = seqs.size(0)
        target_seq_len = seqs.size(1) + 1
        embeded = self.embed(seqs)
        
        padded = pack_padded_sequence(embeded, seq_lens, batch_first=True)
        h0 = Variable(torch.zeros([1, batch_size, vector_dim])).cuda()
        _, h = self.encode(padded, h0)
        
        ys = []
        # first input to decoder is EOS, which is 1 in index
        y = Variable(torch.ones([batch_size, 1])).long().cuda()
        y = self.embed(y)
        for i in range(target_seq_len):
            y, h = self.decode(y, h)
            ys.append(y)
        out = torch.cat(ys, dim=1)
        
        logits = self.project(out.view([-1, vector_dim]))
        return logits.view([batch_size, target_seq_len, vocab_size])

In [5]:
## test model
seqs, targets, lens = zip(*[toy_ds[i] for i in range(16)])
# seqs = np.array(seqs)
# targets = np.array(targets)
# i = np.argsort(lens)[::-1]
# seqs = seqs[i]
# targets = targets[i]
# lens = np.array(lens)[i]
seqs, targets, lens = sort_seqs_by_len(seqs, targets, lens=lens)

x = Variable(torch.from_numpy(seqs)).cuda()
m = BasicSeq2Seq().cuda()
y = m(x, lens)
seqs.shape, targets.shape, y.size()

((16, 5), (16, 6), torch.Size([16, 6, 10]))

In [7]:
## training
batch_size = 128
n_batches = len(toy_ds) // batch_size
n_epochs = 45

objective = nn.CrossEntropyLoss()

model = BasicSeq2Seq().cuda()

In [8]:
%%time
model.train()

optimizer = optim.Adam(model.parameters())
index = np.arange(0, len(toy_ds))

for epoch in range(n_epochs):
    
    np.random.shuffle(index)
    for b, bi in enumerate(np.array_split(index, n_batches)):
        seqs, targets, lens = zip(*[toy_ds[i] for i in bi])
        seqs, targets, lens = sort_seqs_by_len(seqs, targets, lens=lens)
        
        x = Variable(torch.from_numpy(seqs)).cuda()
        y = Variable(torch.from_numpy(targets)).cuda()
        logits = model(x, lens)
        
        loss = objective(logits.view([-1, vocab_size]), y.view([-1]))
        model.zero_grad()
        loss.backward()
        optimizer.step()
        
        if epoch % 15 == 0 and b % (n_batches//2) == 0:
            print(epoch, b, loss.data[0])

0 0 2.315696954727173
0 195 1.6249631643295288
15 0 0.23702576756477356
15 195 0.21347297728061676
30 0 0.04112502932548523
30 195 0.03904372826218605
45 0 0.00453186733648181
45 195 0.00397515669465065


KeyboardInterrupt: 

In [10]:
## evaluation
model.eval()
seqs, targets, lens = zip(*[toy_ds[i] for i in range(20)])
seqs, targets, lens = sort_seqs_by_len(seqs, targets, lens=lens)

x = Variable(torch.from_numpy(seqs)).cuda()
y = model(x, lens)
_, label = torch.max(y, dim=-1)
print("accuracy:", np.mean(label.data.cpu().numpy() == targets))

accuracy: 1.0


## Seq2seq model with bidirectional encoder
- everything else is the same as the basic model, except that the encoder now is using a bidirectional rnn
- concat the hidden state from both directions and use it as the initial state for decoder

In [34]:
## model

vector_dim = 8
vocab_size = toy_ds.vocab_size

class BidiSeq2Seq(nn.Module):
    
    def __init__(self):
        super(BidiSeq2Seq, self).__init__()
        self.embed = nn.Embedding(vocab_size, vector_dim, padding_idx=0)
        self.encode = nn.GRU(input_size=8, hidden_size=vector_dim, num_layers=1,
                             batch_first=True, bidirectional=True)
        # decoder has double hidden dimension to accomodate bidirectional state from encoder
        self.decode = nn.GRU(input_size=8, hidden_size=vector_dim*2, num_layers=1, batch_first=True)
        # project layer to bring the output of decoder to dimension = its input
        self.project = nn.Linear(vector_dim*2, vector_dim)
        self.classify = nn.Linear(vector_dim, vocab_size)
        
    def forward(self, seqs, seq_lens):
        batch_size = seqs.size(0)
        target_seq_len = seqs.size(1) + 1
        embeded = self.embed(seqs)
        
        padded = pack_padded_sequence(embeded, seq_lens, batch_first=True)
        h0 = Variable(torch.zeros([2, batch_size, vector_dim])).cuda()
        _, h = self.encode(padded, h0, )
        h = torch.cat([h[0,...], h[1,...]], dim=1).unsqueeze(dim=0)
        
        ys = []
        # first input to decoder is EOS, which is 1 in index
        y = Variable(torch.ones([batch_size, 1])).long().cuda()
        y = self.embed(y)
        for i in range(target_seq_len):
            y, h = self.decode(y, h)
            y = self.project(y)
            y = F.elu(y)
            ys.append(y)
        out = torch.cat(ys, dim=1)
        
        logits = self.classify(out.view([-1, vector_dim]))
        return logits.view([batch_size, target_seq_len, vocab_size])

In [35]:
## test model
seqs, targets, lens = zip(*[toy_ds[i] for i in range(16)])
seqs, targets, lens = sort_seqs_by_len(seqs, targets, lens=lens)

x = Variable(torch.from_numpy(seqs)).cuda()
m = BidiSeq2Seq().cuda()
y = m(x, lens)
seqs.shape, targets.shape, y.size()

((16, 5), (16, 6), torch.Size([16, 6, 10]))

In [40]:
## training

## make it a little more challenging by using a longer max_seq
toy_ds = ReverseSeqData(n_data=50000, max_seq=7)

batch_size = 128
n_batches = len(toy_ds) // batch_size
n_epochs = 30

objective = nn.CrossEntropyLoss()

model = BidiSeq2Seq().cuda()

In [41]:
%%time
model.train()

optimizer = optim.Adam(model.parameters())
index = np.arange(0, len(toy_ds))

for epoch in range(n_epochs):
    
    np.random.shuffle(index)
    for b, bi in enumerate(np.array_split(index, n_batches)):
        seqs, targets, lens = zip(*[toy_ds[i] for i in bi])
        seqs, targets, lens = sort_seqs_by_len(seqs, targets, lens=lens)
        
        x = Variable(torch.from_numpy(seqs)).cuda()
        y = Variable(torch.from_numpy(targets)).cuda()
        logits = model(x, lens)
        
        loss = objective(logits.view([-1, vocab_size]), y.view([-1]))
        model.zero_grad()
        loss.backward()
        optimizer.step()
        
        if epoch % 15 == 0 and b % (n_batches//2) == 0:
            print(epoch, b, loss.data[0])

0 0 2.363103151321411
0 195 1.2488439083099365
15 0 0.074092335999012
15 195 0.09117448329925537
CPU times: user 14min 16s, sys: 2min 5s, total: 16min 22s
Wall time: 16min 25s


In [43]:
## evaluation
model.eval()
seqs, targets, lens = zip(*[toy_ds[i] for i in range(20)])
seqs, targets, lens = sort_seqs_by_len(seqs, targets, lens=lens)

x = Variable(torch.from_numpy(seqs)).cuda()
y = model(x, lens)
_, label = torch.max(y, dim=-1)
print("accuracy:", np.mean(label.data.cpu().numpy() == targets))

accuracy: 1.0


## another popular way of doing decoder
- the direct output of the decoder is of dim (hidden_dim), to map it back to input_dim, and make it more theoritically correct, we do the following:
    - map the output to output space using softmax
    - get the most likely prediction (or sampling in general)
    - get the embedding of the prediction as the new input
- another advantage is that now embeddings for both encoder and decoder are the same

In [5]:
## model

vector_dim = 8
vocab_size = toy_ds.vocab_size

class BidiSeq2Seq(nn.Module):
    
    def __init__(self):
        super(BidiSeq2Seq, self).__init__()
        self.embed = nn.Embedding(vocab_size, vector_dim, padding_idx=0)
        self.encode = nn.GRU(input_size=8, hidden_size=vector_dim, num_layers=1,
                             batch_first=True, bidirectional=True)
        # decoder has double hidden dimension to accomodate bidirectional state from encoder
        self.decode = nn.GRU(input_size=8, hidden_size=vector_dim*2, num_layers=1, batch_first=True)
        # project layer to bring the output of decoder to dimension = its input
        self.project = nn.Linear(vector_dim*2, vocab_size)
        
    def forward(self, seqs, seq_lens):
        batch_size = seqs.size(0)
        target_seq_len = seqs.size(1) + 1
        embeded = self.embed(seqs)
        
        padded = pack_padded_sequence(embeded, seq_lens, batch_first=True)
        h0 = Variable(torch.zeros([2, batch_size, vector_dim])).cuda()
        _, h = self.encode(padded, h0, )
        h = torch.cat([h[0,...], h[1,...]], dim=1).unsqueeze(dim=0)
        
        ys = []
        # first input to decoder is EOS, which is 1 in index
        y = Variable(torch.ones([batch_size, 1])).long().cuda()
        y = self.embed(y)
        for i in range(target_seq_len):
            y, h = self.decode(y, h)
            ys.append(y)
            y = self.next_decoder_input(y)

            
        out = torch.cat(ys, dim=1)
        
        logits = self.project(out.view([-1, vector_dim*2]))
        return logits.view([batch_size, target_seq_len, vocab_size])
    
    def next_decoder_input(self, decoder_output):
        logits = self.project(decoder_output)
        _, label = torch.max(logits, dim=2)
        embed = self.embed(label)
        return embed

In [6]:
## test model
seqs, targets, lens = zip(*[toy_ds[i] for i in range(16)])
seqs, targets, lens = sort_seqs_by_len(seqs, targets, lens=lens)

x = Variable(torch.from_numpy(seqs)).cuda()
m = BidiSeq2Seq().cuda()
y = m(x, lens)
seqs.shape, targets.shape, y.size()

((16, 5), (16, 6), torch.Size([16, 6, 10]))

In [14]:
## training

## make it a little more challenging by using a longer max_seq
toy_ds = ReverseSeqData(n_data=50000, max_seq=5)

batch_size = 128
n_batches = len(toy_ds) // batch_size
n_epochs = 50

objective = nn.CrossEntropyLoss()

model = BidiSeq2Seq().cuda()

In [15]:
%%time
model.train()

optimizer = optim.Adam(model.parameters())
index = np.arange(0, len(toy_ds))

for epoch in range(n_epochs):
    
    np.random.shuffle(index)
    for b, bi in enumerate(np.array_split(index, n_batches)):
        seqs, targets, lens = zip(*[toy_ds[i] for i in bi])
        seqs, targets, lens = sort_seqs_by_len(seqs, targets, lens=lens)
        
        x = Variable(torch.from_numpy(seqs)).cuda()
        y = Variable(torch.from_numpy(targets)).cuda()
        logits = model(x, lens)
        
        loss = objective(logits.view([-1, vocab_size]), y.view([-1]))
        model.zero_grad()
        loss.backward()
        optimizer.step()
        
        if epoch % 5 == 0 and b % n_batches == 0:
            print(epoch, b, loss.data[0])

0 0 2.318190097808838
5 0 0.27784043550491333
10 0 0.10735460370779037
15 0 0.06630130857229233
20 0 0.023480141535401344
25 0 0.025113224983215332
30 0 0.006958003621548414
35 0 0.0022069981787353754
40 0 0.0009632504661567509
45 0 0.0015089333755895495
CPU times: user 18min 54s, sys: 2min 49s, total: 21min 43s
Wall time: 21min 48s


In [16]:
## evaluation
model.eval()
seqs, targets, lens = zip(*[toy_ds[i] for i in range(20)])
seqs, targets, lens = sort_seqs_by_len(seqs, targets, lens=lens)

x = Variable(torch.from_numpy(seqs)).cuda()
y = model(x, lens)
_, label = torch.max(y, dim=-1)
print("accuracy:", np.mean(label.data.cpu().numpy() == targets))

accuracy: 1.0


## Appendix - Understand behavior of RNN (e.g. GRU) on padded sequence
- its output is from the last layer (if multiple layers involved) for each time step, and if the input is a padded sequence, the output will also be padded (meaning, zeros after certrain step)
- its hidden state is essentially the value of the last effective step of outputs along the time step, considering the padding of the sequence. It has similiar interface as tensorflow

In [9]:
x = Variable(torch.LongTensor([
    [1, 2, 3], 
    [1, 2, 0], 
    [1, 0, 0]])).cuda()
embed = nn.Embedding(num_embeddings=4, embedding_dim=2, padding_idx=0).cuda()
x = embed(x)
x.size()

torch.Size([3, 3, 2])

In [10]:
x.data.cpu().numpy()

array([[[ 0.56909215, -0.58901101],
        [ 1.51466238, -1.08357263],
        [ 1.0598985 , -1.89443576]],

       [[ 0.56909215, -0.58901101],
        [ 1.51466238, -1.08357263],
        [ 0.        ,  0.        ]],

       [[ 0.56909215, -0.58901101],
        [ 0.        ,  0.        ],
        [ 0.        ,  0.        ]]], dtype=float32)

In [11]:
padx = pack_padded_sequence(x, [3, 2, 1])

rnn = nn.GRU(input_size=2, hidden_size=1, batch_first=True, bidirectional=True).cuda()
h0 = Variable(torch.zeros(2, 3, 1)).cuda()
y, h = rnn(padx, h0)
y, lens = pad_packed_sequence(y, batch_first=True)
y.size(), h.size()

(torch.Size([3, 3, 2]), torch.Size([2, 3, 1]))

In [12]:
y

Variable containing:
(0 ,.,.) = 
 -0.1994 -0.9486
 -0.3485 -0.8866
 -0.4583 -0.7033

(1 ,.,.) = 
 -0.2398 -0.9419
 -0.4255 -0.7900
  0.0000  0.0000

(2 ,.,.) = 
 -0.1587 -0.6753
  0.0000  0.0000
  0.0000  0.0000
[torch.cuda.FloatTensor of size 3x3x2 (GPU 0)]

In [13]:
h

Variable containing:
(0 ,.,.) = 
 -0.4583
 -0.4255
 -0.1587

(1 ,.,.) = 
 -0.9486
 -0.9419
 -0.6753
[torch.cuda.FloatTensor of size 2x3x1 (GPU 0)]