# seq2seq model with attention for language translation or chatbot?

## some resources
- [online tutorial](https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation-batched.ipynb) and [code](https://github.com/spro/practical-pytorch/tree/master/seq2seq-translation) from practical pytorch
- MaximumEntropy [seq2seq-pytorch](https://github.com/MaximumEntropy/Seq2Seq-PyTorch)
- IBM [pytorch seq2seq](https://github.com/IBM/pytorch-seq2seq)
- [seq2seq.pytorch](https://github.com/eladhoffer/seq2seq.pytorch)
- [seq2seq with tensorflow tutorials](https://github.com/ematvey/tensorflow-seq2seq-tutorials)
- [seq2seq neural machine translation tutorial](https://github.com/tensorflow/nmt)
- [chatbot based on seq2seq antilm](https://github.com/Marsan-Ma/tf_chatbot_seq2seq_antilm)
- [practical seq2seq for chatbot](http://suriyadeepan.github.io/2016-12-31-practical-seq2seq/)

## datasets
- [Tab-delimited Bilingual Sentence Pairs](http://www.manythings.org/anki/)
- [chat corpus](https://github.com/Marsan-Ma/chat_corpus)

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

In [2]:
import numpy as np
import random

import torch
from torch import nn, optim
from torch.utils import data
from torch.autograd import Variable
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

## Vanila basic seq2seq model
### encoder: 
- a simple rnn (GRU/LSTM), some times with embedding layer before it
- don't care about the output, instead, just take the last hidden state (called thought vector or context?)
- input: batch of padded sequences (if of varying lengths), size=(batch_size, seq_len, input_dim) or (seq_len, batch_size, input_dim) depending on whether it is time_major or batch_major
- output: the hidden state at the last step, size=(batch_size, hidden_dim)

### decoder
- a simple rnn, with a projection layer (softmax) after it, to map rnn seq output to vocab classification
- the initial hidden state will be the thought vector, aka the last hidden state from encoder
- for each step, the input should be the output from last step. And the input of first step for decoder will be a special mark, e.g., SOS (start of sentence) or just EOS
- the sequence output will be projected by another one/sevearl layers to map them to class probablities

### example
- I will follow this [tutorial](https://github.com/ematvey/tensorflow-seq2seq-tutorials/blob/master/1-seq2seq.ipynb), trying to reverse the sequence by using a seq2seq model, in pytorch

In [107]:
## generate some data: 
## input - a sequence of integers(index), target: the reverse of it
## for vocabulary setup, reserving index 0 for padding and index 1 for EOS

## this corresponds to skipping the vocab building (word2inex, index2word) and
## use index directly
class ReverseSeqData(data.Dataset):
    def __init__(self, vocab_size=10, max_seq=10, n_data=1000):
        self.vocab_size = vocab_size
        self.max_seq = max_seq
        self.n_data = n_data
        self.seqs = []
        self.seq_lens = []
        for _ in range(n_data):
            seq_len = np.random.randint(2, max_seq)
            seq = np.zeros(max_seq).astype(np.int64)
            seq[:seq_len] = np.random.randint(2, 10, seq_len) # 0, 1 reserved for padding and EOS
            self.seqs.append(seq)
            self.seq_lens.append(seq_len)
    def __len__(self):
        return len(self.seqs)
    def __getitem__(self, i):
        seq = self.seqs[i]
        seq_len = self.seq_lens[i]
        target = np.zeros(self.max_seq + 1).astype(np.int64)
        target[:seq_len+1] = np.array([x for x in seq[:seq_len][::-1]] + [1])
        return (seq, target, seq_len)
    
toy_ds = ReverseSeqData(n_data=50000, max_seq=5)

print(len(toy_ds))
s, t, l = toy_ds[0]
print(s, t, l)

50000
[6 7 5 9 0] [9 5 7 6 1 0] 4


In [108]:
## model

vector_dim = 8
vocab_size = toy_ds.vocab_size

class BasicSeq2Seq(nn.Module):
    
    def __init__(self):
        super(BasicSeq2Seq, self).__init__()
        self.embed = nn.Embedding(vocab_size, vector_dim, padding_idx=0)
        self.encode = nn.GRU(input_size=8, hidden_size=vector_dim, num_layers=1, batch_first=True)
        self.decode = nn.GRU(input_size=8, hidden_size=vector_dim, num_layers=1, batch_first=True)
        self.project = nn.Linear(vector_dim, vocab_size)
        
    def forward(self, seqs, seq_lens):
        batch_size = seqs.size(0)
        target_seq_len = seqs.size(1) + 1
        embeded = self.embed(seqs)
        
        padded = pack_padded_sequence(embeded, seq_lens, batch_first=True)
        h0 = Variable(torch.zeros([1, batch_size, vector_dim])).cuda()
        _, h = self.encode(padded, h0)
        
        ys = []
        # first input to decoder is EOS, which is 1 in index
        y = Variable(torch.ones([batch_size, 1])).long().cuda()
        y = self.embed(y)
        for i in range(target_seq_len):
            y, h = self.decode(y, h)
            ys.append(y)
        out = torch.cat(ys, dim=1)
        
        logits = self.project(out.view([-1, vector_dim]))
        return logits.view([batch_size, target_seq_len, vocab_size])
    
def sort_seqs_by_len(*seqs, lens):
    order = np.argsort(lens)[::-1]
    sorted_seqs = []
    for seq in seqs:
        sorted_seqs.append(np.asarray(seq)[order])
    return sorted_seqs + [np.asarray(lens)[order]]

In [109]:
## test model
seqs, targets, lens = zip(*[toy_ds[i] for i in range(16)])
# seqs = np.array(seqs)
# targets = np.array(targets)
# i = np.argsort(lens)[::-1]
# seqs = seqs[i]
# targets = targets[i]
# lens = np.array(lens)[i]
seqs, targets, lens = sort_seqs_by_len(seqs, targets, lens=lens)

x = Variable(torch.from_numpy(seqs)).cuda()
m = BasicSeq2Seq().cuda()
y = m(x, lens)
seqs.shape, targets.shape, y.size()

((16, 5), (16, 6), torch.Size([16, 6, 10]))

In [110]:
## training
batch_size = 128
n_batches = len(toy_ds) // batch_size
n_epochs = 100

objective = nn.CrossEntropyLoss()

model = BasicSeq2Seq().cuda()

In [111]:

model.train()

optimizer = optim.Adam(model.parameters())
index = np.arange(0, len(toy_ds))

for epoch in range(n_epochs):
    
    np.random.shuffle(index)
    for b, bi in enumerate(np.array_split(index, n_batches)):
        seqs, targets, lens = zip(*[toy_ds[i] for i in bi])
        seqs, targets, lens = sort_seqs_by_len(seqs, targets, lens=lens)
        
        x = Variable(torch.from_numpy(seqs)).cuda()
        y = Variable(torch.from_numpy(targets)).cuda()
        logits = model(x, lens)
        
        loss = objective(logits.view([-1, vocab_size]), y.view([-1]))
        model.zero_grad()
        loss.backward()
        optimizer.step()
        
        if epoch % 5 == 0 and b % (n_batches//2) == 0:
            print(epoch, b, loss.data[0])

0 0 2.2610697746276855
0 195 1.464776635169983
5 0 0.5823482275009155
5 195 0.5377452969551086
10 0 0.4125298857688904
10 195 0.4034583568572998
15 0 0.3362981379032135
15 195 0.30488529801368713
20 0 0.2657814025878906
20 195 0.25782158970832825
25 0 0.2263297140598297
25 195 0.2644290626049042
30 0 0.20293505489826202
30 195 0.23257048428058624
35 0 0.1814369410276413
35 195 0.15772274136543274
40 0 0.181661918759346
40 195 0.13795605301856995
45 0 0.15113674104213715
45 195 0.14242248237133026
50 0 0.10945641994476318
50 195 0.11432357877492905
55 0 0.14799422025680542
55 195 0.1350439041852951
60 0 0.11765176057815552
60 195 0.11994931101799011
65 0 0.10504516959190369
65 195 0.12767396867275238
70 0 0.10786114633083344
70 195 0.08872746676206589
75 0 0.0758020207285881
75 195 0.09081286936998367
80 0 0.08712781220674515
80 195 0.07148434966802597
85 0 0.0523664727807045
85 195 0.06827341765165329
90 0 0.05967968702316284
90 195 0.05542568862438202
95 0 0.05697206035256386
95 195 0

In [112]:
## evaluation
model.eval()
seqs, targets, lens = zip(*[toy_ds[i] for i in range(20)])
seqs, targets, lens = sort_seqs_by_len(seqs, targets, lens=lens)

x = Variable(torch.from_numpy(seqs)).cuda()
y = model(x, lens)
_, label = torch.max(y, dim=-1)

In [113]:
label.data.cpu().numpy()

array([[6, 6, 9, 9, 1, 0],
       [5, 3, 5, 4, 1, 0],
       [9, 3, 7, 4, 1, 0],
       [8, 8, 8, 7, 1, 0],
       [5, 9, 5, 6, 1, 0],
       [9, 5, 7, 6, 1, 0],
       [5, 8, 9, 1, 0, 0],
       [2, 2, 6, 1, 0, 0],
       [8, 6, 4, 1, 0, 0],
       [2, 6, 3, 1, 0, 0],
       [9, 7, 6, 1, 0, 0],
       [6, 2, 4, 1, 0, 0],
       [3, 7, 4, 1, 0, 0],
       [3, 4, 5, 1, 0, 0],
       [3, 4, 1, 0, 0, 0],
       [8, 3, 1, 0, 0, 0],
       [6, 9, 1, 0, 0, 0],
       [6, 7, 1, 0, 0, 0],
       [3, 4, 1, 0, 0, 0],
       [7, 6, 1, 0, 0, 0]])

In [114]:
targets

array([[6, 6, 9, 9, 1, 0],
       [5, 3, 5, 4, 1, 0],
       [9, 3, 7, 4, 1, 0],
       [8, 8, 8, 7, 1, 0],
       [5, 9, 8, 6, 1, 0],
       [9, 5, 7, 6, 1, 0],
       [5, 8, 9, 1, 0, 0],
       [2, 2, 6, 1, 0, 0],
       [8, 6, 4, 1, 0, 0],
       [2, 6, 3, 1, 0, 0],
       [9, 7, 6, 1, 0, 0],
       [6, 2, 4, 1, 0, 0],
       [3, 7, 4, 1, 0, 0],
       [3, 4, 5, 1, 0, 0],
       [3, 4, 1, 0, 0, 0],
       [8, 3, 1, 0, 0, 0],
       [6, 9, 1, 0, 0, 0],
       [6, 7, 1, 0, 0, 0],
       [3, 4, 1, 0, 0, 0],
       [7, 6, 1, 0, 0, 0]])