# Seq2Seq, Attention

褚则伟 zeweichu@gmail.com

在这份notebook当中，我们会(尽可能)复现Luong的attention模型

由于我们的数据集非常小，只有一万多个句子的训练数据，所以训练出来的模型效果并不好。如果大家想训练一个好一点的模型，可以参考下面的资料。

## 更多阅读

#### 课件
- [cs224d](http://cs224d.stanford.edu/lectures/CS224d-Lecture15.pdf)


#### 论文
- [Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1508.04025?context=cs)
- [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1406.1078)
- [Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation](https://arxiv.org/abs/1406.1078)

#### PyTorch代码
- [seq2seq-tutorial](https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb)
- [Tutorial from Ben Trevett](https://github.com/bentrevett/pytorch-seq2seq)
- [IBM seq2seq](https://github.com/IBM/pytorch-seq2seq)
- [OpenNMT-py](https://github.com/OpenNMT/OpenNMT-py)


#### 更多关于Machine Translation
- [Beam Search](https://www.coursera.org/lecture/nlp-sequence-models/beam-search-4EtHZ)


In [89]:
import os
import sys
import math
from collections import Counter
import numpy as np
import random

import torch
import torch.nn as nn
import torch.nn.functional as F

import nltk

读入中英文数据
- 英文我们使用nltk的word tokenizer来分词，并且使用小写字母
- 中文我们直接使用单个汉字作为基本单元

In [90]:
def load_data(in_file):
    cn = []
    en = []
    num_examples = 0
    with open(in_file, 'r') as f:
        for line in f:
            line = line.strip().split("\t")
            
            en.append(["BOS"] + nltk.word_tokenize(line[0].lower()) + ["EOS"])
            # split chinese sentence into characters
            cn.append(["BOS"] + [c for c in line[1]] + ["EOS"])
    return en, cn

train_file = "nmt/en-cn/train.txt"
dev_file = "nmt/en-cn/dev.txt"
train_en, train_cn = load_data(train_file)
dev_en, dev_cn = load_data(dev_file)

构建单词表

In [91]:
UNK_IDX = 0
PAD_IDX = 1
def build_dict(sentences, max_words=50000):
    word_count = Counter()
    for sentence in sentences:
        for s in sentence:
            word_count[s] += 1
    ls = word_count.most_common(max_words)
    total_words = len(ls) + 2
    word_dict = {w[0]: index+2 for index, w in enumerate(ls)}
    word_dict["UNK"] = UNK_IDX
    word_dict["PAD"] = PAD_IDX
    return word_dict, total_words

en_dict, en_total_words = build_dict(train_en)
cn_dict, cn_total_words = build_dict(train_cn)
inv_en_dict = {v: k for k, v in en_dict.items()}
inv_cn_dict = {v: k for k, v in cn_dict.items()}

把单词全部转变成数字

In [92]:
def encode(en_sentences, cn_sentences, en_dict, cn_dict, sort_by_len=True):
    '''
        Encode the sequences. 
    '''
    length = len(en_sentences)
    out_en_sentences = [[en_dict.get(w, 0) for w in sent] for sent in en_sentences]
    out_cn_sentences = [[cn_dict.get(w, 0) for w in sent] for sent in cn_sentences]

    # sort sentences by english lengths
    def len_argsort(seq):
        return sorted(range(len(seq)), key=lambda x: len(seq[x]))
       
    if sort_by_len:
        sorted_index = len_argsort(out_en_sentences)
        out_en_sentences = [out_en_sentences[i] for i in sorted_index]
        out_cn_sentences = [out_cn_sentences[i] for i in sorted_index]
        
    return out_en_sentences, out_cn_sentences

train_en, train_cn = encode(train_en, train_cn, en_dict, cn_dict)
dev_en, dev_cn = encode(dev_en, dev_cn, en_dict, cn_dict)

把全部句子分成batch

In [93]:
def get_minibatches(n, minibatch_size, shuffle=True):
    idx_list = np.arange(0, n, minibatch_size)
    if shuffle:
        np.random.shuffle(idx_list)
    minibatches = []
    for idx in idx_list:
        minibatches.append(np.arange(idx, min(idx + minibatch_size, n)))
    return minibatches

def prepare_data(seqs):
    lengths = [len(seq) for seq in seqs]
    n_samples = len(seqs)
    max_len = np.max(lengths)

    x = np.zeros((n_samples, max_len)).astype('int32')
    x_lengths = np.array(lengths).astype("int32")
    for idx, seq in enumerate(seqs):
        x[idx, :lengths[idx]] = seq
    return x, x_lengths #x_mask

def gen_examples(en_sentences, cn_sentences, batch_size):
    minibatches = get_minibatches(len(en_sentences), batch_size)
    all_ex = []
    for minibatch in minibatches:
        mb_en_sentences = [en_sentences[t] for t in minibatch]
        mb_cn_sentences = [cn_sentences[t] for t in minibatch]
        mb_x, mb_x_len = prepare_data(mb_en_sentences)
        mb_y, mb_y_len = prepare_data(mb_cn_sentences)
        all_ex.append((mb_x, mb_x_len, mb_y, mb_y_len))
    return all_ex

batch_size = 64
train_data = gen_examples(train_en, train_cn, batch_size)
random.shuffle(train_data)
dev_data = gen_examples(dev_en, dev_cn, batch_size)

数据全部处理完成，现在我们开始构建seq2seq模型

#### Encoder
- Encoder模型的任务是把输入文字传入embedding层和GRU层，转换成一些hidden states作为后续的context vectors

In [94]:
class Encoder(nn.Module):
    def __init__(self, vocab_size, embed_size, enc_hidden_size, dec_hidden_size, dropout=0.2):
        super(Encoder, self).__init__()
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.GRU(embed_size, enc_hidden_size, batch_first=True, bidirectional=True)
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(enc_hidden_size * 2, dec_hidden_size)

    def forward(self, x, lengths):
        sorted_len, sorted_idx = lengths.sort(0, descending=True)
        x_sorted = x[sorted_idx.long()]
        embedded = self.dropout(self.embed(x_sorted))
        
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, sorted_len.long().cpu().data.numpy(), batch_first=True)
        packed_out, hid = self.rnn(packed_embedded)
        out, _ = nn.utils.rnn.pad_packed_sequence(packed_out, batch_first=True)
        _, original_idx = sorted_idx.sort(0, descending=False)
        out = out[original_idx.long()].contiguous()
        hid = hid[:, original_idx.long()].contiguous()
        
        hid = torch.cat([hid[-2], hid[-1]], dim=1)
        hid = torch.tanh(self.fc(hid)).unsqueeze(0)

        return out, hid

#### Luong Attention
- 根据context vectors和当前的输出hidden states，计算输出

In [95]:
class Attention(nn.Module):
    def __init__(self, enc_hidden_size, dec_hidden_size):
        super(Attention, self).__init__()

        self.enc_hidden_size = enc_hidden_size
        self.dec_hidden_size = dec_hidden_size

        self.linear_in = nn.Linear(enc_hidden_size*2, dec_hidden_size, bias=False)
        self.linear_out = nn.Linear(enc_hidden_size*2 + dec_hidden_size, dec_hidden_size)
        
    def forward(self, output, context, mask):
        # output: batch_size, output_len, dec_hidden_size
        # context: batch_size, context_len, enc_hidden_size
    
        batch_size = output.size(0)
        output_len = output.size(1)
        input_len = context.size(1)
        
        context_in = self.linear_in(context.view(batch_size*input_len, -1)).view(                batch_size, input_len, -1) # batch_size, output_len, dec_hidden_size
        attn = torch.bmm(output, context_in.transpose(1,2)) # batch_size, output_len, context_len

        
        attn.data.masked_fill(mask, -1e6)

        attn = F.softmax(attn, dim=2) # batch_size, output_len, context_len

        context = torch.bmm(attn, context) # batch_size, output_len, enc_hidden_size
        
        output = torch.cat((context, output), dim=2) # batch_size, output_len, hidden_size*2

        
        output = output.view(batch_size*output_len, -1)
        output = torch.tanh(self.linear_out(output))
        output = output.view(batch_size, output_len, -1)
        return output, attn


#### Decoder
- decoder会根据已经翻译的句子内容，和context vectors，来决定下一个输出的单词

In [96]:
class Decoder(nn.Module):
    def __init__(self, vocab_size, embed_size, enc_hidden_size, dec_hidden_size, dropout=0.2):
        super(Decoder, self).__init__()
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.attention = Attention(enc_hidden_size, dec_hidden_size)
        self.rnn = nn.GRU(embed_size, hidden_size, batch_first=True)
        self.out = nn.Linear(dec_hidden_size, vocab_size)
        self.dropout = nn.Dropout(dropout)

    def create_mask(self, x_len, y_len):
        device = x_len.device
        max_x_len = x_len.max()
        max_y_len = y_len.max()
        x_mask = torch.arange(max_x_len, device=x_len.device)[None, :] < x_len[:, None]
        y_mask = torch.arange(max_y_len, device=x_len.device)[None, :] < y_len[:, None]
        mask = (1 - x_mask[:, :, None] * y_mask[:, None, :]).byte()
        return mask
        
        
    def forward(self, ctx, ctx_lengths, y, y_lengths, hid):
        sorted_len, sorted_idx = y_lengths.sort(0, descending=True)
        y_sorted = y[sorted_idx.long()]
        hid = hid[:, sorted_idx.long()]
        
        y_sorted = self.dropout(self.embed(y_sorted)) # batch_size, output_length, embed_size

        packed_seq = nn.utils.rnn.pack_padded_sequence(y_sorted, sorted_len.long().cpu().data.numpy(), batch_first=True)
        out, hid = self.rnn(packed_seq, hid)
        unpacked, _ = nn.utils.rnn.pad_packed_sequence(out, batch_first=True)
        _, original_idx = sorted_idx.sort(0, descending=False)
        output_seq = unpacked[original_idx.long()].contiguous()
        hid = hid[:, original_idx.long()].contiguous()

        mask = self.create_mask(y_lengths, ctx_lengths)

        # code.interact(local=locals())
        output, attn = self.attention(output_seq, ctx, mask)
        output = F.log_softmax(self.out(output), -1)
        
        return output, hid, attn

#### Seq2Seq
- 最后我们构建Seq2Seq模型把encoder, attention, decoder串到一起

In [97]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        
    def forward(self, x, x_lengths, y, y_lengths):
        encoder_out, hid = self.encoder(x, x_lengths)
        output, hid, attn = self.decoder(ctx=encoder_out, 
                    ctx_lengths=x_lengths,
                    y=y,
                    y_lengths=y_lengths,
                    hid=hid)
        return output, attn
    
    def translate(self, x, x_lengths, y, max_length=100):
        encoder_out, hid = self.encoder(x, x_lengths)
        preds = []
        batch_size = x.shape[0]
        attns = []
        for i in range(max_length):
            output, hid, attn = self.decoder(ctx=encoder_out, 
                    ctx_lengths=x_lengths,
                    y=y,
                    y_lengths=torch.ones(batch_size).long().to(y.device),
                    hid=hid)
            y = output.max(2)[1].view(batch_size, 1)
            preds.append(y)
            attns.append(attn)
        return torch.cat(preds, 1), torch.cat(attns, 1)

训练

In [98]:
class LanguageModelCriterion(nn.Module):
    def __init__(self):
        super(LanguageModelCriterion, self).__init__()

    def forward(self, input, target, mask):
        input = input.contiguous().view(-1, input.size(2))
        target = target.contiguous().view(-1, 1)
        mask = mask.contiguous().view(-1, 1)
        output = -input.gather(1, target) * mask
        output = torch.sum(output) / torch.sum(mask)

        return output

In [99]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

en_vocab_size = len(en_dict)
cn_vocab_size = len(cn_dict)
embed_size = hidden_size = 100
dropout = 0.2

encoder = Encoder(vocab_size=en_vocab_size, 
                  embed_size=embed_size, 
                  enc_hidden_size=hidden_size,
                  dec_hidden_size=hidden_size,
                  dropout=dropout)
decoder = Decoder(vocab_size=cn_vocab_size, 
                  embed_size=embed_size, 
                  enc_hidden_size=hidden_size,
                  dec_hidden_size=hidden_size,
                  dropout=dropout)
model = Seq2Seq(encoder, decoder)
model = model.to(device)
crit = LanguageModelCriterion().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)


In [100]:
def evaluate(model, data):
    model.eval()
    total_num_words = total_loss = 0.
    with torch.no_grad():
        for it, (mb_x, mb_x_lengths, mb_y, mb_y_lengths) in enumerate(data):
            mb_x = torch.from_numpy(mb_x).long().to(device)
            mb_x_lengths = torch.from_numpy(mb_x_lengths).long().to(device)
            mb_input = torch.from_numpy(mb_y[:,:-1]).long().to(device)
            mb_out = torch.from_numpy(mb_y[:, 1:]).long().to(device)
            mb_y_lengths = torch.from_numpy(mb_y_lengths-1).long().to(device)
            mb_y_lengths[mb_y_lengths <= 0] = 1

            mb_pred, attn = model(mb_x, mb_x_lengths, mb_input, mb_y_lengths)

            mb_out_mask = torch.arange(mb_y_lengths.max().item(), device=device)[None, :] < mb_y_lengths[:, None]
            mb_out_mask = mb_out_mask.float()
            loss = crit(mb_pred, mb_out, mb_out_mask)

            num_words = torch.sum(mb_y_lengths).item()
            total_loss += loss.item() * num_words
            total_num_words += num_words

    print("evaluation loss", total_loss/total_num_words)


In [101]:
def train(model, data, num_epochs=30):
    for epoch in range(num_epochs):
        total_num_words = total_loss = 0.
        model.train()
        for it, (mb_x, mb_x_lengths, mb_y, mb_y_lengths) in enumerate(data):
            mb_x = torch.from_numpy(mb_x).long().to(device)
            mb_x_lengths = torch.from_numpy(mb_x_lengths).long().to(device)
            mb_input = torch.from_numpy(mb_y[:,:-1]).long().to(device)
            mb_out = torch.from_numpy(mb_y[:, 1:]).long().to(device)
            mb_y_lengths = torch.from_numpy(mb_y_lengths-1).long().to(device)
            mb_y_lengths[mb_y_lengths <= 0] = 1

            mb_pred, attn = model(mb_x, mb_x_lengths, mb_input, mb_y_lengths)

            mb_out_mask = torch.arange(mb_y_lengths.max().item(), device=device)[None, :] < mb_y_lengths[:, None]
            mb_out_mask = mb_out_mask.float()
            loss = crit(mb_pred, mb_out, mb_out_mask)

            num_words = torch.sum(mb_y_lengths).item()
            total_loss += loss.item() * num_words
            total_num_words += num_words

            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 5.)
            optimizer.step()

            if it % 100 == 0:
                print("epoch", epoch, "iteration", it, "loss", loss.item())
        print("epoch", epoch, "training loss", total_loss/total_num_words)
        if epoch % 5 == 0:
            print("evaluating on dev...")
            evaluate(model, dev_data)

In [102]:
train(model, train_data, num_epochs=30)

epoch 0 iteration 0 loss 8.08132553100586
epoch 0 iteration 100 loss 5.079873561859131
epoch 0 iteration 200 loss 4.824731826782227
epoch 0 training loss 5.4923821157443315
evaluating on dev...
evaluation loss 4.984728863184165
epoch 1 iteration 0 loss 4.536130428314209
epoch 1 iteration 100 loss 4.477148532867432
epoch 1 iteration 200 loss 4.261579990386963
epoch 1 training loss 4.7648797368652245
epoch 2 iteration 0 loss 3.9508657455444336
epoch 2 iteration 100 loss 4.018619537353516
epoch 2 iteration 200 loss 3.868828535079956
epoch 2 training loss 4.342565599831299
epoch 3 iteration 0 loss 3.58060622215271
epoch 3 iteration 100 loss 3.68619441986084
epoch 3 iteration 200 loss 3.548811912536621
epoch 3 training loss 4.031472658481414
epoch 4 iteration 0 loss 3.256010055541992
epoch 4 iteration 100 loss 3.4553415775299072
epoch 4 iteration 200 loss 3.3572139739990234
epoch 4 training loss 3.7922291376638113
epoch 5 iteration 0 loss 3.0312695503234863
epoch 5 iteration 100 loss 3.2500

In [104]:
def translate_dev(i):
    model.eval()
    
    en_sent = " ".join([inv_en_dict[word] for word in dev_en[i]])
    print(en_sent)
    print(" ".join([inv_cn_dict[word] for word in dev_cn[i]]))

    sent = nltk.word_tokenize(en_sent.lower())
    bos = torch.Tensor([[cn_dict["BOS"]]]).long().to(device)
    mb_x = torch.Tensor([[en_dict.get(w, 0) for w in sent]]).long().to(device)
    mb_x_len = torch.Tensor([len(sent)]).long().to(device)
    
    translation, attention = model.translate(mb_x, mb_x_len, bos)
    translation = [inv_cn_dict[i] for i in translation.data.cpu().numpy().reshape(-1)]

    trans = []
    for word in translation:
        if word != "EOS":
            trans.append(word)
        else:
            break
    print(" ".join(trans))
for i in range(100,120):
    translate_dev(i)
    print()

BOS you have nice skin . EOS
BOS 你 的 皮 膚 真 好 。 EOS
有 很 好 好 好 运 动 看 你 们 好 好 运 。

BOS you 're UNK correct . EOS
BOS 你 部 分 正 确 。 EOS
你 相 信 著 一 個 正 確 的 事 情 。

BOS everyone admired his courage . EOS
BOS 每 個 人 都 佩 服 他 的 勇 氣 。 EOS
抱 怨 一 個 人 跟 我 父 亲 和 她 父 亲 。

BOS what time is it ? EOS
BOS 几 点 了 ？ EOS
谁 是 个 时 候 ， 这 些 事 情 是 我 們 什 麼 時 候 間 是 多 少 ？

BOS i 'm free tonight . EOS
BOS 我 今 晚 有 空 。 EOS
我 今 晚 有 空 话 看 见 我 。

BOS here is your book . EOS
BOS 這 是 你 的 書 。 EOS
這 本 書 對 你 這 個 工 作 的 書 閤 間 間 裡 得 厲 害 。

BOS they are at lunch . EOS
BOS 他 们 在 吃 午 饭 。 EOS
他 們 在 派 對 一 本 書 在 我 們 家 庭 作 弊 。

BOS this chair is UNK . EOS
BOS 這 把 椅 子 很 UNK 。 EOS
这 个 城 市 一 个 人 在 这 个 城 市 一 个 人 。

BOS it 's pretty heavy . EOS
BOS 它 真 重 。 EOS
相 同 了 一 个 人 明 白 。

BOS many attended his funeral . EOS
BOS 很 多 人 都 参 加 了 他 的 葬 礼 。 EOS
老 人 知 道 他 的 忠 告 讓 他 有 很 多 功 課 。

BOS training will be provided . EOS
BOS 会 有 训 练 。 EOS
一 個 人 會 議 一 個 人 會 議 一 點 會 來 。

BOS someone is watching you . EOS
BOS 有 人 在 看 著 你 。 EOS
有 人 在 醫 生 看 见 你 的 朋 友 。

BOS i

### 没有Attention的版本
下面是一个更简单的没有Attention的encoder decoder模型

In [105]:
class PlainEncoder(nn.Module):
    def __init__(self, vocab_size, hidden_size, dropout=0.2):
        super(PlainEncoder, self).__init__()
        self.embed = nn.Embedding(vocab_size, hidden_size)
        self.rnn = nn.GRU(hidden_size, hidden_size, batch_first=True)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, lengths):
        sorted_len, sorted_idx = lengths.sort(0, descending=True)
        x_sorted = x[sorted_idx.long()]
        embedded = self.dropout(self.embed(x_sorted))
        
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, sorted_len.long().cpu().data.numpy(), batch_first=True)
        packed_out, hid = self.rnn(packed_embedded)
        out, _ = nn.utils.rnn.pad_packed_sequence(packed_out, batch_first=True)
        _, original_idx = sorted_idx.sort(0, descending=False)
        out = out[original_idx.long()].contiguous()
        hid = hid[:, original_idx.long()].contiguous()
        
        return out, hid[[-1]]

class PlainDecoder(nn.Module):
    def __init__(self, vocab_size, hidden_size, dropout=0.2):
        super(PlainDecoder, self).__init__()
        self.embed = nn.Embedding(vocab_size, hidden_size)
        self.rnn = nn.GRU(hidden_size, hidden_size, batch_first=True)
        self.out = nn.Linear(hidden_size, vocab_size)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, y, y_lengths, hid):
        sorted_len, sorted_idx = y_lengths.sort(0, descending=True)
        y_sorted = y[sorted_idx.long()]
        hid = hid[:, sorted_idx.long()]

        y_sorted = self.dropout(self.embed(y_sorted)) # batch_size, output_length, embed_size

        packed_seq = nn.utils.rnn.pack_padded_sequence(y_sorted, sorted_len.long().cpu().data.numpy(), batch_first=True)
        out, hid = self.rnn(packed_seq, hid)
        unpacked, _ = nn.utils.rnn.pad_packed_sequence(out, batch_first=True)
        _, original_idx = sorted_idx.sort(0, descending=False)
        output_seq = unpacked[original_idx.long()].contiguous()
#         print(output_seq.shape)
        hid = hid[:, original_idx.long()].contiguous()

        output = F.log_softmax(self.out(output_seq), -1)
        
        return output, hid
    
class PlainSeq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super(PlainSeq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        
    def forward(self, x, x_lengths, y, y_lengths):
        encoder_out, hid = self.encoder(x, x_lengths)
        output, hid = self.decoder(y=y,
                    y_lengths=y_lengths,
                    hid=hid)
        return output, None
    
    def translate(self, x, x_lengths, y, max_length=10):
        encoder_out, hid = self.encoder(x, x_lengths)
        preds = []
        batch_size = x.shape[0]
        attns = []
        for i in range(max_length):
            output, hid = self.decoder(y=y,
                    y_lengths=torch.ones(batch_size).long().to(y.device),
                    hid=hid)
            y = output.max(2)[1].view(batch_size, 1)
            preds.append(y)
            
        return torch.cat(preds, 1), None

In [106]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
dropout=0.1

en_vocab_size = len(en_dict)
cn_vocab_size = len(cn_dict)
embed_size = hidden_size = 100
encoder = PlainEncoder(vocab_size=en_vocab_size, 
                  hidden_size=hidden_size,
                dropout=dropout)
decoder = PlainDecoder(vocab_size=cn_vocab_size, 
                  hidden_size=hidden_size,
                      dropout=dropout)
model = PlainSeq2Seq(encoder, decoder)
model = model.to(device)
crit = LanguageModelCriterion().to(device)
optimizer = torch.optim.Adam(model.parameters())


In [107]:
train(model, train_data, 30)

epoch 0 iteration 0 loss 8.118623733520508
epoch 0 iteration 100 loss 4.908447265625
epoch 0 iteration 200 loss 4.654881000518799
epoch 0 training loss 5.458988490175124
evaluating on dev...
evaluation loss 4.82326412088977
epoch 1 iteration 0 loss 4.377485275268555
epoch 1 iteration 100 loss 4.250931262969971
epoch 1 iteration 200 loss 4.107049942016602
epoch 1 training loss 4.568508705921145
epoch 2 iteration 0 loss 3.728694438934326
epoch 2 iteration 100 loss 3.8160221576690674
epoch 2 iteration 200 loss 3.741051435470581
epoch 2 training loss 4.142142730123562
epoch 3 iteration 0 loss 3.3936266899108887
epoch 3 iteration 100 loss 3.5346450805664062
epoch 3 iteration 200 loss 3.49332332611084
epoch 3 training loss 3.86606609727717
epoch 4 iteration 0 loss 3.158982038497925
epoch 4 iteration 100 loss 3.33746337890625
epoch 4 iteration 200 loss 3.313246250152588
epoch 4 training loss 3.6600888084444545
epoch 5 iteration 0 loss 2.9967830181121826
epoch 5 iteration 100 loss 3.1830751895

In [110]:
for i in range(100,120):
    translate_dev(i)
    print()

BOS you have nice skin . EOS
BOS 你 的 皮 膚 真 好 。 EOS
你 必 須 再 來 一 些 工 作 。

BOS you 're UNK correct . EOS
BOS 你 部 分 正 确 。 EOS
你 在 哪 裡 ， 他 就 是 不 知

BOS everyone admired his courage . EOS
BOS 每 個 人 都 佩 服 他 的 勇 氣 。 EOS
他 們 的 父 親 對 我 的 父 親

BOS what time is it ? EOS
BOS 几 点 了 ？ EOS
什 麼 時 候 ， 不 是 什 么 意

BOS i 'm free tonight . EOS
BOS 我 今 晚 有 空 。 EOS
我 今 晚 有 很 多 次 。

BOS here is your book . EOS
BOS 這 是 你 的 書 。 EOS
你 的 姓 名 字 典 是 你 的 。

BOS they are at lunch . EOS
BOS 他 们 在 吃 午 饭 。 EOS
他 們 在 這 裡 沒 有 人 在 這

BOS this chair is UNK . EOS
BOS 這 把 椅 子 很 UNK 。 EOS
他 們 在 這 裡 有 一 些 錢 。

BOS it 's pretty heavy . EOS
BOS 它 真 重 。 EOS
這 是 什 麼 時 候 ， 我 不 知

BOS many attended his funeral . EOS
BOS 很 多 人 都 参 加 了 他 的 葬 礼 。 EOS
他 的 父 親 對 他 的 手 提 供

BOS training will be provided . EOS
BOS 会 有 训 练 。 EOS
很 多 人 都 知 道 有 什 麼 時

BOS someone is watching you . EOS
BOS 有 人 在 看 著 你 。 EOS
汤 姆 是 我 的 兄 弟 。

BOS i slapped his face . EOS
BOS 我 摑 了 他 的 臉 。 EOS
我 知 道 他 是 否 認 為 他 的

BOS i like UNK music . EOS
BOS 我 喜 歡 流 行 音 樂 。 EOS
