# 2 - Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

해당 내용은 2014년도 [Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation](https://arxiv.org/abs/1406.1078) 논문의 내용의 컨셉을 구현한 것임.

## Introduction

Encoder는 기본적으로 embedded source sequence를 통해 context vector를 만드는 역할을 한다. 이후 이 context vector를 decoder와 linear layer에 넣어 target sentence를 만들게 된다.

전에 구현했던 것처럼 multi-layered LSTM를 encoder, decoder로 사용하면 정보들이 너무 압축되는데, 이를 조금 완화시키면 더 좋은 모델을 만들 수 있다.

## Data Preparation

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim

from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator

import spacy
import numpy as np

import random
import math
import time

Reproducible 을 위해 random seed 고정

In [2]:
SEED = 42

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

저번과 동일하게, 영어와 독일어 spacy 모델을 쓰는데, 본 논문에서는 독일어 역정렬을 수행하지 않음

또한 SOS, EOS 토큰이 포함되고, 모든 문자가 소문자가 되도록 Field를 생성해줌

In [4]:
import en_core_web_sm, de_core_news_sm

spacy_de = de_core_news_sm.load()
spacy_en = en_core_web_sm.load()

def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

SRC = Field(
tokenize=tokenize_de,
init_token='<sos>',
eos_token='<eos>',
lower=True
)

TRG = Field(
    tokenize=tokenize_en,
    init_token='<sos>',
    eos_token='<eos>',
    lower=True
)

저번과 동일한 `Multi30k dataset` 불러오고 단어들을 저장할 vocabulary 생성

In [5]:
train_data, valid_data, test_data = Multi30k.splits(
    exts=('.de', '.en'),
    fields=(SRC, TRG)
)

print(f'Number of training data : {len(train_data.examples)}')
print(f'Number of valid data : {len(valid_data.examples)}')
print(f'Number of test data : {len(test_data.examples)}')

SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(train_data, min_freq=2)

Number of training data : 29000
Number of valid data : 1014
Number of test data : 1000


`device`에서 연산되는 데이터 로더를 만들어줌

In [6]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

BATCH_SIZE = 128

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE, 
    device = device)

## Building the Seq2Seq Model

### Encoder

저번의 multi-layer LSTM을 사용했던 것과 달리 single-layer GRU를 만들어줌.

Dropout은 layer와 layer 사이에서 적용되기 때문에 `dropout`은 사용하지 않음.

GRU는 LSTM과 다르게 cell state 가 없고 오직 hidden state만 수행한다.

$h_t = GRU(e(x_t), h_{t-1})$

$(h_t, c_t) = LSTM(e(x_t), h_{t-1}, c_{t-1})$

$h_t = RNN(e(x_t), h_{t-1})$

위의 식을 보면 GRU와 RNN의 차이가 없어보일 수도 있지만, *gating mechanism*이 다르다.

LSTM : forget gate, input gate, output gate + cell state, hidden state

GRU : reset gate, update gate + hidden state

In [7]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, dropout):
        super().__init__()

        self.hid_dim = hid_dim
        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.rnn = nn.GRU(emb_dim, hid_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, src):
        '''
        src - [src_len, batch_size] 

        embedded - [src_len, batch_size, emb_dim]

        outputs - [src_len, batch_size, hid_dim * n_directions]

        hidden - [n_layers * n_directions, batch_size, hid_dim]
        '''
        embedded = self.dropout(self.embedding(src))

        outputs, hidden = self.rnn(embedded)

        return hidden

## Decoder

<p align="center"><img src="../asset/2(1).png"></p>

입력으로 embedded target token $d(y_t)$ 와 전 단계의 hidden state $s_{t-1}$, 그리고 context vector $z$를 받는다.

이때 context vector는 아래첨자 $t$가 없는데, 모든 decoding 단계에서 동일한 vector를 사용하기 때문임

참고로 initial hidden state $s_0$는 context vecotr $z$ 이다.

이처럼 RNN에서 전단계의 hidden state만 활용한 것과 다르게, context vector가 항상 주어지기 때문에 정보손실이 덜 발생할 수 있다.

In [8]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, dropout):
        super().__init__()

        self.hid_dim = hid_dim
        self.output_dim = output_dim
        self.embedding = nn.Embedding(output_dim, emb_dim)
        self.rnn = nn.GRU(emb_dim + hid_dim, hid_dim)
        self.fc_out = nn.Linear(emb_dim + hid_dim * 2, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, input, hidden, context):
        '''
        input - [batch_size]
        hidden - [n_layers * n_directions, batch_size, hid_dim]
        context - [same as hidden]

        이때 n_layers와 n_direction은 1이다.

        embedded - [1, batch_size, emb_dim]
        emb_con = [1, batch_size, emb_dim + hid_dim]

        output = [seq_len, batch_size, hid_dim * n_directions]
        hidden = [n_layers * n_directions, batch_size, hid_dim]
        '''
        input = input.unsqueeze(0) # [batch_size] -> [1, batch_size]
        embedded = self.dropout(self.embedding(input))
        emb_con = torch.cat((embedded, context), dim=2)
        output, hidden = self.rnn(emb_con, hidden)

        output = torch.cat((embedded.squeeze(0), hidden.squeeze(0), context.squeeze(0)), dim=1)
        prediction = self.fc_out(output)

        return prediction, hidden


## Seq2Seq Model

최종 모델은 아래 그림과 같이 구성된다.

<p align="center"><img src="../asset/2(2).png"></p>

디코딩은 아래와 같은 과정을 통해 순차적으로 진행된다.

- 입력 토큰 $y$와 전 단계의 hidden state $s_{t-1}$, context vector $z$가 decoder에 들어감

- 예측값 $\hat{y}_{t+1}$과 새로운 hidden state $s_t$를 얻는다.

In [9]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()

        self.encoder = encoder
        self.decoder = decoder
        self.device = device

        assert encoder.hid_dim == decoder.hid_dim, 'Hidden dimension of enc and dec should be the same'

    def forward(self, src, trg, teacher_forcing_ratio = 0.5):

        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim

        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)

        context = self.encoder(src)

        hidden = context # Initial hidden state는 context vector 이다.

        input = trg[0,:] # 디코더에는 우선 <sos> 토큰을 넣는다.

        for t in range(1, trg_len):
            output, hidden = self.decoder(input, hidden, context)

            outputs[t] = output

            teacher_force = random.random() < teacher_forcing_ratio

            top1 = output.argmax(1)

            if teacher_force:
                input = trg[t]
            else:
                input = top1

        return outputs               

## Training the Seq2Seq Model

In [10]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, DEC_DROPOUT)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = Seq2Seq(enc, dec, device).to(device)

In [11]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.normal_(param.data, mean=0, std=0.01)
        
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(7854, 256)
    (rnn): GRU(256, 512)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(5893, 256)
    (rnn): GRU(768, 512)
    (fc_out): Linear(in_features=1280, out_features=5893, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

In [12]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 14,220,037 trainable parameters


In [13]:
optimizer = optim.Adam(model.parameters())

TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]

criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

In [14]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src = batch.src
        trg = batch.trg
        
        optimizer.zero_grad()
        
        output = model(src, trg)
        
        #trg = [trg len, batch size]
        #output = [trg len, batch size, output dim]
        
        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        
        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

Evaluation 일때는 teacher_force를 꺼야함

In [15]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.src
            trg = batch.trg

            output = model(src, trg, 0) #turn off teacher forcing

            #trg = [trg len, batch size]
            #output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]
            
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            loss = criterion(output, trg)

            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [16]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [17]:
N_EPOCHS = 10
CLIP = 1 # weight clipping을 위함

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), '2-seq2seq-model.pt')
    
    print(f'Epoch : {epoch+1:02} | Time : {epoch_mins}m {epoch_secs}s | Train loss : {train_loss:.5f} | Valid loss : {valid_loss:.5f}')

Epoch : 01 | Time : 0m 19s | Train loss : 5.04071 | Valid loss : 5.08895
Epoch : 02 | Time : 0m 19s | Train loss : 4.39584 | Valid loss : 5.15434
Epoch : 03 | Time : 0m 19s | Train loss : 4.07433 | Valid loss : 4.68725
Epoch : 04 | Time : 0m 19s | Train loss : 3.75137 | Valid loss : 4.42824
Epoch : 05 | Time : 0m 19s | Train loss : 3.45564 | Valid loss : 4.15878
Epoch : 06 | Time : 0m 19s | Train loss : 3.15959 | Valid loss : 3.98140
Epoch : 07 | Time : 0m 18s | Train loss : 2.90866 | Valid loss : 3.89681
Epoch : 08 | Time : 0m 19s | Train loss : 2.68264 | Valid loss : 3.71527
Epoch : 09 | Time : 0m 19s | Train loss : 2.49487 | Valid loss : 3.68103
Epoch : 10 | Time : 0m 19s | Train loss : 2.30219 | Valid loss : 3.66900


In [19]:
model.load_state_dict(torch.load('2-seq2seq-model.pt'))

test_loss = evaluate(model, test_iterator, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

| Test Loss: 3.583 | Test PPL:  35.986 |


: 