# Neural Machine Translation by Jointly Learning to Align and Translate

Pytorch와 Torchtext라는 모듈을 사용하여 seqeunce-to-sequence 모델을 적용시켰습니다. 위 논문을 바탕으로 implement를 할 수 있었습니다.

## Introduction

그전에 언급했지만 기본적인 encoder-decoder 모델

![](https://github.com/bentrevett/pytorch-seq2seq/blob/master/assets/seq2seq1.png?raw=1)

그전 모델에서 Information compresson을 하기 위해서 context vector, $z$,를 decoder로 time-step마다 전달해주고 context vector와 embed된 입력 단어,$d(y_t)$, 를 hidden state ,$s_t$,를 통해서 예측을 하게 됩니다.

![](https://github.com/bentrevett/pytorch-seq2seq/blob/master/assets/seq2seq7.png?raw=1)

Even though we have reduced some of this compression, our context vector still needs to contain all of the information about the source sentence. The model implemented in this notebook avoids this compression by allowing the decoder to look at the entire source sentence (via its hidden states) at each decoding step! How does it do this? It uses *attention*. 

Attention works by first, calculating an attention vector, $a$, that is the length of the source sentence. The attention vector has the property that each element is between 0 and 1, and the entire vector sums to 1. We then calculate a weighted sum of our source sentence hidden states, $H$, to get a weighted source vector, $w$. 

$$w = \sum_{i}a_ih_i$$

We calculate a new weighted source vector every time-step when decoding, using it as input to our decoder RNN as well as the linear layer to make a prediction. We'll explain how to do all of this during the tutorial.

## Preparing Data

Again, the preparation is similar to last time.

First we import all the required modules.

Pytorch 기반 독일어를 영어로 번역하는 모델
- Encoder-Decoder seq2seq 모델은 RNN을 이용해 input을 feature vector로 인코딩하며, 이렇게 인코딩된 vector를 여기선 'context vector'라함
- 'context vector' : input 문장(독일어 문장)의 추상적인/압축된 representation 
- 이 context vector는 두 번째 RNN을 통해 디코딩되어 영어 문장을 생성함 

필요한 패키지를 import하며, Mac기반 Ios환경에서 Anaconda env.에서 환경설정을 하였다.
이때, python=3.6 기반으로 하며, 아래 패키지를 설치하였다.

conda install pytorch==1.8.0 torchtext==0.9.0 cpuonly -c pytorch
conda install -c pytorch torchtext
conda install -c conda-forge spacy
pip install spacy==2.2.4
conda install -c anaconda numpy

In [1]:
#python=3.6
import torch # conda install pytorch==1.8.0 torchtext==0.9.0 cpuonly -c pytorch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

from torchtext.legacy.datasets import Multi30k #conda install -c pytorch torchtext
from torchtext.legacy.data import Field, BucketIterator

import spacy 
# conda install -c conda-forge spacy 
# pip install spacy==2.2.4
import numpy as np #conda install -c anaconda numpy

import random
import math
import time

Set the random seeds for reproducability.

난수 생성 과정

In [2]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

Load the German and English spaCy models.

Tokenizer model(=문장을 토큰화할 모듈)을 설치 하며, 문장을 토큰화하는 모델을 불러온다. 이때, 추가적으로 아래 커맨드를 이용하여 필요한 모듈을 다운 받는다.

python3 -m spacy download en_core_web_sm
pip3 install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz

python3 -m spacy download de_core_news_sm
pip3 install https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-2.2.0/de_core_news_sm-2.2.0.tar.gz

In [3]:
#spacy_de = spacy.load('de_core_news_sm')
#spacy_en = spacy.load('en_core_web_sm')
import en_core_web_sm
import de_core_news_sm

# 각 언어에 맞는 tokenizer 불러오기 
spacy_de = de_core_news_sm.load()
spacy_en = en_core_web_sm.load()


"""
python3 -m spacy download en_core_web_sm
pip3 install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz

python3 -m spacy download de_core_news_sm
pip3 install https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-2.2.0/de_core_news_sm-2.2.0.tar.gz
"""

'\npython3 -m spacy download en_core_web_sm\npip3 install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz\n\npython3 -m spacy download de_core_news_sm\npip3 install https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-2.2.0/de_core_news_sm-2.2.0.tar.gz\n'

We create the tokenizers.

"tokenizer"함수를 생성

In [4]:
# 독일어 tokenize해서 단어들을 리스트로 만든 후 reverse 
def tokenize_de(text):
    """
    Tokenizes German text from a string into a list of strings
    """
    return [tok.text for tok in spacy_de.tokenizer(text)]
# 영어 tokenize해서 단어들을 리스트로 만들기
def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings
    """
    return [tok.text for tok in spacy_en.tokenizer(text)]

The fields remain the same as before.

TorchText의 Field는 데이터를 어떻게 처리되어야 하는지 정한다.

In [5]:
# SRC = source = input
SRC = Field(tokenize = tokenize_de, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)
# TRG = target = output
TRG = Field(tokenize = tokenize_en, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

Load the data.

Multi30k dataset을 사용하여, 30,000개의 영어, 독일, 프랑스어 문장을 포함한다.
이 때 데이터에서는 영어와 쌍(pair)를 이루는 독일어/불어 문장들로 구성된다.
train, validation, test 데이터를 불러오고, 다운로드한다.

In [6]:
# exts : 어떤 언어 사용할지 명시 (input 언어를 먼저 씀)
# fields = (입력, 출력) 
train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'), 
                                                    fields = (SRC, TRG))

downloading training.tar.gz


training.tar.gz: 100%|██████████| 1.21M/1.21M [00:08<00:00, 137kB/s] 


downloading validation.tar.gz


validation.tar.gz: 100%|██████████| 46.3k/46.3k [00:00<00:00, 72.3kB/s]


downloading mmt_task1_test2016.tar.gz


mmt_task1_test2016.tar.gz: 100%|██████████| 66.2k/66.2k [00:00<00:00, 70.4kB/s]


데이터 확인

In [25]:
print(vars(train_data.examples[0])) # 독일어는 단어 순서가 거꾸로 되어있다.

# 데이터 크기 확인
print(f'Number of training examples: {len(train_data.examples)}')
print(f'Number of validation examples: {len(valid_data.examples)}')
print(f'Number of testing examples: {len(test_data.examples)}')

{'src': ['zwei', 'junge', 'weiße', 'männer', 'sind', 'im', 'freien', 'in', 'der', 'nähe', 'vieler', 'büsche', '.'], 'trg': ['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']}
Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000


Build the vocabulary.
- 각 unique 토큰을 index에 대응시키기 
- 독일어의 vocab과 영어의 vocab은 다름 
- min_freq=2 : 2번 이상 등장하는 단어만 vocab에 포함시키며, 한번만 나오는 단어는 <unk> 토큰으로 처리
    * <unk> = unknown
- vocab은 training set에서만 만들어져야 함 (validation/test set에서 만들면 안 됨)

In [7]:
# min_freq=n는 n번 이상 등장한 토큰만을 출력하며, 토큰이 n번만 등장했다면 <unk>로 대체
SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)

Define the device.
사용할 device(cpu/tpu/gpu)를 선정하는 부분으로, 이번 경우에는 Mac의 CPU를 이용하여 여기서는 작업을 하였다.

In [8]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Create the iterators.

In [9]:
# Iterator 생성
BATCH_SIZE = 128

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    device = device)

## Building the Seq2Seq Model

### Encoder

First, we'll build the encoder. Similar to the previous model, we only use a single layer GRU, however we now use a *bidirectional RNN*. With a bidirectional RNN, we have two RNNs in each layer. A *forward RNN* going over the embedded sentence from left to right (shown below in green), and a *backward RNN* going over the embedded sentence from right to left (teal). All we need to do in code is set `bidirectional = True` and then pass the embedded sentence to the RNN as before. 

![](assets/seq2seq8.png)

We now have:

$$\begin{align*}
h_t^\rightarrow &= \text{EncoderGRU}^\rightarrow(e(x_t^\rightarrow),h_{t-1}^\rightarrow)\\
h_t^\leftarrow &= \text{EncoderGRU}^\leftarrow(e(x_t^\leftarrow),h_{t-1}^\leftarrow)
\end{align*}$$

Where $x_0^\rightarrow = \text{<sos>}, x_1^\rightarrow = \text{guten}$ and $x_0^\leftarrow = \text{<eos>}, x_1^\leftarrow = \text{morgen}$.

As before, we only pass an input (`embedded`) to the RNN, which tells PyTorch to initialize both the forward and backward initial hidden states ($h_0^\rightarrow$ and $h_0^\leftarrow$, respectively) to a tensor of all zeros. We'll also get two context vectors, one from the forward RNN after it has seen the final word in the sentence, $z^\rightarrow=h_T^\rightarrow$, and one from the backward RNN after it has seen the first word in the sentence, $z^\leftarrow=h_T^\leftarrow$.

The RNN returns `outputs` and `hidden`. 

`outputs` is of size **[src len, batch size, hid dim * num directions]** where the first `hid_dim` elements in the third axis are the hidden states from the top layer forward RNN, and the last `hid_dim` elements are hidden states from the top layer backward RNN. We can think of the third axis as being the forward and backward hidden states concatenated together other, i.e. $h_1 = [h_1^\rightarrow; h_{T}^\leftarrow]$, $h_2 = [h_2^\rightarrow; h_{T-1}^\leftarrow]$ and we can denote all encoder hidden states (forward and backwards concatenated together) as $H=\{ h_1, h_2, ..., h_T\}$.

`hidden` is of size **[n layers * num directions, batch size, hid dim]**, where **[-2, :, :]** gives the top layer forward RNN hidden state after the final time-step (i.e. after it has seen the last word in the sentence) and **[-1, :, :]** gives the top layer backward RNN hidden state after the final time-step (i.e. after it has seen the first word in the sentence).

As the decoder is not bidirectional, it only needs a single context vector, $z$, to use as its initial hidden state, $s_0$, and we currently have two, a forward and a backward one ($z^\rightarrow=h_T^\rightarrow$ and $z^\leftarrow=h_T^\leftarrow$, respectively). We solve this by concatenating the two context vectors together, passing them through a linear layer, $g$, and applying the $\tanh$ activation function. 

$$z=\tanh(g(h_T^\rightarrow, h_T^\leftarrow)) = \tanh(g(z^\rightarrow, z^\leftarrow)) = s_0$$

**Note**: this is actually a deviation from the paper. Instead, they feed only the first backward RNN hidden state through a linear layer to get the context vector/decoder initial hidden state. This doesn't seem to make sense to me, so we have changed it.

As we want our model to look back over the whole of the source sentence we return `outputs`, the stacked forward and backward hidden states for every token in the source sentence. We also return `hidden`, which acts as our initial hidden state in the decoder.

Encoder은 RNN으로 구성되어 있으며, 입력 senquence를 입력받아 encode하여 고정된 크기의 context vector를 생성한다. 입력 cell과 입력 hidden state는 제로 텐서이다. 현재 프로젝트의 경우 n_layers = 1 이다

- 초기 은닉 상태, cell state 명시해주지 않으면, 디폴트로 모두 0으로 채워진 텐서로 초기화
- outputs : 맨 위 레이어에서 각 time-stamp마다의 은닉 상태들
- hidden : 각 레이어의 마지막 은닉상태, h_T
- cell : 각 레이어의 마지막 cell state, c_T

<center>self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True)</center>
[학생 질문1]

Encoder 부분에 대한 설명을 읽다 궁금한 지점이 생겨 질문드립니다. RNN 셀의 결과물은 output과 hidden이 나온다고 적혀있고, hidden의 경우 그 size가 [n layers*num directions, batch size, hid dim] 으로 이루어져 있다고 했습니다. 또한 output은 last layer만 결과값으로 내는 것이라는 설명이 있었습니다. 

제가 궁금한 점은 n layers 입니다. Encoder 구현에서 n layer를 따로 지정해주는 부분이 없는데, 이 값은 어떻게 설정되는 것일까요? 

[교수님 답변1]

여기서 GRU() 생성자 호출시 n_layers 파라미터를 전달하지 않는데, 이 경우,  n_layers 의 default  값이 1이라는 것을 짐작할 수 있습니다.
nn.GRU 로 구글링을 하면, 첫 아이템이 GRU-Pytorch 1.12 doc인데 이를 클릭하면 다음과 같이 나옵니다:

Parameters
input_size – The number of expected features in the input x

hidden_size – The number of features in the hidden state h

num_layers – Number of recurrent layers. E.g., setting num_layers=2 would mean stacking two GRUs together to form a stacked GRU, with the second GRU taking in outputs of the first GRU and computing the final results. Default: 1



[학생 질문2]

그리고 n layer라는 것이 rnn 연산을 하는 횟수로 이해하면 되는 지도 궁금합니다.


[교수님 답변2]

rnn cell 들이 수평으로 나열되는 횟수를 말하는 것 같은데, 그게 아니라는 것은 위에서 설명이 되었고, 
rnn 의 생성자를 이용하여 rnn model object를 생성하면 이 object 는 callable object 로 함수처럼 사용되는데, 인풋에 들어있는 token 수 만큼 함수를 반복적으로 호출하도록 되어 있습니다.


In [10]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
        super().__init__()
        
        # embedding: 입력값을 emd_dim 벡터로 변경 => 단어들의 index가 embedding 함수로 넘겨짐
        #input_dim = input 데이터의 vocab size = one-hot vector의 사이즈
        # emb_dim = embedding layer의 차원
        # embedding 함수 : one-hot vector를 emb_dim 길이의 dense vector로 변환
        self.embedding = nn.Embedding(input_dim, emb_dim) 
        
        # embedding을 입력받아 hid_dim(enc_hid_dim) 크기의 hidden state, cell 출력
        # n_directions = 1이 default이며, bidirectional RNN의 경우에는 "n_directions=2"이다
        
        self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True)
        #self.rnn = nn.LSTM(emb_dim, enc_hid_dim, n_layers, dropout = dropout)
        
        # Fully-connected layer를 의미함
        self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)
        
        #dropout = 사용할 드롭아웃의 양 (오버피팅 방지하는 정규화 방법)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        #src = [src len, batch size]
        
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src len, batch size, emb dim]
        
        outputs, hidden = self.rnn(embedded)
                
        #outputs = [src len, batch size, hid dim * num directions]
        #hidden = [n layers * num directions, batch size, hid dim]
        
        #hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
        #outputs are always from the last layer
        
        #hidden [-2, :, : ] is the last of the forwards RNN 
        #hidden [-1, :, : ] is the last of the backwards RNN
        
        #initial decoder hidden is final hidden state of the forwards and backwards 
        #  encoder RNNs fed through a linear layer
        hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))
        
        #outputs = [src len, batch size, enc hid dim * 2]
        #hidden = [batch size, dec hid dim]
        
        return outputs, hidden

### Attention

Next up is the attention layer. This will take in the previous hidden state of the decoder, $s_{t-1}$, and all of the stacked forward and backward hidden states from the encoder, $H$. The layer will output an attention vector, $a_t$, that is the length of the source sentence, each element is between 0 and 1 and the entire vector sums to 1.

Intuitively, this layer takes what we have decoded so far, $s_{t-1}$, and all of what we have encoded, $H$, to produce a vector, $a_t$, that represents which words in the source sentence we should pay the most attention to in order to correctly predict the next word to decode, $\hat{y}_{t+1}$. 

First, we calculate the *energy* between the previous decoder hidden state and the encoder hidden states. As our encoder hidden states are a sequence of $T$ tensors, and our previous decoder hidden state is a single tensor, the first thing we do is `repeat` the previous decoder hidden state $T$ times. We then calculate the energy, $E_t$, between them by concatenating them together and passing them through a linear layer (`attn`) and a $\tanh$ activation function. 

$$E_t = \tanh(\text{attn}(s_{t-1}, H))$$ 

This can be thought of as calculating how well each encoder hidden state "matches" the previous decoder hidden state.

We currently have a **[dec hid dim, src len]** tensor for each example in the batch. We want this to be **[src len]** for each example in the batch as the attention should be over the length of the source sentence. This is achieved by multiplying the `energy` by a **[1, dec hid dim]** tensor, $v$.

$$\hat{a}_t = v E_t$$

We can think of $v$ as the weights for a weighted sum of the energy across all encoder hidden states. These weights tell us how much we should attend to each token in the source sequence. The parameters of $v$ are initialized randomly, but learned with the rest of the model via backpropagation. Note how $v$ is not dependent on time, and the same $v$ is used for each time-step of the decoding. We implement $v$ as a linear layer without a bias.

Finally, we ensure the attention vector fits the constraints of having all elements between 0 and 1 and the vector summing to 1 by passing it through a $\text{softmax}$ layer.

$$a_t = \text{softmax}(\hat{a_t})$$

This gives us the attention over the source sentence!

Graphically, this looks something like below. This is for calculating the very first attention vector, where $s_{t-1} = s_0 = z$. The green/teal blocks represent the hidden states from both the forward and backward RNNs, and the attention computation is all done within the pink block.

![](assets/seq2seq9.png)

In [11]:
class Attention(nn.Module):
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super().__init__()
        
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
        self.v = nn.Linear(dec_hid_dim, 1, bias = False)
        
    def forward(self, hidden, encoder_outputs):
        
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]
        
        batch_size = encoder_outputs.shape[1]
        src_len = encoder_outputs.shape[0]
        
        #repeat decoder hidden state src_len times
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        
        #hidden = [batch size, src len, dec hid dim]
        #encoder_outputs = [batch size, src len, enc hid dim * 2]
        
        
        # torch.cat : concatenate
        # torch.bmm : matmul (행렬 곱)
        # 원래 tensor 크기가 (3) 이었다면 unsqueeze(0) 후에는 (1,3)
        # 원래 tensor 크기가 (3) 이었다면 unsqueeze(1) 후에는 (3,1)
        # unqueeze는 차원에 1을 더해주는 것, 0/1은 axis=0(행)/axis=1(열) 의미 
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2))) 
        
        #energy = [batch size, src len, dec hid dim]

        attention = self.v(energy).squeeze(2)
        
        #attention= [batch size, src len]
        
        return F.softmax(attention, dim=1)

### Decoder

Next up is the decoder. 

The decoder contains the attention layer, `attention`, which takes the previous hidden state, $s_{t-1}$, all of the encoder hidden states, $H$, and returns the attention vector, $a_t$.

We then use this attention vector to create a weighted source vector, $w_t$, denoted by `weighted`, which is a weighted sum of the encoder hidden states, $H$, using $a_t$ as the weights.

$$w_t = a_t H$$

The embedded input word, $d(y_t)$, the weighted source vector, $w_t$, and the previous decoder hidden state, $s_{t-1}$, are then all passed into the decoder RNN, with $d(y_t)$ and $w_t$ being concatenated together.

$$s_t = \text{DecoderGRU}(d(y_t), w_t, s_{t-1})$$

We then pass $d(y_t)$, $w_t$ and $s_t$ through the linear layer, $f$, to make a prediction of the next word in the target sentence, $\hat{y}_{t+1}$. This is done by concatenating them all together.

$$\hat{y}_{t+1} = f(d(y_t), w_t, s_t)$$

The image below shows decoding the first word in an example translation.

![](assets/seq2seq10.png)

The green/teal blocks show the forward/backward encoder RNNs which output $H$, the red block shows the context vector, $z = h_T = \tanh(g(h^\rightarrow_T,h^\leftarrow_T)) = \tanh(g(z^\rightarrow, z^\leftarrow)) = s_0$, the blue block shows the decoder RNN which outputs $s_t$, the purple block shows the linear layer, $f$, which outputs $\hat{y}_{t+1}$ and the orange block shows the calculation of the weighted sum over $H$ by $a_t$ and outputs $w_t$. Not shown is the calculation of $a_t$.

Decoder은 encode된 context vector를 입력받아 decode하여 단어를 예측한다.

In [12]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
        super().__init__()

        # output_dim = output 데이터의 vocab size 
        # input_dim은 데이터에서 주어진대로, output_dim은 우리가 직접 정해서 초기화
        self.output_dim = output_dim
        self.attention = attention
        
        # content vector를 입력받아 emb_dim 출력
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        # embedding을 입력받아 hid_dim(enc_hid_dim) 크기의 hidden state, cell 출력
        self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)
        
        self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, encoder_outputs):
        # input = [batch size]
        # hidden = [batch size, dec hid dim]
        # encoder_outputs = [src len, batch size, enc hid dim * 2]
        
        
        # input = [batch size]
        # hidden = [n layers * n directions, batch size, hid dim]
        # cell = [n layers * n directions, batch size, hid dim]
        # Decoder에서 항상 n directions = 1
        # 따라서 hidden = [n layers, batch size, hid dim]
        # context = [n layers, batch size, hid dim]
        
        # Case1: input = [batch size] → unsqueeze → input = [1, batch size]
        # Case2: input = [1, batch size] → embedding → dropout → embedded = [1, batch size, emb dim]
        # input = [1, batch size], 첫번째 input은 <sos>
        input = input.unsqueeze(0) 
        
        
        
        # embedded = [1, batch size, emb dim]
        # embedded, hidden, cell → rnn → output, hidden, cell
        embedded = self.dropout(self.embedding(input)) 
        
        # a = [batch size, src len]
        a = self.attention(hidden, encoder_outputs)
        # a = [batch size, 1, src len]
        a = a.unsqueeze(1) 
        
        # encoder_outputs = [batch size, src len, enc hid dim * 2]
        encoder_outputs = encoder_outputs.permute(1, 0, 2)   
                
        # weighted = [batch size, 1, enc hid dim * 2]
        weighted = torch.bmm(a, encoder_outputs)     
        
        # weighted = [1, batch size, enc hid dim * 2]
        weighted = weighted.permute(1, 0, 2)
        
        # rnn_input = [1, batch size, (enc hid dim * 2) + emb dim]
        rnn_input = torch.cat((embedded, weighted), dim = 2)
                    
            
        # output = [seq len, batch size, dec hid dim * n directions]   
        # Case1: output = [1, batch size, hid dim] → squeeze → output = [batch size, hid dim]
        # Case2: output = [batch size, hid dim] → fc_out → prediction = [batch size, output dim]
        # Decoder에서 항상 seq len = n directions = 1 
        # 한 번에 한 토큰씩만 디코딩하므로 seq len = 1
        # 따라서 output = [1, batch size, hid dim]
        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))  
        
        
        
        # hidden, cell은 각 time-stamp/각 레이어들의 은닉상태와 cell state들의 리스트
        # output은 마지막 time-stamp/마지막 레이어의 은닉상태만 
        # hidden = [n layers * n directions, batch size, dec hid dim]
        
        # seq len, n layers and n directions will always be 1 in this decoder, therefore:
        # output = [1, batch size, dec hid dim]
        # hidden = [1, batch size, dec hid dim]
        # this also means that output == hidden
        assert (output == hidden).all()
        
        embedded = embedded.squeeze(0)
        output = output.squeeze(0)
        weighted = weighted.squeeze(0)
        
        # prediction = [batch size, output dim]
        prediction = self.fc_out(torch.cat((output, weighted, embedded), dim = 1))
        
        return prediction, hidden.squeeze(0)

### Seq2Seq

This is the first model where we don't have to have the encoder RNN and decoder RNN have the same hidden dimensions, however the encoder has to be bidirectional. This requirement can be removed by changing all occurences of `enc_dim * 2` to `enc_dim * 2 if encoder_is_bidirectional else enc_dim`. 

This seq2seq encapsulator is similar to the last two. The only difference is that the `encoder` returns both the final hidden state (which is the final hidden state from both the forward and backward encoder RNNs passed through a linear layer) to be used as the initial hidden state for the decoder, as well as every hidden state (which are the forward and backward hidden states stacked on top of each other). We also need to ensure that `hidden` and `encoder_outputs` are passed to the decoder. 

Briefly going over all of the steps:
- the `outputs` tensor is created to hold all predictions, $\hat{Y}$
- the source sequence, $X$, is fed into the encoder to receive $z$ and $H$
- the initial decoder hidden state is set to be the `context` vector, $s_0 = z = h_T$
- we use a batch of `<sos>` tokens as the first `input`, $y_1$
- we then decode within a loop:
  - inserting the input token $y_t$, previous hidden state, $s_{t-1}$, and all encoder outputs, $H$, into the decoder
  - receiving a prediction, $\hat{y}_{t+1}$, and a new hidden state, $s_t$
  - we then decide if we are going to teacher force or not, setting the next input as appropriate

Encoder, decoder class를 조합하여 Seq2Seq를 구현합니다.


teacher_force = random.random() < teacher_forcing_ratio 

top1 = output.argmax(1) 

input = trg[t] if teacher_force else top1
          
          
이 class에서 teacher forcing이 구현되어있다. teacher forcing은 일정 확률로 다음 RNN cell에 prediction이 아닌, target token을 입력하는 것이다.

In [13]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
        # encoder와 decoder의 hid_dim이 일치하지 않는 경우 에러메세지
        #assert encoder.hid_dim == decoder.hid_dim, 
        # encoder와 decoder의 hid_dim이 일치하지 않는 경우 에러메세지
        #assert encoder.n_layers == decoder.n_layers, 
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use teacher forcing 75% of the time
        
        
        batch_size = src.shape[1]

        # 타겟 토큰 길이 얻기
        trg_len = trg.shape[0]
        
        # context vector의 차원
        trg_vocab_size = self.decoder.output_dim
        
        
        #tensor to store decoder outputs
        # decoder의 결과(output)를 저장할 텐서
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        #encoder_outputs is all hidden states of the input sequence, back and forwards
        #hidden is the final forward and backward hidden states, passed through a linear layer
        # Encoder의 마지막 은닉 상태가 Decoder의 초기 은닉상태로 쓰임
        encoder_outputs, hidden = self.encoder(src) # initial hidden state
                
        #first input to the decoder is the <sos> tokens
        # Decoder에 들어갈 첫 input은 <sos> 토큰
        input = trg[0,:]
        
        
        # target length만큼 반복
        # range(0,trg_len)이 아닌, range(1,trg_len)인 이유:0번째 trg는 항상 <sos>라서 그에 대한 output도 항상 0 
        for t in range(1, trg_len): # <eos> 제외하고 trg_len-1 만큼 반복
            
            #insert input token embedding, previous hidden state and all encoder hidden states
            #receive output tensor (predictions) and new hidden state
            output, hidden = self.decoder(input, hidden, encoder_outputs)
            
            #place predictions in a tensor holding predictions for each token
            # prediction 저장
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            # teacher forcing을 사용할지, 말지 결정
            # random.random() : [0,1] 사이 랜덤한 숫자 
            # 랜덤 숫자가 teacher_forcing_ratio보다 작으면 True니까 teacher_force=1
            teacher_force = random.random() < teacher_forcing_ratio 
            
            #get the highest predicted token from our predictions
            # 가장 높은 확률을 갖는 값(토큰) 얻기
            top1 = output.argmax(1) 
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            # techer_force = 1 = True이면 trg[t]를, 아니면 top1을 input으로 사용
            # 즉, teacher forcing의 경우에 다음 lstm에 target token 입력
            input = trg[t] if teacher_force else top1

        return outputs

## Training the Seq2Seq Model

The rest of this tutorial is very similar to the previous one.

We initialise our parameters, encoder, decoder and seq2seq model (placing it on the GPU if we have one). 

하이퍼파라미터 지정

In [14]:
# 하이퍼 파라미터 지정
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256 # 임베딩 차원
DEC_EMB_DIM = 256 # 임베딩 차원
ENC_HID_DIM = 512 # hidden state 차원
DEC_HID_DIM = 512 # hidden state 차원
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

# 모델 생성
attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)

model = Seq2Seq(enc, dec, device).to(device)

We use a simplified version of the weight initialization scheme used in the paper. Here, we will initialize all biases to zero and all weights from $\mathcal{N}(0, 0.01)$.


가중치 초기화를 하는 과정으로, mean=0(평균) std=0.01(표준편차) 범위의 정규분포에서 모든 가중치를 초기화 

In [15]:
def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
            
            # (-0.08, 0.08) 범위의 정규분포에서 모든 가중치를 초기화 
            # nn.init.uniform_(param.data, -0.08, 0.08)
        else:
            nn.init.constant_(param.data, 0)
            
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(7855, 256)
    (rnn): GRU(256, 512, bidirectional=True)
    (fc): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (attention): Attention(
      (attn): Linear(in_features=1536, out_features=512, bias=True)
      (v): Linear(in_features=512, out_features=1, bias=False)
    )
    (embedding): Embedding(5893, 256)
    (rnn): GRU(1280, 512)
    (fc_out): Linear(in_features=1792, out_features=5893, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

Calculate the number of parameters. We get an increase of almost 50% in the amount of parameters from the last model. 


모델의 학습가능한 파라미터 수 측정

In [16]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 20,518,917 trainable parameters


We create an optimizer.

In [17]:
optimizer = optim.Adam(model.parameters())

We initialize the loss function.

In [18]:
# loss function
# <pad> 토큰의 index를 넘겨 받으면 오차 계산하지 않고 ignore하기 (=pad에 해당하는 index는 무시한다)
# <pad> = padding
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]

criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

We then create the training loop...

학습을 위한 함수

In [19]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src = batch.src
        trg = batch.trg
        
        optimizer.zero_grad()
        
        output = model(src, trg)
        
        #trg = [trg len, batch size]
        #output = [trg len, batch size, output dim]        
        output_dim = output.shape[-1]        
        # loss 계산을 위해 1d로 변경
        output = output[1:].view(-1, output_dim)        
        # loss 계산을 위해 1d로 변경
        trg = trg[1:].view(-1)
        
        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        # 기울기 폭발 막기 위해 clip
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

...and the evaluation loop, remembering to set the model to `eval` mode and turn off teaching forcing.

evaluation 함수

In [20]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.src
            trg = batch.trg
            
            # teacher_forcing_ratio = 0 (아무것도 알려주면 안 됨)
            output = model(src, trg, 0) #turn off teacher forcing

            #trg = [trg len, batch size]
            #output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]
            
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            loss = criterion(output, trg)

            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

Finally, define a timing function.

시간을 check하는 함수를 정의

In [21]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Then, we train our model, saving the parameters that give us the best validation loss.

In [23]:
N_EPOCHS = 2
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        # 성능이 가장 좋았던 모델을 save
        torch.save(model.state_dict(), 'tut3-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 68m 42s
	Train Loss: 4.301 | Train PPL:  73.788
	 Val. Loss: 4.516 |  Val. PPL:  91.463
Epoch: 02 | Time: 66m 45s
	Train Loss: 3.653 | Train PPL:  38.576
	 Val. Loss: 3.781 |  Val. PPL:  43.850


Finally, we test the model on the test set using these "best" parameters.

성능이 가장 좋았던 모델을 load하여, test loss와 ppl을 측정한다.

In [24]:
# best val loss일 때의 가중치를 Load
model.load_state_dict(torch.load('tut3-model.pt'))

# test loss를 측정
test_loss = evaluate(model, test_iterator, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

| Test Loss: 3.755 | Test PPL:  42.716 |


We've improved on the previous model, but this came at the cost of doubling the training time.

In the next notebook, we'll be using the same architecture but using a few tricks that are applicable to all RNN architectures - packed padded sequences and masking. We'll also implement code which will allow us to look at what words in the input the RNN is paying attention to when decoding the output.

Reference

[1]seq2seq 모델 PyTorch로 구현하기 번역 및 정리

https://codlingual.tistory.com/91

[2][seq2seq + Attention] 불어-영어 번역 모델 PyTorch로 구현하기

https://codlingual.tistory.com/93?category=732088


[3][논문 구현] PyTorch로 Seq2Seq(2014) 구현하고 학습하기

https://deep-learning-study.tistory.com/686

