논문과 https://github.com/bentrevett/pytorch-seq2seq 레포를 참고하여 seq2seq 모델을 알아보았습니다.

In [62]:
import os
import re
from os.path import join as pjoin
import numpy as np
#from tqdm import tqdm
from tqdm.notebook import tqdm
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, random_split

import random
import math
import time

from konlpy.tag import Okt
okt = Okt()

SEED = 111
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True


device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device = torch.device("cpu")

본 모델은 pytorch 라이브러리를 통해 attention을 적용한 seq2seq 모델입니다. 

In [96]:
def listDir(mypath):
    onlyfiles = [pjoin(mypath, f) for f in os.listdir(mypath)]
    return onlyfiles
file_list = listDir("korean_data")
file_list

['korean_data/1_구어체(1)_191213.xlsx',
 'korean_data/2_대화체_191213.xlsx',
 'korean_data/3_문어체_뉴스(1)_191213.xlsx',
 'korean_data/6_문어체_지자체웹사이트_191213.xlsx',
 'korean_data/3_문어체_뉴스(3)_191213.xlsx',
 'korean_data/5_문어체_조례_191213.xlsx',
 'korean_data/4_문어체_한국문화_191213.xlsx',
 'korean_data/3_문어체_뉴스(4)_191213.xlsx',
 'korean_data/1_구어체(2)_191213.xlsx',
 'korean_data/3_문어체_뉴스(2)_191213.xlsx']

In [39]:
df_kor_en = pd.read_excel(file_list[0],
                          index_col="SID")
df_kor_en

Unnamed: 0_level_0,원문,번역문
SID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,'Bible Coloring'은 성경의 아름다운 이야기를 체험 할 수 있는 컬러링 ...,Bible Coloring' is a coloring application that...
2,씨티은행에서 일하세요?,Do you work at a City bank?
3,푸리토의 베스트셀러는 해외에서 입소문만으로 4차 완판을 기록하였다.,"PURITO's bestseller, which recorded 4th rough ..."
4,11장에서는 예수님이 이번엔 나사로를 무덤에서 불러내어 죽은 자 가운데서 살리셨습니다.,In Chapter 11 Jesus called Lazarus from the to...
5,"6.5, 7, 8 사이즈가 몇 개나 더 재입고 될지 제게 알려주시면 감사하겠습니다.",I would feel grateful to know how many stocks ...
...,...,...
199996,나는 먼저 청소기로 바닥을 밀었어요.,"First of all, I vacuumed the floor."
199997,나는 먼저 팀 과제를 하고 놀러 갔어요.,I did the team assignment first and went out t...
199998,나는 비 같은 멋진 연예인을 좋아해요.,I like cool entertainer like Rain.
199999,나는 멋진 자연 경치를 보고 눈물을 흘렸어.,I cried seeing the amazing scenery.


위의 한국어-영어 번역 파일은 한국정보화진흥원에서 관리하는 데이터중 하나입니다. 20만개의 데이터 기준으로 한국어 단어를 추출했을 때 약 5만개의 단어를 추출할 수 있었습니다. 단어 수가 많을 때 임베딩 모델의 크기가 매우 커져서 gpu 메모리가 부족하거나 학습시간이 매우 오래 걸려 테스트가 어려워졌습니다. 학습은 보다 단어 수를 줄인 french-english 데이터를 사용했습니다.

In [60]:
#ko_sequences = []
#en_sequences = []
#for idx, row in tqdm(df_kor_en.iterrows(), total=200000):
#    kor, en = row["원문"], row["번역문"]
#    cleaned_kor = re.sub('[\.\?\!\,]+','', kor)
#    cleaned_en = re.sub('[\.\?\!\,]+','', en)
#    ko_sequences.append(okt.morphs(cleaned_kor))
#    en_sequences.append(cleaned_en.split())
    
def read_file(file_name):
    print("START LOAD {}".format(file_name))
    with open(pjoin("data", file_name), "r", encoding="utf8") as fin:
        lines = fin.readlines()
        sentences = [line.split() for line in lines]
    return sentences

fr_sequences = read_file("small_vocab_fr")
en_sequences = read_file("small_vocab_en")

START LOAD small_vocab_fr
START LOAD small_vocab_en


학습한 데이터의 french 단어 수는 350개, 영어는 250개로 비교적 단순한 구성으로 이루어져 있습니다. 데이터 수는 137860개로 학습은 충분히 이루어질 정도로 구성했습니다.

In [41]:
class TranslationData(Dataset):
    
    def __init__(self, from_sentences, to_sentences):
        self.init_token = "<sos>"
        self.end_token = "<eos>"
        self.end_token_pivot = 2
        self.from_sequences, self.from_word_dict = self.tokenize(from_sentences)
        self.to_sequences, self.to_word_dict = self.tokenize(to_sentences)
        
        self.source_dim = len(self.from_word_dict)
        self.target_dim = len(self.to_word_dict)
    
    def _dict_reverse(self, dictionary, value):
        for k, v in dictionary.items():
            if v == value:
                return k
        raise KeyError
        
    def tokenize(self, sentences):
        word_dict = {
            init_token: 1,
            end_token: 2,
        }
        word_counter = {
            1: len(sentences),
            2: len(sentences)
        }
        pivot = 3
        sequences = []
        for words in sentences:
            for word in words:
                if word not in word_dict:
                    word_dict[word] = pivot
                    word_counter[pivot] = 1
                    pivot = pivot + 1
                else:
                    word_counter[word_dict[word]] += 1
        
        for pivot, count in word_counter.items():
            if count == 1:
                word = self._dict_reverse(word_dict, pivot)
                word_dict[word] = 0
        
        start_pivot = 3
        for key in word_dict.keys():
            if word_dict[key] > 2:
                word_dict[key] = start_pivot
                start_pivot += 1
        
        for words in sentences:
            words = [self.init_token] + words + [self.end_token]
            tokens = [word_dict[w] for w in words]
            sequences.append(tokens)
        
        max_len = max(map(len, sequences))
        
        word_dict = {v: k for k, v in word_dict.items() if v > 0}
        word_dict[0] = "<NONE>"
        print("MAX LEN : {}".format(max_len))
        print("TOTAL SEQ: {}".format(len(sequences)))
        print("WORD DICT LEN: {}".format(len(word_dict)))
        for sequence in sequences:
            seq_len = len(sequence)
            sequence.extend([self.end_token_pivot] * (max_len - seq_len))
        return torch.tensor(sequences), word_dict
    
    def __len__(self):
        return len(self.from_sequences)
    
    def __getitem__(self, index):
        return self.from_sequences[index], self.to_sequences[index]
    

학습을 위한 Dataset은 본 문장에 대응되는 문장이 같이 반환되도록 구성했습니다. 각각 모든 문장은 문장에 있는 단어를 인식한 후 문장에 있는 단어들을 숫자로 바꾸고 각각 문장 처음과 끝에 처음과 끝을 알려주는 token을 추가했습니다. 이후 가장 긴 문장 길이에 맞춰 길이를 모두 똑같이 맞춰줬습니다.

In [18]:
class Encoder(nn.Module):

    def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
        """
        :param input_dim: the size of the one-hot vectors that will be input
        :param emb_dim: the dimensionality of the embedding layer
        :param enc_hid_dim: the dimensionality of the encoder hidden states
        :param dec_hid_dim: the dimensionality of the decoder hidden states
        :param dropout: amount of dropout to use
        """
        super().__init__()

        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional=True)
        self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, src):
        embedded = self.dropout(self.embedding(src))
        outputs, hidden = self.rnn(embedded)
        hidden = torch.tanh(self.fc(torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)))
        return outputs, hidden

Seq2Seq 모델은 encoder 모델과 decoder 모델로 이루어져 있으며 encoder에서는 input 문장을 읽어서 hidden layer의 output을 decoder에 적용합니다. input은 사전에 모두 같은 문장 길이로 만들어 두었으며 현재은 GRU layer를 bidirectional하게 구성하였습니다. 문장 한 시퀀스는 임베딩 layer를 통해 문장 속 단어를 각각 임베딩벡터로 변환하고 GRU layer를 통해 hidden layer를 forward합니다.

In [19]:
class Attention(nn.Module):

    def __init__(self, enc_hid_dim, dec_hid_dim):
        super().__init__()
        self.attn = nn.Linear(enc_hid_dim * 2 + dec_hid_dim, dec_hid_dim)
        self.v = nn.Linear(dec_hid_dim, 1, bias = False)

    def forward(self, hidden, encoder_outputs):
        """
        :param hidden: [batch size, dec hid dim]
        :param encoder_outputs: [src len, batch size, enc hid dim*2]
        merge hidden states of decoder and bidrectional output of encoder
        """
        batch_size = encoder_outputs.shape[1]
        src_len = encoder_outputs.shape[0]
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim=2)))
        attention = self.v(energy).squeeze(2)
        return F.softmax(attention, dim=1)

일반적인 seq2seq 모델은 encoder에서 나온 hidden state를 각 decoder output에 넣어주면서 encoder의 상태를 반영합니다. 이때 attention layer를 추가하여 추가 정보를 줄 수 있습니다. attention layer는 linear layer로 구성되어 있으며 encoder에서 나온 bidirectional hidden layer의 상태가 들어가며 decoder에 들어갈 hidden state가 들어갑니다. attention에서는 input sequence마다 다른 encoder_output에 대해 집중하여 input의 각 단어에 어느정도 가중치를 줄지 결정합니다.

In [20]:
class Decoder(nn.Module):

    def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim,
                 dropout, attention):
        super().__init__()
        self.output_dim = output_dim
        self.attention = attention

        self.embedding = nn.Embedding(output_dim, emb_dim)
        self.rnn = nn.GRU(emb_dim + enc_hid_dim * 2, dec_hid_dim)
        self.fc_out = nn.Linear(emb_dim + 2 * enc_hid_dim + dec_hid_dim, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, ipt, hidden, encoder_outputs):
        ipt = ipt.unsqueeze(0)
        embedded = self.dropout(self.embedding(ipt))
        attn = self.attention(hidden, encoder_outputs)
        attn = attn.unsqueeze(1)
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        
        weighted = torch.bmm(attn, encoder_outputs)
        weighted = weighted.permute(1, 0, 2)
        rnn_input = torch.cat((embedded, weighted), dim = 2)
        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
        assert (output == hidden).all()

        embedded = embedded.squeeze(0)
        output = output.squeeze(0)
        weighted = weighted.squeeze(0)
        prediction = self.fc_out(torch.cat((output, weighted, embedded), dim=1))
        
        return prediction, hidden.squeeze(0)

decoder는 encoder와 비슷하지만 문장의 각 단어를 하나씩 받아 출력하는 점이 다릅니다. decoder는 기본적으로 encoder_output과 attention에 나온 가중치를 항상 사용하게 되며 decode 문장의 각 단어 임베딩 벡터와 함게 GRU layer를 학습합니다. 각 decoder 단계에서는 하나의 단어를 보고 그 다음 단어를 예측하는 형식으로 forward를 구성합니다.

In [21]:
class Seq2Seq(nn.Module):

    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        """
        :param src: shape (src len, batch size)
        :param trg: shape (trg len, batch size)
        :param teacher_forcing_ratio: probability to use teacher forcing
        """
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim

        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)

        encoder_outputs, hidden = self.encoder(src)
        ipt = trg[0, :]

        for t in range(1, trg_len):

            output, hidden = self.decoder(ipt, hidden, encoder_outputs)
            outputs[t] = output
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.argmax(1)
            ipt = trg[t] if teacher_force else top1

        return outputs

전제 Seq2Seq2 모델은 encoder와 decoder로 구성되어 있습니다. 번역할 문장 전체를 encoder를 통해 output을 가져오며 이를 decoder의 문장 각 단어를 예측할 때 사용합니다. decoder는 encoder와는 달리 단어 하나하나 예측하며 output을 받습니다. 이때 항상 다음 단어를 예측하는 것이 아니고 teacher_forcing_ratio를 적용하여 문장이 길어질 때 예측에 점점 오차가 늘어나는 문제를 해결합니다. 이를 위해 random하게 본래 데이터를 사용하거나 전 decode 단계에서 나온 결과를 다시 학습에 사용합니다.

In [93]:
class Train:

    def __init__(self, from_seq, to_seq,
                 enc_emb_dim=128, dec_emb_dim=128,
                 enc_hid_dim=256, dec_hid_dim=256,
                 enc_dropout=0.3, dec_dropout=0.3,
                 epochs=15):
        #self.data = Data()
        self.data = TranslationData(from_seq, to_seq)
        data_len = len(self.data)
        train_num = int(data_len * 0.8)
        valid_num = int(data_len * 0.1)
        test_num = data_len - train_num - valid_num
        train, valid, test = random_split(self.data, [train_num, valid_num, test_num])
        self.train_iter = DataLoader(train, batch_size = 256, shuffle=True)
        self.valid_iter = DataLoader(valid, batch_size = 256, shuffle=False)
        self.test_iter = DataLoader(test, batch_size = 256, shuffle=False)
        self.input_dim = self.data.source_dim
        self.output_dim = self.data.target_dim

        self.enc_emb_dim = enc_emb_dim
        self.dec_emb_dim = dec_emb_dim
        self.enc_hid_dim = enc_hid_dim
        self.dec_hid_dim = dec_hid_dim
        self.enc_dropout = enc_dropout
        self.dec_dropout = dec_dropout

        self.encoder = Encoder(self.input_dim,
                               self.enc_emb_dim,
                               self.enc_hid_dim,
                               self.dec_hid_dim,
                               self.enc_dropout)
        self.attention = Attention(self.enc_hid_dim, self.dec_hid_dim)
        self.decoder = Decoder(self.output_dim,
                               self.dec_emb_dim,
                               self.enc_hid_dim,
                               self.dec_hid_dim,
                               self.dec_dropout,
                               self.attention)
        self.model = Seq2Seq(self.encoder, self.decoder, device).to(device)

        self.epochs = epochs
        self.criterion = nn.CrossEntropyLoss(ignore_index = self.data.end_token_pivot)

    @staticmethod
    def init_weights(m):
        for name, param in m.named_parameters():
            if 'weight' in name:
                nn.init.normal_(param.data, mean=0, std=0.01)
            else:
                nn.init.constant_(param.data, 0)

    def count_parameters(self, model):
        return sum(p.numel() for p in model.parameters() if p.requires_grad)

    def train(self, epoch, iterator, optimizer, criterion, clip):
        self.model.train()
        epoch_loss = 0
        for i, batch in enumerate(iterator):
            src = batch[0].transpose_(0, 1).to(device)
            trg = batch[1].transpose_(0, 1).to(device)
            
            optimizer.zero_grad()
            output = self.model(src, trg)
            # trg = [trg len, batch size]
            # output = [trg len, batch size, output dim]
            output_dim = output.shape[-1]
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].reshape(-1)
            #trg = [(trg len -1) * batch size]
            #output = [(trg len -1) * batch size, output dim]

            loss = criterion(output, trg)
            loss.backward()

            torch.nn.utils.clip_grad_norm(self.model.parameters(), clip)
            optimizer.step()
            epoch_loss += loss.item()
        return epoch_loss / len(iterator)

    def evaluate(self, iterator, criterion):
        self.model.eval()
        epoch_loss = 0
        with torch.no_grad():
            for i, batch in enumerate(iterator):
                src = batch[0].transpose_(0, 1).to(device)
                trg = batch[1].transpose_(0, 1).to(device)
                #src = batch.src
                #trg = batch.trg
                output = self.model(src, trg, 0.0)

                #trg = [trg len, batch size]
                #output = [trg len, batch size, output dim]

                output_dim = output.shape[-1]
                output = output[1:].view(-1, output_dim)
                trg = trg[1:].reshape(-1)
                loss = criterion(output, trg)
                epoch_loss += loss.item()
        return epoch_loss / len(iterator)

    def _epoch_time(self, start_time, end_time):
        elapsed_time = end_time - start_time
        elapsed_mins = int(elapsed_time / 60)
        elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
        return elapsed_mins, elapsed_secs

    def test(self):
        self.model.load_state_dict(torch.load(pjoin('model', 'attention.pt')))
        test_loss = self.evaluate(self.test_iter, self.criterion)
        print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')
        print("TEST INFERENCE")
        with torch.no_grad():
            for i, batch in enumerate(self.test_iter):
                src = batch[0][:5,:].transpose_(0, 1).to(device)
                trg = batch[1][:5,:].transpose_(0, 1).to(device)
                #src = batch.src
                #trg = batch.trg
                output = self.model(src, trg, 0.0)

                #trg = [trg len, batch size]
                #output = [trg len, batch size, output dim]

                output_dim = output.shape[-1]
                output = output[1:].view(-1, output_dim)
                _, output = output.max(-1)
                output = output.reshape((-1, 5))
                trg = trg[1:].transpose(0,1)
                src = src.transpose(0,1)
                output = output.transpose(0,1)
                for s, t, o in zip(src, trg, output):
                    #print(s, t, o)
                    print("CASE")
                    print(" ".join([self.data.from_word_dict[int(_s)] for _s in s]))
                    print(" ".join([self.data.to_word_dict[int(_t)] for _t in t]))
                    print(" ".join([self.data.to_word_dict[int(_o)] for _o in o]))
                    print()

    def run(self):
        self.model.apply(self.init_weights)
        print(self.model)
        print("Model trainable parametes: {}".format(self.count_parameters(self.model)))

        optimizer = optim.Adam(self.model.parameters())

        CLIP = 1
        best_valid_loss = float('inf')
        for epoch in range(self.epochs):
            start_time = time.time()
            train_loss = self.train(epoch, self.train_iter, optimizer, self.criterion, CLIP)
            valid_loss = self.evaluate(self.valid_iter, self.criterion)
            end_time = time.time()

            epoch_mins, epoch_secs = self._epoch_time(start_time, end_time)
            if valid_loss < best_valid_loss:
                best_valid_loss = valid_loss
                torch.save(self.model.state_dict(), pjoin('model', 'attention.pt'))
            print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
            print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
            print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')


학습은 전체 데이터를 0.8, 0.1, 0.1 비율로 train, validation, test로 나누어 하였으며 CrossEntropyLoss를 통해 학습하고 PPL로 성능을 측정했습니다.

In [95]:
train = Train(fr_sequences, en_sequences)
train.run()
train.test()

MAX LEN : 25
TOTAL SEQ: 137860
WORD DICT LEN: 350
MAX LEN : 19
TOTAL SEQ: 137860
WORD DICT LEN: 230
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(350, 128)
    (rnn): GRU(128, 256, bidirectional=True)
    (fc): Linear(in_features=512, out_features=256, bias=True)
    (dropout): Dropout(p=0.3, inplace=False)
  )
  (decoder): Decoder(
    (attention): Attention(
      (attn): Linear(in_features=768, out_features=256, bias=True)
      (v): Linear(in_features=256, out_features=1, bias=False)
    )
    (embedding): Embedding(230, 128)
    (rnn): GRU(640, 256)
    (fc_out): Linear(in_features=896, out_features=230, bias=True)
    (dropout): Dropout(p=0.3, inplace=False)
  )
)
Model trainable parametes: 1891558




Epoch: 01 | Time: 8m 13s
	Train Loss: 1.925 | Train PPL:   6.853
	 Val. Loss: 1.128 |  Val. PPL:   3.088
Epoch: 02 | Time: 8m 16s
	Train Loss: 0.843 | Train PPL:   2.323
	 Val. Loss: 0.737 |  Val. PPL:   2.090
Epoch: 03 | Time: 8m 14s
	Train Loss: 0.439 | Train PPL:   1.551
	 Val. Loss: 0.172 |  Val. PPL:   1.188
Epoch: 04 | Time: 7m 49s
	Train Loss: 0.106 | Train PPL:   1.112
	 Val. Loss: 0.083 |  Val. PPL:   1.086
Epoch: 05 | Time: 5m 40s
	Train Loss: 0.065 | Train PPL:   1.067
	 Val. Loss: 0.067 |  Val. PPL:   1.070
Epoch: 06 | Time: 5m 39s
	Train Loss: 0.053 | Train PPL:   1.054
	 Val. Loss: 0.059 |  Val. PPL:   1.061
Epoch: 07 | Time: 5m 39s
	Train Loss: 0.047 | Train PPL:   1.048
	 Val. Loss: 0.051 |  Val. PPL:   1.052
Epoch: 08 | Time: 5m 39s
	Train Loss: 0.042 | Train PPL:   1.043
	 Val. Loss: 0.045 |  Val. PPL:   1.046
Epoch: 09 | Time: 5m 39s
	Train Loss: 0.040 | Train PPL:   1.041
	 Val. Loss: 0.043 |  Val. PPL:   1.044
Epoch: 10 | Time: 5m 40s
	Train Loss: 0.045 | Train PPL

the united states is freezing during summer , but it is never beautiful in july . . .

CASE
<sos> la france est humide au mois de mai , mais il est parfois beau en août . <eos> <eos> <eos> <eos> <eos> <eos> <eos>
france is wet during may , but it is sometimes beautiful in august . <eos> <eos> <eos> <eos>
france is wet during may , but it is sometimes beautiful in august . . . . .

CASE
<sos> californie est généralement pluvieux en avril , et il est calme à l' automne . <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos>
california is usually rainy during april , and it is quiet in autumn . <eos> <eos> <eos> <eos>
california is usually rainy during april , and it is quiet in fall . autumn . autumn .

CASE
<sos> chine est occupé en octobre , mais il est parfois sec en mai . <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos>
china is busy during october , but it is sometimes dry in may . <eos> <eos> <eos> <eos>
china is busy during october , but it is sometimes dry in may . . .

CASE
<sos> l' inde est parfois agréable en janvier , mais il gèle habituellement en hiver . <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos>
india is sometimes pleasant during january , but it is usually freezing in winter . <eos> <eos> <eos>
india is sometimes pleasant during january , but it is usually freezing in winter . . winter .

CASE
<sos> californie est généralement agréable en janvier , et il est jamais humide en novembre . <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos>
california is usually pleasant during january , and it is never wet in november . <eos> <eos> <eos>
california is usually nice during january , and it is never wet in november . . november .

CASE
<sos> france est chaud en mars , mais il est jamais merveilleux à l' automne . <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos>
france is hot during march , but it is never wonderful in autumn . <eos> <eos> <eos> <eos>
france is warm during march , but it is never wonderful in fall . . . . .

CASE
<s

india is usually beautiful during december , and it is snowy in winter . . . . .

CASE
<sos> la france est généralement humide en mars , mais en été est généralement sec . <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos>
france is usually wet during march , but it is usually dry in summer . <eos> <eos> <eos>
france is usually wet during march , but it is usually dry in summer . . . .

CASE
<sos> nos fruits le plus aimé est le pamplemousse , mais votre plus aimé est la poire . <eos> <eos> <eos> <eos> <eos> <eos> <eos>
our most loved fruit is the grapefruit , but your most loved is the pear . <eos> <eos>
our most loved fruit is the grapefruit , but your most loved is the pear . the pear

CASE
<sos> elle conduit ce camion rouge brillant . <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos>
she is driving that shiny red truck . <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos>
she drives that shiny red truck . . . . . . . . .

데이터 자체적으로 단어 수가 적어서 학습은 잘되었고 test set의 inference를 확인해보았을 때 hot, warm과 같이 비슷한 단어가 다르게 나올뿐 결과도 잘 나오는 것을 확인해볼 수 있습니다.
test는 

- input
- 실제 output
- 예측 output

으로 확인 가능합니다.

학습에 사용한 데이터는 고유명사가 없고 단어 수도 적어 학습에 용이했지만 한국어 데이터를 사용했을 때의 어려움과 마찬가지로 단어수가 늘어나면 모델이 매우 커져 학습에 큰 어려움이 생깁니다. 번역 모델은 기본적으로 이러한 한계가 있어 transformer 모델을 사용하더라고 임베딩 모델에 변화가 필요해 보입니다.