NMT by Jointly Learning to Align and Translate

Summary

Alignment라는 이름으로 NLP에서 attention 개념을 처음으로 도입했다. 기존의 RNN encoder-decoder 모델이 fixed-length 벡터에 source sentence의 모든 정보를 담아야 했던 한계에서 벗어났다. 문장 길이가 길 때 번역 성능을 크게 향상시켰다.

AS-IS

번역 모델에서 기존의 State-of-the-art는 여러 components로 이루어진 전통적인 phrase-based translation system이었다. Joshua Bengio가 2003년에 neural probabilistic language model을 발표한 뒤로, 뉴럴 네트워크도 번역 모델에 많이 쓰였다. 그러나 아직 뉴럴 네트워크는 기존 시스템에 서포트 모듈로 들어가거나 마지막 결과를 ranking or scoring 해주는 역할로 쓰일 뿐이었다.

Neural Machine Translation이라는 개념은 2013 ~ 2014년 경에 처음 등장했다. 기존 시스템에 비해 NMT는 하나의 모델로 번역 문제를 해결하겠다는 야망을 갖고 있다. 그러나 아직 기존 시스템의 성능을 뛰어넘는 결과를 내지는 못하였다.

Problem

기존 NMT는 RNN encoder - decoder 모델이다. encoder는 source sentence의 정보를 fixed-length vector로 embedding하고, decoder는 이 embedding을 받아 번역 문장을 generate한다.

문제는 source sentence의 길이가 얼마나 길든 고정된 크기의 벡터로 embedding한다는 것이다. 문장이 길면 길어질 수록 담는 정보가 많은데, 이를 embedding할 때 정보의 손실이 발생할 것이다. 이 제약을 극복할 모델 아키텍쳐가 필요하다.

Propose

architecture

Decoder : RNN with alignment model
- encoder의 hidden state를 annotation이라고 한다.
- 각각의 annotation과 decoder의 이전 hidden state를 조합하여 다음에 생성될 글자가 source sentence의 어디와 매칭될 것인지 결정하는 것을 alignment model이라고 부른다. 이 모델은 feedforward neural network를 쓴다. 이 결과를 energy라고 부르자.
- energy를 softmax한 것이 각 source sentence의 단어에 대한 weight가 된다.
- 이 weight에 따라 annotation들의 가중 평균을 내면 context vector가 된다.
- 1. context vector 2. decoder의 이전 hidden state 3. decoder의 이전 단어가 decoder의 다음 hidden state를 결정한다.
Encoder : Bidirectional gated RNN
- bidirectional을 쓰는 이유 : bidirectional RNN의 forward와 backward RNN의 hidden state를 concatenate한 것을 해당 단어의 annotation이라고 부른다. RNN은 최근의 input을 강하게 기억하므로 annotation은 해당 input의 주위를 더욱 반영할 것이다. 이 때문에 이를 annotation이라고 부를 수 있다.

Advantage

cost function의 gradient가 soft alignment를 통해 backpropagate될 수 있다. 이 gradient는 alignment model과 전체 translation model을 동시에 학습시킨다.
처음의 가설 대로 RNN encoder - decoder 모델에 비해서 더 긴 문장에 대해 성능이 월등히 높은 것을 알 수 있다.
더욱이, 이 모델은 원문과 번역문 간 어떤 단어끼리 (soft)매칭되는지 alignment model도 번역과 동시에 학습한다. 프랑스어와 영어이기 때문에 대부분 monotonic하지만, 그렇지 않은 경우도 잘 캐치한다.
특정 task에 대해서는 기존 state-of-the-art인 복잡한 conventional phrase-based translation system (Moses) 보다도 높았다. Moses가 parallel corpus 외에 monolingual corpus (418M words)도 사용하는 것을 감안하면 굉장한 성과이다.

Math

Translation

Finding a target sentence y that maximizes the conditional probability of y given a source sentence x

x : source sentence
y : target sentence

Encoder - Decoder

Encoder

$h_t = f(x_t, h_{t-1})$
$c = q(\{h_1, h_2, ..., h_{T_x}\})$
LSTM as f, $q(\{h_1, h_2, ..., h_{T_x}\}) = h_{T_x}$

Decoder

$p(y) = \prod_{t=1}^Tp(y_t|\{y_1, ..., y_{t-1}\}, c)$

in RNN

$p(y_t|\{y_1, ..., y_{t-1}\}, c) = g(y_{t-1}, s_t, c)$

Learning to Align and Translate

Decoder

$p(y_t|\{y_1, ..., y_{t-1}\}, c) = g(y_{t-1}, s_t, c)$
$s_i = f(s_{i-1}, y_{i-1}, c_i)$
$c_i = \sum_{j=1}^{T_x} \alpha_{ij}h_j$
$\alpha_{ij} = \frac{exp(e_{ij})}{\sum_{k=1}^{T_x}{exp(e_{ik})}}$
$e_{ij} =a(s_{i-1}, h_j)$

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NMT by Jointly Learning to Align and Translate

Summary

AS-IS

Problem

Propose

Advantage

Math

Translation

Encoder - Decoder

Encoder

Decoder

Learning to Align and Translate

Decoder

Clone this wiki locally