Write a Sequence to Sequence (seq2seq) Model

chainer

0. Introduction

The sequence to sequence (seq2seq) model[1][2] is a learning model that converts an input sequence into an output sequence. In this context, the sequence is a list of symbols, corresponding to the words in a sentence. The seq2seq model has achieved great success in fields such as machine translation, dialogue systems, question answering, and text summarization. All of these tasks can be regarded as the task to learn a model that converts an input sequence into an output sequence.

1. Basic Idea of Seq2seq Model

1.1 Overview of Seq2seq Model

The Notations of Sequence

The seq2seq model converts an input sequence into an output sequence. Let the input sequence and the output sequence be $\bf X$ and $\bf Y$. The i-th element of the input sequence is represented as ${\bf x}_i$, and the j-th element of the output sequence is also represented as ${\bf y}_j$. Generally, each of the ${\bf x}_i$ and the ${\bf y}_j$ is the one-hot vector of the symbols. For example, in natural language processing(NLP), the one-hot vector represents the word and its size becomes the vocabulary size.

Let's think about the seq2seq model in the context of NLP. Let the vocabulary of the inputs and the outputs be 𝒱^(s) and 𝒱^(t), all the elements ${\bf x}_i$ and ${\bf y}_j$ satisfy ${\bf x}_i \in \mathbb{R}^{|{\mathcal V}^{(s)}|}$ and ${\bf y}_i \in \mathbb{R}^{|{\mathcal V}^{(t)}|}$. The input sequence $\bf X$ and the output sequence $\bf Y$ are represented as the following equations:

$$\begin{aligned} {\bf X} &= ({\bf x}_1, ..., {\bf x}_I) = ({\bf x}_i)_{i=1}^I \\\ {\bf Y} &= ({\bf y}_1, ..., {\bf y}_J) = ({\bf y}_j)_{j=1}^J \end{aligned}$$

I and J are the length of the input sequence and the output sequence. Using the typical NLP notation, ${\bf y}_0$ is the one-hot vector of BOS, which is the virtual word representing the beginning of the sentence, and ${\bf y}_{J+1}$ is that of EOS, which is the virtual word representing the end of the sentence.

The Notations of Conditional Probability $P({\bf Y}|{\bf X})$

Next, let's think about the conditional probability $P({\bf Y}|{\bf X})$ generating the output sequence $\bf Y$ when the input sequence $\bf X$ is given. The purpose of seq2seq model is modeling the probability $P({\bf Y}|{\bf X})$. However, the seq2seq model does not model the probability $P({\bf Y}|{\bf X})$ directly. Actually, it models the probability $P({\bf y}_j|{\bf Y}_{<j}, {\bf X})$, which is the probability of generating the j-th element of the output sequence ${\bf y}_j$ given the ${\bf Y}_{<j}$ and ${\bf X}$. ${\bf Y}_{<j}$ means the output sequence from 1 to j − 1, or $({\bf y}_j)_{j=1}^{j-1}$. In this notation, you can write the model $P_{\theta}({\bf Y}|{\bf X})$ with the product of $P_{\theta}({\bf y}_j|{\bf Y}_{<j}, {\bf X})$:

$$P_{\theta}({\bf Y}|{\bf X}) = \prod_{j=1}^{J+1} P_{\theta}({\bf y}_j|{\bf Y}_{<j}, {\bf X})$$

Processing Steps in Seq2seq Model

Now, let's think about the processing steps in seq2seq model. The feature of seq2seq model is that it consists of the two processes:

The process that generates the fixed size vector $\bf z$ from the input sequence $\bf X$
The process that generates the output sequence $\bf Y$ from $\bf z$

In other words, the information of $\bf X$ is conveyed by $\bf z$, and $P_{\theta}({\bf y}_j|{\bf Y}_{<j}, {\bf X})$ is actually calculated by $P_{\theta}({\bf y}_j|{\bf Y}_{<j}, {\bf z})$.

First, we represent the process which generating $\bf z$ from $\bf X$ by the function Λ:

$${\bf z} = \Lambda({\bf X})$$

The function Λ may be the recurrent neural net such as LSTMs.

Second, we represent the process which generating $\bf Y$ from $\bf z$ by the following formula:

$$\begin{aligned} P_{\theta}({\bf y}_j|{\bf Y}_{<j}, {\bf X}) = \Upsilon({\bf h}_j^{(t)}, {\bf y}_j) \\\ {\bf h}_j^{(t)} = \Psi({\bf h}_{j-1}^{(t)}, {\bf y}_{j-1}) \end{aligned}$$

Ψ is the function to generate the hidden vectors ${\bf h}_j^{(t)}$, and Υ is the function to calculate the generative probability of the one-hot vector ${\bf y}_j$. When j = 1, ${\bf h}_{j-1}^{(t)}$ or ${\bf h}_0^{(t)}$ is $\bf z$ generated by $\Lambda({\bf X})$, and ${\bf y}_{j-1}$ or ${\bf y}_0$ is the one-hot vector of BOS.

1.2 Model Architecture of Seq2seq Model

In this section, we describe the architecture of seq2seq model. To simplify the explanation, we use the most basic architecture. The architecture of seq2seq model can be separated to the five major roles.

Encoder Embedding Layer
Encoder Recurrent Layer
Decoder Embedding Layer
Decoder Recurrent Layer
Decoder Output Layer

The encoder consists of two layers: the embedding layer and the recurrent layer, and the decoder consists of three layers: the embedding layer, the recurrent layer, and the output layer.

In the explanation, we use the following symbols:

Symbol	Definition
H	the size of the hidden vector
D	the size of the embedding vector
${\bf x}_i$	the one-hot vector of i-th word in the input sentence
${\bf \bar x}_i$	the embedding vector of i-th word in the input sentence
${\bf E}^{(s)}$	Embedding matrix of the encoder
${\bf h}_i^{(s)}$	the i-th hidden vector of the encoder
${\bf y}_j$	the one-hot vector of j-th word in the output sentence
${\bf \bar y}_j$	the embedding vector of j-th word in the output sentence
${\bf E}^{(t)}$	Embedding matrix of the decoder
${\bf h}_j^{(t)}$	the j-th hidden vector of the decoder

1.2.1 Encoder Embedding Layer

The first layer, or the encoder embedding layer converts the each word in the input sentence to the embedding vector. When processing the i-th word in the input sentence, the input and the output of the layer are the following:

The input is ${\bf x}_i$ : the one-hot vector which represents i-th word
The output is ${\bf \bar x}_i$ : the embedding vector which represents i-th word

Each embedding vector is calculated by the following equation:

$${\bf \bar x}_i = {\bf E}^{(s)} {\bf x}_i$$

${\bf E}^{(s)} \in {\mathbb R}^{D \times |{\mathcal V}^{(s)}|}$ is the embedding matrix of the encoder.

1.2.2 Encoder Recurrent Layer

The encoder recurrent layer generates the hidden vectors from the embedding vectors. When processing the i-th embedding vector, the input and the output of the layer are the following:

The input is ${\bf \bar x}_i$ : the embedding vector which represents the i-th word
The output is ${\bf h}_i^{(s)}$ : the hidden vector of the i-th position

For example, when using the uni-directional RNN of one layer, the process can be represented as the following function Ψ^(s):

$$\begin{aligned} {\bf h}_i^{(s)} &= \Psi^{(s)}({\bf \bar x}_i, {\bf h}_{i-1}^{(s)}) \\\ &= {\rm tanh} \left( {\bf W}^{(s)} \left[ \begin{array}{cc} {\bf h}_{i-1}^{(s)} \\ {\bf \bar x}_{i} \end{array} \right] + {\bf b}^{(s)} \right) \end{aligned}$$

In this case, we use the ${\rm tanh}$ as the activation function.

1.2.3 Decoder Embedding Layer

The decoder embedding layer converts the each word in the output sentence to the embedding vector. When processing the j-th word in the output sentence, the input and the output of the layer are the following:

The input is ${\bf y}_{j-1}$ : the one-hot vector which represents the (j − 1)-th word generated by the decoder output layer
The output is ${\bf \bar y}_j$ : the embedding vector which represents the (j − 1)-th word

Each embedding vector is calculated by the following equation:

$${\bf \bar y}_j = {\bf E}^{(t)} {\bf y}_{j-1}$$

${\bf E}^{(t)} \in {\mathbb R}^{D \times |{\mathcal V}^{(t)}|}$ is the embedding matrix of the encoder.

1.2.4 Decoder Recurrent Layer

The decoder recurrent layer generates the hidden vectors from the embedding vectors. When processing the j-th embedding vector, the input and the output of the layer are the following:

The input is ${\bf \bar y}_j$ : the embedding vector
The output is ${\bf h}_j^{(t)}$ : the hidden vector of j-th position

For example, when using the uni-directional RNN of one layer, the process can be represented as the following function Ψ^(t):

$$\begin{aligned} {\bf h}_j^{(t)} &= \Psi^{(t)}({\bf \bar y}_j, {\bf h}_{j-1}^{(t)}) \\\ &= {\rm tanh} \left( {\bf W}^{(t)} \left[ \begin{array}{cc} {\bf h}_{j-1}^{(t)} \\ {\bf \bar y}_{j} \end{array} \right] + {\bf b}^{(t)} \right) \end{aligned}$$

In this case, we use the ${\rm tanh}$ as the activation function. And we must use the encoder's hidden vector of the last position as the decoder's hidden vector of first position as following:

$${\bf h}_0^{(t)} = {\bf z} = {\bf h}_I^{(s)}$$

1.2.5 Decoder Output Layer

The decoder output layer generates the probability of the j-th word of the output sentence from the hidden vector. When processing the j-th embedding vector, the input and the output of the layer are the following:

The input is ${\bf h}_j^{(t)}$ : the hidden vector of j-th position
The output is p_j : the probability of generating the one-hot vector ${\bf y}_j$ of the j-th word

$$\begin{aligned} p_j &= P_{\theta}({\bf y}_j|{\bf Y}_{<j}) = {\rm softmax}({\bf o}_j) \cdot {\bf y}_j \\\ &= {\rm softmax}({\bf W}^{(o)}{\bf h}_j^{(t)} + {\bf b}^{(o)}) \cdot {\bf y}_j \end{aligned}$$

Note

There are a lot of varieties of seq2seq models. We can use the different RNN models in terms of: (1) directionality (unidirectional or bidirectional), (2) depth (single-layer or multi-layer), (3) type (a vanilla RNN, a Long Short-term Memory (LSTM), or a gated recurrent unit (GRU)), and (4) additional functionality (s.t. Attention Mechanism).

2. Implementation of Seq2seq Model

The official Chainer repository includes a neural machine translation example using the seq2seq model. We will now provide an overview of the example and explain its implementation in detail. chainer/examples/seq2seq

2.1 Model Overview

In this simple example, an input sequence is processed by a stacked LSTM-RNN (long short-term memory recurrent neural networks) and it is encoded as a fixed-size vector. The output sequence is also processed by another stacked LSTM-RNN. At decoding time, an output sequence is generated using argmax.