chainer
The sequence to sequence (seq2seq) model[1][2] is a learning model that converts an input sequence into an output sequence. In this context, the sequence is a list of symbols, corresponding to the words in a sentence. The seq2seq model has achieved great success in fields such as machine translation, dialogue systems, question answering, and text summarization. All of these tasks can be regarded as the task to learn a model that converts an input sequence into an output sequence.
The seq2seq model converts an input sequence into an output sequence. Let the input sequence and the output sequence be
Let's think about the seq2seq model in the context of NLP. Let the vocabulary of the inputs and the outputs be 𝒱(s) and 𝒱(t), all the elements
I and J are the length of the input sequence and the output sequence. Using the typical NLP notation,
Next, let's think about the conditional probability
Now, let's think about the processing steps in seq2seq model. The feature of seq2seq model is that it consists of the two processes:
- The process that generates the fixed size vector
$\bf z$ from the input sequence$\bf X$ - The process that generates the output sequence
$\bf Y$ from$\bf z$
In other words, the information of
First, we represent the process which generating
The function Λ may be the recurrent neural net such as LSTMs.
Second, we represent the process which generating
Ψ is the function to generate the hidden vectors
In this section, we describe the architecture of seq2seq model. To simplify the explanation, we use the most basic architecture. The architecture of seq2seq model can be separated to the five major roles.
- Encoder Embedding Layer
- Encoder Recurrent Layer
- Decoder Embedding Layer
- Decoder Recurrent Layer
- Decoder Output Layer
The encoder consists of two layers: the embedding layer and the recurrent layer, and the decoder consists of three layers: the embedding layer, the recurrent layer, and the output layer.
In the explanation, we use the following symbols:
Symbol | Definition |
---|---|
H | the size of the hidden vector |
D | the size of the embedding vector |
the one-hot vector of i-th word in the input sentence | |
the embedding vector of i-th word in the input sentence | |
Embedding matrix of the encoder | |
the i-th hidden vector of the encoder | |
the one-hot vector of j-th word in the output sentence | |
the embedding vector of j-th word in the output sentence | |
Embedding matrix of the decoder | |
the j-th hidden vector of the decoder |
The first layer, or the encoder embedding layer converts the each word in the input sentence to the embedding vector. When processing the i-th word in the input sentence, the input and the output of the layer are the following:
- The input is
${\bf x}_i$ : the one-hot vector which represents i-th word - The output is
${\bf \bar x}_i$ : the embedding vector which represents i-th word
Each embedding vector is calculated by the following equation:
The encoder recurrent layer generates the hidden vectors from the embedding vectors. When processing the i-th embedding vector, the input and the output of the layer are the following:
- The input is
${\bf \bar x}_i$ : the embedding vector which represents the i-th word - The output is
${\bf h}_i^{(s)}$ : the hidden vector of the i-th position
For example, when using the uni-directional RNN of one layer, the process can be represented as the following function Ψ(s):
In this case, we use the
The decoder embedding layer converts the each word in the output sentence to the embedding vector. When processing the j-th word in the output sentence, the input and the output of the layer are the following:
- The input is
${\bf y}_{j-1}$ : the one-hot vector which represents the (j − 1)-th word generated by the decoder output layer - The output is
${\bf \bar y}_j$ : the embedding vector which represents the (j − 1)-th word
Each embedding vector is calculated by the following equation:
The decoder recurrent layer generates the hidden vectors from the embedding vectors. When processing the j-th embedding vector, the input and the output of the layer are the following:
- The input is
${\bf \bar y}_j$ : the embedding vector - The output is
${\bf h}_j^{(t)}$ : the hidden vector of j-th position
For example, when using the uni-directional RNN of one layer, the process can be represented as the following function Ψ(t):
In this case, we use the
The decoder output layer generates the probability of the j-th word of the output sentence from the hidden vector. When processing the j-th embedding vector, the input and the output of the layer are the following:
- The input is
${\bf h}_j^{(t)}$ : the hidden vector of j-th position - The output is pj : the probability of generating the one-hot vector
${\bf y}_j$ of the j-th word
Note
There are a lot of varieties of seq2seq models. We can use the different RNN models in terms of: (1) directionality (unidirectional or bidirectional), (2) depth (single-layer or multi-layer), (3) type (a vanilla RNN, a Long Short-term Memory (LSTM), or a gated recurrent unit (GRU)), and (4) additional functionality (s.t. Attention Mechanism).
The official Chainer repository includes a neural machine translation example using the seq2seq model. We will now provide an overview of the example and explain its implementation in detail. chainer/examples/seq2seq
In this simple example, an input sequence is processed by a stacked LSTM-RNN (long short-term memory recurrent neural networks) and it is encoded as a fixed-size vector. The output sequence is also processed by another stacked LSTM-RNN. At decoding time, an output sequence is generated using argmax.
First, let's import necessary packages.
../../../examples/seq2seq/seq2seq.py
Define all training settings here.
../../../examples/seq2seq/seq2seq.py
The Chainer implementation of seq2seq is shown below. It implements the model depicted in the above figure.
../../../examples/seq2seq/seq2seq.py
- In
Seq2seq
, three functions are defined: the constructor__init__
, the function callforward
, and the function for translationtranslate
.
../../../examples/seq2seq/seq2seq.py
- When we instantiate this class for making a model, we give the number of stacked lstms to
n_layers
, the vocabulary size of the source language ton_source_vocab
, the vocabulary size of the target language ton_target_vocab
, and the size of hidden vectors ton_units
. - This network uses
chainer.links.NStepLSTM
,chainer.links.EmbedID
, andchainer.links.Linear
as its building blocks. All the layers are registered and initialized in the context withself.init_scope()
. - You can access all the parameters in those layers by calling
self.params()
. - In the constructor, it initializes all parameters with values sampled from a uniform distribution U( − 1, 1).
../../../examples/seq2seq/seq2seq.py
- The
forward
method takes sequences of source language's word IDsxs
and sequences of target language's word IDsys
. Each sequence represents a sentence, and the size ofxs
is mini-batch size. Note that the sequences of word IDs
xs
andys
are converted to a vocabulary-size one-hot vectors and then multiplied with the embedding matrix insequence_embed
to obtain embedding vectorsexs
andeys
.../../../examples/seq2seq/seq2seq.py
self.encoder
andself.decoder
are the encoder and the decoder of the seq2seq model. Each element of the decoder outputos
is h[1 : J](t) in the figure above.- After calculating the recurrent layer output, the loss
loss
and the perplexityperp
are calculated, and the values are logged bychainer.report
.
Note
It is well known that the seq2seq model learns much better when the source sentences are reversed. The paper[1] says that "While the LSTM is capable of solving problems with long term dependencies, we discovered that the LSTM learns much better when the source sentences are reversed (the target sentences are not reversed). By doing so, the LSTM’s test perplexity dropped from 5.8 to 4.7, and the test BLEU scores of its decoded translations increased from 25.9 to 30.6." So, at the first line in the forward
, the input sentences are reversed xs = [x[::-1] for x in xs]
.
../../../examples/seq2seq/seq2seq.py
- After the model learned the parameters, the function
translate
is called to generate the translated sentencesouts
from the source sentencesxs
. - So as not to change the parameters, the codes for the translation are nested in the scope
chainer.no_backprop_mode()
andchainer.using_config('train', False)
.
In this tutorial, we use French-English corpus from WMT15 website that contains 10^9 documents. We must prepare additional libraries, dataset, and parallel corpus. To understand the pre-processing, see requirements-label
.
After the pre-processing the dataset, let's make dataset objects:
../../../examples/seq2seq/seq2seq.py
This code uses utility functions below:
../../../examples/seq2seq/seq2seq.py
../../../examples/seq2seq/seq2seq.py
../../../examples/seq2seq/seq2seq.py
BLEU[3] (bilingual evaluation understudy) is the evaluation metric for the quality of text which has been machine-translated from one natural language to another.
../../../examples/seq2seq/seq2seq.py
Here, the code below just creates iterator objects.
../../../examples/seq2seq/seq2seq.py
Instantiate Seq2seq
model.
../../../examples/seq2seq/seq2seq.py
Prepare an optimizer. We use chainer.optimizers.Adam
.
../../../examples/seq2seq/seq2seq.py
Let's make a trainer object.
../../../examples/seq2seq/seq2seq.py
Setup the trainer's extension to see the BLEU score on the test data.
../../../examples/seq2seq/seq2seq.py
Let's start the training!
../../../examples/seq2seq/seq2seq.py
Before running the example, you must prepare additional libraries, dataset, and parallel corpus.
- See the detail description: chainer/examples/seq2seq/README.md
You can train the model with the script: chainer/examples/seq2seq/seq2seq.py
$ pwd
/root2chainer/chainer/examples/seq2seq
$ python seq2seq.py --gpu=0 giga-fren.preprocess.en giga-fren.preprocess.fr \
vocab.en vocab.fr \
--validation-source newstest2013.preprocess.en \
--validation-target newstest2013.preprocess.fr > log
100% (22520376 of 22520376) |#############| Elapsed Time: 0:09:20 Time: 0:09:20
100% (22520376 of 22520376) |#############| Elapsed Time: 0:10:36 Time: 0:10:36
100% (3000 of 3000) |#####################| Elapsed Time: 0:00:00 Time: 0:00:00
100% (3000 of 3000) |#####################| Elapsed Time: 0:00:00 Time: 0:00:00
epoch iteration main/loss validation/main/loss main/perp validation/main/perp validation/main/bleu elapsed_time
0 200 171.449 991.556 85.6739
0 400 143.918 183.594 172.473
0 600 133.48 126.945 260.315
0 800 128.734 104.127 348.062
0 1000 124.741 91.5988 436.536
...
Note
Before running the script, be careful the locale and the python's encoding. Please setup them to use utf-8 encoding.
While you are training the model, you can get the validation results:
...
# source : We knew the Government had tried many things , like launching <UNK> with <UNK> or organising speed dating evenings .
# result : Nous savions que le gouvernement avait <UNK> plusieurs fois , comme le <UNK> <UNK> , le <UNK> ou le <UNK> <UNK> .
# expect : Nous savions que le gouvernement avait tenté plusieurs choses comme lancer des parfums aux <UNK> ou organiser des soirées de <UNK>
...