This code mainly implements the Neural Machine Translation system described in (Bahdanau et al., 2015). Also known as "RNNSearch", or "Sequence-to-Sequence Modeling with Attention Mechanism", this is, as of 2016, the most commonly used approach in the recent field of Neural Machine Translation. This code is being developped at Kyoto University in Kurohashi-Kawahara lab (http://nlp.ist.i.kyoto-u.ac.jp/).
This implementation uses the Chainer Deep Learning Library.
This code was used for participation in several shared translation tasks and could obtain state-of-the-art results. In particular, we obtained:
- Best results for the ASPEC dataset (Japanese->English, English->Japanese, Chinese->Japanese and Japanese->Chinese) in the 3rd Workshop on Asian Translation (2016): system description paper, organizers report
- Best results for the French-English EDP dataset of the WMT2017 Biomedical Translation Task (2017):organizers report
- Best results in term of crowdsourced human evaluation for 3/4 of the language directions in the ASPEC dataset (English->Japanese, Chinese->Japanese and Japanese->Chinese) in the 4th Workshop on Asian Translation (2017): system description paper, organizers report
#Requirements and Installation:
The following is currently required for running Kyoto-NMT:
- Python >=3.6
- A recent version of
chainer
(>=7.0). - If using a GPU (recommended): a recent version of
cupy
(>=7.0) - The plotting libraries
plotly
andbokeh
are used in some visualisation scripts
There is now a PyPI repository, so you can install with:
pip install knmt
Dependencies will be automatically installed. However, if you plan to use a GPU (which you should probably do for any real-size experiments), you also need to manually install cupy
. Check this page for details.
pip install cupy
You can confirm that cuda (and optionnally cudnn) are enabled and recognized by chainer by running the version
subcommand of knmt
:
knmt version
In the displayed text, you can look for a section that looks like this:
*********** chainer version *********** version 1.20.0.1 cuda True cudnn True cuda_version 7050 cudnn_version 5005
Such output shows that CUDA 7.5 was installed and recognized by chainer. The library CUDNN 5 is also installed.
You can of course also install the dependencies separately:
pip install chainer
pip install plotly
pip install bokeh
To use the latest version of the code in github, you can either first clone the repository, then install it:
git clone https://github.com/fabiencro/knmt.git
cd knmt
pip install . -U
or directly install from github:
pip install git+https://github.com/fabiencro/knmt.git -U
If you plan on modifying the code, it is better to do a "editable" installation:
git clone https://github.com/fabiencro/knmt.git
cd knmt
pip install -e . -U
This way, modifications to the code in the knmt directory is automatically available without the need to re-install. (see https://pip.pypa.io/en/stable/reference/pip_install/#editable-installs)
#Usage:
After installing, a small script called knmt
should be available in your path.
This script has several subcommands, which can be listed by entering, on the command line:
knmt --help
Mostly, the most important subcommands are knmt make_data
(for data preparation), knmt train
(for training) and knmt eval
(for translating). Each of these subcommands have several options which can be listed with the --help
option. eg:
knmt train --help
NB: in previous versions, one could instead call the three scripts make_data.py
, train.py
, eval.py
in the nmt_chainer sub-directory, but this is not recomended anymore.
The training data should be first preprocessed before training can begin. This is done by the command knmt make_data
. The important things that this processing will do is:
- Tokenize the training data
- Reduce the vocabulary: select the top X most frequent token types and replace others by a special symbol (eg.
UNK
) - Assign a unique id number from 0 to X to each remaining token type
- Convert the training sentences into sequence of id numbers
Step 2 is often necessary due to performance and memory issues when using a large target vocabulary. Step 4 is necessary because the implemented models work on sequence of integers, not sequence of strings.
The required training data is a sentence-aligned parallel corpus that is expected to be in two utf-8 text files: one for source language sentences and the other target language sentences. One sentence per line, words separated by whitespaces. Additionally, some validation data should be provided in a similar form (a source and a target file). This validation data will be used for early-stopping, as well as to visualize the progress of the training. One should also specify the maximum size of vocabulary for source and target sentences. For example:
knmt make_data train.src train.tgt data_prefix --dev_src valid.src --dev_tgt valid.tgt --src_voc_size 100000 --tgt_voc_size 30000
As a result of this call, two dictionaries indexing the 100000 and 30000 most common source and target words are created (with a special index for out-of-vocabulary words). The training and validation data are then converted to integer sequences according to these dictionaries and saved in a gzipped JSON file prefixed with data_prefix
.
Training is done by invoking the knmt train
command, passing as argument the data prefix used in the data preparation part.
knmt train data_prefix train_prefix
This simple call will train a network with size and features similar to those used in the original (Bahdanau et al., 2015) paper (except that LSTMs are used in place of GRUs). But there are many options to specify different aspects of the network: embedding layer size, hidden states size, number of lstm stacks, etc. The training settings can also be specified at this point: weight decay, learning rate, training algorithm, dropout values, minibatch size, etc.
knmt will create several files prefixed by train_prefix.
A JSON file train_prefix.config is created, containing all the parameters given to knmt train
(used for restarting an interrupted training session, or using a model for evaluation).
A file train_prefix.result.sqlite is also created, containing a sqlite database that will keep track of the training progress.
Furthermore, model files, containing optimized network parameters will be saved regularly. Every n minibatches (by default n = 200), an evaluation is performed on the validation set. Both perplexity and BLEU scores are computed. The BLEU score is computed by translating the validation set with a greedy search.
The models that have given the best BLEU and best perplexity so far are saved in files train_prefix.model.best.npz
and train_prefix.model.best_loss.npz
respectively. This allows to have early stopping based on two different criterions: validation BLEU and validation perplexity.
The sqlite database keep track of many information items during the training: validation BLEU and perplexity, training perplexity, time to process each minibatch, etc. An additional script graph_training.py
can use this database to generate a plotly (https://github.com/plotly) graph showing the evolution of BLEU, perplexity and training loss. This graph can be generated while training is still in progress and is very useful for monitoring ongoing experiments.
During training, knmt
will save snapshots of the model and trainer states regularly (how often is controled by the --save_ckpt_every
option. These snapshots are located in the train_prefix/
sub_directory. In addition, when an exception is encountered (such as an out-of-memory error, or a Ctrl+C user interruption), knmt will attempt to save a final_snapshot
file.
Training can be resumed from any of these snapshot files by adding the --load_trainer_snapshot
option to the the original command:
knmt train data_prefix train_prefix --option1 --option2 xxx --load_trainer_snapshot train_prefix/final_snapshot
Warning: The way snapshot saving and handling is being modified in the develop branch, so this method for resuming will be slightly changed shortly.
Evaluation is done by running the command knmt eval
. It allows, among other things, to translate sentences by doing a beam search with a trained model. The following command will translate input.txt
into translations.txt
using the parameters that gave the best validation BLEU during training:
knmt eval train_prefix.config train_prefix.model.best.npz input.txt translations.txt--mode beam_search --beam_width 30
We usually find that it is better to use the parameters that gave the best validation BLEU rather than the ones that gave the best validation loss. Although it can be even better to do an ensemble translation with the two. The eval.py
script has many options for tuning the beam search, ensembling several trained models, displaying the attention for each translation, etc.
In general, the translation process may generate some tags T_UNK_X
, where X
is an integer. The model learned to generate these tags if it was given some training data containing UNK
tags (see data preparation section). If the model generate a T_UNK_X
tag, it indicates it thinks that the correct word that should have been generated is outside of its vocabulary, but corresponds to the translation of the source word at position X
. One can then try to replace these tags with the correct translation using bilingual dictionnnaries (such an idea was originally proposed in (Luang et al., 2015) ).
knmt utils
is a subcommand of knmt
that contains sub-subcommands (yes, sub-subcommands are a bit unusual) giving access to some utility scripts. The command knmt utils replace_tgt_unk
is provided to do this UNK
tag replacement automatically. It expects a dictionnary in JSON format associating source language words to their translation. After having generated the translation translation.txt
of input file input.txt
using the knmt eval
command, one can generate a file translation.no_unk.txt
in which UNK
tags have been replaced with the following command:
knmt utils replace_tgt_unk translation.txt input.txt translation.no_unk.txt --dic dictionary.json
One might want to translate a sentence with "placeholders". E.g. Translate a sentence like "We bought tons of coal". In order to force the decoder to output one-to-one translations of such placeholders, one can use the "--force_placeholders" option. When using the option "--force_placeholders", tokens in the vocabulary of the form "", "", ... will be detected as placeholders. Then the decoder will try its best to output a translation in which each placeholder in the input has exactly one corresponding placeholder in the output. Note however that the model should still have been trained to translate these placeholders correctly. The "--force_placeholders" only put additional constraints on the decoder but will not work if the model assign very low probabilities to placeholders.
Neural Network training and evaluation can often be much faster on a GPU than on a CPU. Both knmt eval
and knmt train
accept a --gpu
argument to specify a GPU device to use. --gpu
should be followed by a number indicating the specific device to use. eg. --gpu 0
to use the first GPU device of the system; --gpu 2
to use the third GPU device (assuming the machine has at least 3 GPUs).
The evolution of the training (training loss, validation loss, validation BLEU, ...) is curently stored in a sqlite file training_prefix.result.sqlite
created during the training process.
At any time during and after the training, one can generate a graph showing the evolution of these values. The utils subcommand knmt utils graph
will take such a *.result.sqlite
file and generate an html file containing the graph. This uses the plotly
graphing library.
knmt utils graph training_prefix.result.sqlite graph.html
This will result in a graph similar to this one, where training loss appears as a green line, validation loss as a blue line, and validation BLEU appears as a scatter plot of small blue circles:
#! shell
knmt train /path_to_experiment_dir/prefix /path_to_experiment_dir/prefix_train --gpu 0
--optimizer adam
--weight_decay 1e-6
--use_goto
--l2_gradient_clipping 1
--mb_size 64
--max_src_tgt_length 90
// use smaller value if frequently running out of memory
Kyoto University Participation to WAT 2016. Fabien Cromieres, Chenhui Chu, Toshiaki Nakazawa and Sadao Kurohashi. 2016. Proceedings of the 3rd Workshop on Asian Translation.URL
Kyoto-NMT: a Neural Machine Translation implementation in Chainer. Fabien Cromieres. 2016. Proceedings of COLING 2016: System Demonstrations. URL
The file nmt_chainer/translation/mail_handler.py implements a daemon that monitors an IMAP mail account for translation requests. Whenever a request is encountered, it is dispatched to the appropriate translation server and its response will be sent back by email to its emitter.
- python-daemon
Copy the sample configuration and edit it according to your environment:
cp conf/mail_handler.json.sample conf/mail_handler.json
To start the daemon:
python nmt_chainer/translation/mail_handler.py
Consult the log file to make sure that the daemon is running properly:
less mail_handler.log
To stop the daemon:
kill `pgrep -f mail_handler.py`
Kyoto University Participation to WAT 2017. Fabien Cromieres, Raj Dabre, Toshiaki Nakazawa and Sadao Kurohashi. 2017. Proceedings of the 4th Workshop on Asian Translation.URL