This is the first release of this NMT toolchain, that allows to train seq2seq models, specifically for translation purposes.
It assumes English to be one of the two languages, either source or target.
This version uses Torchtext with version 0.8.1.
Higlhlights
- Corpora are downloadable from https://opus.nlpl.eu/
- Data Preprocessing with spacy (models: English, German, Multi-Language) or with a simple tokenizer
- Customizable training
- Translation with customizable beam size
Corpora
Corpora are downloaded automatically from https://opus.nlpl.eu/ if the program does not find any file under data/raw/<corpus_name>.
To download corpora, run preprocess.py
(e.g. python preprocess.py --lang_code de --corpus europarl
). On opus.nlp.eu, corpora are available either as txt or in tmx format, which is the most common format in the translation industry. Therefore this code hosts the tmx2corpus implementation by Aaron Madlon-Kay.
Currently you can specify following corpora:
- "europarl", https://opus.nlpl.eu/Europarl.php
- "ted", https://opus.nlpl.eu/TED2020.php
- "wikipedia", https://opus.nlpl.eu/Wikipedia.php
- "tatoeba", https://opus.nlpl.eu/Tatoeba.php
Model Architecture
The NMT model has following structure:
Training is customizable and can be performed using the script train_model.py
. Run python train_model.py
--help for more info.