Skip to content

v1.0.0 - Neural Machine Translation with Seq2Seq models

Latest
Compare
Choose a tag to compare
@dilettacal dilettacal released this 03 Sep 17:49

This is the first release of this NMT toolchain, that allows to train seq2seq models, specifically for translation purposes.
It assumes English to be one of the two languages, either source or target.
This version uses Torchtext with version 0.8.1.

Higlhlights

  • Corpora are downloadable from https://opus.nlpl.eu/
  • Data Preprocessing with spacy (models: English, German, Multi-Language) or with a simple tokenizer
  • Customizable training
  • Translation with customizable beam size

Corpora

Corpora are downloaded automatically from https://opus.nlpl.eu/ if the program does not find any file under data/raw/<corpus_name>.
To download corpora, run preprocess.py (e.g. python preprocess.py --lang_code de --corpus europarl). On opus.nlp.eu, corpora are available either as txt or in tmx format, which is the most common format in the translation industry. Therefore this code hosts the tmx2corpus implementation by Aaron Madlon-Kay.

Currently you can specify following corpora:

Model Architecture

The NMT model has following structure:
model_structure

Training is customizable and can be performed using the script train_model.py. Run python train_model.py --help for more info.