Release v1.0.0 - Neural Machine Translation with Seq2Seq models · dilettacal/nmt_seq2seq

This is the first release of this NMT toolchain, that allows to train seq2seq models, specifically for translation purposes.
It assumes English to be one of the two languages, either source or target.
This version uses Torchtext with version 0.8.1.

Higlhlights

Corpora are downloadable from https://opus.nlpl.eu/
Data Preprocessing with spacy (models: English, German, Multi-Language) or with a simple tokenizer
Customizable training
Translation with customizable beam size

Corpora

Corpora are downloaded automatically from https://opus.nlpl.eu/ if the program does not find any file under data/raw/<corpus_name>.
To download corpora, run preprocess.py (e.g. python preprocess.py --lang_code de --corpus europarl). On opus.nlp.eu, corpora are available either as txt or in tmx format, which is the most common format in the translation industry. Therefore this code hosts the tmx2corpus implementation by Aaron Madlon-Kay.

Currently you can specify following corpora:

"europarl", https://opus.nlpl.eu/Europarl.php
"ted", https://opus.nlpl.eu/TED2020.php
"wikipedia", https://opus.nlpl.eu/Wikipedia.php
"tatoeba", https://opus.nlpl.eu/Tatoeba.php

Model Architecture

The NMT model has following structure:

Training is customizable and can be performed using the script train_model.py. Run python train_model.py --help for more info.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.0.0 - Neural Machine Translation with Seq2Seq models

Higlhlights

Corpora

Model Architecture