Skip to content
Ensemble Seq2Seq neural machine translation model running on PySpark using Elephas
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.github Update bug_report.md Dec 16, 2018
dist Add layers to support Elephas (WIP) Dec 19, 2018
lib Calculate BLEU using multiple smoothing functions Dec 21, 2018
vendor
.gitignore Update GitIgnore Dec 20, 2018
LICENSE Initial commit Nov 8, 2018
README.md Update README Dec 18, 2018
requirements.txt Update requirements.txt Dec 17, 2018
setup.py Generate egg file Dec 17, 2018

README.md

Tardis

Ensemble Seq2Seq neural machine translation model running on PySpark using Elephas

An ensemble of the neural machine translation model from from Sequence to Sequence Learning with Neural Networks by Sutskever et al. [1] trained over PySpark using Elephas. We assess the effectiveness of our model on the EN-FR and EN-DE datasets from WMT-14.

Prerequisites

  • Keras >= 2.2.4
  • Elephas >= 0.4
  • Pandas >= 0.23.4

Getting started

  • Download the en_de dataset under data/datasets/en_de:

  • Repeat the same process for the en_vi dataset under data/datasets/en_vi

  • Download the FastText WikiText embeddings for English, German and Vietnamese

  • To run the single node Seq2Seq model on a GPU, issue the following command from the project root directory:

    • python -m lib.model --gpu <gpu_no> --dataset <lang_pair> --batch-size <batch_size>
  • To run the single node TinySeq2Seq model on a CPU, issue the following command from the project root directory:

    • python -m lib.model --cpu [--ensemble] --dataset <lang_pair> --batch-size <batch_size>
  • To run the TinySeq2Seq ensemble on multiple nodes:

    • Generate the egg file by running - must run after every change in the code: python setup.py bdist_egg
    • Issue the following command from the project root directory: (WIP)
    • spark-submit --driver-memory 1G -m lib/model/__main__.py --cpu [--ensemble] --dataset <lang_pair> --batch-size <batch_size> --recurrent-unit gru

Note: Beam search is used by default during testing. Add the flag --beam-size 0 to use greedy search.

References

[1] Sutskever, I., Vinyals, O. and Le, Q.V., 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104-3112). [2] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In Empirical Methods in Natural Language Processing (EMNLP).

You can’t perform that action at this time.