Skip to content

Latest commit

 

History

History
17 lines (11 loc) · 1.6 KB

multi-way-nmt-shared-attention.md

File metadata and controls

17 lines (11 loc) · 1.6 KB

TLDR; The authors train a single Neural Machine Translation model that can translate between N*M language pairs, with a parameter spaces that grows linearly with the number of languages. The model uses a single attention mechanism shared across encoders/decoders. The authors demonstrate the the model performs particularly well for resource-constrained languages, outperforming single-pair models trained on the same data.

Key Points

  • Attention mechanism: Both encoder and decoder output attention-specific vectors, which are then combined. Thus, adding a new source/target language does not result in a quadratic explosion of parameters.
  • Bidirectional RNN, 620-dimensional embeddings, GRU with 1k units, 1k affine layer tanh. Adam, minibatch 60 examples. Only use sentence up to length 50.
  • Model clearly outperforms single-pair models when parallel corpora are constrained to small size. Not so much for large corpora.
  • The single model doesn't fit on a GPU.
  • Can in theory be used to translate between pairs that didn't have a bilingual training corpus, but the authors don't evaluate this in the paper.
  • Main difference to "Multi-task Sequence to Sequence Learning": Uses attention mechanism

Notes / Questions

  • I don't see anything that would force the encoders to map sequences of different languages into the same representation (as the authors briefly mentioned). Perhaps it just encodes language-specific information that the decoders can use to decide which source language it was?