Skip to content

Latest commit

 

History

History
34 lines (28 loc) · 3.05 KB

nmt-with-reconstruction.md

File metadata and controls

34 lines (28 loc) · 3.05 KB

TLDR; The authors add a reconstruction objective to the standard seq2seq model by adding a "Reconstructor" RNN that is trained to re-generate the source sequence based on the hidden states of the decoder. A reconstruction cost is then added to the cost function and the architecture is trained end-to-end. The authors find that the technique improves upon the baseline both when 1. used during training only and 2. when used as a rankign objective during beam search decoding.

Key Points

  • Problem to solve:
    • Standard seq2seq models tend to under- and over-translate because they don't ensure that all of the source information is covered by the target side.
    • The MLE objective only captures information from source -> target, which favors short translations. Thus, Increasing the beam size actually lowers translation quality
  • Basic Idea
    • Reconstruct source sentences form the latent representations of the decoder
    • Use attention over decoder hidden states
    • Add MLE reconstruction probability to the training objective
  • Beam Decoding is now two-phase scheme
    1. Generate candidates using the encoder-decoder
    2. For each candidate, compute a reconstruction score and use it to re-rank together with the likelihood
  • Training Procedure
    • Params Chinese-English: vocab=30k, maxlen=80, embedding_dim=620, hidden_dim=1000, batch=80.
    • 1.25M pairs trained for 15 epochs using Adadelta, the train with reconstructor for 10 epochs.
  • Results:
    • Model increases BLEU from 30.65 -> 31.17 (beam size 10) when used for training only and decoding stays unchaged
    • BLEU increase from 31.17 -> 31.73 (beam size 10) when also used for decoding
    • Model successfully deals with large decoding spaces, i.e. BLEU now increases together with beam size

Notes

  • See this issue for author's comments
  • I feel like "adequacy" is a somewhat strange description of what the authors try to optimize. Wouldn't "coverage" be more appropriate?
  • In Table 1, why does BLEU score still decrease when length normalization is applied? The authors don't go into detail on this.
  • The training curves are a bit confusing/missing. I would've liked to see a standard training curve that shows the MLE objective loss and the finetuning with reconstruction objective side-by-side.
  • The training procedure somewhat confusing. The say "We further train the model for 10 epochs" with reconstruction objective, byt then "we use a trained model at iteration 110k". I'm assuming they do early-stopping at 110k * 80 = 8.8M steps. Again, would've liked to see the loss curves for this, not just BLEU curves.
  • I would've liked to see model performance on more "standard" NMT datasets like EN-FR and EN-DE, etc.
  • Is there perhaps a smarter way to do reconstruction iteratively by looking at what's missing from the reconstructed output? Trainig with reconstructor with MLE has some of the same drawbacks as training standard enc-dec with MLE and teacher forcing.