Skip to content

Latest commit

 

History

History
29 lines (15 loc) · 2.04 KB

1807.1007.md

File metadata and controls

29 lines (15 loc) · 2.04 KB

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates, Kudo, 2018

Paper, Tags: #nlp, #embeddings

The question addressed in this paper is whether it is possible to harness the segmentation ambiguity as a noise to improve the robustness of NMT.

We present a regularization method, subword regularization, which trains the model with multiple subword segmentations probabilistically sampled during training. We show that subword regularization achieves significant improvements over the method using a single subword sequence.

Also, for better subword sampling, we propose a new subword segmenation algorithm based on a unigram language model.

NMT training with on-the-fly subword sampling

To predict the target subword, a common choice is to use a RNN architecture. NMT is trained using the standard maximum likelihood estimation, maximizing the log-likelihood of a given parallel corpus.

We can't optimize the marginal log-likelihood because the number of possible segmentations increases exponentially with respect to the sentence length. We approximate it with finite 1 sequences sampled from P(x|X) and P(y|Y) respectively.

BPE vs. unigram LM

BPE is based on a greedy and deterministic symbol replacement, which can't provide multiple segmentations with probabilities.

The unigram LM assumes that each subword occurs independently, and therefore the probability of a subword sequence is the product of the subword occurrence probabilities.

If the vocabulary is given we can estimate subword occurrence probabilities via the EM algorithm, but in real setting we don't know the vocabulary size. We find it with an iterative algorithm (see paper section 3.2).

The unigram LM model is more flexible than BPE because it's based on a probabilistic LM and can output multiple segmentations with their probabilities.

Conclusion

We virtually augment training data with on-the-fly subword sampling, which helps improve the accuracy and robustness of NMT models.