Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2022 NAACL On Synthetic Data for Back Translation #32

Open
IsaacJ60 opened this issue Jul 12, 2023 · 0 comments
Open

2022 NAACL On Synthetic Data for Back Translation #32

IsaacJ60 opened this issue Jul 12, 2023 · 0 comments
Assignees
Labels
literature-review Summary of the paper related to the work

Comments

@IsaacJ60
Copy link
Member

Main Problem

The main problem addressed in this work is the generation of synthetic data for back translation in Neural Machine Translation (NMT) and understanding the factors that affect the performance of back translation.

Proposed Method

The authors propose two methods to improve the synthetic data for back translation: Data Manipulation and Gamma Score. In Data Manipulation, they combine synthetic corpora generated by beam search and sampling to balance the trade-off between importance and quality. They tune the combination ratio to optimize the back-translation performance. In Gamma Score, they introduce a score that balances both quality and importance to generate translations. The score is based on an interpolation of importance weight and the probability of the translation given the source sentence. They select the translation with the highest score or sample a translation based on the score distribution.

Input/Output

The input to the proposed methods is a monolingual corpus in the source language and a pretrained NMT model. The output is a synthetic corpus generated through either data manipulation or the gamma score method.

Example

In an experiment on the WMT14 DE-EN dataset, the authors compared the performance of their proposed methods with baseline methods. In Data Manipulation, they achieved similar BLEU scores to sampling back translation, even without using bitext, and improved the performance compared to beam search back translation. In the Gamma Score method, they achieved significantly better results than both sampling and beam search back translation. The results were measured using SacreBLEU and COMET metrics.

Related Works & Their Gaps

The related works discussed include the initial proposal of back translation by Bojar and Tamchyna, the extension of back translation for NMT by Sennrich et al., and the exploration of various back-translation generation methods by Imamura et al., Edunov et al., and others. Data augmentation methods for NMT, such as token frequency balancing and SwitchOut, are also mentioned. Also, the use of monolingual data in semi-supervised machine translation and the improvement of translation quality through back translation are discussed. The gaps in the related works include the limited exploration of balancing importance and quality in synthetic data, inconsistent improvements across different translation tasks in data augmentation, and the need for more efficient methods for leveraging monolingual data in NMT.

@DelaramRajaei DelaramRajaei added the literature-review Summary of the paper related to the work label Jul 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
literature-review Summary of the paper related to the work
Projects
None yet
Development

No branches or pull requests

2 participants