Skip to content

Code for EMNLP 2023 Findings paper "Addressing the Length Bias Problem in Document-Level Neural Machine Translation".

License

Notifications You must be signed in to change notification settings

ZhuochengZhang98/D2DToolkits

Repository files navigation



MIT License

D2D Toolkits: A fast and versatile document-level machine translation toolkits

This repository contains the source code for D2D Toolkits, A fast and versatile document-level machine translation toolkits for fairseq.

We also provide our latest document-level machine translation methods:

Features

  • Fast and efficient document-level data indexing and saving.
  • Reliable decoding strategies for document-level machine translation.
  • Support various document-level machine translation methods:
    • Document to document translation
    • Document to sentence translation
    • Contextualized translation
  • Support various document level data augmentation methods:

Installation

First install the latest fairseq.

git clone https://github.com/facebookresearch/fairseq
cd fairseq
pip install ./

Then install the required packages for D2D Toolkits.

git clone https://github.com/salvation-z/LengthBiasDNMT
cd LengthBiasDNMT
pip install -r requirements.txt

Usage

Prepare dataset

We provide plenty of tools for preprocessing document-level datasets. Here is an example for preparing Europarl10 dataset.

bash scripts/prepare_europarl10.sh

This scripts will automaticly download the Europarl10 dataset and prepare the document index for training. For more details about the document index, please refer to the scripts and the source code.

Training

D2D Toolkits can be used as a standard plugin for fairseq. Add following arguments to fairseq-train for using D2D Tookits.

--user-dir D2D_PATH
                    enable d2d plugin for fairseq
--task document_translation 
                    enable d2d plugin for fairseq
--data-type DATA_TYPE
                    specify how texts are data type, e.g. seg2seg, sent2sent, doc2sent, context, hybrid, divide
--allow-mixup       enable document concatenation
--use-tags          add document tags to the input
--use-mask          add local/global masks to the input

Generation

To enable document level decoding, add following arguments to fairseq-generate for your generation.

--user-dir D2D_PATH
                    enable d2d plugin for fairseq
--task document_translation
                    enable d2d plugin for fairseq
--data-type DATA_TYPE
                    specify how texts are data type, e.g. seg2seg, sent2sent, doc2sent
--context-window CONTEXT_WINDOW
                    specify the context window size for Slide Decoding
--slide-decode      enable Slide Decoding
--use-tags          add document tags to the input
--use-mask          add local/global masks to the input
--force-decode      force to generate target document that has the same number of sentence as the source document
--allow-longer      do not raise error when the test sequence is longer than the training sequence

Length Bias DNMT

We provide scripts to reproduce our proposed methods in Addressing the Length Bias Problem in Document-Level Neural Machine Translation, please following our scripts.

License

D2D Toolkit is MIT-licensed.

Citation

@misc{zhang2023addressing,
      title={Addressing the Length Bias Problem in Document-Level Neural Machine Translation}, 
      author={Zhuocheng Zhang and Shuhao Gu and Min Zhang and Yang Feng},
      year={2023},
      eprint={2311.11601},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@inproceedings{ott2019fairseq,
  title = {fairseq: A Fast, Extensible Toolkit for Sequence Modeling},
  author = {Myle Ott and Sergey Edunov and Alexei Baevski and Angela Fan and Sam Gross and Nathan Ng and David Grangier and Michael Auli},
  booktitle = {Proceedings of NAACL-HLT 2019: Demonstrations},
  year = {2019},
}

About

Code for EMNLP 2023 Findings paper "Addressing the Length Bias Problem in Document-Level Neural Machine Translation".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages