D2D Toolkits: A fast and versatile document-level machine translation toolkits

D2D Toolkits: A fast and versatile document-level machine translation toolkits
- Features
- Installation
- Usage
- License
- Citation

D2D Toolkits: A fast and versatile document-level machine translation toolkits

This repository contains the source code for D2D Toolkits, A fast and versatile document-level machine translation toolkits for fairseq.

We also provide our latest document-level machine translation methods:

Addressing the Length Bias Problem in Document-Level Neural Machine Translation

Features

Fast and efficient document-level data indexing and saving.
Reliable decoding strategies for document-level machine translation.
Support various document-level machine translation methods:
- Document to document translation
- Document to sentence translation
- Contextualized translation
Support various document level data augmentation methods:

Installation

First install the latest fairseq.

git clone https://github.com/facebookresearch/fairseq
cd fairseq
pip install ./

Then install the required packages for D2D Toolkits.

git clone https://github.com/salvation-z/LengthBiasDNMT
cd LengthBiasDNMT
pip install -r requirements.txt

Usage

Prepare dataset

We provide plenty of tools for preprocessing document-level datasets. Here is an example for preparing Europarl10 dataset.

bash scripts/prepare_europarl10.sh

This scripts will automaticly download the Europarl10 dataset and prepare the document index for training. For more details about the document index, please refer to the scripts and the source code.

Training

D2D Toolkits can be used as a standard plugin for fairseq. Add following arguments to fairseq-train for using D2D Tookits.

--user-dir D2D_PATH
                    enable d2d plugin for fairseq
--task document_translation 
                    enable d2d plugin for fairseq
--data-type DATA_TYPE
                    specify how texts are data type, e.g. seg2seg, sent2sent, doc2sent, context, hybrid, divide
--allow-mixup       enable document concatenation
--use-tags          add document tags to the input
--use-mask          add local/global masks to the input

Generation

To enable document level decoding, add following arguments to fairseq-generate for your generation.

--user-dir D2D_PATH
                    enable d2d plugin for fairseq
--task document_translation
                    enable d2d plugin for fairseq
--data-type DATA_TYPE
                    specify how texts are data type, e.g. seg2seg, sent2sent, doc2sent
--context-window CONTEXT_WINDOW
                    specify the context window size for Slide Decoding
--slide-decode      enable Slide Decoding
--use-tags          add document tags to the input
--use-mask          add local/global masks to the input
--force-decode      force to generate target document that has the same number of sentence as the source document
--allow-longer      do not raise error when the test sequence is longer than the training sequence

Length Bias DNMT

We provide scripts to reproduce our proposed methods in Addressing the Length Bias Problem in Document-Level Neural Machine Translation, please following our scripts.

License

D2D Toolkit is MIT-licensed.

Citation

@misc{zhang2023addressing,
      title={Addressing the Length Bias Problem in Document-Level Neural Machine Translation}, 
      author={Zhuocheng Zhang and Shuhao Gu and Min Zhang and Yang Feng},
      year={2023},
      eprint={2311.11601},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@inproceedings{ott2019fairseq,
  title = {fairseq: A Fast, Extensible Toolkit for Sequence Modeling},
  author = {Myle Ott and Sergey Edunov and Alexei Baevski and Angela Fan and Sam Gross and Nathan Ng and David Grangier and Michael Auli},
  booktitle = {Proceedings of NAACL-HLT 2019: Demonstrations},
  year = {2019},
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
d2d		d2d
examples/length_bias_transformer		examples/length_bias_transformer
scripts		scripts
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

D2D Toolkits: A fast and versatile document-level machine translation toolkits

Features

Installation

Usage

Prepare dataset

Training

Generation

Length Bias DNMT

License

Citation

About

Releases

Packages

Languages

License

ZhuochengZhang98/D2DToolkits

Folders and files

Latest commit

History

Repository files navigation

D2D Toolkits: A fast and versatile document-level machine translation toolkits

Features

Installation

Usage

Prepare dataset

Training

Generation

Length Bias DNMT

License

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages