Simple and Scalable Nearest Neighbor Machine Translation

Official Code for our paper "Simple and Scalable Nearest Neighbor Machine Translation" (ICLR 2023).

This project impliments our SK-MT(short for Simple and Scalable kNN-MT) as well as vanilla kNN-MT. The implementation is built upon THUMT and heavily inspired by adaptive-knn-mt and KoK. Many thanks to the authors for making their code avaliable.

We also provide the implementation built upon fairseq, which can be found in fairseq branch. The performance of SK-MT we reported in our paper are evaluated based on THUMT framework.

Requirements and Installation

pytorch version >= 1.1.0
python version >= 3.6

You need to install PyTorch based on your hardware condition. Take cu116 as an example, you can install the environment by

pip install --upgrade pip

pip3 install torch==1.12.0+cu116 --extra-index-url https://download.pytorch.org/whl/cu116

pip install torch-scatter -f https://data.pyg.org/whl/torch-1.12.0+cu116.html

pip install numpy=1.23

pip install tensorboardX cffi cython dataclasses hydra-core regex sacremoses sacrebleu tqdm nltk matplotlib absl-py sklearn tensorboard bitarray six

pip install -U git+https://github.com/pltrdy/pyrouge

Instructions

Pre-trained Model

The pre-trained translation model can be downloaded from this site. We use the De-En Single Model and follow adaptive-knn-mt to evaluate the performance of the kNN-MT and adaptive kNN-MT. We provide the Transformer-based model we use in our experiments under those two frameworks: fairseq Model and THUMT Model.

Data

The raw data can be downloaded in this site, and you should preprocess them with moses toolkits and the bpe-codes provided by pre-trained model. To implement SK-MT, we recommend to follow copyisallyouneed to perform text retrieval using BM25. The obtained textual data can be used in THUMT framework. Moreover, if you favor fairseq, you are required to follow its instruction to preprocess and binarize the textual data. For convenience, We also provide pre-processed textual data for THUMT and binarized data for fairseq.

Update: We also provide the scripts to retrieve reference samples.

Domain Adaptation

This section provides instructions to perform SK-MT based on THUMT framework. More information about the implementations on fairseq framework can be found in the fairseq branch.

Retrieval and Preprocessing

bash scripts/domain_adaptation/preprocess.sh

Inference with SK-MT

bash scripts/domain_adaptation/run_sk_mt.sh

The batch size and other parameters should be adjusted by yourself depending on the hardware condition. We recommend to adopt the following hyper-parameters to replicate good SK-MT results.

	tm counts	$k$	$\tau$
SK-MT$_{1}$	2	1	100
SK-MT$_{2}$	16	2	100

Inference with NMT

bash scripts/domain_adaptation/run_nmt.sh

Online Learning

Inference with SK-MT

bash scripts/online_learning/run_sk_mt.sh

The recommeded hyper-parameters are the same as what used in Domain Adaptation.

Citation

If you find this repo helpful for your research, please cite the following paper:

@inproceedings{DBLP:conf/iclr/DaiZLCLD023,
  author       = {Yuhan Dai and
                  Zhirui Zhang and
                  Qiuzhi Liu and
                  Qu Cui and
                  Weihua Li and
                  Yichao Du and
                  Tong Xu},
  title        = {Simple and Scalable Nearest Neighbor Machine Translation},
  booktitle    = {The Eleventh International Conference on Learning Representations,
                  {ICLR} 2023, Kigali, Rwanda, May 1-5, 2023},
  publisher    = {OpenReview.net},
  year         = {2023},
  url          = {https://openreview.net/pdf?id=uu1GBD9SlLe},
  timestamp    = {Fri, 30 Jun 2023 14:55:53 +0200},
  biburl       = {https://dblp.org/rec/conf/iclr/DaiZLCLD023.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Contact

If you have questions, suggestions and bug reports, please email dirkiedye@gmail.com or zrustc11@gmail.com.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
scripts		scripts
thumt		thumt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple and Scalable Nearest Neighbor Machine Translation

Requirements and Installation

Instructions

Pre-trained Model

Data

Domain Adaptation

Retrieval and Preprocessing

Inference with SK-MT

Inference with NMT

Online Learning

Inference with SK-MT

Citation

Contact

About

Contributors 2

Languages

dirkiedai/sk-mt

Folders and files

Latest commit

History

Repository files navigation

Simple and Scalable Nearest Neighbor Machine Translation

Requirements and Installation

Instructions

Pre-trained Model

Data

Domain Adaptation

Retrieval and Preprocessing

Inference with SK-MT

Inference with NMT

Online Learning

Inference with SK-MT

Citation

Contact

About

Topics

Resources

Stars

Watchers

Forks

Contributors 2

Languages