Skip to content

Yusser96/Exploring-Paracrawl-for-Document-level-Neural-Machine-Translation

 
 

Repository files navigation

Paracrawl-Paragraphs Dataset

This code is for EACL 2023 paper Exploring Paracrawl for Document-level Neural Machine Translation.

Python Version: Python3.6

Package Requirements: torch==1.4.0 tensorboardX numpy==1.19.0

Framework: Our model and experiments are built upon G-Transformer.

Before running the scripts, please install fairseq dependencies by:

    pip install --editable .

Please also follow the readme under folder raw_data to download raw data.

Data Extraction

we provided the final dataset we used in the paper in the raw_data folder.

for re-extracting the data you can follow the instruction below:

    cd data_scripts
    
    bash extract_data.sh
  • Clean the Data:
    cd data_scripts
    
    pip install langid
    
    bash clean_data.sh

training Settings

G-Transformer fine-tuned on sent Transformer

  • Prepare data:
    mkdir exp_finetune
    bash exp_gtrans/run-all.sh prepare-finetune exp_finetune
  • Train model:
    CUDA_VISIBLE_DEVICES=0,1,2,3 bash exp_gtrans/run-all.sh run-finetune train exp_finetune
  • Evaluate model:
    bash exp_gtrans/run-all.sh run-finetune test exp_finetune

About

G-Transformer for Document-level Machine Translation

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 95.1%
  • Shell 1.8%
  • Cuda 1.7%
  • Other 1.4%