Skip to content

amazon-science/idiom-mt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Idiom-MT

If you use this repository for your research, please cite:

Automatic Evaluation and Analysis of Idioms in Neural Machine Translation. Christos Baziotis, Prashant Mathur, Eva Hasler. EACL 2023.

The goal of the project is to address the problem of literal translations in machine translation systems. This problem is particularly pronounced during the translation of idiomatic expressions, such as “couch potato” or “once in a blue moon”, which tend to be translated word-for-word into the target language. This project contains the following:

  1. Methods for targeted and automatic evaluation of idioms in context.
  2. Models that are more robust to literal translations.
  3. Analysis of translation models that explores how different models represent idiomatic expressions, by varying the available context, as well as how these different representations are reflected in the output (translation) of the machine translation system.

Project Structure

This is how the codebase is organized.

Idiom-MT
├── analysis                # Contains code and jupyter notebooks with exploratory analysis
├── checkpoints             # Contains the checkpoints of all pretrained and finetuned models
├── data                    # Contains the raw + preprocessed data used in our experiments
├── data-bin                # Contains the data in binarized form for faireq training
├── experiments             # Contains the scripts for launching experiments + their logs
├── literal_translatability # The package for estimating the literal translatability in parallel data
├── metrics                 # The packages with the evaluation metrics developed as part of the project
├── phrase_extractor        # The package with the phrase-matching sentence extraction tool
├── prototype               # *ignore* random scripts for prototyping models and ideas
├── tools                   # Third-party tools, such as Moses, fast_align etc.
├── user                    # The fairseq's --user-dir, which contains all our custom fairseq code (plugins+extensions)
├── utils                   # Helper scripts, such as for colleting results of filtering parallel data
├── tok                     # Contains the (sentencepiece) tokenized data (from ./data/)
└── vocab                   # Contains the sentencepiece models used for tokenization (./data/ --> ./tok/) 

Prerequisites

Install Requirements

1. Create Environment (Optional): Ideally, you should create an environment for the project. Use python=3.7 because at the time of writing this document there are some issues that prevent remote development with Pycharm from Mac OS.

conda create -n idiom-mt python=3.7
conda activate idiom-mt

2. Install PyTorch (guide) with the desired Cuda version if you want to use the GPU:

pip install torch torchvision torchaudio

IMPORTANT: for A100 do the following:

conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia

3. Install dependencies: Then, install the rest of the requirements:

pip install -r requirements.txt

Not required (you can skip that): If you want to install apex for faster fairseq training, you may encounter compilation issues. If the instructions in fairseq's repository don't work for you, you could try this command:

CUDA_HOME=/usr/local/cuda-11.1 pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Check first what version you have installed in /usr/local/ and then use the one that matches with your PyTorch installation. Weirdly, for me ( and others) it worked with cuda-11.1 even though I had PyTorch with cuda-11.0.

4. SpaCy models: Then install the language support models for Spacy from https://spacy.io/usage/models. For example, for English run:

for lang in en zh; do
  python -m spacy download ${lang}_core_web_sm
done

for lang in fr de el it pt es ru; do
  python -m spacy download ${lang}_core_news_sm
done

5. Third-party libraries: Finally, install the third-party libraries for preprocessing, such Moses:

bash install-tools.sh

(Optional) Test environment with mBART fine-tuning

First, download the EN-DE News Commentary v13 dataset:

bash download_data_prototype.sh

Then, download mBART:

bash download_mbart.sh

Then, preprocess the data for mBART finetuning:

./prepare_data_mbart_finetuning.sh \
  data/news_commentary_en_de_v13 \
  tok/news_commentary_en_de_v13_mbart \
  data-bin/news_commentary_en_de_v13_mbart \
  train dev test \
  en de \
  en_XX de_DE

Then, preprocess the data for randomly initialized NMT training:

./prepare_data_translation.sh \
  data/news_commentary_en_de_v13 \
  tok/news_commentary_en_de_v13_random \
  data-bin/news_commentary_en_de_v13_random \
  train dev test \
  en de

Reproduce the Experiments

(If you don't have time for reading this, check prepare_data_TLDR.sh)

Step 1. Download and prepare the training/dev/test splits

First, you need to download and prepare the data that will be used for the NMT experiments. Use prepare_data.sh and just specify the language pair that you want to use, and the idiom list, which is going to affect the idiom test data as well as the regular-vs-idiom training data splits.

SRC_LANG=en
TRG_LANG=fr
IDIOMS_LIST=./data/idioms_data/all_idioms.en
bash prepare_data.sh $SRC_LANG $TRG_LANG $IDIOMS_LIST
Details of how the script works and how it organized the data (You can skip this)

This script does the following:

  1. downloads the (WMT) testsets and the (WMT/Europarl) parallel data in data/parallel_en_fr/.

  2. filters the parallel data based on length and discard very uneven sentences

    data/parallel_en_fr
    ├── dev.en
    ├── dev.fr
    ├── test.en
    ├── test.fr
    ├── train.en
    ├── train.en.clean
    ├── train.fr
    └── train.fr.clean
    
  3. Extracts sentence pairs from the parallel data that have an idiom in the source-side, based on a given idiom list in data/parallel_en_fr.idioms.en/. Check the phrase_extractor/phrase_extractor.py tool for details.

    data/parallel_en_fr.idioms.en
    ├── annotations.tsv
    ├── matches.log
    ├── samples
    ├── sentences.txt
    ├── spans.txt
    └── stats.txt
    
  4. Splits the parallel data (from 1.) into two groups in data/parallel_en_fr.idioms.en.splits/:

    • ordinary: these are pairs that don't have an idiom in the source side
    • special: these are pairs that DO have an idiom in the source side This step uses phrase_extractor/split_data_by_line_id.py.
    data/parallel_en_fr.idioms.en.splits
    ├── ordinary.en
    ├── ordinary.fr
    ├── special.en
    ├── special.fr
    └── special.spans.en
    
  5. Separates the special pairs into train and test splits. This step uses utils/train_test_split_idiom_pairs.py

    data/parallel_en_fr.idioms.en.splits/splits
    ├── discarded.en
    ├── discarded.fr
    ├── discarded.spans.en
    ├── freqs.txt
    ├── test.en
    ├── test.fr
    ├── test.spans.en
    ├── train.en
    ├── train.fr
    └── train.spans.en
    
  6. Creates the idiom test set in parallel_en_fr.idiom_test, the idiom training split in data/parallel_en_fr.idiom_train and the train/dev/test split with the regular parallel data in data/parallel_en_fr.regular, excluding those that went into the idiom train/test data.

    data/parallel_en_fr.idiom_test
    ├── test.en
    ├── test.fr
    └── test.spans.en
    data/parallel_en_fr.idiom_train   # this can also be used for upsampling
    ├── train.en
    ├── train.fr
    └── train.spans.en
    data/parallel_en_fr.regular
    ├── dev.en
    ├── dev.fr
    ├── test.en
    ├── test.fr
    ├── train.en
    └── train.fr    
    

Step 2. Binarize the parallel data

This step binarizes the train/dev/test data for training with Fairseq. We need to binarize/preprocess the data in a different way for each training process.

NMT from random initialization. This involves training a joint sentencepiece model on the source+target training data, segmenting the text with that sentencepiece model and finally binarizing the data. See the source code for the script usage instructions.

# First, train SPM and binarize the regular training data
bash prepare_data_translation.sh \
  ./data/parallel_en_fr.regular \
  ./tok/parallel_en_fr.regular.random \
  ./data-bin/parallel_en_fr.regular.random \
  train dev test en fr
  
# Next, reuse the pretrained SPM and binarize the idiom training data
bash prepare_data_translation.sh \
  ./data/parallel_en_fr.idiom_train \
  ./tok/parallel_en_fr.idiom_train.random \
  ./data-bin/parallel_en_fr.idiom_train.random \
  train '' '' en fr \
  ./vocab/parallel_en_fr.regular

Also, create a split which contains both the regular and the idiom data, by symlinking them into data-bin/parallel_en_fr.regular+idiom.random.

bash combine_data.sh \
    data-bin/parallel_en_fr.idiom_train.random \
    data-bin/parallel_en_fr.regular.random \
    data-bin/parallel_en_fr.regular+idiom.random \
    en fr

NMT from mBART initialization. This involves segmenting the text mBART's sentencepiece model and then binarizing the data. See the source code for the script usage instructions.

# tokenize and binarize the different splits
for split in regular idiom_train; do
    bash prepare_data_mbart_finetuning.sh \
      ./data/parallel_en_fr.${split} \
      ./tok/parallel_en_fr.${split}.mbart \
      ./data-bin/parallel_en_fr.${split}.mbart \
      train dev test en_XX fr_XX
done

# symlink the regular and idiom data to create the joint split
bash combine_data.sh \
    data-bin/parallel_en_fr.idiom_train.mbart \
    data-bin/parallel_en_fr.regular.mbart \
    data-bin/parallel_en_fr.regular+idiom.mbart \
    en_XX fr_XX

Step 3. Launch the experiments

For running experiments, read the documentation in ./experiments/README.md.

Checkpoints and results

The checkpoints of each model are saved in the ./checkpoints/ directory. Inside each model's folder, you will find the checkpoints (last.pt and best.pt) for that particular model, and all the model outputs and scores. Here is an example of the structure for the model enfr_joint.random:

  • ./checkpoints/enfr_joint.random/: besides the checkpoints you will find the model outputs and scores for the generic MT eval
  • ./checkpoints/enfr_joint.random/parallel_en_fr.idiom_test/: model outputs and scores for the idiom-specific eval
  • ./checkpoints/enfr_joint.random/analysis/: model outputs and scores for all the analysis methods. The outputs.json contains all the results together.

Results To collect all the results, run:

bash collect_results.sh

It will save the results in .csv files, organized by language pair and experiments-vs-analysis:

  • enfr.results.analysis.csv
  • enfr.results.experiments.csv

About

No description, website, or topics provided.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •