Investigating the isometric behaviour of Neural Machine Translation models on binary semantic equivalence spaces
Isometry is defined mathematically as a distance-preserving transformation between two metric spaces. A simplified illustration of isometry in higher dimensional functional spaces can be seen below. In this research, we view Neural Machine Translation (NMT) models from the perspective of semantic isometry and assume that well-performing NMT models function approximately isometrically on semantic metric spaces. That is to say, if two sentences are semantically equivalent on the source side, they should remain semantically equivalent after translation on the target side given a well-performing NMT model. We hypothesize that the frequency of such semantically isometric behaviour correlates positively with general model performance.
We conduct our investigation by using two NMT models of varying performance to translate semantically-equivalent German paraphrases, based off diverse WMT19 test data references, to English. We use Facebook's AI Research's (FAIR) WMT19 winning single model from Ng. et al. (2019) as our SOTA model and abbreviate this as FAIR-WMT19. We train a large Transformer model based on the Scaling NMT methodology from Ott et al. (2018) on WMT16 data and utilize this model as our non-SOTA model. We abbreviate this model as STANDARD-WMT16.
We simplify the notion of semantic metric spaces into probabilistic binary semantic equivalence spaces and compute these using three Transformer language models fine-tuned on Google's PAWS-X paraphrase detection task. We adapt our workflow from Google's XTREME benchmark system.
By analyzing the paraphrase detection outputs, we show that the frequency of semantically isometric behaviour indeed correlates positively with general model performance. Our findings have interesting implications for automatic sequence evaluation metrics and vulnerabilities of NMT models towards adversarial paraphrases.
A more detailed description of our methodologies and results can be found in this paper.
Note: fairseq
and XTREME
are used as third-party extensions in this repository with licensing details found here.
-
This repository's code was tested with Python versions
3.7.*
. To sync dependencies, we recommend creating a virtual environment and installing the relevant packages viapip
:pip install -r requirements.txt
-
In this repository, we use
R
versions3.6.*
andlualatex
for efficientTikZ
visualizations. Execute the following within yourR
console to get the dependencies:install.packages(c("ggplot2","optparse","tikzDevice","rjson","ggpointdensity", "fields","gridExtra","devtools","reshape2")) devtools::install_github("teunbrand/ggh4x")
-
Initialize the xtreme-pawsx git submodule by running the following command:
bash scripts/setup_xtreme_pawsx.sh
-
Manually download preprocessed WMT'16 En-De data provided by Google and place the tarball in the
data
directory (~480 MB download size). -
Manually download the following four pre-trained models and place all of the tarballs in the
models
directory (~9 GB total download size):-
STANDARD-WMT16 for non-SOTA
de-en
translation. Model achievedBLEU-4
score of31.0
on thenewstest2014
test data set. -
mBERTBase for multilingual paraphrase detection. Model fine-tuned on
en,de,es,fr,ja,ko,zh
languages with macro-F1 score of0.886
. -
XLM-RBase for multilingual paraphrase detection. Model fine-tuned on
en,de,es,fr,ja,ko,zh
languages with macro-F1 score of0.890
. -
XLM-RLarge for multilingual paraphrase detection. Model fine-tuned on
en,de,es,fr,ja,ko,zh
languages with macro-F1 score of0.906
.
-
-
Download
PAWS-X
,WMT19
Legacy German paraphrases andWMT19
AR German paraphrases, as well as prepare the previously downloadedWMT16
data and pre-trained models by running the command below:bash scripts/prepare_data_models.sh
-
Optional: We provide a secondary branch
slurm-s3it
for executing computationally heavy workflows (eg. training, evaluating) on thes3it
server withslurm
. To use this branch, simply execute:git checkout slurm-s3it
-
Optional: If you want to further develop this repository; you can auto-format shell/R scripts and synchronize python dependencies, the development log and the
slurm-s3it
branch by initializing our pre-commit and pre-pushgit
hooks:bash scripts/setup_git_hooks.sh
Since we already provide pre-trained models in this repository, we treat model training as an auxiliary procedure. If you would like to indeed train the non-SOTA STANDARD-WMT16 model and fine-tune paraphrase detection models, refer to the instructions in TRAINING.md.
In order to translate WMT19 Legacy and WMT19 AR German paraphrases to English, utilize our script translate_wmt19_paraphrases_de_en.sh
:
Usage: translate_wmt19_paraphrases_de_en.sh [-h|--help] [glob]
Translate WMT19 paraphrases using both torch-hub and local models
Optional arguments:
-h, --help Show this help message and exit
glob <glob> Glob for finding local NMT model checkpoints, defaults to
"./models/transformer_vaswani_wmt_en_de_big.wmt16.de-en.1594228573/
checkpoint_best.pt"
This script will generate translations using the SOTA FAIR-WMT19 and the non-SOTA STANDARD-WMT16 models. Translation results will be saved as json
files in the predictions
directory. To run this script using our defaults, simply execute:
bash scripts/translate_wmt19_paraphrases_de_en.sh
After translating the WMT19 Legacy and WMT19 AR paraphrases, we can conduct a quick and dirty evaluation of source and target sentences using commutative variants of the BLEU-4
and chrF-2
automatic sequence evaluation metrics, which were initialized with the default settings from sacrebleu
. For this, we provide evaluate_bleu_chrf_wmt19_paraphrases_de_en.sh
:
Usage: evaluate_bleu_chrf_wmt19_paraphrases_de_en.sh [-h|--help] [glob]
Conduct shallow evaluation of WMT19 paraphrases with commutative
BLEU-4 and chrF-2 scores
Optional arguments:
-h, --help Show this help message and exit
glob <glob> Glob for finding input json translations, defaults to
"./predictions/*/*.json"
This script will analyze source and target sentences in the aforementioned json
files and will append commutative BLEU-4
and chrF-2
scores in-place. To run this script, simply execute:
bash scripts/evaluate_bleu_chrf_wmt19_paraphrases_de_en.sh
Next, we can run our fine-tuned paraphrase detection models on our source and target sentences. For this, we provide evaluate_paraphrase_detection_wmt19_paraphrases_de_en.sh
:
Usage: evaluate_paraphrase_detection_wmt19_paraphrases_de_en.sh [-h|--help] [glob]
Conduct evaluation of WMT19 paraphrases using pre-trained paraphrase
detection models
Optional arguments:
-h, --help Show this help message and exit
glob <glob> Glob for finding input json translations, defaults to
"./predictions/*/*.json"
This script will analyze source and target sentences in the aforementioned json
files and will append the paraphrase detection models' softmax
scores for the paraphrase (or positive) label in-place. To run this script, simply execute:
bash scripts/evaluate_paraphrase_detection_wmt19_paraphrases_de_en.sh
In order to plot the evolutions of model-related training parameters, we provide visualize_model_evolutions.sh
:
Usage: visualize_model_evolutions.sh [-h|--help] [glob]
Visualize model evolutions for translation and paraphrase detection models
Optional arguments:
-h, --help Show this help message and exit
glob <glob> Glob for finding tensorboard log directories, which will
be converted to csv's and then plotted. Defaults to
"./models/*/{train,train_inner,valid}"
This script will aggregate tensorboard event logs into csv
files and produce tikz-based plots of model evolutions as pdf
files in the img
directory. To run this script, simply execute:
bash scripts/visualize_model_evolutions.sh
In order to visualize the previously processed commutative chrF-2
scores, we provide visualize_chrf_wmt19_paraphrases_de_en.sh
:
Usage: visualize_chrf_wmt19_paraphrases_de_en.sh [-h|--help] [glob]
Visualize commutative chrF-2 scores of WMT19 paraphrase translations
Optional arguments:
-h, --help Show this help message and exit
glob <glob> Glob for finding input json translations, defaults to
"./predictions/*/*.json"
This script will produce a tikz-based plot of the commutative chrF-2
scores and will save it as pdf
file in the img
directory. To run this script, simply execute:
bash scripts/visualize_chrf_wmt19_paraphrases_de_en.sh
In order to visualize the previously processed paraphrase detection results, we provide visualize_paraphrase_detection_wmt19_paraphrases_de_en.sh
:
Usage: visualize_paraphrase_detection_wmt19_paraphrases_de_en.sh [-h|--help] [glob]
Visualize paraphrase detection predictions of WMT19 paraphrase translations
Optional arguments:
-h, --help Show this help message and exit
glob <glob> Glob for finding input json translations, defaults to
"./predictions/*/*.json"
This script will produce tikz-based plots of the respective paraphrase detection softmax
scores and joint model decisions, and will save them as pdf
files in the img
directory. To run this script, simply execute:
bash scripts/visualize_paraphrase_detection_wmt19_paraphrases_de_en.sh
In order to visualize correlations between commutative chrF-2
scores and paraphrase detection predictions, we provide visualize_paraphrase_detection_wmt19_paraphrases_de_en.sh
:
Usage: visualize_chrf_paraphrase_detection_wmt19_paraphrases_de_en.sh [-h|--help] [glob]
Visualize commutative chrF-2 and paraphrase detection predictions of WMT19 paraphrase translations
Optional arguments:
-h, --help Show this help message and exit
glob <glob> Glob for finding input json translations, defaults to
"./predictions/*/*.json"
This script will produce tikz-based plots of correlations between commutative chrF-2
scores and paraphrase detection predictions and will save them as pdf
files in the img
directory. To run this script, simply execute:
bash scripts/visualize_chrf_paraphrase_detection_wmt19_paraphrases_de_en.sh