GitHub - chanberg/COMET-mbr

Version 1.0 is finally out 🥳! whats new?

comet-compare command for statistical comparison between two models

comet-score with multiple hypothesis/systems

Embeddings caching for faster inference (thanks to @jsouza).

Length Batching for faster inference (thanks to @CoderPat)

Integration with SacreBLEU for dataset downloading (thanks to @mjpost)

Monte-carlo Dropout for uncertainty estimation (thanks to @glushkovato and @chryssa-zrv)

Some code refactoring

Quick Installation

Simple installation from PyPI

pip install --upgrade pip  # ensures that pip is current
pip install unbabel-comet

or

pip install unbabel-comet==1.0.1 --use-feature=2020-resolver

To develop locally install Poetry and run the following commands:

git clone https://github.com/Unbabel/COMET
cd COMET
poetry install

Alternately, for development, you can run the CLI tools directly, e.g.,

PYTHONPATH=. ./comet/cli/score.py

Scoring MT outputs:

CLI Usage:

Test examples:

echo -e "Dem Feuer konnte Einhalt geboten werden\nSchulen und Kindergärten wurden eröffnet." >> src.de
echo -e "The fire could be stopped\nSchools and kindergartens were open" >> hyp1.en
echo -e "The fire could have been stopped\nSchools and pre-school were open" >> hyp2.en
echo -e "They were able to control the fire.\nSchools and kindergartens opened" >> ref.en

Basic scoring command:

comet-score -s src.de -t hyp1.en -r ref.en

you can set --gpus 0 to test on CPU.

Scoring multiple systems:

comet-score -s src.de -t hyp1.en hyp2.en -r ref.en

WMT test sets via SacreBLEU:

comet-score -d wmt20:en-de -t PATH/TO/TRANSLATIONS

The default setting of comet-score prints the score for each segment individually. If you are only interested in the score for the whole dataset (computed as the average of the segment scores), you can use the --quiet flag.

comet-score -s src.de -t hyp1.en -r ref.en --quiet

You can select another model/metric with the --model flag and for reference-free (QE-as-a-metric) models you don't need to pass a reference.

comet-score -s src.de -t hyp1.en --model wmt20-comet-qe-da

Following the work on Uncertainty-Aware MT Evaluation you can use the --mc_dropout flag to get a variance/uncertainty value for each segment score. If this value is high, it means that the metric is less confident in that prediction.

comet-score -s src.de -t hyp1.en -r ref.en --mc_dropout 30

When comparing two MT systems we encourage you to run the comet-compare command to get statistical significance with Paired T-Test and bootstrap resampling (Koehn, et al 2004).

comet-compare -s src.de -x hyp1.en -y hyp2.en -r ref.en

For even more detailed MT contrastive evaluation please take a look at our new tool MT-Telescope.

Multi-GPU Inference:

COMET is optimized to be used in a single GPU by taking advantage of length batching and embedding caching. When using Multi-GPU since data e spread across GPUs we will typically get fewer cache hits and the length batching samples is replaced by a DistributedSampler. Because of that, according to our experiments, using 1 GPU is faster than using 2 GPUs specially when scoring multiple systems for the same source and reference.

Nonetheless, if your data does not have repetitions and you have more than 1 GPU available, you can run multi-GPU inference with the following command:

comet-score -s src.de -t hyp1.en -r ref.en --gpus 2

Minimum Bayes Risk Decoding:

MBR decoding can be performed with the run_mbr.py script. You need a txt file for the source sentences, the candidate translations and the support hypotheses. The format is one sentence per line. The number of lines in the candidate and support files need to be a multiple of the number of lines in the source file (line 1 = source sent one, line 1-100 = candidates for source sent 1 with 100 samples).

python run_mbr.py -s src.de -c candidates.en -t support.en -nc 100 -ns 100 -o mbr_out.txt

To get the individual MBR scores for the sensitivity analysis (with a potentially variable number of candidates), construct a json file of the following format containing the source sentence and at least one candidate. The candidates can be arbitrarily named:

{
  "0": {
    "src": "Dem Feuer konnte Einhalt geboten werden",
    "cand-1": "The fire could be stopped",
    "cand-2": "They were able to control the fire."
  },
  "1": {
      "src": "Schulen und Kindergärten sind geöffnet",
      "cand-1": "Schools and kindergartens were open",
      "cand-2": "Schools and kindergartens were open"
  },
  ...
}

Then you can run the following script and it will return a json file of the same structure but with a list where the first element is the sentence and the second the MBR score.

python run_mbr_for_sensitivity.py -j candidates.json -t support.en -ns 100 -o mbr_out.json

Changing Embedding Cache Size:

You can change the cache size of COMET using the following env variable:

export COMET_EMBEDDINGS_CACHE="2048"

by default the COMET cache size is 1024.

Scoring within Python:

from comet import download_model, load_from_checkpoint

model_path = download_model("wmt20-comet-da")
model = load_from_checkpoint(model_path)
data = [
    {
        "src": "Dem Feuer konnte Einhalt geboten werden",
        "mt": "The fire could be stopped",
        "ref": "They were able to control the fire."
    },
    {
        "src": "Schulen und Kindergärten wurden eröffnet.",
        "mt": "Schools and kindergartens were open",
        "ref": "Schools and kindergartens opened"
    }
]
seg_scores, sys_score = model.predict(data, batch_size=8, gpus=1)

Languages Covered:

All the above mentioned models are build on top of XLM-R which cover the following languages:

Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Bulgarian, Burmese, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskri, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.

Thus, results for language pairs containing uncovered languages are unreliable!

COMET Models:

We recommend the two following models to evaluate your translations:

wmt20-comet-da: DEFAULT Reference-based Regression model build on top of XLM-R (large) and trained of Direct Assessments from WMT17 to WMT19. Same as wmt-large-da-estimator-1719 from previous versions.
wmt20-comet-qe-da: Reference-FREE Regression model build on top of XLM-R (large) and trained of Direct Assessments from WMT17 to WMT19. Same as wmt-large-qe-estimator-1719 from previous versions.

These two models were developed to participate on the WMT20 Metrics shared task (Mathur et al. 2020) and were among the best metrics that year. Also, in a large-scale study performed by Microsoft Research these two metrics are ranked 1st and 2nd in terms of system-level decision accuracy (Kocmi et al. 2020). At segment-level, these systems also correlate well with expert evaluations based on MQM data (Freitag et al. 2020).

For more information about the available COMET models read our metrics descriptions here

Train your own Metric:

Instead of using pretrained models your can train your own model with the following command:

comet-train --cfg configs/models/{your_model_config}.yaml

You can then use your own metric to score:

comet-score -s src.de -t hyp1.en -r ref.en --model PATH/TO/CHECKPOINT

Note: Please contact ricardo.rei@unbabel.com if you wish to host your own metric within COMET available metrics!

unittest:

In order to run the toolkit tests you must run the following command:

coverage run --source=comet -m unittest discover
coverage report -m

Publications

If you use COMET please cite our work! Also, don't forget to say which model you used to evaluate your systems.

Name		Name	Last commit message	Last commit date
Latest commit History 278 Commits
.github		.github
comet		comet
configs		configs
data		data
docs		docs
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
METRICS.md		METRICS.md
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_mbr.py		run_mbr.py
run_mbr_for_sensitivity.py		run_mbr_for_sensitivity.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Quick Installation

Scoring MT outputs:

CLI Usage:

Multi-GPU Inference:

Minimum Bayes Risk Decoding:

Changing Embedding Cache Size:

Scoring within Python:

Languages Covered:

COMET Models:

Train your own Metric:

unittest:

Publications

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

chanberg/COMET-mbr

Folders and files

Latest commit

History

Repository files navigation

Quick Installation

Scoring MT outputs:

CLI Usage:

Multi-GPU Inference:

Minimum Bayes Risk Decoding:

Changing Embedding Cache Size:

Scoring within Python:

Languages Covered:

COMET Models:

Train your own Metric:

unittest:

Publications

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages