# WMT17

* To English
* Segment level data
* Scores
* Pearson correlation




To reproduce our results for a given metric import the metric config from the `configs.py` file and run the cells below.

E.g. for BERTScore:

`from geneval.replication.configs import bertscore_config as config`

In [1]:
# install dependencies

!pip install datasets
!pip install bert_score
!pip install git+https://github.com/google-research/bleurt.git
!pip install unbabel-comet
!pip install transformers
!pip install POT

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.4.0-py3-none-any.whl (365 kB)
[K     |████████████████████████████████| 365 kB 27.4 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 71.9 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 11.3 MB/s 
Collecting fsspec[http]>=2021.11.1
  Downloading fsspec-2022.7.1-py3-none-any.whl (141 kB)
[K     |████████████████████████████████| 141 kB 64.3 MB/s 
Collecting multiprocess
  Downloading multiprocess-0.70.13-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 70.2 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0

In [2]:
!git clone https://github.com/drehero/geneval

Cloning into 'geneval'...
remote: Enumerating objects: 467, done.[K
remote: Counting objects: 100% (117/117), done.[K
remote: Compressing objects: 100% (74/74), done.[K
remote: Total 467 (delta 46), reused 104 (delta 37), pack-reused 350[K
Receiving objects: 100% (467/467), 44.23 MiB | 14.63 MiB/s, done.
Resolving deltas: 100% (190/190), done.
Checking out files: 100% (162/162), done.


In [3]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [4]:
import pathlib

import datasets
import numpy as np
import pandas as pd
from scipy.stats import pearsonr

from geneval.geneval.data.wmt import WMT17

In [5]:
# import metric config

from geneval.replication.configs import bleurt_config as config

In [6]:
out_path = pathlib.Path(f"/content/drive/MyDrive/results/wmt17/")
lang_pairs = ["cs-en", "de-en", "fi-en", "lv-en", "ru-en", "tr-en", "zh-en"]

In [7]:
scorer = datasets.load_metric(config.metric_path, **config.load_args)

Downloading builder script:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/405M [00:00<?, ?B/s]

In [8]:
for lang_pair in lang_pairs:
    # load data
    wmt = WMT17(lang_pair)

    # compute score
    args = config.compute_args.copy()
    if config.uses_reference:
        args["references"] = wmt.references
    if config.uses_source:
        args["sources"] = wmt.sources
    
    scores = scorer.compute(
        predictions=wmt.translations,
        **args
    )

    # save
    df = pd.DataFrame({
        "translation": wmt.translations,
        "reference": wmt.references,
        "source": wmt.sources,
        "human_score": wmt.scores,
        "metric_score": scores[config.score_name] if config.score_name is not None else scores
    })
    if "model_type" in args.keys():
        fn = f"{lang_pair}-{args['model_type'].split('/')[-1]}.csv"
    elif "config_name" in config.load_args.keys():
        fn = f"{lang_pair}-{config.load_args['config_name'].split('/')[-1]}.csv"
    else:
        fn = f"{lang_pair}.csv"
    df.to_csv(out_path / config.metric_name / fn, index=False)

Downloading wmt17-metrics-task-package.tgz:   0%|          | 0.00/528M [00:00<?, ?B/s]

In [9]:
# load scores and compute pearson correlation
results = {}
for lang_pair in lang_pairs:
    if "model_type" in config.compute_args.keys():
        fn = f"{lang_pair}-{config.compute_args['model_type'].split('/')[-1]}.csv"
    elif "config_name" in config.load_args.keys():
        fn = f"{lang_pair}-{config.load_args['config_name'].split('/')[-1]}.csv"
    else:
        fn = f"{lang_pair}.csv"
    df = pd.read_csv(out_path / config.metric_name / fn)
    corr = pearsonr(df["metric_score"], df["human_score"])[0]
    results[lang_pair] = corr

In [10]:
results

{'cs-en': 0.7575706298130365,
 'de-en': 0.7925074803716805,
 'fi-en': 0.8755595409028247,
 'lv-en': 0.834040096238825,
 'ru-en': 0.8194504123410535,
 'tr-en': 0.8393159587281375,
 'zh-en': 0.8240082422558146}