# NLG Metricverse Demo

This notebook is an introduction to **nlg-metricverse**. It contains simple examples to apply Natural Language Generation (NLG) evaluation metrics, analyze them, compute metric-metric and metric-human correlations.

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

Developed by
*   Giacomo Frisoni @ University of Bologna, Italy (giacomo.frisoni[at]unibo.it)
*   Andrea Zammarchi @ University of Bologna, Italy (andrea.zammarchi3[at]studio.unibo.it)
*   Marco Avagnano @ University of Bologna, Italy (marco.avagnano[at]studio.unibo.it)



## Installation

To start off, we have to install the nlg-metricverse package from PyPI or build the library from source. Select one:

In [None]:
# FROM PYPI
!pip install nlg-metricverse --quiet

or

In [None]:
# FROM GITHUB SOURCE
import os
!git clone https://github.com/disi-unibo-nlp/nlg-metricverse.git
os.chdir("/content/nlg-metricverse/")
!pip install -v . --quiet

## Imports

We start with required imports.

In [1]:
import json # Just for pretty printing the output metric dicts
from nlgmetricverse import NLGMetricverse, load_metric
base_path = "/content/nlg-metricverse/nlgmetricverse/metrics/"

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Usage

The use of NLG Metricverse follows two simple steps:
1. the definition of a scorer object (formed by one or more metrics);
2. its application to input texts (with a pipeline or parallel execution strategy in case of more metrics).

**NOTE:** to be able to use several metrics (e.g., SacreBLEU, BERTScore, COMET, BLEURT), you need to install the related package(s). When you try to use a metric with an implementation based on external packages, NLG Metricverse will throw an exception indicating the installation need for these packages.

In [2]:
# Feel free to play around with the samples below

# 1:1
predictions_1_1 = ["Peace in the dormitory, peace in the world.", "There is a cat on the mat."]
references_1_1 = ["Peace at home, peace in th world.", "The cat is playing on the mat."]

# 1:N
predictions_1_n = ["Evaluating artificial text has never been so simple", "the cat is on the mat"]
references_1_n = [
    ["Evaluating artificial text is not difficult", "Evaluating artificial text is simple"],
    ["The cat is playing on the mat.", "The cat plays on the mat."]
]

# M:N
predictions_m_n = [
    ["Evaluating artificial text has never been so simple", "The evaluation of automatically generated text is simple."],
    ["the cat is on the mat", "the cat likes playing on the mat"]
]
references_m_n = [
    ["Evaluating artificial text is not difficult", "Evaluating artificial text is simple"],
    ["The cat is playing on the mat.", "The cat plays on the mat."]
]

#@title Select the Hypothesis/Reference setting

N_ARITY = "N:M" #@param["1:1", "1:N", "N:M"]

REDUCTION_FUNCTION = "mean" #@param["mean", "max"]

predictions, references = None, None
if (N_ARITY == "1:1"):
  predictions, references = predictions_1_1, references_1_1
elif (N_ARITY == "1:N"):
  predictions, references = predictions_1_n, references_1_n
else:
  predictions, references = predictions_m_n, references_m_n

### Basic features

###### Get metrics by applying filters

In [None]:
from nlgmetricverse import filter_metrics, Categories, ApplTasks, QualityDims

print(filter_metrics(category=Categories.Embedding))
print(filter_metrics(appl_task=ApplTasks.MachineTranslation))
print(filter_metrics(quality_dim=QualityDims.Fluency))

##### Load Hypothesis/References from files

In [None]:
from nlgmetricverse import DataLoaderStrategies

predictions_dir = "/content/data_loading/predictions"
references_dir = "/content/data_loading/references"

scorer = NLGMetricverse(metrics="meteor")
scores = scorer(predictions=predictions_dir, references=references_dir, 
                strategy=DataLoaderStrategies.OneRecordPerLine)
print(scores)


### Metrics



##### Abstractness

In [None]:
scorer = NLGMetricverse(metrics=load_metric(base_path + "abstractness"))
scores = scorer(predictions=predictions, references=references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

##### Accuracy

In [None]:
scorer = NLGMetricverse(metrics=load_metric(base_path + "accuracy"))
scores = scorer(predictions=predictions, references=references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

##### AUN

In [None]:
scorer = NLGMetricverse(metrics=load_metric(base_path + "aun"))
scores = scorer(predictions=predictions, references=references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

##### BARTScore

In [None]:
scorer = NLGMetricverse(metrics=load_metric(base_path + "bartscore"))
scores = scorer(predictions=predictions, references=references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

In [None]:
# Author samples
# https://github.com/neulab/BARTScore

auth_predictions = ["I'm super happy today.", "This is a good idea."]
auth_references = [
  ["I feel good today.", "I feel sad today."],
  ["Not bad.", "Sounds like a good idea."]
]
scorer = NLGMetricverse(metrics=load_metric(
    base_path + "bartscore",
    compute_kwargs={"segment_scores": True}))
scores = scorer(predictions=auth_predictions, references=auth_references) # max aggregation (default)
print(json.dumps(scores, indent=4))

##### BERTScore

In [None]:
scorer = NLGMetricverse(metrics=load_metric(base_path + "bertscore"))
scores = scorer(predictions=predictions, references=references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

In [None]:
# HF samples
# https://github.com/huggingface/datasets/tree/master/metrics/bertscore

# Maximal values with the distilbert-base-uncased model:
hf_predictions = ["hello world", "general kenobi"]
hf_references = ["hello world", "general kenobi"]
scorer = NLGMetricverse(metrics=load_metric(
    base_path + "bertscore",
    compute_kwargs={"model_type": "distilbert-base-uncased"}))
scores = scorer(predictions=hf_predictions, references=hf_references)
print(json.dumps(scores, indent=4))

# Partial match with the bert-base-uncased model and 5 layers:
hf_predictions = ["hello world", "general kenobi"]
hf_references = ["goodnight moon", "the sun is shining"]
scorer = NLGMetricverse(metrics=load_metric(
    base_path + "bertscore",
    compute_kwargs={
        "model_type": "bert-base-uncased",
        "num_layers": 5
    }))
scores = scorer(predictions=hf_predictions, references=hf_references)
print(json.dumps(scores, indent=4))

##### BLEU

In [None]:
scorer = NLGMetricverse(metrics=load_metric(base_path + "bleu"))
scores = scorer(predictions=predictions, references=references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

In [None]:
# HF samples
# https://raw.githubusercontent.com/huggingface/datasets/master/metrics/bleu

hf_predictions = ["hello there general kenobi", "foo bar foobar"]
hf_references = ["hello there general kenobi", "foo bar foobar"]
scorer = NLGMetricverse(metrics=load_metric(base_path + "bleu"))
scores = scorer(predictions=hf_predictions, references=hf_references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

##### BLEURT

In [None]:
!pip install --upgrade pip  # ensures that pip is current
!git clone https://github.com/google-research/bleurt.git
!pip install ./bleurt

In [None]:
scorer = NLGMetricverse(metrics=load_metric(
    base_path + "bleurt"))
scores = scorer(predictions=predictions, references=references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

##### CER

In [None]:
!pip install jiwer

In [None]:
scorer = NLGMetricverse(metrics=load_metric(base_path + "cer"))
scores = scorer(predictions=predictions, references=references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

##### CharacTER

In [None]:
!pip install levenshtein

In [None]:
scorer = NLGMetricverse(metrics=load_metric(base_path + "character"))
scores = scorer(predictions=predictions, references=references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

##### ChrF(++)

In [None]:
!pip install sacrebleu

In [None]:
scorer = NLGMetricverse(metrics=load_metric(base_path + "chrf"))
scores = scorer(predictions=predictions, references=references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

In [None]:
# HF samples (do not match)
# https://github.com/huggingface/datasets/blob/master/metrics/chrf/README.md

hf_predictions = ["The relationship between cats and dogs is not exactly friendly.", "a good bookshop is just a genteel black hole that knows how to read."]
hf_references = ["The relationship between dogs and cats is not exactly friendly.", "A good bookshop is just a genteel Black Hole that knows how to read."]

hf_predictions = ["The relationship between cats and dogs is not exactly friendly.", "a good bookshop is just a genteel black hole that knows how to read."]
hf_references = [["The relationship between dogs and cats is not exactly friendly.", ], ["A good bookshop is just a genteel Black Hole that knows how to read."]]

# A simple example of calculating chrF
scorer = NLGMetricverse(metrics=load_metric(base_path + "chrf"))
scores = scorer(predictions=hf_predictions, references=hf_references)
print(json.dumps(scores, indent=4))

# The same example, but with the argument word_order=2, to calculate chrF++ instead of chrF
scorer = NLGMetricverse(metrics=load_metric(
    base_path + "chrf",
    compute_kwargs={"word_order": 2}))
scores = scorer(predictions=hf_predictions, references=hf_references)
print(json.dumps(scores, indent=4))

# The same chrF++ example as above, but with lowercase=True to normalize all case
scorer = NLGMetricverse(metrics=load_metric(
    base_path + "chrf",
    compute_kwargs={"word_order": 2, "lowercase": True}))
scores = scorer(predictions=hf_predictions, references=hf_references)
print(json.dumps(scores, indent=4))

##### Cider

In [None]:
scorer = NLGMetricverse(metrics=load_metric(base_path + "cider"))
scores = scorer(predictions=predictions, references=references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

##### Coleman-Liau

In [None]:
scorer = NLGMetricverse(metrics=load_metric(base_path + "coleman_liau"))
scores = scorer(predictions=predictions, references=references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

##### COMET

In [None]:
!pip install unbabel-comet

In [None]:
# HF samples
# Note: COMET is MT-specific and also requires source sentences

# Full match
# hf_sources = ["Dem Feuer konnte Einhalt geboten werden", "Schulen und Kindergärten wurden eröffnet."]
# hf_predictions = ["They were able to control the fire.", "Schools and kindergartens opened"]
# hf_references = ["They were able to control the fire.", "Schools and kindergartens opened"]

# Partial match
hf_sources = ["Dem Feuer konnte Einhalt geboten werden", "Schulen und Kindergärten wurden eröffnet."]
hf_predictions = ["The fire could be stopped", "Schools and kindergartens were open"]
hf_references = ["They were able to control the fire", "Schools and kindergartens opened"]

# No match
# hf_sources = ["Dem Feuer konnte Einhalt geboten werden", "Schulen und Kindergärten wurden eröffnet."]
# hf_predictions = ["The girl went for a walk", "The boy was sleeping"]
# hf_references = ["They were able to control the fire", "Schools and kindergartens opened"]

scorer = NLGMetricverse(metrics=load_metric(
    base_path + "comet",
    config_name="wmt21-cometinho-da", # smaller model than wmt20-comet-da (default)
    compute_kwargs={"gpus": 0, "num_workers": 0, "progress_bar": True, "batch_size": 2}))
scores = scorer(sources=hf_sources, predictions=hf_predictions, references=hf_references)
print(json.dumps(scores, indent=4))

##### EED

In [None]:
scorer = NLGMetricverse(metrics=load_metric(base_path + "eed"))
scores = scorer(predictions=predictions, references=references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

##### F1

In [None]:
scorer = NLGMetricverse(metrics=load_metric(base_path + "f1"))
scores = scorer(predictions=predictions, references=references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

##### Flesch-Kincaid

In [None]:
scorer = NLGMetricverse(metrics=load_metric(base_path + "flesch_kincaid"))
scores = scorer(predictions=predictions, references=references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

##### Gunning-Fog

In [None]:
scorer = NLGMetricverse(metrics=load_metric(base_path + "gunning_fog"))
scores = scorer(predictions=predictions, references=references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

##### Mauve

In [None]:
scorer = NLGMetricverse(metrics=load_metric(base_path + "mauve"))
scores = scorer(predictions=predictions, references=references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

##### METEOR

In [None]:
scorer = NLGMetricverse(metrics=load_metric(base_path + "meteor"))
scores = scorer(predictions=predictions, references=references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

In [None]:
# Popular tutorial sample
tut_predictions = ["the cat sat on the mat"]
tut_references = ["on the mat sat the cat"]

#P = 1, R = 1, F_mean = 1.0000
#p = 0.5*(6/6)^3 = 0.5000
#M = 1.0000*(1-0.5000) = 0.5000
scorer = NLGMetricverse(metrics=load_metric(base_path + "meteor"))
scores = scorer(predictions=tut_predictions, references=tut_references)
print(json.dumps(scores, indent=4))

# HF sample
hf_predictions = ["It is a guide to action which ensures that the military always obeys the commands of the party"]
hf_references = ["It is a guide to action that ensures that the military will forever heed Party commands"]
scorer = NLGMetricverse(metrics=load_metric(base_path + "meteor"))
scores = scorer(predictions=hf_predictions, references=hf_references)
print(json.dumps(scores, indent=4))

##### MoverScore

In [None]:
!pip install git+https://github.com/AIPHES/emnlp19-moverscore.git
!pip install transformers

In [None]:
from moverscore_v2 import get_idf_dict, word_mover_score 
from collections import defaultdict

refs = ["Peace at home, peace in th world.", "The cat is playing on the mat."]
sys = ["Peace in the dormitory, peace in the world.", "There is a cat on the mat."]

idf_dict_hyp = get_idf_dict(sys) # idf_dict_hyp = defaultdict(lambda: 1.)
idf_dict_ref = get_idf_dict(refs) # idf_dict_ref = defaultdict(lambda: 1.)

scores = word_mover_score(refs, sys, idf_dict_ref, idf_dict_hyp, \
                          stop_words=[], n_gram=1, remove_subwords=True)
print(scores)

In [None]:
scorer = NLGMetricverse(metrics=load_metric(base_path + "moverscore"))
scores = scorer(predictions=predictions, references=references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

##### NIST

In [None]:
scorer = NLGMetricverse(metrics=load_metric(base_path + "nist"))
scores = scorer(predictions=predictions, references=references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

##### Nubia

In [None]:
scorer = NLGMetricverse(metrics=load_metric(base_path + "nubia"))
scores = scorer(predictions=predictions, references=references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

##### Perplexity

In [None]:
scorer = NLGMetricverse(metrics=load_metric(base_path + "perplexity"))
scores = scorer(predictions=predictions, references=references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

##### Prism

In [None]:
scorer = NLGMetricverse(metrics=load_metric(base_path + "prism"))
scores = scorer(predictions=predictions, references=references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

##### Repetitiveness

In [None]:
scorer = NLGMetricverse(metrics=load_metric(base_path + "repetitiveness"))
scores = scorer(predictions=predictions, references=references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

##### Rouge

In [None]:
scorer = NLGMetricverse(metrics=load_metric(base_path + "rouge"))
scores = scorer(predictions=predictions, references=references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

In [None]:
# TO ROUGE OR NOT TO ROUGE samples
# https://towardsdatascience.com/to-rouge-or-not-to-rouge-6a5f3552ea45

# 1)
hf_predictions = ["The quick brown fox jumped over the lazy dog."]
hf_references = ["The fox jumped over the dog."]
scorer = NLGMetricverse(metrics=load_metric(base_path + "rouge"))
scores = scorer(predictions=hf_predictions, references=hf_references,
                rouge_types=["rougeL"],
                use_aggregator=False, use_stemmer=False,
                metric_to_select="fmeasure")
print(json.dumps(scores, indent=4))

# 2)
hf_predictions = ["The quick brown fox jumped over the lazy dog."]
hf_references = ["The quick brown dog jumped over the lazy fox."]
scorer = NLGMetricverse(metrics=load_metric(base_path + "rouge"))
scores = scorer(predictions=hf_predictions, references=hf_references,
                rouge_types=["rougeL"],
                use_aggregator=False, use_stemmer=False,
                metric_to_select="fmeasure")
print(json.dumps(scores, indent=4))

# 3)
hf_predictions = ["The quick brown fox jumped over the lazy dog."]
hf_references = ["The fast wood-coloured fox hopped over the lethargic dog."]
scorer = NLGMetricverse(metrics=load_metric(base_path + "rouge"))
scores = scorer(predictions=hf_predictions, references=hf_references,
                rouge_types=["rougeL"],
                use_aggregator=False, use_stemmer=False,
                metric_to_select="fmeasure")
print(json.dumps(scores, indent=4))

In [None]:
hf_predictions = ["The quick brown fox jumped over the lazy dog."]
hf_references = ["The quick brown dog jumped over the lazy fox."]
scorer = NLGMetricverse(metrics=load_metric(base_path + "rouge", compute_kwargs={"rouge_types": ["rougeL"]}))
scores = scorer(predictions=hf_predictions, references=hf_references,
                use_aggregator=False, use_stemmer=False,
                metric_to_select="fmeasure")
print(json.dumps(scores, indent=4))

##### SacreBLEU

In [None]:
!pip install sacrebleu

In [None]:
scorer = NLGMetricverse(metrics=load_metric(base_path + "prism"))
scores = scorer(predictions=predictions, references=references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

##### TER

In [None]:
!pip install jiwer

In [None]:
scorer = NLGMetricverse(metrics=load_metric(base_path + "ter"))
scores = scorer(predictions=predictions, references=references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

##### WER

In [None]:
!pip install jiwer

In [None]:
scorer = NLGMetricverse(metrics=load_metric(base_path + "wer"))
scores = scorer(predictions=predictions, references=references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

##### WMD

In [None]:
scorer = NLGMetricverse(metrics=load_metric(base_path + "wmd"))
scores = scorer(predictions=predictions, references=references, reduce_fn=REDUCTION_FUNCTION)
print(json.dumps(scores, indent=4))

### Meta-Eval

##### Metric-Human Correlation

In [None]:
!pip install jiwer
!pip install sacrebleu

In [None]:
from matplotlib import pyplot as plt
from nlgmetricverse import metric_human_correlation

metrics = ["meteor", "ter", "wer"]
predictions_large = [
    ["Evaluating artificial text has never been so simple", "The evaluation of automatically generated text is simple.", "Evaluating artificial text is really easy."],
    ["the cat is on the mat", "the cat likes playing on the mat", "the cat is laying on the mat"],
    ["The weather outside is cold", "It's freezing today", "Look! It's not warm at all today"]
]
references_large = [
    ["Evaluating artificial text is not difficult", "Evaluating artificial text is simple", "The evaluation of artificial text is easy"],
    ["The cat is playing on the mat.", "The cat plays on the mat.", "Look! The cat plays on the mat"],
    ["Outside is cold today", "It's freezing today outside", "The temperature is low ouside"]
]

metric_human_correlation(predictions=predictions_large, references=references_large, metrics=metrics, human_scores=[0.5, 0.6, 0.7])
plt.show()

##### Metric-Metric Correlation

In [None]:
!pip install jiwer
!pip install sacrebleu

In [None]:
from matplotlib import pyplot as plt
from nlgmetricverse import metrics_correlation

metrics = ["meteor", "ter", "wer"]
predictions_large = [
    ["Evaluating artificial text has never been so simple", "The evaluation of automatically generated text is simple.", "Evaluating artificial text is really easy."],
    ["the cat is on the mat", "the cat likes playing on the mat", "the cat is laying on the mat"],
    ["The weather outside is cold", "It's freezing today", "Look! It's not warm at all today"]
]
references_large = [
    ["Evaluating artificial text is not difficult", "Evaluating artificial text is simple", "The evaluation of artificial text is easy"],
    ["The cat is playing on the mat.", "The cat plays on the mat.", "Look! The cat plays on the mat"],
    ["Outside is cold today", "It's freezing today outside", "The temperature is low ouside"]
]

metrics_correlation(predictions=predictions_large, references=references_large, metrics=metrics)
plt.show()

##### Performance comparison

In [None]:
from matplotlib import pyplot as plt
from nlgmetricverse import times_correlation

metrics = ["meteor", "ter", "wer"]
times_correlation(predictions=predictions, references=references, metrics=metrics)
plt.show()

### Visualization


##### BERT neuron factors

In [None]:
from nlgmetricverse import bert_neuron_factors

text = ''' Now I ask you: what can be expected of man since he is a being endowed with strange qualities? Shower upon him every earthly blessing, drown him in a sea of happiness, so that nothing but bubbles of bliss can be seen on the surface; give him economic prosperity, such that he should have nothing else to do but sleep, eat cakes and busy himself with the continuation of his species, and even then out of sheer ingratitude, sheer spite, man would play you some nasty trick. He would even risk his cakes and would deliberately desire the most fatal rubbish, the most uneconomical absurdity, simply to introduce into all this positive good sense his fatal fantastic element. It is just his fantastic dreams, his vulgar folly that he will desire to retain, simply in order to prove to himself--as though that were so necessary-- that men still are men and not the keys of a piano, which the laws of nature threaten to control so completely that soon one will be able to desire nothing but by the calendar. And that is not all: even if man really were nothing but a piano-key, even if this were proved to him by natural science and mathematics, even then he would not become reasonable, but would purposely do something perverse out of simple ingratitude, simply to gain his point. And if he does not find means he will contrive destruction and chaos, will contrive sufferings of all sorts, only to gain his point! He will launch a curse upon the world, and as only man can curse (it is his privilege, the primary distinction between him and other animals), may be by his curse alone he will attain his object--that is, convince himself that he is a man and not a piano-key!
'''
bert_neuron_factors(text)

##### N-Gram distance

In [None]:
from nlgmetricverse import n_gram_distance_visualization

n_gram_distance_visualization(predictions[0][0], references[0][0])

##### Similarity Word Matching

In [None]:
from nlgmetricverse import similarity_word_matching

similarity_word_matching(predictions[1][0], references[1][0], lang="en")