# NLG Metricverse Demo

This notebook is an introduction to **nlg-metricverse**. It contains simple examples to apply Natural Language Generation (NLG) evaluation metrics, analyze them, compute metric-metric and metric-human correlations.

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

Developed by
*   Giacomo Frisoni @ University of Bologna, Italy (giacomo.frisoni[at]unibo.it)
*   Andrea Zammarchi @ University of Bologna, Italy
*   Marco Avagnano @ University of Bologna, Italy



### Supported Metrics

NLG Metricverse supports X diverse evaluation metrics overall (last update: May X, 2022).<br>
Please refer to the GitHub page to add new metrics.

| Metric | Publication Year | Conference | Categories | Eval Task | Property | Main Tasks | Trained* | Unsupervised** | Eval Dimensions |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| BLEU | 2002 | ACL | n-gram overlap | matching | n-gram precision | MT, IC, DG, QG, RG | X | ✓ | ADE, FLU |
| NIST | 2002 | HLT | n-gram overlap | matching | n-gram precision w/ IDF-weighted n-grams | MT | X | ✓ |
| ORANGE (SentBLEU) | 2004 | COLING | n-gram overlap | matching | n-gram precision w/ smoothing | MT | X | ✓ |
| ROUGE | 2004 | ACL | n-gram overlap | matching | n-gram recall | SUM, DG | X | ✓ | COV |
| WER | 2004 | ICSLP | n-gram overlap | matching | % of insert, delete, replace | MT, SR | X | ✓ |
| METEOR | 2005 | ACL | n-gram overlap | matching | n-gram harmonic mean w/ paraphrase knowledge<br>(e.g., stemming, synonyms) and penalty factor for<br>fragmented matches | MT, IC, DG | X | ✓ |
| CIDEr (TODO) | 2005 | CVPR | n-gram overlap | matching | cosine similarity between TF-IDF weighted n-grams | IC | X | ✓ |
| TER | 2006 | AMTA | n-gram overlap | matching | translation edit rate<br>(i.e., WER + shift movement as extra editing step) | MT | X | ✓ |
| ChrF(++) | 2015 | ACL | n-gram overlap | matching | character-level precision and recall | MT, IC, SUM | X | ✓ |
| WMD (TODO) | 2015 | ICML | distance-based | | earth mover's distance (EMD) on words | IC, SUM | X | ✓ |
| SMD (TODO) | 2015 | ICML | distance-based | | earth mover's distance (EMD) on sentences | IC, SR, SUM | X | ✓ |
| CharacTER (TODO) | 2016 | WMT | n-gram overlap | matching | character-level TER | MT | X | ✓ |
| SacreBLEU | 2018 | ACL | n-gram overlap | matching | standardized BLEU | MT | X | ✓ |
| METEOR++ (TODO) | 2018 | WMT | n-gram overlap | matching | METEOR w/ copy knowledge and syntactic-level<br>paraphrases matching | MT | X | ✓ |
| MOVERScore | 2019 | ACL | embedding-based | matching | IDF-weighted n-gram soft-alignment (WMD generalization)<br>via contextualized embeddings | MT, SUM, D2T, IC | ✓ (ELMo/BERT) | ✓ |
| COMET | 2020 | EMNLP | embedding-based | regression<br>ranking | multilingual-MT human judgment predictions through<br>pre-trained cross-lingual encoders (word embeddings) +<br> pooling layers (sentence embeddings) +<br>feed-forward regressor or triplet margin loss depending<br>on the judgement type (real-value or relative ranking) | MT | ✓ (XLM-RoBERTa)<br><i>end-to-end</i> | X |
| FactCC(X) (TODO) | 2020 | EMNLP | embedding-based | classification | weakly-supervised whole-document↔summary-sentence<br>factual consistency evaluation based on BERT's \[CLS\] | SUM | ✓ (BERT)<br><i>end-to-end</i> | X |
| BLEURT | 2020 | ACL | embedding-based | regression | robust human score prediction based on fine-tuning a BERT<br>model with an additional pre-training scheme characterized<br>by millions of synthetic reference-candidate pairs and lexical-/<br>semantic-level tasks combined through an aggregated loss | MT, D2T | ✓ (BERT)<br><i>end-to-end</i> | X |
| NUBIA (TODO) | 2020 | EvalNLGEval<br>NeurIPS talk | embedding-based | regression | human score prediction with three modules: neural feature<br>extractor on reference-hypothesis pairs (multiple pre-trained<br>transformers capturing semantic similarity, logic entailment,<br>sentence intelligibility) + aggregator (features→quality score<br>mapping) + calibrator | MT, IC | ✓ (RoBERTa, GPT-2)<br><i>end-to-end</i> | X |
| BERTScore | 2020 | ICLR | embedding-based | matching | IDF-weighted n-gram hard-alignment via contextualized<br>embeddings | ...| ✓ (BERT) | ✓ |
| BARTScore | 2021 | NeurIPS | embedding-based | generation | multi-perspective evaluation as text generation via a<br>pre-trained seq2seq model | MT, SUM, D2T | ✓ (BART) | ✓ | INFO, REL, FLU, COH,<br>FAC, COV, ADE |
| MAUVE (TODO) | 2021 | NeurIPS | | | |
| RoMe (TODO) | 2022 | ACL | | | |
| InfoLM (TODO) | 2022 | AAAI | | | |
| Perplexity (TODO) | / | / | | | |

\*contain learnable components<br>
\**are human judgment-free (i.e., do not require human judgments to train)

<i>Tasks</i>. <b>MT</b>: Machine Translation, <b>IC</b>: Image Captioning, <b>SR</b>: Speech Recognition, <b>SUM</b>: Document Summarization, <b>DG</b>: Document or Story Generation, <b>QG</b>: Question Generation, <b>RG</b>: Dialogue Response Generation, <b>D2T</b>: Data-to-Text.

<i>Eval Dimensions</i>. <b>INFO</b>: Informativeness, <b>REL</b>: Relevance, <b>FLU</b>: Fluency, <b>COH</b>: Coherence, <b>FAC</b>: Factuality, <b>COV</b>: Semantic Coverage, <b>ADE</b>: Adequacy.

### Installation

To start off, we have to install the nlg-metricverse package from PyPI or build the library from source.

In [None]:
# !pip install nlg-metricverse

import os
# !git clone https://github.com/disi-unibo-nlp/nlg-metricverse.git
!git clone https://disi-unibo-nlp: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@github.com/disi-unibo-nlp/nlg-metricverse.git # temp private repo
os.chdir("/content/nlg-metricverse/")
!pip install -v .

Now we can import our library.

In [None]:
from nlgmetricverse import Nlgmetricverse

scorer = Nlgmetricverse()
scorer = Nlgmetricverse(metrics=["bertscore"])