# Dataset.metric

The library also provides a selection of metrics focusing in particular on:

* providing a common API accross a range of NLP metrics,
* providing metrics associated to some benchmark datasets provided by the libray such as GLUE or SQuAD,
* providing access to recent and somewhat complex metrics such as BLEURT or BERTScore,
* allowing simple use of metrics in distributed and large-scale settings.

Metrics in the datasets library have a lot in common with how `datasets.Datasets` are loaded and provided using `datasets.load_dataset()`.

Like datasets, metrics are added to the library as small scripts wrapping them in a common API.

A `datasets.Metric` can be created from various source:

* from a metric script provided on the [HuggingFace Hub](https://huggingface.co/metrics), or
* from a metric script provide at a local path in the filesystem.

In this section we detail these options to access metrics.

In [2]:
from datasets import list_metrics
print(list_metrics())

['accuracy', 'bertscore', 'bleu', 'bleurt', 'coval', 'f1', 'gleu', 'glue', 'indic_glue', 'meteor', 'precision', 'recall', 'rouge', 'sacrebleu', 'seqeval', 'squad', 'squad_v2', 'xnli']


To load a metric from the Hub we use the `datasets.load_metric()` command and give it the short name of the metric you would like to load as listed above.

Let’s load the metric associated to the MRPC subset of the GLUE benchmark for Natural Language Understanding. You can explore this dataset and find more details about it on the online viewer here :

In [3]:
from datasets import load_metric

# same as dataset, for certain meta-datasets like GLUE, 
# need to specify which dataset as an argument 
metric = load_metric('glue', 'mrpc')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1586.0, style=ProgressStyle(description…




In [4]:
metric

Metric(name: "glue", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
""", stored examples: 0)

Evaluating a model’s predictions with `datasets.Metric` involves just a couple of methods:

1. `datasets.Metric.add()` and `datasets.Metric.add_batch()` are used to add pairs of predictions/reference (or just predictions if a metric doesn’t make use of references) to a temporary and memory efficient cache table,
2. `datasets.Metric.compute()` then gather all the cached predictions and reference to compute the metric score.

Note:

* `datasets.Metric.add_batch()` require the use of named arguments to avoid the silent error of mixing predictions with references.

In [None]:
# Example of typical usage
for batch in dataset:
    inputs, references = batch
    predictions = model(inputs)
    metric.add_batch(predictions=predictions, references=references)
score = metric.compute()

# Questions?

[Using datasets.Metric with Trainer()](https://github.com/huggingface/datasets/issues/1592):
    
I was quite surprised in the Metric documentation I don't see how it can be used with `Trainer()`. That would be the most intuitive use case instead of having to iterate the batches and add predictions and references to the metric, then compute the metric manually. Ideally, any pre-built metrics can be added to `compute_metrics` argument of `Trainer()` and they will be calculated at an interval specified by `TrainingArguments.evaluation_strategy`.