# evaluate
Examples of how to use the evaluate library from Huggingface to evaluate NLP tasks using different metrics.
[website](https://huggingface.co/evaluate-metric)

In [1]:
!pip install datasets evaluate --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25h

## Example 1: Evaluating Accuracy for a Classification Task
Let's evaluate the accuracy of a classification model using the GLUE MRPC dataset.

In [2]:
from datasets import load_dataset
from evaluate import load

# Load the dataset
dataset = load_dataset('glue', 'mrpc', split='validation[:10%]')

# Assume these are the predicted and actual labels
predictions = [0, 1, 0, 1, 1, 0, 0, 1, 0, 1]
references = dataset['label'][:10]

# Load the accuracy metric
accuracy_metric = load('accuracy')

# Compute the accuracy
accuracy = accuracy_metric.compute(predictions=predictions, references=references)

# Print the result
print("Accuracy:", accuracy)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/649k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Accuracy: {'accuracy': 0.5}


## Example 2: Evaluating BLEU Score for a Translation Task
Let's compute the BLEU score for a machine translation task using a sample dataset.

In [3]:
from evaluate import load

# Assume these are the predicted translations and references
predictions = ["The quick brown fox jumps over the lazy dog."]
references = [["The quick brown fox jumps over the lazy dog."]]

# Load the BLEU metric
bleu_metric = load('bleu')

# Compute the BLEU score
bleu = bleu_metric.compute(predictions=predictions, references=references)

# Print the result
print("BLEU Score:", bleu)


Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

BLEU Score: {'bleu': 1.0, 'precisions': [1.0, 1.0, 1.0, 1.0], 'brevity_penalty': 1.0, 'length_ratio': 1.0, 'translation_length': 10, 'reference_length': 10}


## Example 3: Evaluating ROUGE Score for a Summarization Task
Let's compute the ROUGE score for a summarization task using sample summaries and references.

Note: To be able to use evaluate-metric/rouge, you need to install the following dependencies 'rouge_score' using 'pip install rouge_score' for instance'

In [5]:
!pip install rouge_score --quiet

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone


In [6]:
from evaluate import load

# Assume these are the generated summaries and references
predictions = ["The quick brown fox jumps over the lazy dog."]
references = ["A fast brown fox leaps over a sleepy dog."]

# Load the ROUGE metric
rouge_metric = load('rouge')

# Compute the ROUGE score
rouge = rouge_metric.compute(predictions=predictions, references=references)

# Print the result
print("ROUGE Score:", rouge)


ROUGE Score: {'rouge1': 0.4444444444444444, 'rouge2': 0.125, 'rougeL': 0.4444444444444444, 'rougeLsum': 0.4444444444444444}


## Example 4: Evaluating BERTScore for Text Generation
Let's compute the BERTScore for a text generation task.

Note: To be able to use evaluate-metric/bertscore, you need to install the following dependencies 'bert_score' using 'pip install bert_score' for instance'

In [9]:
!pip install bert_score --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m35.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [10]:
from evaluate import load

# Assume these are the generated texts and references
predictions = ["The quick brown fox jumps over the lazy dog."]
references = ["A quick brown fox leaps over a lazy dog."]

# Load the BERTScore metric
bertscore_metric = load('bertscore')

# Compute the BERTScore
bertscore = bertscore_metric.compute(predictions=predictions, references=references, lang="en")

# Print the result
print("BERTScore:", bertscore)


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore: {'precision': [0.9852874279022217], 'recall': [0.9852874279022217], 'f1': [0.9852874279022217], 'hashcode': 'roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.41.1)'}
