# Use Metrics to Evaluate Your Model

In this part of our end-to-end series, we will evaluate the annotation results of our dataset using the `metrics` module. To see the previous steps, you can refer to the tutorials such as [creating the dataset](./create-dataset-001.ipynb), [adding responses and suggestions](./add-resoponses) or [training your model](./train-model-006.ipynb). Feel free to check out the [practical guides](../../../../practical_guides/practical_guides.md) page for more in-depth information.

After having your dataset annotated by the annotators, it is strongly recommended to evaluate the annotation results. This is especially important if you are planning to use the dataset for training your model. The `metrics` module provides you with the necessary tools to compare the annotated data against suggestions.

![workflow](../../../../_static/tutorials/end2end/base/workflow_metrics.png)

## Table of Contents

1. [Pull the Dataset](#Pull-the-Dataset)
    1. [From Argilla](#From-Argilla)
    2. [From HuggingFace Hub](#From-HuggingFace-Hub)
2. [Unify Responses](#Unify-Responses)
3. [Annotator Metrics](#Annotator-Metrics)
    1. [Metrics per Annotator](#Metrics-per-Annotator)
    2. [Metrics for Unified Responses](#Metrics-for-Unified-Responses)
4. [Conclusion](#Conclusion)


## Running Argilla

For this tutorial, you will need to have an Argilla server running. There are two main options for deploying and running Argilla:

**Deploy Argilla on Hugging Face Spaces:** If you want to run tutorials with external notebooks (e.g., Google Colab) and you have an account on Hugging Face, you can deploy Argilla on Spaces with a few clicks:

[![deploy on spaces](https://huggingface.co/datasets/huggingface/badges/raw/main/deploy-to-spaces-lg.svg)](https://huggingface.co/new-space?template=argilla/argilla-template-space)

For details about configuring your deployment, check the [official Hugging Face Hub guide](https://huggingface.co/docs/hub/spaces-sdks-docker-argilla).

**Launch Argilla using Argilla's quickstart Docker image**: This is the recommended option if you want [Argilla running on your local machine](../../../../getting_started/quickstart.md). Note that this option will only let you run the tutorial locally and not with an external notebook service.

For more information on deployment options, please check the Deployment section of the documentation.

<div class="alert alert-info">

Tip

This tutorial is a Jupyter Notebook. There are two options to run it:

- Use the Open in Colab button at the top of this page. This option allows you to run the notebook directly on Google Colab. Don't forget to change the runtime type to GPU for faster model training and inference.
- Download the .ipynb file by clicking on the View source link at the top of the page. This option allows you to download the notebook and run it on your local machine or on a Jupyter notebook tool of your choice.
</div>

First, let's install our dependencies and import the necessary libraries:

In [None]:
!pip install argilla
!pip install datasets transformers

In [2]:
import argilla as rg
from argilla.client.feedback.metrics.annotator_metrics import AnnotatorMetric

In order to run this notebook we will need some credentials to push and load datasets from `Argilla` and `🤗 Hub`, let's set them in the following cell:

In [None]:
# Argilla credentials
api_url = "http://localhost:6900"  # "https://<YOUR-HF-SPACE>.hf.space"
api_key = DEFAULT_API_KEY  # admin.apikey
# Huggingface credentials
hf_token = "hf_..."

Log in to Argilla:

In [None]:
rg.init(api_url=api_url, api_key=api_key)

### Enable Telemetry
We gain valuable insights from how you interact with our tutorials. To improve ourselves in offering you the most suitable content, using the following lines of code will help us understand that this tutorial is serving you effectively. Though this is entirely anonymous, you can choose to skip this step if you prefer. For more info, please check out the Telemetry page.

```python
from argilla.utils.telemetry import tutorial_running

tutorial_running()
```

## Pull the Dataset

To employ metrics, we can pull a dataset that consists of multiple annotations per record. We can do this either from Argilla or HuggingFace Hub. Let us see how we can pull it from both sources.

### From Argilla

We can pull the dataset from Argilla by using the `from_argilla` method. 

In [None]:
dataset = rg.FeedbackDataset.from_argilla("argilla/go_emotions_raw")

### From HuggingFace Hub

We can also pull the dataset from HuggingFace Hub. Similarly, we can use the `from_huggingface` method to pull the dataset.

In [None]:
dataset = rg.FeedbackDataset.from_huggingface("argilla/go_emotions_raw", split="train[:1000]")

<div class="alert alert-info">

Note 

The dataset pulled from HuggingFace Hub is an instance of `FeedbackDataset` whereas the dataset pulled from Argilla is an instance of `RemoteFeedbackDataset`. The difference between the two is that the former is a local one and the changes made on it stay locally. On the other hand, the latter is a remote one and the changes made on it are directly reflected on the dataset on the Argilla server, which can make your process faster.

</div>

Let us briefly examine what our dataset looks like. It is a dataset that consists of data items with the field `text`. For each record, we have multiple annotations that label the text with at least one sentiment. Let us see an example of a text and the given responses. In this example, the record has been annotated by 3 annotators and one of them has labeled the text with one sentiment while the other two have labeled it with two sentiments.

In [140]:
print("text:", dataset[5].fields["text"])
print("responses:", [dataset[5].responses[i].values["label"].value for i in range(len(dataset[5].responses))])

text:  And not all children's hospitals need the same stuff, so call and ask what they need. But I like your tip. You're correct. 
responses: [['neutral'], ['approval', 'desire'], ['approval', 'love']]


## Unify Responses

When you have multiple annotations per record in your project, it is a good practice to unify the responses to have a single response per record. This is preferable as it makes the dataset more consistent and easier to work with. Let us see how we can unify the responses with Argilla. First, we create a strategy to unify the responses. We go with the `majority` vote strategy, which means that we will keep the responses that have been suggested by the majority of the annotators. 

In [146]:
strategy = rg.MultiLabelQuestionStrategy("majority")

In [None]:
dataset.compute_unified_responses(
    question=dataset.question_by_name("label"),
    strategy=strategy,
)

We can look at a record to see how the responses have been unified. In our case, the responses have been unified to `approval` as it is the majority vote among the responses.

In [148]:
dataset.records[5].unified_responses

{'label': [UnifiedValueSchema(value=['approval'], strategy=<RatingQuestionStrategy.MAJORITY: 'majority'>)]}

## Annotator Metrics

Annotator metrics refer to the metrics where the responses from each annotator (or the unified response) are compared against the suggestions we have added to the model. This type of metric is useful to see how each annotator individually or all annotators collectively perform on the dataset. Argilla offers various metrics and you can access the list of the available ones by referring to the [metrics](#linkt-to-metrics) page. Alternatively, you can discover the available metrics by using the `allowed_metrics` method.

The question type we have in the current dataset is `MultiLabelQuestion`. By using the `allowed_metrics` method, we can see the metrics below, which are available for this question type.

In [150]:
metric = AnnotatorMetric(dataset=dataset, question_name="label")
metric.allowed_metrics

['accuracy', 'f1-score', 'precision', 'recall', 'confusion-matrix']

### Metrics per Annotator

In AnnotatorMetrics, we have two options to pick the ground truths, against which we want to make the comparison: responses or suggestions. You can have more info regarding this on [this page](#annotator-metrics).

Let us first select responses as our ground truths and compare the suggestions to them. 

To calculate the metrics per annotator, we can straightforwardly use the `compute_response_metrics` on our dataset. We set the arguments as the question name and the name of the metrics we want to calculate.

In [None]:
responses_metrics = dataset.compute_responses_metrics(question_name="label", metric_names=["accuracy", "precision", "recall", "f1-score"])

We now have a dictionary whose keys are the IDs of annotators and values are the metrics at hand. We can look at the score of any annotator by giving the particular ID.

We see that the annotator below has a total of 182 annotations while their scores are in mid between 0.4 and 0.6.

In [159]:
responses_metrics["00000000-0000-0000-0000-000000000004"]

[AnnotatorMetricResult(metric_name='accuracy', count=182, result=0.5714285714285714),
 AnnotatorMetricResult(metric_name='precision', count=182, result=0.4427905213343358),
 AnnotatorMetricResult(metric_name='recall', count=182, result=0.5377066798941799),
 AnnotatorMetricResult(metric_name='f1-score', count=182, result=0.428750352375672)]

As stated above, we can alternatively use the suggestions as our ground truths. Let us see how the annotators perform when we use the suggestions as our ground truths. Besides, we will be calculating all possible ones by feeding `metric.allowed_metrics` to the `compute_response_metrics` method. 

In [None]:
suggestions_metrics = dataset.compute_suggestions_metrics(question_name="label", metric_names=metric.allowed_metrics)

Again, we see that the annotator below has a total of 182 annotations while their scores are in mid between 0.4 and 0.6.

In [162]:
suggestions_metrics["00000000-0000-0000-0000-000000000004"]

[AnnotatorMetricResult(metric_name='accuracy', count=182, result=0.5714285714285714),
 AnnotatorMetricResult(metric_name='f1-score', count=182, result=0.428750352375672),
 AnnotatorMetricResult(metric_name='precision', count=182, result=0.5377066798941799),
 AnnotatorMetricResult(metric_name='recall', count=182, result=0.4427905213343358),
 AnnotatorMetricResult(metric_name='confusion-matrix', count=182, result={'admiration':                             suggestions_admiration_true  \
 responses_admiration_true                           174   
 responses_admiration_false                            5   
 
                             suggestions_admiration_false  
 responses_admiration_true                              0  
 responses_admiration_false                             3  , 'amusement':                            suggestions_amusement_true  \
 responses_amusement_true                          176   
 responses_amusement_false                           3   
 
                    

### Metrics for Unified Responses

For a better understanding of the general performance of the annotators, we can unify the responses before calculating the metrics. This will be simply done by using the extra `strategy` argument in the `compute_response_metrics` method. When this argument is set, Argilla will unify the responses and then compare the unified responses against the suggestions or vice versa.

In [None]:
responses_metrics_unified = dataset.compute_responses_metrics(question_name="label", metric_names=["accuracy", "precision", "recall", "f1-score"], strategy="majority")

In [176]:
responses_metrics_unified

[AnnotatorMetricResult(metric_name='accuracy', count=1000, result=0.803),
 AnnotatorMetricResult(metric_name='precision', count=1000, result=0.7865376025406677),
 AnnotatorMetricResult(metric_name='recall', count=1000, result=0.8002996311981446),
 AnnotatorMetricResult(metric_name='f1-score', count=1000, result=0.7836689489727956)]

Similarly, if want to select suggestions as our ground truths and calculate the metrics against them, we can do it as follows.

In [None]:
suggestions_metrics_unified = dataset.compute_suggestions_metrics(question_name="label", metric_names=["accuracy", "precision", "recall", "f1-score"], strategy="majority")

In [178]:
suggestions_metrics_unified

[AnnotatorMetricResult(metric_name='accuracy', count=1000, result=0.799),
 AnnotatorMetricResult(metric_name='precision', count=1000, result=0.7653116748179684),
 AnnotatorMetricResult(metric_name='recall', count=1000, result=0.7669313732138586),
 AnnotatorMetricResult(metric_name='f1-score', count=1000, result=0.7480244903078532)]

## Conclusion

In this tutorial, we have seen how we can evaluate the annotation results of our dataset using the `metrics` module. We have first unified the response to have a more comprehensive outlook on the annotations. Then, we have calculated the metrics per annotator and for the unified responses. We have also seen how we can select the ground truths as responses or suggestions. If you feel that the annotations are not satisfactory, you can reiterate the annotation process by making changes in the structure of your project. You can refer to the [practical guides](../../../../practical_guides/practical_guides.md) to refine your structure or check out the [advanced tutorials](../../../../tutorials.md) to learn more about the advanced use cases of Argilla.