<div class="alert alert-info">

Note
    
This tutorial demonstrates a sample usage for `FeedbackDataset`, which offers implementations different from the old `TextClassificationDataset`, `Text2TextDataset` and `TokenClassificationDataset`. To have info about old datasets, you can have a look at them [here]([../getting_started/quickstart_workflow.html](https://docs.argilla.io/en/latest/getting_started/quickstart_workflow.html)). Not sure which dataset to use? Check out our section on [choosing a dataset](https://docs.argilla.io/en/latest/practical_guides/choose_dataset.html).
    
</div>

# Workflow Feedback Dataset

Argilla Feedback is a tool designed to obtain and manage both the feedback data from annotators and the suggestions from small and large language models.


## Install Libraries

Install the latest version of Argilla in Colab, along with other libraries and models used in this notebook.

In [None]:
!pip install argilla datasets setfit evaluate seqeval

## Set Up Argilla

If you have already deployed Argilla Server, then you can skip this step. Otherwise, you can quickly deploy it in two different ways:

* You can deploy Argilla Server on [HF Spaces](https://huggingface.co/new-space?template=argilla/argilla-template-space).

* Alternatively, if you want to run Argilla locally on your own computer, the easiest way to get Argilla UI up and running is to deploy on Docker:

    ```
    docker run -d --name quickstart -p 6900:6900 argilla/argilla-quickstart:latest
    ```

More info on Installation [here](../getting_started/installation/deployments/deployments.html).

## Connect to Argilla



It is possible to connect to our Argilla instance by simply importing the Argilla library and using the environment variables and `rg.init()`.

* `ARGILLA_API_URL`: It is the url of the Argilla Server.
  * If you're using Docker, it is `http://localhost:6900` by default.
  * If you're using HF Spaces, it is constructed as `https://[your-owner-name]-[your_space_name].hf.space`.
* `ARGILLA_API_KEY`: It is the API key of the Argilla Server. It is `owner` by default.
* `HF_TOKEN`: It is the Hugging Face API token. It is only needed if you're using a [private HF Space](https://docs.argilla.io/en/latest/getting_started/installation/deployments/huggingface-spaces.html#deploy-argilla-on-spaces). You can configure it in your profile: [Setting > Access Tokens](https://huggingface.co/settings/tokens).
* `workspace`: It is a “space” inside your Argilla instance where authorized users can collaborate. It's `admin` by default.

For more info about custom configurations like headers, workspace separation or access credentials, check our [config page](https://docs.argilla.io/en/latest/getting_started/installation/configurations/configurations.html).

In [None]:
# Argilla credentials
api_url = "http://localhost:6900"  # "https://<YOUR-HF-SPACE>.hf.space"
api_key = DEFAULT_API_KEY  # admin.apikey
# Huggingface credentials
hf_token = "hf_..."

In [None]:
import argilla as rg

rg.init(api_url=api_url, api_key=api_key, workspace="admin")

# If you want to use your private HF Space
# rg.init(api_url=api_url, api_key=api_key, workspace="admin", extra_headers={"Authorization": f"Bearer {os.environ['HF_TOKEN']}"})

## Create Dataset

FeedbackDataset is the container for Argilla Feedback structure. Argilla Feedback offers different components for FeedbackDatasets that you can employ for various aspects of your workflow. For a more detailed explanation, refer to the [documentation](https://docs.argilla.io/en/latest/practical_guides/practical_guides.html) and the [end-to-end tutorials](https://docs.argilla.io/en/latest/tutorials_and_integrations/tutorials/tutorials.html) for beginners.

To start, we need to configure the FeedbackDatasest. To do so, there are two options: use a pre-defined template or create a custom one.

### Use a Task Template

Argilla offers a set of [pre-defined templates for different tasks](https://docs.argilla.io/en/latest/practical_guides/create_update_dataset/create_dataset.html#task-templates). You can use them to configure your dataset straightforward. For instance, if you want to create a dataset for simple text classification, you can use the following code:

In [1]:
dataset = rg.FeedbackDataset.for_text_classification(
    labels=["yes", "no"],
    multi_label=False,
    use_markdown=True,
    guidelines=None,
    metadata_properties=None,
    vectors_settings=None,
)
dataset

FeedbackDataset(
   fields=[TextField(name='text', title='Text', required=True, type='text', use_markdown=True)]
   questions=[LabelQuestion(name='label', title='Label', description='Classify the text by selecting the correct label from the given list of labels.', required=True, type='label_selection', labels=['yes', 'no'], visible_labels=None)]
   guidelines=This is a text classification dataset that contains texts and labels. Given a set of texts and a predefined set of labels, the goal of text classification is to assign one label to each text based on its content. Please classify the texts by making the correct selection.)
   metadata_properties=[])
)

### Configure a Custom Dataset

If your dataset does not fit into one of the pre-defined templates, you [can create a custom dataset](https://docs.argilla.io/en/latest/practical_guides/create_update_dataset/create_dataset.html#define-questions) by defining the fields, questions, records, metadata properties and vectors settings.

#### Fields

In our example, `fields` will store the question and answer structure to be used for each sample. It has the following arguments: `name`, `title` (optional), `required` (optional) and `use_markdown` (optional).

In [3]:
fields = [
    rg.TextField(name="question", title="Question", required=True, use_markdown=False),
    rg.TextField(name="answer",title="Answer", required=True, use_markdown=True)
]

#### Questions

For the dataset, you need to define at least one question type. As of today, the different question types that Argilla offers are `RatingQuestion`, `TextQuestion`, `LabelQuestion`, `MultiLabelQuestion` and `RankingQuestion`.

Let's create a `LabelQuestion` for the current example with its `name`, `title` seen on Argilla UI, and the dictionary with the `labels`. It will be `required` and the number of `visible_labels` in the UI is the default (20).

In [4]:
label_question = [
    rg.LabelQuestion(
        name="relevant",
        title="Relevancy",
        labels=["yes", "no"],
        required=True,
        visible_labels=None
    )
]

#### Metadata Properties

Metadata can optionally be added to the dataset to filter and sort the records. They can be of the following types: `TermsMetadataProperty`, `IntegerMetadataProperty` or `FloatMetadataProperty`.

Let's add a `TermsMetadataProperty` to the dataset with the `name`, `title` and `values` (optional) to be used for filtering and sorting.

In [5]:
metadata_properties = [
    rg.TermsMetadataProperty(
        name="groups",
        title="Annotation groups",
        values=["group-a", "group-b", "group-c"]
    )
]

#### Vectors Settings

Vectors can optionally be added to use  similarity search. You'll need to specify the `name`, `title` (optional) and `dimensions` of the vector.

In [9]:
vectors_settings = [
    rg.VectorSettings(
        name="my_vector",
        dimensions= 5 # e.g. 768 for BERT
    )
]

### Annotation guideline

As it is helpful for annotators, we can enrich our task with `guidelines` as well. Clear guidelines will help them to understand the task better and make more accurate annotations. There are two ways to have guidelines: defining it as an argument to the FeedbackDataset or as an argument (`description`) to the question instances above. Depending on the specific task you employ, you may want to use either one of them, so it is good practice to try both.

#### Create the Dataset

We can now create our FeedbackDataset instance with the features defined above. Do not forget to define `fields`, `questions`, `metadata_properties` and `vectors_settings` as a list, while `guidelines` expects a string.

In [10]:
dataset = rg.FeedbackDataset(
    guidelines="Annotations should be made according to the policy.",
    fields=fields,
    questions=label_question,
    metadata_properties=metadata_properties,
    vectors_settings=vectors_settings
)
dataset

FeedbackDataset(
   fields=[TextField(name='question', title='Question', required=True, type='text', use_markdown=False), TextField(name='answer', title='Answer', required=True, type='text', use_markdown=True)]
   questions=[LabelQuestion(name='relevant', title='Relevancy', description=None, required=True, type='label_selection', labels=['yes', 'no'], visible_labels=None)]
   guidelines=Annotations should be made according to the policy.)
   metadata_properties=[TermsMetadataProperty(name='groups', title='Annotation groups', visible_for_annotators=True, type='terms', values=['group-a', 'group-b', 'group-c'])])
)

## Upload data

### Records

A record refers to each of the data items that will be annotated by the annotator team. The records will be the pieces of information that will be shown to the user in the UI in order to complete the annotation task. In the current single-label dataset sample, it can only consist of a text to be labeled while it will be a prompt and output pair in the case of instruction datasets.

For Argilla Feedback, we can define a `FeedbackRecord` with the mandatory argument `fields` and optional arguments `metadata` and `vectors`.

In [11]:
# A sample FeedbackRecord
record = rg.FeedbackRecord(
    fields={
        "question": "Why can camels survive long without water?",
        "answer": "Camels use the fat in their humps to keep them filled with energy and hydration for long periods of time."
    },
    metadata={"groups": "group-a"},
    vectors={"my_vector": [0.1, 0.2, 0.3, 0.4, 0.5]}
)

### Suggestions

Argilla Feedback offers a way to use suggestions from LLMs and other models as a starting point for annotators. This way, annotators can save time and effort by correcting the predictions instead of annotating from scratch. They can also be added directly in `FeedbackRecords` as `suggestions`.

In [12]:
record.suggestions=[
        {
        "question_name": "relevant",
        "value": "yes"
        }
    ]

### Responses

Argilla Feedback can deal with multiple responses per record for each one of the annotators. We can define a list of responses for each record. Each response will be a dictionary with the annotator's name as the key and the response as the value. They can also be added directly in `FeedbackRecords` as `responses`.

In [None]:
record.responses = [
    {
        "values":{
            "relevant":{
                "value": "yes"
            }
        }
    }
]

#### Add the Records

Now, it is quite simple to add records to the FeedbackDataset we have previously created, in the form of a list.

In [15]:
dataset.add_records([record])
dataset[0]

FeedbackRecord(fields={'question': 'Why can camels survive long without water?', 'answer': 'Camels use the fat in their humps to keep them filled with energy and hydration for long periods of time.'}, metadata={'groups': 'group-a'}, vectors={'my_vector': [0.1, 0.2, 0.3, 0.4, 0.5]}, responses=[ResponseSchema(user_id=None, values={'relevant': ValueSchema(value='yes')}, status=<ResponseStatus.submitted: 'submitted'>)], suggestions=(SuggestionSchema(question_name='relevant', type=None, score=None, value='yes', agent=None),), external_id=None)

Now that we have our dataset with already annotated responses and suggestions as model predictions, we can push the dataset to the Argilla space.

<div class="alert alert-info">

Note
    
From Argilla 1.14.0, calling `push_to_argilla` will not just push the `FeedbackDataset` into Argilla, but will also return the remote `FeedbackDataset` instance, which implies that the additions, updates, and deletions of records will be pushed to Argilla as soon as they are made. This is a change from previous versions of Argilla, where you had to call `push_to_argilla` again to push the changes to Argilla.
    
</div>

In [None]:
remote_dataset = dataset.push_to_argilla(name="relevancy_dataset", workspace="admin")

## Train a model

As with other datasets, Feedback datasets also allow to create a training pipeline and make inferences with the resulting model. After you gather responses with Argilla Feedback, you can easily fine-tune an LLM. In this example, we will have to complete a text classification task.

For fine-tuning, we will use setfit library and the [Argilla Trainer](https://docs.argilla.io/en/latest/practical_guides/fine_tune.html#the-argillatrainer), which is a powerful wrapper around many of our favorite NLP libraries. It provides a very intuitive abstract representation to facilitate simple training workflows using decent default pre-set configurations without having to worry about any data transformations from Argilla.

Let us first create our dataset to train. For this example, we will use the [emotion](https://huggingface.co/datasets/argilla/emotion) dataset from Argilla, which was created using Argilla. Each text item has its responses as 6 different sentiments, which are Sadness, Joy, Love, Anger, Fear and Surprise.

In [None]:
# Besides Argilla, it can also be imported with load_dataset from datasets
dataset_hf = rg.FeedbackDataset.from_huggingface("argilla/emotion")

We can then start to create a training pipeline by first defining `TrainingTask`, which is used to define how the data should be processed and formatted according to the associated task and framework. Each task has its own classmethod and the data formatting can always be customized via `formatting_func`. You can visit [this page](https://docs.argilla.io/en/latest/practical_guides/fine_tune.html#tasks) for more info. Simpler tasks like text classification can be defined using default definitions, as we do in this example.

In [None]:
from argilla.feedback import TrainingTask

task = TrainingTask.for_text_classification(
    text=dataset_hf.field_by_name("text"),
    label=dataset_hf.question_by_name("label")
)

We can then define our ArgillaTrainer for any of the supported frameworks and customize the training config using ArgillaTrainer.update_config.

Let us define ArgillaTrainer with any of the supported frameworks. 

In [None]:
from argilla.feedback import ArgillaTrainer

trainer = ArgillaTrainer(
    dataset=dataset_hf,
    task=task,
    framework="setfit",
    train_size=0.8
)

You can update the model config via `update_config`.

In [None]:
trainer.update_config(num_train_epochs=2)

We can now train the model with `train`

In [None]:
trainer.train(output_dir="setfit_model")

and make inferences with `predict`.

In [None]:
trainer.predict("This is just perfect!")

We have trained a model with FeedbackDataset in this tutorial. For more info about concepts in Argilla Feedback and LLMs, look [here](https://docs.argilla.io/en/latest/conceptual_guides/llm/llm.html). For a more detailed explanation, refer to the [documentation](https://docs.argilla.io/en/latest/practical_guides/practical_guides.html) and the [end-to-end tutorials](https://docs.argilla.io/en/latest/tutorials_and_integrations/tutorials/tutorials.html) for beginners.

-------------

