# How-to Guide

This guide will help you with all the practical aspects of setting up an annotation project for training and fine-tuning LLMs using Argilla's Feedback Task Datasets. It covers everything from defining your task to collecting, organizing, and using the feedback effectively.

![Feedback dataset snapshot](../../_static/images/llms/snapshot-feedback-demo.png)

In [1]:
import argilla as rg
import os

rg.init(
    api_url=os.environ.get("ARGILLA_API_URL_DEV"),
    api_key=os.environ.get("ARGILLA_API_KEY")
)

## Define the task
The Feedback Task Datasets allow to combine multiple questions of different kinds, so the first step will be to define the aim of your project and the kind of data and feedback you will need to get there.

### Format records
A record in Argilla refers to a data item that requires annotation and can consist of one or multiple fields. For example, your records can include a pair of a prompt and an output. Currently, we only support plain text fields, but we plan to introduce support for markdown and images in the future.

Take some time to explore and find data that fits the purpose of your project. If you are planning to use public data, the [Datasets page](https://huggingface.co/datasets) of the Hugging Face Hub is a good place to start.

´´´{hint}
Always check the licenses of the datasets to make sure you can legally use the dataset for your specfic use case.
´´´

Once you have a dataset, load it and inspect it to find the fields that you want to use in your Feedback dataset. A quick overview of the data will also help you formulate the right questions later.

In [None]:
from datasets import load_dataset

dataset = load_dataset('databricks/databricks-dolly-15k', split='train')
dataset

In [None]:
import pandas as pd

# turn it into a pandas dataframe to get a quick overview of a few examples
df = pd.DataFrame(dataset)
df

The next step is to create records following Argilla's Feedback Record format [link to Python reference].

The name of the fields will need to match the fields set up in the dataset configuration (see [below](#create-your-dataset)).

In [None]:
# as we create the records, we can rename the fields and optionally filter the original dataset
records = [rg.FeedbackRecord(fields={"question": record["instruction"], "answer": record["response"]}) for record in dataset if record["category"]=="open_qa"]

### Define questions
To collect feedback for your dataset, you need to formulate questions. The Feedback Task currently supports the following types of questions:

- Rating: These questions require annotators to select one option from a list of integer values. This type is useful for collecting numerical scores.
- Text: These questions offer annotators a free-text area where they can enter any text. This type is useful for collecting natural language data, such as corrections or explanations.

```{note}
We have plans to expand the range of supported question types in future releases of the Feedback Task.
```

You can define your questions using the Python SDK and set up the following configurations:
- `name`: A shortname for the question.
- `title`: The text displayed in the UI.
- `description` (optional): The text to be displayed in the question tooltip in the UI. You can use it to give more context or information to annotators.
- `required`: Set your question as required or optional. Annotators must answer all required questions to submit a response, but they have the choice to answer optional questions or not.
- `values`: In a RatingQuestion, these are the rating options represented as a list of integer values.

```{note}
The order of the questions in the UI follows the order in which these are added to the dataset in the Python SDK.
```

In [None]:
# list of questions to display in the feedback form
questions =[
    rg.RatingQuestion(
        name="rating", 
        title="Rate the quality of the response:", 
        description="1 = very bad - 5= very good",
        required=True,
        values=[1,2,3,4,5]
    ),
    rg.TextQuestion(
        name="corrected-text",
        title="Provide a correction to the response:",
        required=False
    )
]


### Write guidelines
Once you have decided on the data to show and the questions to ask, it's important to provide clear guidelines to the annotators. These guidelines help them understand the task and answer the questions consistently. You can provide guidelines in two ways:
- In the dataset guidelines: this is added as an argument when you create your dataset in the Python SDK (see below). It will appear in the dataset settings in the UI.
- As question descriptions: these are added as an argument when you create questions in the Python SDK (see above). This text will appear in a tooltip next to the question in the UI.

It is good practice to use at least the dataset guidelines, if not both methods. In the guidelines, you can include a description of the project, details on how to answer each question with examples, instructions on when to discard a record, etc. Question descriptions should be short and provide context to a specific question. They can be a summary of the guidelines to that question, but often times that is not sufficient to align the whole annotation team.

## Set up your annotation team
Depending on the nature of your project and the size of your annotation team, you may want to have control over annotation overlap i.e., having multiple annotations for a single record. You will need to decide on this before pushing your dataset to Argilla, as this has implications on how your dataset is set up. Let's explore a few overlapping options.

### Full overlap
The Feedback Task supports having multiple annotations for your records. This means that all users with access to the dataset can give responses to all the records in the dataset. To have this full overlap just push the dataset (as detailed in [Create your dataset](#create-your-dataset)) in a workspace where all team members have access. Learn more about managing user access to workspaces [here](../getting_started/installation/configurations/user_management.md#creating-an-annotator-user-assigned-to-a-workspace).

### Zero overlap
If you only want one annotation per record, we recommend that you split your records into chunks and assign these to a single annotator. Then, you can create several datasets, one in each annotator's personal workspace, and add the records assigned to each of them.

In [None]:

import httpx
import random
from collections import defaultdict

# make a request using your Argilla Client
rg_client= rg.active_client().client
auth_headers = {"X-Argilla-API-Key": rg_client.token}
http=httpx.Client(base_url=rg_client.base_url, headers=auth_headers)
users = http.get("/api/users").json()

# optional: filter users to get only those with annotator role
users = [u for u in users if u['role']=='annotator']

# optional: shuffle the records to get a random assignment
random.shuffle(records)

# build a dictionary where the key is the username and the value is the list of records assigned to them
assignments = defaultdict(list)

# divide your records in chunks of the same length as the users list and make the assignments
# you will need to follow the instructions to create and push a dataset for each of the key-value pairs in this dictionary
n = len(users)
chunked_records = [records[i:i + n] for i in range(0, len(records), n)]
for chunk in chunked_records:
    for idx, record in enumerate(chunk):
        assignments[users[idx]['username']].append(record)

### Controlled overlap
This option is optimal when you want to have annotation overlap, but up to a certain number and not with the whole team. This can be because you want your team to be more efficient or perhaps to calculate the agreement between pairs of annotators. In this case, you also need to create several datasets and push them to the annotators' personal workspaces with the difference that each record will appear in multiple datasets. 

Here's an example of how you can do those assignments:

In [None]:
# code to assign with overlap
def assign_records(users, records, overlap):
    assignments = {user['username']: [] for user in users}
    random.shuffle(records)
    
    num_users = len(users)
    num_records = len(records)
    num_assignments = num_records * overlap
    
    assignments_per_user = num_assignments // num_users
    
    for i in range(num_records):
        record = records[i]
        
        for j in range(overlap):
            user_index = (i * overlap + j) % num_users
            user = users[user_index]['username']
            assignments[user].append(record)
    
    return assignments

assignments = assign_records(users, records, overlap=3)

## Create and import a dataset
Now we are ready to create our dataset. To do that, first you'll need to define the following configurations:
- `name`: The name of the dataset.
- `workspace`: The workspace where the dataset will be created. If you don't provide one, it will be placed in the default workspace attached to the API key used in `rg.init()`.
- `guidelines` (optional): A set of guidelines for the annotators. These will appear in the dataset settings in the UI.
- `fields`: The list of fields to show in the record card. The order in which the fields will appear matches the order of this list.
- `questions`: The list of questions to show in the form.

Once the dataset is created locally, add the records and, when you're happy with the result, push the dataset to Argilla. At that point, you will be able to see the dataset from the UI.

In [None]:
# create a dataset locally
dataset = rg.FeedbackDataset(
    guidelines="You will see a collection of records with a question and an answer.\nYou will be asked to rate the answer from 1 (very bad) to 5 (very good).\nIf your rating is below 5, please provide a correction to the output.",
    fields = [
        rg.TextField(name="question"),
        rg.TextField(name="answer")
    ],
    questions=questions
)

# add the records to the dataset
dataset.add_records(records)

# push the dataset and records to Argilla
dataset.push_to_argilla(name='my_dataset', workspace='my_workspace')

It is also possible to load and import datasets saved in the Hugging Face Hub that have Argilla's Feedback Dataset format:

In [None]:
# load a public dataset
dataset = rg.FeedbackDataset.from_huggingface("argilla/feedback-dataset")

# load a private dataset
dataset = rg.FeedbackDataset.from_huggingface("argilla/feedback-dataset", use_auth_token=True)

# push to Argilla
dataset.push_to_argilla(name="my_hub_dataset", workspace="my_workspace")

Or copy an existing dataset in your Argilla instance:

In [None]:
# load the dataset
dataset = rg.FeedbackDataset.from_argilla("demo_feedback", workspace="recognai")

# push the dataset with a different name / workspace
dataset.push_to_argilla(name="my_dataset", workspace="my_workspace")

## Annotating a Feedback Dataset
Once you open the dataset, you will see by default the records with `Pending` responses, i.e. records that still don't have a response, in a single-record view. On the left, you can find the record to annotate and on the right the form with all the questions to answer. 

We highly recommend that you read the annotation guidelines before starting the annotation. If there are any, you can find them in the dataset settings page. [describe how to get there] If any of the questions have a description, you will find an info icon next to them. Click it to read the description.

In the annotation view, you will be able to provide responses. Once all required questions have responses the `Submit` button will be enabled and you will be able to submit your response. If you prefer not to give a response for a record, you can move to the next record or discard it using the `Discard` button. 

If you need to review your submitted or discarded responses, you can select the queue you need. From there, you can modify, submit or discard responses. You can also use the `Clear` button to remove the response and send the record back to the `Pending` queue.

You can speed up the annotation process by using shortcuts:
|Action|Keys|
|------|----|
|Clear|&#8679; `Shift` + &blank; `Space`|
|Discard|&#x232B; `Backspace`|
|Discard (from text area)|&#8679; `Shift` + &#x232B; `Backspace`|
|Submit|&crarr; `Enter`|
|Submit (from text area)|&#8679; `Shift` + &crarr; `Enter`|
|Go to previous page|&larr; `Left arrow`|
|Go to next page|&rarr; `Right arrow`|

<!-- <table>
<tr>
<th>Action</th>
<th>Keys</th>
</tr>
<tr>
<td>Clear</td>
<td>&#8679; `Shift` + &blank; `Space`</td>
</tr>
<tr>
<td>Discard</td>
<td>&#x232B; `Backspace`</td>
</tr>
<tr>
<td>Discard (from text area)</td>
<td>&#8679; `Shift` + &#x232B; `Backspace`</td>
</tr>
<tr>
<td>Submit</td>
<td>&crarr; `Enter`</td>
</tr>
<tr>
<td>Submit (from text area)</td>
<td>&#8679; `Shift` + &crarr; `Enter`</td>
</tr>
<tr>
<td>Go to previous page</td>
<td>&larr; `Left arrow`</td>
</tr>
<tr>
<td>Go to next page</td>
<td>&rarr; `Right arrow`</td>
</tr>
</table> 
-->
![Spanshot of the Submitted queue and the progress bar in a Feedback dataset](../../_static/images/llms/snapshot-feedback-submitted.png)

You can track your progress and the number of `Pending`, `Submitted` and `Discarded` responses by clicking the `Progress` icon in the sidebar.

## Collect responses

To collect the responses given by annotators, you can simply load the dataset in this way:

In [2]:
feedback = rg.FeedbackDataset.from_argilla("demo_feedback", workspace="recognai")

Fetching records from Argilla: 15it [00:02,  6.69it/s]


If your dataset doesn't have any annotation overlap i.e., all records have only one response, the post-processing stage will be quite simple. You just need to transform the dataset into whatever format you need for training.

```{note}
Remember to only take into account responses with the `submitted` status.
```

### Measure and solve disagreements

If your dataset does have records with more than one `submitted` response, you will need to unify the responses before using the data for training. 

For example, you can see how the responses from these two users differ:

In [3]:
# example of annotation results for a single record
print(f"Question: {feedback[4]['fields']['question']}")
print(f"Answer: {feedback[4]['fields']['answer']}")
for ix,response in enumerate(feedback[4]['responses']):
    print('-'*50)
    print(f"Annotator {ix+1}")
    print(f"Status: {response['status']}")
    print(f"Response: {response['values']}")

Question: Which episodes of season four of Game of Thrones did Michelle MacLaren direct?
Answer: She directed "Oathkeeper" and "First of His Name" the fourth and fifth episodes of season four, respectively.
--------------------------------------------------
Annotator 1
Status: submitted
Response: {'rating': {'value': 5}}
--------------------------------------------------
Annotator 2
Status: submitted
Response: {'rating': {'value': 3}, 'corrected-text': {'value': 'In Season 3 she directed the episodes "The Bear and the Maiden Fair" and "Second Sons". In Season 4 , she directed another two episodes: "Oathkeeper" and "First of His Name".'}}


Ratings often represent a subjective value, meaning that there is no wrong or right answer for these questions. However, since a `RatingQuestion` has a closed set of options, their results can help with visualizing the disagreement between annotators. On the other hand, texts are unique and subjective, making it almost impossible that two annotators will give the same answer for a `TextQuestion`. For this reason, we don't recommend using these responses to measure disagreements.

If you want to do an initial exploration of the responses, you can use your preferred library for plotting data. Here are some simple examples of some visualisations that you could do to evaluate the potential disagreement in the responses:

In [4]:
# plot 1: submitted responses per record
from collections import Counter,OrderedDict
import plotly.express as px

count_submitted = Counter()
for record in feedback:
    if record['responses'] != []:
        submitted = [r for r in record['responses'] if r['status']=='submitted']
        count_submitted[len(submitted)] += 1
count_submitted = OrderedDict(sorted(count_submitted.items()))
count_submitted = [{'submitted_responses': k, 'no_records': v} for k,v in count_submitted.items()]


fig = px.bar(count_submitted, x='submitted_responses', y='no_records')
fig.update_xaxes(title_text="No. of submitted responses", dtick=1)
fig.update_yaxes(title_text="No. of records")
fig.show()

In [6]:
# plot 2: distance between responses in rating question
list_values = []
for record_ix,record in enumerate(feedback):
    if record['responses'] != []:
        submitted = [r for r in record['responses'] if r['status']=='submitted']
        if len(submitted) > 1:
            for response_ix,response in enumerate(submitted):
                list_values.append({'record': str(record_ix+1), 'annotator': response_ix+1, 'value': response['values']['rating']['value']})

fig = px.histogram(list_values, x='record', y='value', color='annotator', barmode='group')
fig.show()

```{tip}
If you feel like the disagreement between annotators is too high, especially for questions that aren't as subjective, this is a good sign that you should review your annotation guidelines and/or the questions and options.
```

In the following sections, we will explore some techniques you can use to solve disagreements in the responses. These are not the only possible techniques and you should choose them carefully according to the needs of your project and annotation team.

#### Unifying ratings
##### Majority vote
If a record has more than 2 submitted responses, you can take the most popular option as the final score. In the case of a tie, you can break it choosing a random option or the lowest / highest score.

##### Mean score
For this technique you can take all responses and calculate the mean score. That will be final rating.

##### Lowest / highest score
Depending on how the question is formulated, you can take the `max` or `min` value. That will be the final rating.

#### Unifying texts
##### Rate / rank the responses
Make a new dataset that includes the texts you have collected in the record fields and ask your annotation team to rate or rank the responses. Then choose the response with the highest score. If there is a tie, choose one of the options randomly or consider duplicating the record as explained [below](#duplicate-the-record).

##### Choose based on the annotator
Take a subset of the records (enough to get a good representation of responses from each annotator), and rate / rank them as explained in the section above. Then, give each annotator a score based on the preferences of the team. You can use this score to choose text responses over the whole dataset.

##### Choose based on answers to other questions
You can use the answers to other questions as quality markers. For example, you can assume that whoever gave the lowest score will make a more extensive correction and you may want to choose that as the final text. However, this method does not guarantee that the text will be of good quality.

##### Duplicate the record
You may consider that the different answers given by your annotation team are all valid options. In this case, you can duplicate the record to keep each answer. Again, this method does not guarantee the quality of the text, so it is recommended to check the quality of the text, for example using a rating question.

### Export or publish a dataset

If you need to export a dataset, load the dataset using the `from_argilla` method as demonstrated [above](#collect-responses) and turn it into the desired format. Then, you can save it, for example, as a json file or even push it to the Hugging Face Hub.

In [None]:
import json

# save as json
with open("my_file.json", "w") as file:
    json.dump(feedback, indent=4)

In [None]:
# push to a public dataset in the Hugging Face Hub
feedback.push_to_huggingface("argilla/feedback-dataset", generate_card=True)

# push to a private dataset in the Hugging Face Hub
feedback.push_to_huggingface("argilla/feedback-dataset", generate_card=True, private=True)

The Hugging Face Hub uses git for version control, meaning that the data will be updated every time that you push using the same dataset name, but previous versions won't be lost. Feedback Datasets saved in the Hub keeping Argilla's format can be loaded ready to be pushed to Argilla as explained [above](#create-and-import-a-dataset).

## Fine-tuning?