# Collect responses

To collect the responses given by annotators, you can simply load the dataset in this way:

In [None]:
feedback = rg.FeedbackDataset.from_argilla("demo_feedback", workspace="recognai")

If your dataset doesn't have any annotation overlap i.e., all records have one response only, the post-processing stage will be quite simple. You just need to transform the dataset into whatever format you need for training.

Note

Remember to only take into account responses with the `submitted` status.

## Measure and solve disagreements

If your dataset does have records with more than one `submitted` response, you will need to unify the responses before using the data for training. 

For example, you can see how the responses from these two users differ:

In [3]:
# example of annotation results for a single record
print(f"Question: {feedback[4]['fields']['question']}")
print(f"Answer: {feedback[4]['fields']['answer']}")
for ix,response in enumerate(feedback[4]['responses']):
    print('-'*50)
    print(f"Annotator {ix+1}")
    print(f"Status: {response['status']}")
    print(f"Response: {response['values']}")

Question: Which episodes of season four of Game of Thrones did Michelle MacLaren direct?
Answer: She directed "Oathkeeper" and "First of His Name" the fourth and fifth episodes of season four, respectively.
--------------------------------------------------
Annotator 1
Status: submitted
Response: {'rating': {'value': 5}}
--------------------------------------------------
Annotator 2
Status: submitted
Response: {'rating': {'value': 3}, 'corrected-text': {'value': 'In Season 3 she directed the episodes "The Bear and the Maiden Fair" and "Second Sons". In Season 4 , she directed another two episodes: "Oathkeeper" and "First of His Name".'}}


Ratings often represent a subjective value, meaning that there is no wrong or right answer for these questions. However, since a `RatingQuestion` has a closed set of options, their results can help with visualizing the disagreement between annotators. On the other hand, texts are unique and subjective, making it almost impossible that two annotators will give the same answer for a `TextQuestion`. For this reason, we don't recommend using these responses to measure disagreements.

If you want to do an initial exploration of the responses, you can use your preferred library for plotting data. Here are some simple examples of some visualisations that you could do to evaluate the potential disagreement in the responses:

In [None]:
# plot 1: submitted responses per record
from collections import Counter,OrderedDict
import plotly.express as px

count_submitted = Counter()
for record in feedback:
    if record['responses'] != []:
        submitted = [r for r in record['responses'] if r['status']=='submitted']
        count_submitted[len(submitted)] += 1
count_submitted = OrderedDict(sorted(count_submitted.items()))
count_submitted = [{'submitted_responses': k, 'no_records': v} for k,v in count_submitted.items()]


fig = px.bar(count_submitted, x='submitted_responses', y='no_records')
fig.update_xaxes(title_text="No. of submitted responses", dtick=1)
fig.update_yaxes(title_text="No. of records")
fig.show()

![Plot 1: Submitted responses per record](../../../_static/images/llms/collect_responses_plot_1.png)

In [None]:
# plot 2: distance between responses in rating question
list_values = []
for record_ix,record in enumerate(feedback):
    if record['responses'] != []:
        submitted = [r for r in record['responses'] if r['status']=='submitted']
        if len(submitted) > 1:
            for response_ix,response in enumerate(submitted):
                list_values.append({'record': str(record_ix+1), 'annotator': response_ix+1, 'value': response['values']['rating']['value']})

fig = px.histogram(list_values, x='record', y='value', color='annotator', barmode='group')
fig.show()

![Plot 2: Distance between responses in the rating question](../../../_static/images/llms/collect_responses_plot_2.png)

Hint

If you feel like the disagreement between annotators is too high, especially for questions that aren't as subjective, this is a good sign that you should review your annotation guidelines and/or the questions and options.

In the following sections, we will explore some techniques you can use to solve disagreements in the responses. These are not the only possible techniques and you should choose them carefully according to the needs of your project and annotation team.

### Unifying ratings
#### Majority vote
If a record has more than 2 submitted responses, you can take the most popular option as the final score. In the case of a tie, you can break it choosing a random option or the lowest / highest score.

#### Mean score
For this technique you can take all responses and calculate the mean score. That will be final rating.

#### Lowest / highest score
Depending on how the question is formulated, you can take the `max` or `min` value. That will be the final rating.

### Unifying texts
#### Rate / rank the responses
Make a new dataset that includes the texts you have collected in the record fields and ask your annotation team to rate or rank the responses. Then choose the response with the highest score. If there is a tie, choose one of the options randomly or consider duplicating the record as explained [below](#duplicate-the-record).

#### Choose based on the annotator
Take a subset of the records (enough to get a good representation of responses from each annotator), and rate / rank them as explained in the section above. Then, give each annotator a score based on the preferences of the team. You can use this score to choose text responses over the whole dataset.

#### Choose based on answers to other questions
You can use the answers to other questions as quality markers. For example, you can assume that whoever gave the lowest score will make a more extensive correction and you may want to choose that as the final text. However, this method does not guarantee that the text will be of good quality.

#### Duplicate the record
You may consider that the different answers given by your annotation team are all valid options. In this case, you can duplicate the record to keep each answer. Again, this method does not guarantee the quality of the text, so it is recommended to check the quality of the text, for example using a rating question.

### Export or publish a dataset

If you need to export a dataset, load the dataset using the `from_argilla` method as demonstrated [above](#collect-responses) and turn it into the desired format. Then, you can save it, for example, as a json file or even push it to the Hugging Face Hub.

In [None]:
import json

# save as json
with open("my_file.json", "w") as file:
    json.dump(feedback, indent=4)

In [None]:
# push to a public dataset in the Hugging Face Hub
feedback.push_to_huggingface("argilla/feedback-dataset", generate_card=True)

# push to a private dataset in the Hugging Face Hub
feedback.push_to_huggingface("argilla/feedback-dataset", generate_card=True, private=True)

The Hugging Face Hub uses git for version control, meaning that the data will be updated every time that you push using the same dataset name, but previous versions won't be lost. Feedback Datasets saved in the Hub keeping Argilla's format can be loaded ready to be pushed to Argilla as explained [above](#create-and-import-a-dataset).