# 🖼️ Curate an instruction dataset for supervised fine-tuning

The internet is flooding with open-source datasets for fine-tuning LLMs, some created by humans, others generated with generative models. However, these datasets often have many problematic and low-quality examples. By curating them, we can make the fine-tuning step more efficient.

In this example, we take [Databricks' Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) dataset to show how you can build an instruction dataset for your fine-tuning projects by cleaning a public dataset using Argilla's Feedback Task.

Let's get started!

<img src="../../../_static/images/llms/curating-feedback-instructiondataset/snapshot_dolly_curation.png" alt="A Feedback Task setting for the curation of Databricks' Dolly dataset" style="width: 1100px;">

<div class="alert alert-info.tip">

Tip
    
This tutorial is a Jupyter Notebook. There are two options to run it:

- Use the Open in Colab button at the top of this page. This option allows you to run the notebook directly on Google Colab. Don't forget to change the runtime type to GPU for faster model training and inference.
- Download the .ipynb file by clicking on the View source link at the top of the page. This option allows you to download the notebook and run it on your local machine or on a Jupyter notebook tool of your choice.
</div>

## Setup

For this tutorial, you will need to have an Argilla server running. If you don't have one already, check out our [Quickstart](../../../getting_started/quickstart.md) or [Installation](../../../getting_started/installation/installation.md) pages. Once you do, complete the following steps:

1. Install the Argilla client and the required third party libraries using `pip`:

In [None]:
%pip install argilla datasets pandas httpx -qqq

2. Let's make the necessary imports:

In [3]:
import argilla as rg
from datasets import Dataset, load_dataset
import pandas as pd
import httpx
import random
from collections import defaultdict, Counter

3. If you are running Argilla using the Docker quickstart image or Hugging Face Spaces, you need to init the Argilla client with the `URL` and `API_KEY`:

In [None]:
# Replace api_url with the url to your HF Spaces URL if using Spaces
# Replace api_key if you configured a custom API key
rg.init(
    api_url="http://localhost:6900", 
    api_key="admin.apikey"
)

## Define the project 
As a first step, let's load the dataset and quickly explore the data:

In [None]:
data = load_dataset('databricks/databricks-dolly-15k', split='train')

In [4]:
df = data.to_pandas()
df

Unnamed: 0,instruction,context,response,category
0,When did Virgin Australia start operating?,"Virgin Australia, the trading name of Virgin A...",Virgin Australia commenced services on 31 Augu...,closed_qa
1,Which is a species of fish? Tope or Rope,,Tope,classification
2,Why can camels survive for long without water?,,Camels use the fat in their humps to keep them...,open_qa
3,"Alice's parents have three daughters: Amy, Jes...",,The name of the third daughter is Alice,open_qa
4,When was Tomoaki Komorida born?,Komorida was born in Kumamoto Prefecture on Ju...,"Tomoaki Komorida was born on July 10,1981.",closed_qa
...,...,...,...,...
15006,How do i accept the change,,Embrace the change and see the difference,brainstorming
15007,What is a laser and who created it?,A laser is a device that emits light through a...,A laser is a device that emits light from an e...,summarization
15008,What is the difference between a road bike and...,,Road bikes are built to be ridden on asphalt a...,open_qa
15009,How does GIS help in the real estate investmen...,,"Real estate investors depend on precise, accur...",general_qa


For our project, we would like to make sure that the `instruction` and the `response` are formulated clearly and concisely, and that they provide correct information. We will add those two fields to our records. To help our annotation team, we will also include the `context` field.

In [17]:
# format the data as Argilla records
records = [rg.FeedbackRecord(fields={"instruction": record["instruction"], "response": record["response"], "context": record["context"]}) for record in data]

# list of fields that we will use later for our dataset settings
fields = [
    rg.TextField(name="instruction"),
    rg.TextField(name="context", title="Context / Input"),
    rg.TextField(name="response")
]

Now we can think of the questions that we would like to ask about these records and we will provide some guidelines for the annotators.

In [18]:
# list of questions to display in the feedback form
questions =[
    rg.RatingQuestion(
        name="changes-needed", 
        title="Does the instruction or response in this record need corrections?", 
        description="0 = no: choose if everything looks good and you won't provide any corrections\n1 = yes: choose if you are providing corrections to the instruction or response",
        required=True,
        values=[0,1]
    ),
    rg.TextQuestion(
        name="new-instruction",
        title="Provide a new version of the instruction:",
        required=False
    ),
    rg.TextQuestion(
        name="new-response",
        title="Provide a new verstion of the response:",
        required=False
    )
]

guidelines = "In this dataset, you will find a collection of records that show at least an instruction and a response to that instruction. The aim of the project is to correct the instructions and responses to make sure they are of the highest quality. Both texts should be clear and include real information. In addition, the response should be as complete but concise as possible. To help you with the responses, some records have another text field called Context. This field shows the text where the response was taken from.\n\nTo curate the dataset, you will need to answer the following questions:\n\n1 - Does the instruction or response in this record need corrections?\nThis question has a binary selection. Choose 0 if everything looks good and you won't provide any corrections. You can submit the response straightaway. Choose 1 if you are going to provide a corrected version of either the instruction or the response.\n\n2 - Provide a new version of the instruction:\nIf the instruction needs corrections, write a new version in this text area. If the instruction is ok, leave it empty.\n\n3 - Provide a new version of the response:\nIf the response needs corrections, write a new version in this text area. If the response is ok, leave it empty.\n\nIf you are not sure about a record and you prefer not to provide a response, click Discard."


## Split the workload and import to Argilla
For this specific project, we don't want any overlap between our annotation team, as we only want one unique version of each record. We'll assume that the annotations of our team have the desired quality to work as demonstration data for our instruction-following model. 

<div class="alert alert-info">

Tip

For extra quality assurance, you can make a new dataset where annotators rate the quality of the human annotated dataset.
</div>

To avoid having multiple responses for a record, we will split the workload between all of our annotators and import the records assigned to them in a dataset in their personal workspace. 

First, let's get the list of users using the Argilla Client.

In [None]:
# make a request using your Argilla Client to get the list of users
rg_client= rg.active_client().client
auth_headers = {"X-Argilla-API-Key": rg_client.token}
http=httpx.Client(base_url=rg_client.base_url, headers=auth_headers)
users = http.get("/api/users").json()

# filter users to get only those with annotator role
users = [user for user in users if user['role']=='annotator']

When we're happy with the list of users, we can move on to do the assignments:

In [20]:
# shuffle the records to get a random assignment
random.shuffle(records)

# build a dictionary where the key is the username and the value is the list of records assigned to them
assignments = defaultdict(list)

# divide your records in chunks of the same length as the users list and make the assignments
# you will need to follow the instructions to create and push a dataset for each of the key-value pairs in this dictionary
n = len(users)
chunked_records = [records[i:i + n] for i in range(0, len(records), n)]
for chunk in chunked_records:
    for idx, record in enumerate(chunk):
        assignments[users[idx]['username']].append(record)

# create a dataset for each annotator and push it to their personal workspace
for username,records in assignments.items():
    dataset = rg.FeedbackDataset(
        guidelines=guidelines,
        fields=fields,
        questions=questions
    )
    dataset.add_records(records)
    dataset.push_to_argilla(name='dolly_cleaning', workspace=username)

## Collect feedback and publish the results
At this point, the datasets are ready to start the annotation. Once the annotations are done, we will collect all the feedback from our team and combine it in a single dataset. 

In [None]:
feedback = []
for username in assignments.keys():
    feedback.extend(rg.FeedbackDataset.from_argilla('dolly_cleaning', workspace=username))

Let's explore the dataset a bit so we can draw some conclusions about it:

In [64]:
responses = []

for record in feedback:
    if record['responses'] == []:
        continue
    
    # we should only have 1 response per record, so we can safely use the first one only
    response = record['responses'][0]
    if response['values'].get('new-instruction', {}).get('value') != '' and response['values'].get('new-response', {}).get('value') != '':
        new_fields = 'both'
    elif response['values'].get('new-instruction', {}).get('value') != '':
        new_fields = 'instruction'
    elif response['values'].get('new-response', {}).get('value') != '':
        new_fields = 'response'
    else:
        new_fields = 'None'
    

    responses.append({'status': response['status'], 'changes-needed': response['values'].get('changes-needed', {}).get('value'), 'new-fields': new_fields})

responses_df = pd.DataFrame(responses)

In [None]:
import plotly.express as px
fig = px.histogram(responses_df, x='status')
fig.show()

![Plot showing the number of submitted and discarded responses](../../../_static/images/llms/curating-feedback-instructiondataset/plot_submitted_discarded.png)

We can see that the majority of the records have submitted responses. That means that we are not losing too much data during the annotation project.

<div class="alert alert-info">

Tip
 
If an important percentage of the records have a discarded response, you can take all the discarded records and serve them to a different annotator as long as you are using the `Discard` button as a way for your annotation team to skip records.
</div>

Now, let's check how many of our submitted responses proposed modifications to the original text:

In [None]:
fig = px.histogram(responses_df.loc[responses_df['status']=='submitted'], x='changes-needed', color='new-fields')
fig.update_xaxes(dtick=1)
fig.update_layout(bargap=0.2)
fig.show()

![Plot showing the fields that were modified](../../../_static/images/llms/curating-feedback-instructiondataset/plot_modified_fields.png)

As we can see here, an important percentage of the submitted responses considered that the records needed a modification and proposed new versions for both fields. The graph also shows that our annotation team has understood the task correctly and didn't include changes to the fields when they declared that the record didn't need changes or vice versa.

We could publish the dataset as it is now, but for this example we'll do a little post-processing to simplify the fields and substitute the old instruction and response with the new version provided by our annotators. That way, we have a dataset that's fully ready for fine-tuning.

In [15]:
new_records = []
for record in feedback:
    if record['responses'] == []:
        continue
    # we should only have 1 response per record, so we can safely use the first one only
    response = record['responses'][0]
    # we will skip records where our annotators didn't submit their feedback
    if response['status'] != 'submitted':
        continue

    response_values = response['values']
    # if the annotator answered 0 in the first question, we will keep the original fields
    if response_values['changes-needed']['value'] == 0:
        new_records.append(record['fields'])
    # if not, we will substitute the instruction and response with the text submitted by the annotator
    else:
        new_instruction = response['values'].get('new-instruction', {}).get('value')
        new_response = response['values'].get('new-response', {}).get('value')

        if new_instruction not in ['', None]:
            record['fields']['instruction'] = new_instruction
        if new_response not in ['', None]:
            record['fields']['response'] = new_response
        new_records.append(record['fields'])

Let's check how it looks:

In [16]:
new_df = pd.DataFrame(new_records)
new_df

Unnamed: 0,instruction,context,response
0,"The New Deal was a series of programs, public ...","The New Deal was a series of programs, public ...","The ""3 R's"" historians refer to are the follow..."
1,The music genre is the categorisation of music...,,The music genre is the categorisation of music...
2,There are many reasons in history for Irish pu...,,There are many reasons in history for Irish pu...
3,Who was the winner of the International Booker...,,"Marieke Lucas Rijneveld, a Dutch author, won t..."
4,What is the national day of Germany?,,The National Day of Germany is called German U...
5,What is an idea?,,"In common usage and in philosophy, ideas are t..."
6,Classify each of the following as root or shoo...,,"Tomato, brinjal, lady finger, cucumber are sho..."
7,"Given a reference text about multiple myeloma,...","Multiple myeloma (MM), also known as plasma ce...",Although the cause of multiple myeloma is not ...
8,What game was the animated series Tank Knights...,Fortress is a shooter video game developed by ...,Tank Knights Fortress was based on the video g...
9,What is cuban cuisine?,,Cuban cuisine refers to the food eaten in the ...


Now we're happy with the result, we can publish it in the Hugging Face Hub, so the whole open-source community can benefit from it.

In [None]:
#push to hub
new_dataset = Dataset(new_records)
new_dataset.push_to_hub(".../curated_databricks-dolly-15k")

This dataset is ready to be used as a demonstration dataset to fine-tune instruction-following models.

## Summary

In this tutorial, we learned how to create an instruction dataset by curating a public dataset with a permissive license, in this case the Dolly dataset made by Databricks employees. This can help us to fine-tune an instruction-following model using high-quality data that will help us get better results with a more efficient training.