# 🖼️ Curate an instruction dataset for supervised fine-tuning

In this tutorial, we will show you how you can curate a public instruction dataset, like [Databricks' Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k), to use it for fine-tuning an LLM to solve instruction tasks. It will walk you through the following steps:

- Define the project 🤔
- Split the workload and import to Argilla 📫
- Collect feedback and publish the results 🧑‍💻

<img src="../../_static/tutorials/curating-feedback-instructiondataset/snapshot_dolly_curation.png" alt="A Feedback Task setting for the curation of Databricks' Dolly dataset" style="width: 1100px;">

## Introduction

The internet is flooding with open-source datasets for fine-tuning LLMs, some created by humans, others generated with generative models. However, these datasets often have many problematic and low-quality examples. By curating them, we can make the fine-tuning step more efficient.

In this tutorial, we take the Dolly dataset as an example to show how you can build an instruction dataset for your fine-tuning projects by cleaning a generated dataset using Argilla's Feedback Task.

Let's get started!

## Running Argilla

For this tutorial, you will need to have an Argilla server running. There are two main options for deploying and running Argilla:

1. [Deploy Argilla on Hugging Face Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-argilla): This is the fastest option and the recommended choice for connecting to external notebooks (e.g., Google Colab) if you have an account on Hugging Face.

2. [Launch Argilla using Argilla's quickstart Docker image](../../getting_started/quickstart.ipynb): This is the recommended option if you want Argilla running on your local machine. Note that this option will only let you run the tutorial locally and not with an external notebook service.

For more information on deployment options, please check the Deployment section of the documentation.

<div class="alert alert-info">

Tip
    
This tutorial is a Jupyter Notebook. There are two options to run it:

- Use the Open in Colab button at the top of this page. This option allows you to run the notebook directly on Google Colab. Don't forget to change the runtime type to GPU for faster model training and inference.
- Download the .ipynb file by clicking on the View source link at the top of the page. This option allows you to download the notebook and run it on your local machine or on a Jupyter notebook tool of your choice.
</div>

## Setup

For this tutorial, you'll need to install the Argilla client and a few third party libraries using `pip`:

In [None]:
%pip install argilla datasets pandas httpx -qqq

: 

Let's import the Argilla module for reading and writing data:

In [1]:
import argilla as rg

If you are running Argilla using the Docker quickstart image or Hugging Face Spaces, you need to init the Argilla client with the `URL` and `API_KEY`:

In [None]:
# Replace api_url with the url to your HF Spaces URL if using Spaces
# Replace api_key if you configured a custom API key
rg.init(
    api_url="http://localhost:6900", 
    api_key="admin.apikey"
)

In [2]:
import os
rg.init(
    api_url=os.environ.get("ARGILLA_API_URL_DEV"),
    api_key=os.environ.get("ARGILLA_API_KEY")
)

Finally, let's include the imports we need:

In [4]:
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


## Define the project 
As a first step, let's load the dataset and quickly explore the data.

In [5]:
data = load_dataset('databricks/databricks-dolly-15k', split='train')
data

Dataset({
    features: ['instruction', 'context', 'response', 'category'],
    num_rows: 15011
})

In [6]:
import pandas as pd
df = pd.DataFrame(data)
df

Unnamed: 0,instruction,context,response,category
0,When did Virgin Australia start operating?,"Virgin Australia, the trading name of Virgin A...",Virgin Australia commenced services on 31 Augu...,closed_qa
1,Which is a species of fish? Tope or Rope,,Tope,classification
2,Why can camels survive for long without water?,,Camels use the fat in their humps to keep them...,open_qa
3,"Alice's parents have three daughters: Amy, Jes...",,The name of the third daughter is Alice,open_qa
4,When was Tomoaki Komorida born?,Komorida was born in Kumamoto Prefecture on Ju...,"Tomoaki Komorida was born on July 10,1981.",closed_qa
...,...,...,...,...
15006,How do i accept the change,,Embrace the change and see the difference,brainstorming
15007,What is a laser and who created it?,A laser is a device that emits light through a...,A laser is a device that emits light from an e...,summarization
15008,What is the difference between a road bike and...,,Road bikes are built to be ridden on asphalt a...,open_qa
15009,How does GIS help in the real estate investmen...,,"Real estate investors depend on precise, accur...",general_qa


For our project, we would like to make sure that the `instruction` and the `response` are formulated clearly and concisely and also using correct language and information. We will add those two fields to our records. To help our annotation team, we will also include the `context` field.

In [17]:
# format the data as Argilla records
records = [rg.FeedbackRecord(fields={"instruction": record["instruction"], "response": record["response"], "context": record["context"]}) for record in data]

# list of fields that we will use later for our dataset settings
fields = [
    rg.TextField(name="instruction"),
    rg.TextField(name="context", title="Context / Input"),
    rg.TextField(name="response")
]

Now we can think of the questions that we would like to ask about these records and we will provide some guidelines for the annotators.

In [18]:
# list of questions to display in the feedback form
questions =[
    rg.RatingQuestion(
        name="changes-needed", 
        title="Does the instruction or response in this record need corrections?", 
        description="0 = no: choose if everything looks good and you won't provide any corrections\n1 = yes: choose if you are providing corrections to the instruction or response",
        required=True,
        values=[0,1]
    ),
    rg.TextQuestion(
        name="new-instruction",
        title="Provide a new version of the instruction:",
        required=False
    ),
    rg.TextQuestion(
        name="new-response",
        title="Provide a new verstion of the response:",
        required=False
    )
]

guidelines = "In this dataset, you will find a collection of records that show at least an instruction and a response to that instruction. The aim of the project is to correct the instructions and responses to make sure they are of the highest quality. Both texts should be clear and include real information. In addition, the response should be as complete but concise as possible. To help you with the responses, some records have another text field called Context. This field shows the text where the response was taken from.\n\nTo curate the dataset, you will need to answer the following questions:\n\n1 - Does the instruction or response in this record need corrections?\nThis question has a binary selection. Choose 0 if everything looks good and you won't provide any corrections. You can submit the response straightaway. Choose 1 if you are going to provide a corrected version of either the instruction or the response.\n\n2 - Provide a new version of the instruction:\nIf the instruction needs corrections, write a new version in this text area. If the instruction is ok, leave it empty.\n\n3 - Provide a new version of the response:\nIf the response needs corrections, write a new version in this text area. If the response is ok, leave it empty.\n\nIf you are not sure about a record and you prefer not to provide a response, click Discard."


## Split the workload and import to Argilla
For this specific project, we don't want any overlap between our annotation team, as we only want one unique version of each record. We'll assume that the annotations of our team have the desired quality to work as demonstration data for our instruction-following model. 

```{tip}
For extra quality assurance, you can make a new dataset where annotators rate the quality of the human annotated dataset.
```

To avoid having multiple responses for a record, we will split the workload between all of our annotators and import the records assigned to them in a dataset in their personal workspace. 

First, let's get the list of users using the Argilla Client.

In [9]:
import httpx
import random
from collections import defaultdict

# make a request using your Argilla Client to get the list of users
rg_client= rg.active_client().client
auth_headers = {"X-Argilla-API-Key": rg_client.token}
http=httpx.Client(base_url=rg_client.base_url, headers=auth_headers)
users = http.get("/api/users").json()

# filter users to get only those with annotator role
users = [user for user in users if user['role']=='annotator']



In [3]:
users = [
    {"username": "natalia-annotator"},
    {"username": "natalia-admin"},
    {"username": "recognai"},
]

When we're happy with the list of users, we can move on to do the assignments:

In [20]:
# shuffle the records to get a random assignment
random.shuffle(records)

# build a dictionary where the key is the username and the value is the list of records assigned to them
assignments = defaultdict(list)

# divide your records in chunks of the same length as the users list and make the assignments
# you will need to follow the instructions to create and push a dataset for each of the key-value pairs in this dictionary
n = len(users)
chunked_records = [records[i:i + n] for i in range(0, len(records), n)]
for chunk in chunked_records:
    for idx, record in enumerate(chunk):
        assignments[users[idx]['username']].append(record)

# create a dataset for each annotator and push it to their personal workspace
for username,records in assignments.items():
    dataset = rg.FeedbackDataset(
        guidelines=guidelines,
        fields=fields,
        questions=questions
    )
    dataset.add_records(records)
    dataset.push_to_argilla(name='dolly_cleaning', workspace=username)

## Collect feedback and publish the results
At this point, the dataset are ready to start the annotation. Once this is done, we will collect all the feedback from our annotators and join it in a single dataset. 

In [6]:
feedback = []
for username in assignments.keys():
    feedback.extend(rg.FeedbackDataset.from_argilla('dolly_cleaning', workspace=username))

In [7]:
feedback = []
for username in users:
    feedback.extend(rg.FeedbackDataset.from_argilla('dolly_cleaning', workspace=username['username']))

Fetching records from Argilla: 21it [00:03,  5.45it/s]
Fetching records from Argilla: 21it [00:03,  5.93it/s]
Fetching records from Argilla: 21it [00:03,  6.30it/s]


Let's explore the dataset a bit so we can draw some conclusions about it:

In [1]:
# count how many records were modified. How many had modifications in the instruction, how many in the response. Discarded records?

We could publish the dataset as it is now, but for this example we'll do a little post-processing to simplify the fields and substitute the old instruction and response with the new version provided by our annotators. That way we have a dataset that's fully ready for fine-tuning.

In [None]:
new_records = []
for record in feedback:
    # we should only have 1 response per record, so we can safely use 
    response = record['responses'][0]
    # we will skip records where our annotators didn't submit their feedback
    if response['status'] != 'submitted':
        continue

    response_values = record['responses'][0]['values']
    # if the annotator answered 0 in the first question, we will keep the original fields
    if response_values['changes-needed']['value'] == 0:
        new_records.append(record['fields'])
    # if not, we will substitute the instruction and response with the text submitted by the annotator
    else:
        if 'new-instruction' in response_values:
            record['fields']['instruction'] = response_values['new-instruction']['value']
        if 'new-response' in response_values:
            record['fields']['response'] = response_values['new-response']['value']
        new_records.append(record['fields'])

Let's check how it looks:

In [None]:
new_df = pd.DataFrame(new_records)
new_df

Now we're happy with the result, we can publish it in the Hugging Face Hub, so the whole open-source community can benefit from it.

In [None]:
from datasets import Dataset

#push to hub
new_dataset = Dataset(new_records)
new_dataset.push_to_hub(".../curated_dolly")

This dataset is ready to be used as a demonstration dataset to fine-tune instruction-following models.

## Summary

In this tutorial, we learned how to create an instruction dataset by curating a public dataset with a permissive license, in this case the Dolly dataset made by Databricks employees. This can help us to fine-tune an instruction-following model using high-quality data that will allow a more efficient training with better results.