# Creating a `FeedbackDataset` for `text-classification`.

## 1. Push to `Argilla` and 🤗`hub`

This tutorial is the first of a series in which we will get to know the `FeedbackDataset` for `text-classification`. We will start by creating a basic dataset using as example [ag_news](https://huggingface.co/datasets/ag_news) and push it to `Argilla` and the 🤗`hub`.

## Table of Contents

1. [Create a `FeedbackDataset`](#create-feedbackdataset)

    1.1 [Create a `FeedbackDataset` from a huggingface `dataset`](#create-feedbackdataset-datasets)

    1.2 [Create a `FeedbackDataset` from a `pandas.DataFrame`](#create-feedbackdataset-pandas)

2. [Push our `FeedbackDataset` to `Argilla`](#push-to-argilla)
3. [Push our `FeedbackDataset` to the 🤗`hub`](#push-to-hf-hub)
4. [Conclusions](#conclusions)


## Running Argilla

For this tutorial, you will need to have an Argilla server running. There are two main options for deploying and running Argilla:

**Deploy Argilla on Hugging Face Spaces:** If you want to run tutorials with external notebooks (e.g., Google Colab) and you have an account on Hugging Face, you can deploy Argilla on Spaces with a few clicks:

[![deploy on spaces](https://huggingface.co/datasets/huggingface/badges/raw/main/deploy-to-spaces-lg.svg)](https://huggingface.co/new-space?template=argilla/argilla-template-space)

For details about configuring your deployment, check the [official Hugging Face Hub guide](https://huggingface.co/docs/hub/spaces-sdks-docker-argilla).

**Launch Argilla using Argilla's quickstart Docker image**: This is the recommended option if you want [Argilla running on your local machine](../../getting_started/quickstart.ipynb). Note that this option will only let you run the tutorial locally and not with an external notebook service.

For more information on deployment options, please check the Deployment section of the documentation.

<div class="alert alert-info">

Tip

This tutorial is a Jupyter Notebook. There are two options to run it:

- Use the Open in Colab button at the top of this page. This option allows you to run the notebook directly on Google Colab. Don't forget to change the runtime type to GPU for faster model training and inference.
- Download the .ipynb file by clicking on the View source link at the top of the page. This option allows you to download the notebook and run it on your local machine or on a Jupyter notebook tool of your choice.
</div>

First let's install our dependencies and import the necessary libraries:

In [28]:
!pip install argilla==1.18.0
#!pip install argilla==1.19.0
#!pip install datasets

Collecting argilla==1.18.0
  Obtaining dependency information for argilla==1.18.0 from https://files.pythonhosted.org/packages/8f/d8/2af275c2a8df0d8e06a8d7daa8231431b2b1fd99f2d2169a131f41ff0d58/argilla-1.18.0-py3-none-any.whl.metadata
  Downloading argilla-1.18.0-py3-none-any.whl.metadata (15 kB)
Downloading argilla-1.18.0-py3-none-any.whl (2.8 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.8/2.8 MB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m
[?25hInstalling collected packages: argilla
  Attempting uninstall: argilla
    Found existing installation: argilla 1.19.0
    Uninstalling argilla-1.19.0:
      Successfully uninstalled argilla-1.19.0
Successfully installed argilla-1.18.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pi

In [5]:
import argilla as rg
from datasets import load_dataset

In [3]:
rg.init(
    api_url="https://<YOUR-HF-SPACE>.hf.space",
    api_key="admin.apikey"
)



## Create a `FeedbackDataset` <a class="anchor" id="create-feedbackdataset"></a>

For this tutorial we will use the [ag_news](https://huggingface.co/datasets/ag_news) dataset which can be downloaded from the 🤗`hub`. We will load only the first 1000 items from the training sample.

In [6]:
ds = load_dataset("ag_news", split="train[:1000]")
ds

Dataset({
    features: ['text', 'label'],
    num_rows: 1000
})

We will just load the first 1000 records for this tutorial, but feel free to test the full dataset.

This dataset contains a collection of news articles (we can see the content in the `text` column), which have been asigned one of the following classification `labels`: *World (0), Sports (1), Business (2), Sci/Tech (3)*.

Let's use the [task templates](https://docs.argilla.io/en/latest/practical_guides/create_dataset.html#task-templates) to create a feedback dataset ready for `text-classification`.

In [7]:
feedback_dataset = rg.FeedbackDataset.for_text_classification(
    labels=["World", "Sports", "Business", "Sci/Tech"],
    guidelines="Classify the articles into one of the four categories.",
)
feedback_dataset

FeedbackDataset(
    fields=[TextField(name='text', title='Text', required=True, type='text', use_markdown=False)]
    questions=[LabelQuestion(name='label', title='Label', description='Classify the text by selecting the correct label from the given list of labels.', required=True, type='label_selection', labels=['World', 'Sports', 'Business', 'Sci/Tech'], visible_labels=None)]
    guidelines=Classify the articles into one of the four categories.)
)

We could compare this dataset with the custom configuration we would use previously (we can take a look at the [custom configuration](https://docs.argilla.io/en/latest/practical_guides/create_dataset.html#custom-configuration) for more information on the creation of a `FeedbackDataset` when we want a finer control):

In [13]:
feedback_dataset_long = rg.FeedbackDataset(
    guidelines="Classify the articles into one of the four categories.",
    fields=[
        rg.TextField(name="text", title="Text from the article"),
    ],
    questions=[
        rg.LabelQuestion(
            name="label",
            title="In which category does this article fit?",
            labels={"World": "0", "Sports": "1", "Business": "2", "Sci/Tech": "3"},
            required=True,
            visible_labels=None
        )
    ]
)
feedback_dataset_long

FeedbackDataset(
    fields=[TextField(name='text', title='Text from the article', required=True, type='text', use_markdown=False)]
    questions=[LabelQuestion(name='label', title='In which category does this article fit?', description=None, required=True, type='label_selection', labels={'World': '0', 'Sports': '1', 'Business': '2', 'Sci/Tech': '3'}, visible_labels=None)]
    guidelines=Classify the articles into one of the four categories.)
)

### Create a `FeedbackDataset` from a huggingface `dataset` <a class="anchor" id="create-feedbackdataset-datasets"></a>

The next step once we have our `FeedbackDataset` created is adding the [`FeedbackRecords`](https://docs.argilla.io/en/latest/getting_started/cheatsheet.html#create-records) to it.

In order to create our records we can just loop over the items in the `datasets.Dataset`.

In [8]:
records = []
for i, item in enumerate(ds):
    records.append(
        rg.FeedbackRecord(
            fields={
                "text": item["text"],
            },
            external_id=f"record-{i}"
        )
    )

    ## EXPLAIN THE external_id 

### Create a `FeedbackDataset` from a `pandas.DataFrame` <a class="anchor" id="create-feedbackdataset-pandas"></a>

If we had our data in a different format, let's say a `csv` file, maybe it's more direct to read the data using pandas for that.

We will transform our dataset to pandas format for this example, and the remaining `FeedbackRecord` creation remains just the same:

In [18]:
df_dataset = ds.to_pandas()
df_dataset.head()

Unnamed: 0,text,label
0,Wall St. Bears Claw Back Into the Black (Reute...,2
1,Carlyle Looks Toward Commercial Aerospace (Reu...,2
2,Oil and Economy Cloud Stocks' Outlook (Reuters...,2
3,Iraq Halts Oil Exports from Main Southern Pipe...,2
4,"Oil prices soar to all-time record, posing new...",2


In [20]:
df_dataset

Unnamed: 0,text,label
0,Wall St. Bears Claw Back Into the Black (Reute...,2
1,Carlyle Looks Toward Commercial Aerospace (Reu...,2
2,Oil and Economy Cloud Stocks' Outlook (Reuters...,2
3,Iraq Halts Oil Exports from Main Southern Pipe...,2
4,"Oil prices soar to all-time record, posing new...",2
...,...,...
995,U.S. Stocks Rebound as Oil Prices Ease NEW YO...,2
996,Dollar Rises Vs Euro After Asset Data NEW YOR...,2
997,Bikes Bring Internet to Indian Villagers (AP) ...,3
998,Celebrity Chefs Are Everywhere in Vegas By ADA...,3


In [21]:
records_pandas = []
for i, item in df_dataset.iterrows():
    records_pandas.append(
        rg.FeedbackRecord(
            fields={
                "text": item["text"],
            },
            external_id=f"record-{i}"
        )
    )

Let's add our records to the dataset:

In [9]:
feedback_dataset.add_records(records)

By now we have our dataset with the texts ready to be labeled, let's push it to `Argilla`.

## Push our `FeedbackDataset` to `Argilla` <a class="anchor" id="push-to-argilla"></a>

In [11]:
remote_dataset = feedback_dataset.push_to_argilla(name="end2end_textclassification", workspace="admin")

Pushing records to Argilla...: 100%|██████████| 32/32 [00:11<00:00,  2.87it/s]


If we go to our `Argilla` instance we should see a similar screen like the following.

![feedback-dataset](../images/feedback-dataset-text-classification-1.png)

Where we can see the *Text from the article* we wanted, and the different labels to choose from.

#### Download our dataset from `Argilla`

We can now download the dataset from `Argilla` just to check it:

In [14]:
remote_dataset = rg.FeedbackDataset.from_argilla("end2end_textclassification", workspace="admin")
remote_dataset

RemoteFeedbackDataset(
   id=03710d63-0f92-4eef-b948-f90ed883af23
   name=end2end_textclassification
   workspace=Workspace(id=31b3ea30-fa1b-4f24-8181-ea5e78ae74e4, name=admin, inserted_at=2023-11-13 16:21:41.274922, updated_at=2023-11-13 16:21:41.274922)
   url=https://plaguss-argilla-tutorials.hf.space/dataset/03710d63-0f92-4eef-b948-f90ed883af23/annotation-mode
   fields=[RemoteTextField(id=UUID('a5425ba8-ae20-4a3a-90f3-efe8e911083a'), client=None, name='text', title='Text', required=True, type='text', use_markdown=False)]
   questions=[RemoteLabelQuestion(id=UUID('276c7734-c432-4459-bc8e-4e7c412c47a3'), client=None, name='label', title='Label', description=None, required=True, type='label_selection', labels=['World', 'Sports', 'Business', 'Sci/Tech'], visible_labels=None)]
   guidelines=Classify the articles into one of the four categories.)

## Push our `FeedbackDataset` to the 🤗`hub` <a class="anchor" id="push-to-hf-hub"></a>

If we wanted to share our dataset with the world, we could use the Huggingface hub for it.

First we need to login to huggingface. The following snippet will ask for our HF token.

If we don't have one already, we can obtain it from [here](https://huggingface.co/docs/hub/security-tokens) (remember to set the *write* access).

In [15]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

And now we can just call the method on the `FeedbackDataset`.

In [16]:
remote_dataset.push_to_huggingface("argilla/end2end_textclassification")



Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

## Conclusion

In this tutorial we created an `Argilla` `FeedbackDataset` for text classification, starting from [ag_news](https://huggingface.co/datasets/ag_news).

We created a `FeedbackDataset` for text classification with a `LabelQuestion`, from data stored as a `datasets.Dataset` and a `pandas.DataFrame`.
This dataset was pushed both to `Argilla` where we can curate and label the records, and finally pushed it to the 🤗`hub`.

To learn more about how to work with the `FeedbackDataset` check the [cheatsheet](https://docs.argilla.io/en/latest/getting_started/cheatsheet.html#cheatsheet).