# Creating a `FeedbackDataset` for `text-classification`.

## 1. Push to `Argilla` and 🤗`hub`

This tutorial is the first of a series in which we will get to know the `FeedbackDataset` for `text-classification`. We will start by creating a basic dataset using as example [ag_news](https://huggingface.co/datasets/ag_news) and push it to `Argilla` and the 🤗`hub`.

## Table of Contents

- 1. [Create a `FeedbackDataset`](#create-feedbackdataset)
    - 1.1 [Create a `FeedbackDataset` from a huggingface `dataset`](#create-feedbackdataset-datasets)
    - 1.2 [Create a `FeedbackDataset` from a `pandas.DataFrame`](#create-feedbackdataset-pandas)
- 2. [Push our `FeedbackDataset` to `Argilla`](#push-to-argilla)
- 3. [Push our `FeedbackDataset` to the 🤗`hub`](#push-to-hf-hub)
- 4. [Conclusions](#conclusions)


## Running Argilla

For this tutorial, you will need to have an Argilla server running. There are two main options for deploying and running Argilla:

**Deploy Argilla on Hugging Face Spaces:** If you want to run tutorials with external notebooks (e.g., Google Colab) and you have an account on Hugging Face, you can deploy Argilla on Spaces with a few clicks:

[![deploy on spaces](https://huggingface.co/datasets/huggingface/badges/raw/main/deploy-to-spaces-lg.svg)](https://huggingface.co/new-space?template=argilla/argilla-template-space)

For details about configuring your deployment, check the [official Hugging Face Hub guide](https://huggingface.co/docs/hub/spaces-sdks-docker-argilla).

**Launch Argilla using Argilla's quickstart Docker image**: This is the recommended option if you want [Argilla running on your local machine](../../getting_started/quickstart.ipynb). Note that this option will only let you run the tutorial locally and not with an external notebook service.

For more information on deployment options, please check the Deployment section of the documentation.

<div class="alert alert-info">

Tip

This tutorial is a Jupyter Notebook. There are two options to run it:

- Use the Open in Colab button at the top of this page. This option allows you to run the notebook directly on Google Colab. Don't forget to change the runtime type to GPU for faster model training and inference.
- Download the .ipynb file by clicking on the View source link at the top of the page. This option allows you to download the notebook and run it on your local machine or on a Jupyter notebook tool of your choice.
</div>

First let's install our dependencies and import the necessary libraries:

In [None]:
!pip install argilla==1.18.0
!pip install datasets

In [2]:
import argilla as rg
from datasets import load_dataset

In [None]:
rg.init(
    api_url="https://<YOUR-HF-SPACE>.hf.space",
    api_key="admin.apikey"
)

## Create a `FeedbackDataset` <a class="anchor" id="create-feedbackdataset"></a>

For this tutorial we will use the [ag_news](https://huggingface.co/datasets/ag_news) dataset which can be downloaded from the 🤗`hub`. We will load only the first 1000 items from the training sample.

In [None]:
ds = load_dataset("ag_news", split="train[:1000]")

We will just load the first 1000 records for this tutorial, but feel free to test the full dataset.

This dataset contains a collection of news articles (we can see the content in the `text` column), which have been asigned one of the following classification `labels`: *World (0), Sports (1), Business (2), Sci/Tech (3)*.

So in order to create our dataset we will add a single `field` with the name `text`, and a single question with the name `label`.
Given that these texts can only pertain to a single category, we choose the `LabelQuestion`.
You can visit the Argilla [cheatsheet](https://docs.argilla.io/en/latest/getting_started/cheatsheet.html#configure-datasets) for more information on the dataset:

In [None]:
feedback_dataset = rg.FeedbackDataset(
    guidelines="This dataset contains a collection of news articles. Please label them on the category they belong.",
    fields=[
        rg.TextField(name="text", title="Text from the article"),
    ],
    questions=[
        rg.LabelQuestion(
            name="label",
            title="In which category does this article fit?",
            labels={"World": "0", "Sports": "1", "Business": "2", "Sci/Tech": "3"},
            required=True,
            visible_labels=None
        )
    ]
)

In [21]:
feedback_dataset

FeedbackDataset(
    fields=[TextField(name='text', title='Text from the article', required=True, type='text', use_markdown=False)]
    questions=[LabelQuestion(name='label', title='In which category does this article fit?', description=None, required=True, type='label_selection', labels={'World': '0', 'Sports': '1', 'Business': '2', 'Sci/Tech': '3'}, visible_labels=None)]
    guidelines=This dataset contains a collection of news articles. Please label them on the category they belong.)
)

### Create a `FeedbackDataset` from a huggingface `dataset` <a class="anchor" id="create-feedbackdataset-datasets"></a>

The next step once we have our `FeedbackDataset` created is adding the [`FeedbackRecords`](https://docs.argilla.io/en/latest/getting_started/cheatsheet.html#create-records) to it.

In order to create our records we can just loop over the items in the `datasets.Dataset`.

In [None]:
records = []
for i, item in enumerate(ds):
    records.append(
        rg.FeedbackRecord(
            fields={
                "text": item["text"],
            },
            external_id=f"record-{i}"
        )
    )

### Create a `FeedbackDataset` from a `pandas.DataFrame` <a class="anchor" id="create-feedbackdataset-pandas"></a>

If we had our data in a different format, let's say a `csv` file, maybe it's more direct to read the data using pandas for that.

We will transform our dataset to pandas format for this example, and the remaining `FeedbackRecord` creation remains just the same:

In [None]:
df_dataset = ds.to_pandas()

In [None]:
records_pandas = []
for i, item in df_dataset.to_pandas().iterrows():
    records_pandas.append(
        rg.FeedbackRecord(
            fields={
                "text": item["text"],
            },
            external_id=f"record-{i}"
        )
    )

Let's add our records to the dataset:

In [None]:
feedback_dataset.add_records(records)

By now we have our dataset with the texts ready to be labeled, let's push it to `Argilla`.

## Push our `FeedbackDataset` to `Argilla` <a class="anchor" id="push-to-argilla"></a>

In [None]:
remote_dataset = feedback_dataset.push_to_argilla(name="end2end_textclassification", workspace="admin")

If we go to our `Argilla` instance we should see a similar screen like the following.

![argilla-dataset](../images/feedback-dataset-text-classification-1.png)

Where we can see the *Text from the article* we wanted, and the different labels to choose from.

#### Download our dataset from `Argilla`

We can now download the dataset from `Argilla` just to check it:

In [None]:
remote_dataset = rg.FeedbackDataset.from_argilla("end2end_textclassification", workspace="admin")

## Push our `FeedbackDataset` to the 🤗`hub` <a class="anchor" id="push-to-hf-hub"></a>

If we wanted to share our dataset with the world, we could use the Huggingface hub for it.

First we need to login to huggingface. The following snippet will ask for our HF token.

If we don't have one already, we can obtain it from [here](https://huggingface.co/docs/hub/security-tokens) (remember to set the *write* access).

In [None]:
from huggingface_hub import notebook_login

notebook_login()

And now we can just call the method on the `FeedbackDataset`.

In [None]:
remote_dataset.push_to_huggingface("argilla/end2end_textclassification")

## Conclusion

In this tutorial we created an `Argilla` `FeedbackDataset` for text classification, starting from [ag_news](https://huggingface.co/datasets/ag_news).

We created a `FeedbackDataset` for text classification with a `LabelQuestion`, from data stored as a `datasets.Dataset` and a `pandas.DataFrame`.
This dataset was pushed both to `Argilla` where we can curate and label the records, and finally pushed it to the 🤗`hub`.

To learn more about how to work with the `FeedbackDataset` check the [cheatsheet](https://docs.argilla.io/en/latest/getting_started/cheatsheet.html#cheatsheet).