# Creating a `Dataset`

This tutorial is part of a series in which we will get to know the `Dataset`. Before starting this tutorial, you need to do the tutorial on [configuring users and workspaces](./configure-users-and-workspaces-000.ipynb). In this step, we will show how to configure a `Dataset` and add `Records` to it.

We will start by creating a basic dataset using the [ag_news](https://huggingface.co/datasets/ag_news) dataset as an example and push it to `Argilla` and the Hugging Face `hub`.


## Running Argilla

For this tutorial, you will need to have an Argilla server running. There are two main options for deploying and running Argilla:

**Deploy Argilla on Hugging Face Spaces:** If you want to run tutorials with external notebooks (e.g., Google Colab) and you have an account on Hugging Face, you can deploy Argilla on Spaces with a few clicks:

[![deploy on spaces](https://huggingface.co/datasets/huggingface/badges/raw/main/deploy-to-spaces-lg.svg)](https://huggingface.co/new-space?template=argilla/argilla-template-space)

For details about configuring your deployment, check the [official Hugging Face Hub guide](https://huggingface.co/docs/hub/spaces-sdks-docker-argilla).

**Launch Argilla using Argilla's quickstart Docker image**: This is the recommended option if you want [Argilla running on your local machine](../../../../getting_started/quickstart.md). Note that this option will only let you run the tutorial locally and not with an external notebook service.

For more information on deployment options, please check the Deployment section of the documentation.

<div class="alert alert-info">

Tip

This tutorial is a Jupyter Notebook. There are two options to run it:

- Use the Open in Colab button at the top of this page. This option allows you to run the notebook directly on Google Colab. Don't forget to change the runtime type to GPU for faster model training and inference.
- Download the .ipynb file by clicking on the View source link at the top of the page. This option allows you to download the notebook and run it on your local machine or on a Jupyter notebook tool of your choice.
</div>


First let's install our dependencies and import the necessary libraries:


In [None]:
!pip install "argilla-python"
!pip install datasets

In [3]:
import argilla as rg
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


Connect to argilla:


In [4]:
# papermill_description=logging-to-argilla
client = rg.Argilla()

## Configure a `Dataset`


For this tutorial we will use the [ag_news](https://huggingface.co/datasets/ag_news) dataset which can be downloaded from the 🤗`hub`. We will load only the first 1000 items from the training sample.


In [5]:
ds = load_dataset("ag_news", split="train[:1000]")
ds

Downloading data: 100%|██████████| 18.6M/18.6M [00:06<00:00, 3.09MB/s]
Downloading data: 100%|██████████| 1.23M/1.23M [00:00<00:00, 2.32MB/s]
Generating train split: 100%|██████████| 120000/120000 [00:00<00:00, 1338233.95 examples/s]
Generating test split: 100%|██████████| 7600/7600 [00:00<00:00, 1354093.30 examples/s]


Dataset({
    features: ['text', 'label'],
    num_rows: 1000
})

We will just load the first 1000 records for this tutorial, but feel free to test the full dataset.


This dataset contains a collection of news articles (we can see the content in the `text` column), which have been asigned one of the following classification `labels`: _World (0), Sports (1), Business (2), Sci/Tech (3)_.

Let's use the [task templates](https://docs.argilla.io/en/latest/practical_guides/create_update_dataset/create_dataset.html#task-templates) to create a feedback dataset ready for `text-classification`.


In [None]:
text_classification_dataset = rg.Dataset(
    template=rg.Template.for_text_classification(
        guidelines="Classify the articles into one of the four categories.",
        labels=["World", "Sports", "Business", "Sci/Tech"],
    )
)
client.datasets.create(text_classification_dataset)

We could compare this dataset with the custom configuration we would use previously (we can take a look at the [custom configuration](https://docs.argilla.io/en/latest/practical_guides/create_update_dataset/create_dataset.html#custom-configuration) for more information on the creation of a `Dataset` when we want a finer control):


In [8]:
custom_text_classification_template = rg.Template(
    guidelines="Classify the articles into one of the four categories.",
    fields=[
        rg.Field(name="text", title="Text from the article"),
        rg.Field(name="context", title="Context from the article"),
    ],
    questions=[
        rg.Question(
            name="label",
            title="In which category does this article fit?",
            required=True,
            settings=rg.QuestionSettings.MultiLabel(
                options=rg.LabelOption.from_labels(["World", "Sports", "Business", "Sci/Tech"])
            ),
        )
    ],
)
custom_text_classification_dataset = rg.Dataset(template=custom_text_classification_template)
client.datasets.create(custom_text_classification_dataset)

FeedbackDataset(
   fields=[TextField(name='text', title='Text from the article', required=True, type='text', use_markdown=False)]
   questions=[LabelQuestion(name='label', title='In which category does this article fit?', description=None, required=True, type='label_selection', labels={'World': '0', 'Sports': '1', 'Business': '2', 'Sci/Tech': '3'}, visible_labels=None)]
   guidelines=Classify the articles into one of the four categories.)
   metadata_properties=[])
)

## Add `Records`

### From a Hugging Face `dataset`


The next step once we have our `Dataset` created is adding the FeedbackRecords to it.


In order to create our records we can just loop over the items in the `datasets.Dataset`.


In [9]:
for i, item in enumerate(ds):
    text_classification_dataset.records.add(
        rg.Record(
            fields={
                "text": item["text"],
            },
            id=f"record-{i}",
        )
    )

### From a `pandas.DataFrame` <a class="anchor" id="create-feedbackdataset-pandas"></a>


If we had our data in a different format, let's say a `csv` file, maybe it's more direct to read the data using pandas for that.

We will transform our dataset to pandas format for this example, and the remaining `FeedbackRecord` creation remains just the same:


In [10]:
df_dataset = ds.to_pandas()
df_dataset.head()

Unnamed: 0,text,label
0,Wall St. Bears Claw Back Into the Black (Reute...,2
1,Carlyle Looks Toward Commercial Aerospace (Reu...,2
2,Oil and Economy Cloud Stocks' Outlook (Reuters...,2
3,Iraq Halts Oil Exports from Main Southern Pipe...,2
4,"Oil prices soar to all-time record, posing new...",2


Let's add our records to the dataset:


In [None]:
for i, item in df_dataset.iterrows():
    text_classification_dataset.records.add(
        rg.Record(
            fields={
                "text": item["text"],
            },
            id=f"record-{i}",
        )
    )

## Publish our dataset to Argilla for Feedback


In [21]:
client.datasets.publish(text_classification_dataset)

By now we have our dataset with the texts ready to be labeled, let's push it to `Argilla`.


If we go to our `Argilla` instance we should see a similar screen like the following.


![feedback-dataset](../../../../_static/tutorials/end2end/text-classification/feedback-dataset-text-classification-1.png)


# Download our dataset for use


In [None]:
dataset = client.datasets.get(text_classification_dataset.id)
local_dataset = dataset.pull()
# We can now convert to our favorite format

df = dataset.to_pandas()
ds = dataset.to_datasets()

## Conclusion


In this tutorial we created an `Argilla` `FeedbackDataset` for text classification, starting from [ag_news](https://huggingface.co/datasets/ag_news).

We created a `FeedbackDataset` for text classification with a `LabelQuestion`, from data stored as a `datasets.Dataset` and a `pandas.DataFrame`.
This dataset was pushed both to `Argilla` where we can curate and label the records, and finally pushed it to the 🤗`hub`.

To learn more about how to work with the `FeedbackDataset` check the [cheatsheet](https://docs.argilla.io/en/latest/getting_started/cheatsheet.html#cheatsheet). To continue with assigning records to annotators, you can refer to the [next tutorial](./assign-records-002.ipynb).
