# Datasets

This guide showcases some features of the `Dataset` classes in the Argilla client.
The Dataset classes are lightweight containers for Argilla records. These classes facilitate importing from and exporting to different formats (e.g., `pandas.DataFrame`, `datasets.Dataset`) as well as sharing and versioning Argilla datasets using the Hugging Face Hub.

For each record type there's a corresponding Dataset class called `DatasetFor<RecordType>`.
You can look up their API in the [reference section](../reference/python/python_client.rst#module-argilla.client.datasets)

## Creating a Dataset

Under the hood the Dataset classes store the records in a simple Python list.
Therefore, working with a Dataset class is not very different to working with a simple list of records:

In [None]:
import argilla as rg

# Start with a list of Argilla records
dataset_rg = rg.DatasetForTextClassification(my_records)

# Loop over the dataset
for record in dataset_rg:
    print(record)

# Index into the dataset
dataset_rg[0] = rg.TextClassificationRecord(text="replace record")

# log a dataset to the Argilla web app
rg.log(dataset_rg, "my_dataset")


The Dataset classes do some extra checks for you, to make sure you do not mix record types when appending or indexing into a dataset. 

## Importing a Dataset

When you have your data in a [_pandas DataFrame_](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) or a [_datasets Dataset_](https://huggingface.co/docs/datasets/access.html), we provide some neat shortcuts to import this data into a Argilla Dataset. 
You have to make sure that the data follows the record model of a specific task, otherwise you will get validation errors. 
Columns in your DataFrame/Dataset that are not supported or recognized, will simply be ignored.

The record models of the tasks are explained in the [reference section](../reference/python/python_client.rst#module-argilla.client.models). 

<div class="alert alert-info">

Note

Due to it's pyarrow nature, data in a `datasets.Dataset` has to follow a slightly different model, that you can look up in the examples of the `Dataset*.from_datasets` [docstrings](../reference/python/python_client.rst#argilla.client.datasets.DatasetForTokenClassification.from_datasets). 
    
</div>

In [None]:
import argilla as rg

# import data from a pandas DataFrame
dataset_rg = rg.read_pandas(my_dataframe, task="TextClassification")
# or
dataset_rg = rg.DatasetForTextClassification.from_pandas(my_dataframe)

# import data from a datasets Dataset
dataset_rg = rg.read_datasets(my_dataset, task="TextClassification")
# or
dataset_rg = rg.DatasetForTextClassification.from_datasets(my_dataset)


We also provide helper arguments you can use to read almost arbitrary datasets for a given task from the [Hugging Face Hub](https://huggingface.co/datasets).
They map certain input arguments of the Argilla records to columns of the given dataset.
Let's have a look at a few examples:

In [None]:
import argilla as rg
from datasets import load_dataset

# the "poem_sentiment" dataset has columns "verse_text" and "label"
dataset_rg = rg.DatasetForTextClassification.from_datasets(
    dataset=load_dataset("poem_sentiment", split="test"),
    text="verse_text",
    annotation="label",
)

# the "snli" dataset has the columns "premise", "hypothesis" and "label"
dataset_rg = rg.DatasetForTextClassification.from_datasets(
    dataset=load_dataset("snli", split="test"),
    inputs=["premise", "hypothesis"],
    annotation="label",
)

# the "conll2003" dataset has the columns "id", "tokens", "pos_tags", "chunk_tags" and "ner_tags"
rg.DatasetForTokenClassification.from_datasets(
    dataset=load_dataset("conll2003", split="test"),
    tags="ner_tags",
)

# the "xsum" dataset has the columns "id", "document" and "summary"
rg.DatasetForTextGeneration.from_datasets(
    dataset=load_dataset("xsum", split="test"),
    text="document",
    annotation="summary",
)


You can also use the shortcut `rg.read_datasets(dataset=..., task=..., **kwargs)` where the keyword arguments are passed on to the corresponding `from_datasets()` method.

## Sharing on Hugging Face

You can easily share your Argilla dataset with your community via the Hugging Face Hub.
For this you just need to export your Argilla Dataset to a `datasets.Dataset` and [push it to the hub](https://huggingface.co/docs/datasets/upload_dataset.html?highlight=push_to_hub#upload-from-python):

In [None]:
import argilla as rg

# load your annotated dataset from the Argilla web app
dataset_rg = rg.load("my_dataset")

# export your Argilla Dataset to a datasets Dataset
dataset_ds = dataset_rg.to_datasets()

# push the dataset to the Hugging Face Hub
dataset_ds.push_to_hub("my_dataset")


Afterward, your community can easily access your annotated dataset and log it directly to the Argilla web app:

In [None]:
from datasets import load_dataset

# download the dataset from the Hugging Face Hub
dataset_ds = load_dataset("user/my_dataset", split="train")

# read in dataset, assuming its a dataset for text classification
dataset_rg = rg.read_datasets(dataset_ds, task="TextClassification")

# log the dataset to the Argilla web app
rg.log(dataset_rg, "dataset_by_user")


## Prepare dataset for training

If you want to train a Hugging Face transformer or a spaCy NER pipeline, we provide a handy method to prepare your dataset: `DatasetFor*.prepare_for_training()`.
It will return a Hugging Face dataset or a spaCy DocBin, optimized for the training process with the Hugging Face Trainer or the spaCy cli.

### TextClassification
For text classification tasks, it flattens the inputs into separate columns of the returned dataset and converts the annotations of your records into integers and writes them in a label column:

In [None]:
dataset_rg = rg.DatasetForTextClassification(
    [
        rg.TextClassificationRecord(
            inputs={"title": "My title", "content": "My content"}, annotation="news"
        )
    ]
)

dataset_rg.prepare_for_training()[0]
# Output:
# {'title': 'My title', 'content': 'My content', 'label': 0}


### TokenClassification

For token classification tasks, it converts the annotations of a record into integers representing BIO tags and writes them in a `ner_tags` column: 

In [None]:
dataset_rg = rg.DatasetForTokenClassification(
    [
        rg.TokenClassificationRecord(
            text="I live in Madrid",
            tokens=["I", "live", "in", "Madrid"],
            annotation=[("LOC", 10, 15)],
        )
    ]
)

dataset_rg.prepare_for_training()[0]
# Output:
# {..., 'tokens': ['I', 'live', 'in', 'Madrid'], 'ner_tags': [0, 0, 0, 1], ...}


## Dataset settings

Argilla datasets have certain *settings* that you can configure via the `rg.*Settings` classes, for example `rg.TextClassificationSettings`.

### Define a labeling schema

You can define a labeling schema for your Argilla dataset, which fixes the allowed labels for your predictions and annotations.
Once you set a labeling schema, each time you log to the corresponding dataset, Argilla will perform validations of the added predictions and annotations to make sure they comply with the schema.

In [None]:
import argilla as rg

# Define labeling schema
settings = rg.TextClassificationSettings(label_schema=["A", "B", "C"])

# Apply settings to a new or already existing dataset
rg.configure_dataset(name="my_dataset", settings=settings)

# Logging to the newly created dataset triggers the validation checks
rg.log(rg.TextClassificationRecord(text="text", annotation="D"), "my_dataset")
# BadRequestApiError: Argilla server returned an error with http status: 400
