# Explore the generated datasets in Argilla

## Create an argilla instance and uploading the datasets

The following cells will push the previously created datasets to explore them in argilla. All the steps can be followed in the
[argilla-quickstart](https://argilla-io.github.io/argilla/dev/getting_started/quickstart/) section of the documentation.

In [21]:
## Install argilla if you haven't yet
#!pip install argilla --pre

Instantiate the client pointing to the created space.

In [2]:
import argilla as rg

client = rg.Argilla(
    api_url="https://plaguss-argilla-sdk-chatbot.hf.space",
    api_key="owner.apikey"
)

  from .autonotebook import tqdm as notebook_tqdm


Download the first dataset from the Hugging Face Hub and select the relevant columns we want to explore

In [9]:
from datasets import load_dataset

data = load_dataset("plaguss/argilla_sdk_docs_raw_unstructured", split="train")

In [11]:
# Will select just the columns we are going to explore, and transform to list of dicts
data = data.select_columns(["filename", "chunks"]).to_list()

## Dataset with raw chunks of documentation

Let's upload the raw chunks to argilla to look at the raw data we generated using the `docs_dataset.py` script.

- Dataset in Hugging Face Hub: [plaguss/argilla_sdk_docs_raw_unstructured](https://huggingface.co/datasets/plaguss/argilla_sdk_docs_raw_unstructured)

Create the settings of the dataset and push it to Argilla to track it:

In [7]:
settings = rg.Settings(
    guidelines="Review the chunks of docs.",
    fields=[
        rg.TextField(
            name="filename",
            title="Filename where this chunk was extracted from",
            use_markdown=False,
        ),
        rg.TextField(
            name="chunk",
            title="Chunk from the documentation",
            use_markdown=False,
        ),
    ],
    questions=[
        rg.LabelQuestion(
            name="good_chunk",
            title="Does this chunk contain relevant information?",
            labels=["yes", "no"],
        )
    ],
)

In [8]:
dataset = rg.Dataset(
    name="argilla_sdk_docs_raw_unstructured",
    settings=settings,
    client=client,
)
dataset.create()



Dataset(id=UUID('b5952697-daac-457b-aab6-7d2c0ff2cb6d') inserted_at=datetime.datetime(2024, 6, 24, 10, 29, 11, 467309) updated_at=datetime.datetime(2024, 6, 24, 10, 29, 14, 94152) name='argilla_sdk_docs_raw_unstructured' status='ready' guidelines='Review the chunks of docs.' allow_extra_metadata=False workspace_id=UUID('4fcd03e1-223d-4ad0-ac21-437193f75ea6') last_activity_at=datetime.datetime(2024, 6, 24, 10, 29, 14, 94152) url=None)

Add records to it:

In [14]:
dataset.records.log(records=data, mapping={"filename": "filename", "chunks": "chunk"})

Adding and updating records: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.37s/batch]


DatasetRecords(Dataset(id=UUID('b5952697-daac-457b-aab6-7d2c0ff2cb6d') inserted_at=datetime.datetime(2024, 6, 24, 10, 29, 11, 467309) updated_at=datetime.datetime(2024, 6, 24, 10, 29, 14, 94152) name='argilla_sdk_docs_raw_unstructured' status='ready' guidelines='Review the chunks of docs.' allow_extra_metadata=False workspace_id=UUID('4fcd03e1-223d-4ad0-ac21-437193f75ea6') last_activity_at=datetime.datetime(2024, 6, 24, 10, 29, 14, 94152) url=None))

## Dataset with generated queries

The following dataset contains the synthetic queries generated with distilabel. We will repeat the previous steps with the dataset used to fine tune our embedding model.

- Dataset in Hugging Face Hub: [plaguss/argilla_sdk_docs_queries](https://huggingface.co/datasets/plaguss/argilla_sdk_docs_queries)

In [15]:
settings = rg.Settings(
    guidelines="Review the chunks of docs.",
    fields=[
        rg.TextField(
            name="anchor",
            title="Anchor (Chunk from the documentation).",
            use_markdown=False,
        ),
        rg.TextField(
            name="positive",
            title="Positive sentence that queries the anchor.",
            use_markdown=False,
        ),
        rg.TextField(
            name="negative",
            title="Negative sentence that may use similar words but has content unrelated to the anchor.",
            use_markdown=False,
        ),
    ],
    questions=[
        rg.LabelQuestion(
            name="is_positive_relevant",
            title="Is the positive query relevant?",
            labels=["yes", "no"],
        ),
        rg.LabelQuestion(
            name="is_negative_irrelevant",
            title="Is the negative query irrelevant?",
            labels=["yes", "no"],
        )
    ],
)

In [16]:
dataset = rg.Dataset(
    name="argilla_sdk_docs_queries",
    settings=settings,
    client=client,
)
dataset.create()

Dataset(id=UUID('5e1c6c80-9e37-4b28-aed3-d098622e11db') inserted_at=datetime.datetime(2024, 6, 24, 10, 56, 10, 105584) updated_at=datetime.datetime(2024, 6, 24, 10, 56, 13, 94443) name='argilla_sdk_docs_queries' status='ready' guidelines='Review the chunks of docs.' allow_extra_metadata=False workspace_id=UUID('4fcd03e1-223d-4ad0-ac21-437193f75ea6') last_activity_at=datetime.datetime(2024, 6, 24, 10, 56, 13, 94443) url=None)

In [17]:
data = load_dataset("plaguss/argilla_sdk_docs_queries", split="train")

# Will select just the columns we are going to explore, and transform to list of dicts
data = data.select_columns(["anchor", "positive", "negative"]).to_list()

Downloading readme: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.10k/4.10k [00:00<00:00, 8.03MB/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 137k/137k [00:00<00:00, 247kB/s]
Generating train split: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 980/980 [00:00<00:00, 47295.11 examples/s]


In [19]:
dataset.records.log(records=data)

Adding and updating records: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  1.47batch/s]


DatasetRecords(Dataset(id=UUID('5e1c6c80-9e37-4b28-aed3-d098622e11db') inserted_at=datetime.datetime(2024, 6, 24, 10, 56, 10, 105584) updated_at=datetime.datetime(2024, 6, 24, 10, 56, 13, 94443) name='argilla_sdk_docs_queries' status='ready' guidelines='Review the chunks of docs.' allow_extra_metadata=False workspace_id=UUID('4fcd03e1-223d-4ad0-ac21-437193f75ea6') last_activity_at=datetime.datetime(2024, 6, 24, 10, 56, 13, 94443) url=None))