# Explore the generated datasets in Argilla

## Create an argilla instance and uploading the datasets

The following cells will push the previously created datasets to explore them in argilla. All the steps can be followed in the
[argilla-quickstart](https://argilla-io.github.io/argilla/dev/getting_started/quickstart/) section of the documentation.

In [21]:
## Install argilla if you haven't yet
#!pip install argilla --pre

Instantiate the client pointing to the created space.

In [1]:
import argilla as rg

client = rg.Argilla(
    api_url="https://plaguss-argilla-sdk-chatbot.hf.space",
    api_key="YOUR_API_KEY"
)

  from .autonotebook import tqdm as notebook_tqdm


Download the first dataset from the Hugging Face Hub and select the relevant columns we want to explore

In [2]:
from datasets import load_dataset

data = load_dataset("plaguss/argilla_sdk_docs_raw_unstructured", split="train")

In [3]:
# Will select just the columns we are going to explore, and transform to list of dicts
data = data.select_columns(["filename", "chunks"]).to_list()

## Dataset with raw chunks of documentation

Let's upload the raw chunks to argilla to look at the raw data we generated using the `docs_dataset.py` script.

- Dataset in Hugging Face Hub: [plaguss/argilla_sdk_docs_raw_unstructured](https://huggingface.co/datasets/plaguss/argilla_sdk_docs_raw_unstructured)

Create the settings of the dataset and push it to Argilla to track it:

In [4]:
settings = rg.Settings(
    guidelines="Review the chunks of docs.",
    fields=[
        rg.TextField(
            name="filename",
            title="Filename where this chunk was extracted from",
            use_markdown=False,
        ),
        rg.TextField(
            name="chunk",
            title="Chunk from the documentation",
            use_markdown=False,
        ),
    ],
    questions=[
        rg.LabelQuestion(
            name="good_chunk",
            title="Does this chunk contain relevant information?",
            labels=["yes", "no"],
        )
    ],
)

In [5]:
dataset = rg.Dataset(
    name="argilla_sdk_docs_raw_unstructured",
    settings=settings,
    client=client,
)
dataset.create()



Dataset(id=UUID('f869d3d1-8695-4819-ba56-c62bd0054c3d') inserted_at=datetime.datetime(2024, 6, 28, 7, 22, 12, 633904) updated_at=datetime.datetime(2024, 6, 28, 7, 22, 15, 275982) name='argilla_sdk_docs_raw_unstructured' status='ready' guidelines='Review the chunks of docs.' allow_extra_metadata=False workspace_id=UUID('91bc79aa-28e4-4ce7-a20f-af44afb0c7a1') last_activity_at=datetime.datetime(2024, 6, 28, 7, 22, 15, 275982) url=None)

Add records to it:

In [6]:
dataset.records.log(records=data, mapping={"filename": "filename", "chunks": "chunk"})

Adding and updating records: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.82s/batch]


DatasetRecords(Dataset(id=UUID('f869d3d1-8695-4819-ba56-c62bd0054c3d') inserted_at=datetime.datetime(2024, 6, 28, 7, 22, 12, 633904) updated_at=datetime.datetime(2024, 6, 28, 7, 22, 15, 275982) name='argilla_sdk_docs_raw_unstructured' status='ready' guidelines='Review the chunks of docs.' allow_extra_metadata=False workspace_id=UUID('91bc79aa-28e4-4ce7-a20f-af44afb0c7a1') last_activity_at=datetime.datetime(2024, 6, 28, 7, 22, 15, 275982) url=None))

## Dataset with generated queries

The following dataset contains the synthetic queries generated with distilabel. We will repeat the previous steps with the dataset used to fine tune our embedding model.

- Dataset in Hugging Face Hub: [plaguss/argilla_sdk_docs_queries](https://huggingface.co/datasets/plaguss/argilla_sdk_docs_queries)

In [7]:
settings = rg.Settings(
    guidelines="Review the chunks of docs.",
    fields=[
        rg.TextField(
            name="anchor",
            title="Anchor (Chunk from the documentation).",
            use_markdown=False,
        ),
        rg.TextField(
            name="positive",
            title="Positive sentence that queries the anchor.",
            use_markdown=False,
        ),
        rg.TextField(
            name="negative",
            title="Negative sentence that may use similar words but has content unrelated to the anchor.",
            use_markdown=False,
        ),
    ],
    questions=[
        rg.LabelQuestion(
            name="is_positive_relevant",
            title="Is the positive query relevant?",
            labels=["yes", "no"],
        ),
        rg.LabelQuestion(
            name="is_negative_irrelevant",
            title="Is the negative query irrelevant?",
            labels=["yes", "no"],
        )
    ],
)

In [8]:
dataset = rg.Dataset(
    name="argilla_sdk_docs_queries",
    settings=settings,
    client=client,
)
dataset.create()

Dataset(id=UUID('46c5e638-fb2b-4765-8a1e-901b09d8a0b5') inserted_at=datetime.datetime(2024, 6, 28, 7, 23, 7, 262351) updated_at=datetime.datetime(2024, 6, 28, 7, 23, 10, 950167) name='argilla_sdk_docs_queries' status='ready' guidelines='Review the chunks of docs.' allow_extra_metadata=False workspace_id=UUID('91bc79aa-28e4-4ce7-a20f-af44afb0c7a1') last_activity_at=datetime.datetime(2024, 6, 28, 7, 23, 10, 950167) url=None)

In [9]:
data = load_dataset("plaguss/argilla_sdk_docs_queries", split="train")

# Will select just the columns we are going to explore, and transform to list of dicts
data = data.select_columns(["anchor", "positive", "negative"]).to_list()

Downloading readme: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.10k/4.10k [00:00<00:00, 15.2MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 137k/137k [00:00<00:00, 253kB/s]
Generating train split: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 980/980 [00:00<00:00, 134984.66 examples/s]


In [10]:
dataset.records.log(records=data)

Adding and updating records: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.05batch/s]


DatasetRecords(Dataset(id=UUID('46c5e638-fb2b-4765-8a1e-901b09d8a0b5') inserted_at=datetime.datetime(2024, 6, 28, 7, 23, 7, 262351) updated_at=datetime.datetime(2024, 6, 28, 7, 23, 10, 950167) name='argilla_sdk_docs_queries' status='ready' guidelines='Review the chunks of docs.' allow_extra_metadata=False workspace_id=UUID('91bc79aa-28e4-4ce7-a20f-af44afb0c7a1') last_activity_at=datetime.datetime(2024, 6, 28, 7, 23, 10, 950167) url=None))

## Dataset with chatbot interactions

This dataset will track the interactions with the chatbot, so we can review the responses and improve it.

In [49]:
settings_chatbot_interactions = rg.Settings(
    guidelines="Review the user interactions with the chatbot.",
    fields=[
        rg.TextField(
            name="instruction",
            title="User instruction",
            use_markdown=True,
        ),
        rg.TextField(
            name="response",
            title="Bot response",
            use_markdown=True,
        ),
    ],
    questions=[
        rg.LabelQuestion(
            name="is_response_correct",
            title="Is the response correct?",
            labels=["yes", "no"],
        ),
        rg.LabelQuestion(
            name="out_of_guardrails",
            title="Did the model answered something out of the ordinary?",
            description="If the model answered something unrelated to Argilla SDK",
            labels=["yes", "no"],
        ),
        rg.TextQuestion(
            name="feedback",
            title="Let any feedback here",
            description="This field should be used to report any feedback that can be useful",
            required=False
        ),
    ],
    metadata=[
        rg.TermsMetadataProperty(
            name="conv_id",
            title="Conversation ID",
        ),
        rg.IntegerMetadataProperty(
            name="turn",
            min=0,
            max=100,
            title="Conversation Turn",
        )
    ]
)

In [50]:
dataset_chatbot = rg.Dataset(
    name="chatbot_interactions",
    settings=settings_chatbot_interactions,
    client=client,
)
dataset_chatbot.create()

Dataset(id=UUID('102022cc-1197-4652-bdf8-77db56ecbe74') inserted_at=datetime.datetime(2024, 6, 28, 10, 44, 25, 739838) updated_at=datetime.datetime(2024, 6, 28, 10, 44, 31, 101443) name='chatbot_interactions' status='ready' guidelines='Review the user interactions with the chatbot.' allow_extra_metadata=False workspace_id=UUID('91bc79aa-28e4-4ce7-a20f-af44afb0c7a1') last_activity_at=datetime.datetime(2024, 6, 28, 10, 44, 31, 101443) url=None)

Helper function to render the chat history as html:

In [45]:
def create_chat_html(history: list[tuple[str, str]]) -> str:
    """Helper function to create a conversation in HTML in argilla.

    Args:
        history: History of messages with the chatbot.

    Returns:
        HTML formatted conversation.
    """
    chat_html = ""
    alignments = ["right", "left"]
    colors = ["#c2e3f7", "#f5f5f5"]

    for turn in history:
        # Create the HTML message div with inline styles
        message_html = ""

        # To include message still not answered
        (user, assistant) = turn
        if assistant is None:
            turn = (user, )

        for i, content in enumerate(turn):
            message_html += f'<div style="display: flex; justify-content: {alignments[i]}; margin: 10px;">'
            message_html += f'<div style="background-color: {colors[i]}; padding: 10px; border-radius: 10px; max-width: 70%; word-wrap: break-word;">{content}</div>'
            message_html += "</div>"

        # Add the message to the chat HTML
        chat_html += message_html

    return chat_html

html = create_chat_html([("user first query", "bot response"), ("second_query", "new response")])

from IPython.core.display import display, HTML
display(HTML(html))


  from IPython.core.display import display, HTML
