In [2]:
import argilla as rg
from argilla._constants import DEFAULT_API_KEY

In [3]:
# Argilla credentials
api_url = "http://localhost:6900" # "https://<YOUR-HF-SPACE>.hf.space"
api_key = DEFAULT_API_KEY # admin.apikey
# Huggingface credentials
hf_token = "hf_..."

In [4]:
rg.init(api_url=api_url, api_key=api_key)

# # If you want to use your private HF Space
# rg.init(extra_headers={"Authorization": f"Bearer {hf_token}"})



# Add Text Descriptives as Metadata

In this tutorial, we will add text descriptives as metadata to a FeedbackDataset easily using the `TextDescriptivesExtractor` integrated on Argilla.

The steps are as follows:

-


## Introduction

Text descriptives are methods for analyzing and describing features of a text. They range from simple metrics like word count to more complex ones such as sentiment analysis or topic modeling, converting unstructured text into structured data easier to understand. For annotation projects, they provide information not captured by annotators, and added as metadata, they help in filtering and creating dataset subsets.

To get the text descriptives, we will use the `TextDescriptivesExtractor` based on the [TextDescriptives](https://github.com/HLasse/TextDescriptives) library. Some of the basic metrics added by this extractor are:

* *n_tokens*: Number of tokens in the text.
* *n_unique_tokens*: Number of unique tokens in the text.
* *n_sentences*: Number of sentences in the text.
* *perplexity*:  Measures the text complexity, vocabulary diversity and unpredictability. Lower scores suggest that the model finds the text more predictable, while a higher perplexity score means the model finds the text less predictable.
* *entropy*: Indicates text randomness or uncertainty. Higher scores denote varied, unpredictable language use.
* *flesch_reading_ease*: A readability test designed to indicate how easy a English text is to understand, based on sentence length and syllable count per word. Higher scores mean that is easier to read, while lower scores indicate complexity.

## Running Argilla

For this tutorial, you will need to have an Argilla server running. There are two main options for deploying and running Argilla:


**Deploy Argilla on Hugging Face Spaces**: If you want to run tutorials with external notebooks (e.g., Google Colab) and you have an account on Hugging Face, you can deploy Argilla on Spaces with a few clicks:

[![deploy on spaces](https://huggingface.co/datasets/huggingface/badges/raw/main/deploy-to-spaces-lg.svg)](https://huggingface.co/login?next=%2Fnew-space%3Ftemplate%3Dargilla%2Fargilla-template-space)

For details about configuring your deployment, check the [official Hugging Face Hub guide](https://huggingface.co/docs/hub/spaces-sdks-docker-argilla).


**Launch Argilla using Argilla's quickstart Docker image**: This is the recommended option if you want [Argilla running on your local machine](../../getting_started/quickstart.html). Note that this option will only let you run the tutorial locally and not with an external notebook service.

For more information on deployment options, please check the Deployment section of the documentation.

<div class="alert alert-info">

Tip
    
This tutorial is a Jupyter Notebook. There are two options to run it:

- Use the Open in Colab button at the top of this page. This option allows you to run the notebook directly on Google Colab. Don't forget to change the runtime type to GPU for faster model training and inference.
- Download the .ipynb file by clicking on the View source link at the top of the page. This option allows you to download the notebook and run it on your local machine or on a Jupyter Notebook tool of your choice.
</div>

## Set up the Environment

To complete this tutorial, you will need to install the Argilla client and a few third-party libraries using `pip`:

In [None]:
# %pip install --upgrade pip
%pip install argilla -qqq
%pip install datasets

Let's make the needed imports:

In [1]:
import argilla as rg

from datasets import load_dataset

If you are running Argilla using the Docker quickstart image or a public Hugging Face Spaces, you need to init the Argilla client with the `URL` and `API_KEY`:

In [None]:
# Replace api_url with the url to your HF Spaces URL if using Spaces
# Replace api_key if you configured a custom API key
# Replace workspace with the name of your workspace
rg.init(
    api_url="http://localhost:6900", 
    api_key="owner.apikey",
    workspace="admin"
)

If you're running a private Hugging Face Space, you will also need to set the [HF_TOKEN](https://huggingface.co/settings/tokens) as follows:

In [None]:
# # Set the HF_TOKEN environment variable
# import os
# os.environ['HF_TOKEN'] = "your-hf-token"

# # Replace api_url with the url to your HF Spaces URL
# # Replace api_key if you configured a custom API key
# rg.init(
#     api_url="https://[your-owner-name]-[your_space_name].hf.space", 
#     api_key="admin.apikey",
#     extra_headers={"Authorization": f"Bearer {os.environ['HF_TOKEN']}"},
# )

### Enable Telemetry

We gain valuable insights from how you interact with our tutorials. To improve ourselves in offering you the most suitable content, using the following lines of code will help us understand that this tutorial is serving you effectively. Though this is entirely anonymous, you can choose to skip this step if you prefer. For more info, please check out the [Telemetry](../../reference/telemetry.md) page.

In [None]:
try:
    from argilla.utils.telemetry import tutorial_running
    tutorial_running()
except ImportError:
    print("Telemetry is introduced in Argilla 1.20.0 and not found in the current installation. Skipping telemetry.")

## Load the Dataset

We're going to use the `oasst_response_comparison` dataset available on Hugging Face. This dataset includes two fields: the prompt and the response from a open assistant. It also features a LabelQuestion, MultilabelQuestion and TextQuestion, and provides certain guidelines. However, there's no associated metadata in this dataset.

In [38]:
dataset = rg.FeedbackDataset.from_huggingface("argilla/oasst_response_quality", split="train[:100]")

Parsing records: 100%|██████████| 100/100 [00:00<00:00, 151.62it/s]


In [39]:
dataset

FeedbackDataset(
   fields=[TextField(name='prompt', title='Prompt', required=True, type=<FieldTypes.text: 'text'>, use_markdown=True), TextField(name='response', title='Response', required=True, type=<FieldTypes.text: 'text'>, use_markdown=True)]
   questions=[LabelQuestion(name='relevant', title='Is the response relevant for the given prompt?', description=None, required=True, type=<QuestionTypes.label_selection: 'label_selection'>, labels=['Yes', 'No'], visible_labels=None), MultiLabelQuestion(name='content_class', title='Does the response include any of the following?', description=None, required=False, type=<QuestionTypes.multi_label_selection: 'multi_label_selection'>, labels={'hate': 'Hate Speech', 'inappropriate': 'Inappropriate content', 'not_english': 'Not English', 'pii': 'Personal information', 'sexual': 'Sexual content', 'untruthful': 'Untruthful info', 'violent': 'Violent content'}, visible_labels=7), RatingQuestion(name='rating', title='Rate the quality of the response:'

In [40]:
try:
    remote_dataset = dataset.push_to_argilla(name="oasst_response_quality", workspace="argilla")
except:
    rg.FeedbackDataset.from_argilla("oasst_response_quality", workspace="argilla").delete()
    remote_dataset = dataset.push_to_argilla(name="oasst_response_quality", workspace="argilla")

Output()

In [41]:
remote_dataset

RemoteFeedbackDataset(
   id=338783d5-8fbc-4c30-86fc-afc77ff92ec0
   name=oasst_response_quality
   workspace=Workspace(id=507a6ccf-f7e0-40e9-9384-5c8840abb505, name=argilla, inserted_at=2023-12-12 14:04:58.940990, updated_at=2023-12-12 14:04:58.940990)
   url=http://localhost:6900/dataset/338783d5-8fbc-4c30-86fc-afc77ff92ec0/annotation-mode
   fields=[RemoteTextField(id=UUID('9f0a29e8-689f-483f-ad2c-35054d01fa0a'), client=None, name='prompt', title='Prompt', required=True, type='text', use_markdown=True), RemoteTextField(id=UUID('8b0dea30-9949-4261-baa3-776f8b6e2473'), client=None, name='response', title='Response', required=True, type='text', use_markdown=True)]
   questions=[RemoteLabelQuestion(id=UUID('86c01a19-a5cf-4769-9f01-027cff3c3e28'), client=None, name='relevant', title='Is the response relevant for the given prompt?', description=None, required=True, type='label_selection', labels=['Yes', 'No'], visible_labels=None), RemoteMultiLabelQuestion(id=UUID('77bae5ef-8d77-4e1e-8d88

## Add Text Descriptives as Metadata

Our dataset currently lacks metadata. To address this, we will add the text descriptives as metadata using the `TextDescriptivesExtractor`, which has the following arguments:

* *model*: the language of the model.
* *metrics*: the metrics to be extracted.
* *fields*: the field names to extract metrics from.
* *visible_for_annotators*: whether the metadata is visible for annotators.
* *show_progress*: whether to show the progress bar.

So, first, we will use the default english model and metrics to get the text descriptives of the `prompt` field.

In [22]:
# Initialize the TextDescriptivesExtractor
tde = rg.TextDescriptivesExtractor(
    model = "en",
    metrics = None,
    fields = "response",
    visible_for_annotators = True,
    show_progress = True,
)

### To a Local FeedbackDataset

In [15]:
# Retrieve your FeedbackRecords
records = [record for record in dataset]

In [16]:
tde = rg.TextDescriptivesExtractor(
    model = "en",
    metrics = None,
    fields = "response",
    visible_for_annotators = True,
    show_progress = True,
)

In [None]:
# Extract the text descriptives of the records indicated fields
updated_records = tde.update_records(records)

In [11]:
updated_records[5]

FeedbackRecord(fields={'prompt': 'I am using docker compose and i need to mount the docker socket - how would i do that?', 'response': "You can mount the Docker socket in a Docker Compose service by adding the following to your docker-compose.yml file:\n\njavascript\n\nversion: '3'\nservices:\n  your_service_name:\n    # ... other service configuration\n    volumes:\n      - /var/run/docker.sock:/var/run/docker.sock\n\nThis will mount the host's Docker socket at /var/run/docker.sock inside the service's container. This allows the service to interact with the host's Docker daemon and perform tasks such as starting and stopping containers.\n\nIt is important to note that mounting the Docker socket inside a container can potentially expose your host to security risks, so it should only be done in trusted environments or with proper security measures in place."}, metadata={'prompt_n_tokens': 18, 'prompt_n_unique_tokens': 15, 'prompt_n_sentences': 1, 'prompt_perplexity': 1.7, 'prompt_entrop

In [16]:
dataset.add_records(updated_records)

In [None]:
dataset.push_to_argilla("prueba1")

### To a Remote FeedbackDataset

In [None]:
# Update the dataset
updated_remote_dataset = tde.update_dataset(remote_dataset)

In [24]:
updated_remote_dataset

# Code

In [74]:
#  coding=utf-8
#  Copyright 2021-present, the Recognai S.L. team.
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.
import logging
import re
from typing import Dict, List, Optional, Tuple, Union

import pandas as pd
import textdescriptives as td
from rich.progress import Progress

from argilla.client.feedback.dataset.local.dataset import FeedbackDataset
from argilla.client.feedback.dataset.remote.dataset import RemoteFeedbackDataset
from argilla.client.feedback.schemas.metadata import (
    FloatMetadataProperty,
    IntegerMetadataProperty,
    TermsMetadataProperty,
)
from argilla.client.feedback.schemas.records import FeedbackRecord
from argilla.client.feedback.schemas.remote.records import RemoteFeedbackRecord

_LOGGER = logging.getLogger(__name__)
_LOGGER.setLevel(logging.INFO)


class TextDescriptivesExtractor:
    """This class extracts a number of basic text descriptives from FeedbackDataset
    records using the TextDescriptives library and adds them as record metadata."""

    def __init__(
        self,
        model: str = "en",
        metrics: Optional[List[str]] = None,
        fields: Optional[List[str]] = None,
        visible_for_annotators: bool = True,
        show_progress: bool = True,
    ):
        """
        Initialize a new TextDescriptivesExtractor object.

        Args:
            model (str): The language model to use for text descriptives.
            metrics (Optional[List[str]]): A list of metrics to extract. If None, all metrics will be extracted.
            fields (Optional[List[str]]): A list of field names to extract metrics from. If None, all fields will be used.
            visible_for_annotators (bool): Whether the extracted metrics should be visible to annotators.
            show_progress (bool): Whether to show a progress bar when extracting metrics.
        """
        self.model = model
        self.metrics = metrics
        print("self.metrics", self.metrics)
        self.fields = fields
        print("self.fields", self.fields)
        self.visible_for_annotators = visible_for_annotators
        self.show_progress = show_progress
        self.__basic_metrics = [
            "n_tokens",
            "n_unique_tokens",
            "n_sentences",
            "perplexity",
            "entropy",
            "flesch_reading_ease",
        ]

    def _extract_metrics_for_single_field(
        self,
        records: List[Union[FeedbackRecord, RemoteFeedbackRecord]],
        field: str,
        basic_metrics: Optional[List[str]] = None,
    ) -> Optional[pd.DataFrame]:
        """
        Extract text descriptives metrics for a single field from a list of feedback records
        using the TextDescriptives library.

        Args:
            records (List[Union[FeedbackRecord, RemoteFeedbackRecord]]): A list of FeedbackDataset or RemoteFeedbackDataset records.
            field (str): The name of the field to extract metrics for.
            basic_metrics (Optional[List[str]]): A list of basic metrics to extract. If None, all metrics will be extracted.

        Returns:
            Optional[pd.DataFrame]: A dataframe containing the text descriptives metrics for the field, or None if the field is empty.
        """
        # If the field is empty, skip it
        field_text = [record.fields[field] for record in records if record.fields[field]]
        if not field_text:
            return None
        # If language is english, the default spacy model is used (to avoid warning message)
        if self.model == "en":
            print("forsinglefield:self.metrics", self.metrics)
            field_metrics = td.extract_metrics(text=field_text, spacy_model="en_core_web_sm", metrics=self.metrics)
            print("forsinglefield:field_metrics", field_metrics)
        else:
            field_metrics = td.extract_metrics(text=field_text, lang=self.model, metrics=self.metrics)
        # Drop text column
        field_metrics = field_metrics.drop("text", axis=1)
        # Select all column names that contain ONLY NaNs
        nan_columns = field_metrics.columns[field_metrics.isnull().all()].tolist()
        print("forsinglefield:nan_columns", nan_columns)
        if nan_columns:
            _LOGGER.warning(f"The following columns contain only NaN values: {nan_columns}")
        # If basic metrics is None, use all basic metrics
        if basic_metrics is None and self.metrics is None:
            print("forsinglefield:basic_metrics", basic_metrics)
            print("forsinglefield:self.metrics", self.metrics)
            basic_metrics = self.__basic_metrics
            field_metrics = field_metrics.loc[:, basic_metrics]
        # Concatenate field name with the metric name
        field_metrics.columns = [f"{field}_{metric}" for metric in field_metrics.columns]
        return field_metrics

    def _extract_metrics_for_all_fields(
        self, records: List[Union[FeedbackRecord, RemoteFeedbackRecord]], fields: List[str] = None
    ) -> pd.DataFrame:
        """
        Extract text descriptives metrics for all named fields from a list of feedback records
        using the TextDescriptives library.
        Args:
            records (List[Union[FeedbackRecord, RemoteFeedbackRecord]]): A list of FeedbackDataset or RemoteFeedbackDataset records.
            fields (List[str]): A list of fields to extract metrics for. If None, extract metrics for all fields.
        Returns:
            pd.DataFrame: A dataframe containing the text descriptives metrics for each record and field.
        """
        # If fields is None, use all fields
        print("forallfields:before:fields", fields)
        if self.fields:
            fields = self.fields
        else:
            fields = list({key for record in records for key in record.fields.keys()})
        print("forallfields: after:fields", fields)
        # Extract all metrics for each field
        field_metrics = {
            field: self._extract_metrics_for_single_field(records=records, field=field) for field in fields
        }
        print("forallfields:field_metrics.items")
        for field, metrics in field_metrics.items():
            print(f"Field: {field}, Metrics: {metrics}")
        field_metrics = {field: metrics for field, metrics in field_metrics.items() if metrics is not None}
        # If there is only one field, return the metrics for that field directly
        print(len(field_metrics))
        if len(field_metrics) == 1:
            return list(field_metrics.values())[0]
        else:
            # If there are multiple fields, combine metrics for each field into a single dataframe
            final_metrics = pd.concat(field_metrics, axis=1, keys=field_metrics.keys())
            final_metrics.columns = final_metrics.columns.droplevel(0)
        return final_metrics

    def _cast_to_python_types(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Convert integer, boolean and floats columns in a dataframe
        to Python native types.

        Args:
            df (pd.DataFrame): The text descriptives dataframe.

        Returns:
            pd.DataFrame: The text descriptives dataframe with integer and boolean columns cast to Python native types.
        """
        # Select columns by data type
        int_cols = df.select_dtypes(include=["int64"]).columns
        bool_cols = df.select_dtypes(include=["boolean"]).columns
        float_cols = df.select_dtypes(include=["float64"]).columns
        # Cast integer columns to Python's native int type
        df[int_cols] = df[int_cols].astype(int)
        # Cast boolean columns to Python's native str type
        df[bool_cols] = df[bool_cols].astype(str)
        # Cast float columns to Python's native float type and round to 2 decimal places
        df[float_cols] = df[float_cols].astype(float).round(2)
        return df

    def _clean_column_name(self, col_name: str) -> str:
        """
        Clean the column name of a dataframe to fit a specific regex pattern.
        Args:
            col_name (str): A column name.
        Returns:
            str: A column name that fits the regex pattern.
        """
        col_name = col_name.lower()  # Convert to lowercase
        col_name = re.sub(r"[^a-z0-9_]", "_", col_name)  # Replace non-alphanumeric characters with underscores
        return col_name

    def _create_metadata_properties(self, df: pd.DataFrame) -> List:
        """
        Generate metadata properties based on dataframe columns and data types.

        Args:
            df (pd.DataFrame): The text descriptives dataframe.

        Returns:
            List: A list of metadata properties.
        """
        properties = []
        for col, dtype in df.dtypes.items():
            name = col
            title = name.replace("_", " ").title()
            print("dtype", dtype)
            if dtype in ["object", "bool"]:
                prop = TermsMetadataProperty(
                    name=name,
                    title=title,
                    visible_for_annotators=self.visible_for_annotators,
                    values=df[col].unique().tolist(),
                )
            elif dtype == "int32":
                prop = IntegerMetadataProperty(
                    name=name, title=title, visible_for_annotators=self.visible_for_annotators
                )
            elif dtype == "float64":
                prop = FloatMetadataProperty(name=name, title=title, visible_for_annotators=self.visible_for_annotators)
            else:
                _LOGGER.warning(f"Unhandled data type for column {col}: {dtype}")
                prop = None
            if prop is not None:
                properties.append(prop)
            print(properties)
        return properties

    def _add_text_descriptives_to_metadata(
        self, records: List[Union[FeedbackRecord, RemoteFeedbackRecord]], df: pd.DataFrame
    ) -> List[Union[FeedbackRecord, RemoteFeedbackRecord]]:
        """
        Add the text descriptives metrics extracted previously as metadata
        to a list of FeedbackDataset records.

        Args:
            records (List[Union[FeedbackRecord, RemoteFeedbackRecord]]): A list of FeedbackDataset or RemoteFeedbackDataset records.
            df (pd.DataFrame): The text descriptives dataframe.

        Returns:
            List[Union[FeedbackRecord, RemoteFeedbackRecord]]: A list of FeedbackDataset or RemoteFeedbackDataset records with extracted metrics added as metadata.
        """
        modified_records = []
        with Progress() as progress_bar:
            task = progress_bar.add_task(
                "Adding text descriptives to metadata...", total=len(records), visible=self.show_progress
            )
            for record, metrics in zip(records, df.to_dict("records")):
                filtered_metrics = {key: value for key, value in metrics.items() if not pd.isna(value)}
                record.metadata.update(filtered_metrics)
                modified_records.append(record)
                progress_bar.update(task, advance=1)
        return modified_records

    def update_records(
        self, records: List[Union[FeedbackRecord, RemoteFeedbackRecord]]
    ) -> List[Union[FeedbackRecord, RemoteFeedbackRecord]]:
        """
        Extract text descriptives metrics from a list of FeedbackDataset or RemoteFeedbackDataset records,
        add them as metadata to the records and return the updated records.

        Args:
            records (List[Union[FeedbackRecord, RemoteFeedbackRecord]]): A list of FeedbackDataset or RemoteFeedbackDataset records.

        Returns:
            List[Union[FeedbackRecord, RemoteFeedbackRecord]]: A list of FeedbackDataset or RemoteFeedbackDataset records with text descriptives metrics added as metadata.

        >>> import argilla as rg
        >>> records = [rg.FeedbackRecord(fields={"text": "This is a test."})]
        >>> tde = rg.TextDescriptivesExtractor()
        >>> updated_records = tde.update_records(records)
        """
        # Extract text descriptives metrics from records
        extracted_metrics = self._extract_metrics_for_all_fields(records)
        print("extracted_metrics:type", type(extracted_metrics))
        # If the dataframe doesn't contain any columns, return the original records and log a warning
        if extracted_metrics.shape[1] == 0:
            _LOGGER.warning(
                "No text descriptives metrics were extracted. This could be because the metrics contained NaNs."
            )
            return records
        else:
            # Cast integer and boolean columns to Python native types
            extracted_metrics = self._cast_to_python_types(extracted_metrics)
            # Clean column names
            extracted_metrics.columns = [self._clean_column_name(col) for col in extracted_metrics.columns]
            # Add the metrics to the metadata of the records
            modified_records = self._add_text_descriptives_to_metadata(records, extracted_metrics)
            return modified_records

    def update_dataset(
        self, dataset: Union[FeedbackDataset, RemoteFeedbackDataset]
    ) -> Union[FeedbackDataset, RemoteFeedbackDataset]:
        """
        Extract text descriptives metrics from records in a FeedbackDataset
        or RemoteFeedbackDataset, add them as metadata to the records and
        return the updated dataset.

        Args:
            dataset (Union[FeedbackDataset, RemoteFeedbackDataset]): A FeedbackDataset or RemoteFeedbackDataset.

        Returns:
            Union[FeedbackDataset, RemoteFeedbackDataset]: A FeedbackDataset or RemoteFeedbackDataset with text descriptives metrics added as metadata.

        >>> import argilla as rg
        >>> rg.init(...)
        >>> dataset = rg.FeedbackDataset.from_argilla(name="my-dataset")
        >>> tde = rg.TextDescriptivesExtractor()
        >>> updated_dataset = tde.update_dataset(dataset)

        """
        if isinstance(dataset, (FeedbackDataset, RemoteFeedbackDataset)):
            records = dataset.records
        else:
            raise ValueError(
                f"Provided object is of `type={type(dataset)}` while only `type=FeedbackDataset` or `type=RemoteFeedbackDataset` are allowed."
            )
        # Extract text descriptives metrics from records
        extracted_metrics = self._extract_metrics_for_all_fields(records)
        # Cast integer and boolean columns to Python native types
        extracted_metrics = self._cast_to_python_types(extracted_metrics)
        # Clean column names
        extracted_metrics.columns = [self._clean_column_name(col) for col in extracted_metrics.columns]
        # Create metadata properties based on dataframe columns and data types
        metadata_properties = self._create_metadata_properties(extracted_metrics)
        # Add each metadata property iteratively to the dataset
        [dataset.add_metadata_property(prop) for prop in metadata_properties]
        # Add the metrics to the metadata
        if isinstance(dataset, FeedbackDataset):
            with Progress() as progress_bar:
                task = progress_bar.add_task(
                    "Adding text descriptives to metadata...", total=len(records), visible=self.show_progress
                )
                for record, metrics in zip(records, extracted_metrics.to_dict("records")):
                    filtered_metrics = {key: value for key, value in metrics.items() if not pd.isna(value)}
                    record.metadata.update(filtered_metrics)
                    progress_bar.update(task, advance=1)
        elif isinstance(dataset, RemoteFeedbackDataset):
            modified_records = self._add_text_descriptives_to_metadata(records, extracted_metrics)
            dataset = dataset.update_records(modified_records)
        return dataset


# Pruebas

In [5]:
ds = rg.FeedbackDataset(
    fields=[
        rg.TextField(name="text"),
        rg.TextField(name="text2"),
        rg.TextField(name="text3"),
    ],
    questions=[
        rg.RatingQuestion(
            name="answer_quality",
            description="How would you rate the quality of the answer?",
            values=[1, 2, 3, 4, 5],
        ),
    ]
)
records = [
    rg.FeedbackRecord(fields={"text": "This is a test.", "text2": "This is a test. You wanna sing,sing if is your passion.", "text3": "This is a test? Are you sure? I dont think so."}),
    rg.FeedbackRecord(fields={"text": "This i", "text2": "your house.", "text3": "try this"}),
    rg.FeedbackRecord(fields={"text": "You went there", "text2": "This thing is tooo shrt, i should write it longer so that the metrics works", "text3": "obviously this doesn't work, why? I dont. know. Who KNows?"}),
]
ds.add_records(records)

In [7]:
try:
    remote_ds = ds.push_to_argilla(name="basic_one", workspace="argilla")
except:
    rg.FeedbackDataset.from_argilla("basic_one", workspace="argilla").delete()
    remote_ds = ds.push_to_argilla(name="basic_one", workspace="argilla")
records = remote_ds.records

Output()

## records (remote/local) with metrics and fields OK

In [6]:
tde = rg.TextDescriptivesExtractor(metrics=["coherence"], fields=["text", "text2"])
updated_records = tde.update_records(records)

self.metrics ['coherence']
self.fields ['text', 'text2']
forallfields:before:fields None
forallfields: after:fields ['text', 'text2']
forsinglefield:self.metrics ['coherence']
forsinglefield:field_metrics               text  first_order_coherence  second_order_coherence
0  This is a test.                    NaN                     NaN
1           This i                    NaN                     NaN
2   You went there                    NaN                     NaN
forsinglefield:nan_columns ['first_order_coherence', 'second_order_coherence']


forsinglefield:self.metrics ['coherence']
forsinglefield:field_metrics                                                 text  first_order_coherence  \
0  This is a test. You wanna sing,sing if is your...               0.306215   
1                                        your house.                    NaN   
2  This thing is tooo shrt, i should write it lon...                    NaN   

   second_order_coherence  
0                     NaN  
1                     NaN  
2                     NaN  
forsinglefield:nan_columns ['second_order_coherence']


  similarities.append(sent.similarity(sents[i + order]))


Output()

forallfields:field_metrics.items
Field: text, Metrics:    text_first_order_coherence  text_second_order_coherence
0                         NaN                          NaN
1                         NaN                          NaN
2                         NaN                          NaN
Field: text2, Metrics:    text2_first_order_coherence  text2_second_order_coherence
0                     0.306215                           NaN
1                          NaN                           NaN
2                          NaN                           NaN
2
extracted_metrics:type <class 'pandas.core.frame.DataFrame'>


In [32]:
updated_records

[RemoteFeedbackRecord(id=UUID('563e0ff5-2aee-433c-b248-2483278dcc59'), client=<httpx.Client object at 0x0000021D00217880>, fields={'text': 'This is a test.', 'text2': 'This is a test. You wanna sing,sing if is your passion.', 'text3': 'This is a test? Are you sure? I dont think so.'}, metadata={'text_flesch_reading_ease': 118.18, 'text_flesch_kincaid_grade': -2.23, 'text_gunning_fog': 1.6, 'text_automated_readability_index': -6.48, 'text_coleman_liau_index': -7.03, 'text_lix': 4.0, 'text_rix': 0.0, 'text_token_length_mean': 2.75, 'text_token_length_median': 3.0, 'text_token_length_std': 1.3, 'text_sentence_length_mean': 4.0, 'text_sentence_length_median': 4.0, 'text_sentence_length_std': 0.0, 'text_syllables_per_token_mean': 1.0, 'text_syllables_per_token_median': 1.0, 'text_syllables_per_token_std': 0.0, 'text_n_tokens': 4, 'text_n_unique_tokens': 4, 'text_proportion_unique_tokens': 1.0, 'text_n_characters': 12, 'text_n_sentences': 1, 'text2_flesch_reading_ease': 109.1, 'text2_flesch_

In [138]:
updated_records[0].metadata

{'text2_first_order_coherence': 0.31}

## records (remote/local) with metrics OK

In [9]:
tde = rg.TextDescriptivesExtractor(metrics=["coherence"])
updated_records = tde.update_records(records)

self.metrics ['coherence']
self.fields None
forallfields:before:fields None
forallfields: after:fields ['text2', 'text', 'text3']
forsinglefield:self.metrics ['coherence']
forsinglefield:field_metrics                                                 text  first_order_coherence  \
0  This is a test. You wanna sing,sing if is your...               0.306215   
1                                        your house.                    NaN   
2  This thing is tooo shrt, i should write it lon...                    NaN   

   second_order_coherence  
0                     NaN  
1                     NaN  
2                     NaN  
forsinglefield:nan_columns ['second_order_coherence']


  similarities.append(sent.similarity(sents[i + order]))


forsinglefield:self.metrics ['coherence']
forsinglefield:field_metrics               text  first_order_coherence  second_order_coherence
0  This is a test.                    NaN                     NaN
1           This i                    NaN                     NaN
2   You went there                    NaN                     NaN
forsinglefield:nan_columns ['first_order_coherence', 'second_order_coherence']


forsinglefield:self.metrics ['coherence']


Output()

  similarities.append(sent.similarity(sents[i + order]))


forsinglefield:field_metrics                                                 text  first_order_coherence  \
0     This is a test? Are you sure? I dont think so.                0.40504   
1                                           try this                    NaN   
2  obviously this doesn't work, why? I dont. know...                0.47226   

   second_order_coherence  
0                0.159334  
1                     NaN  
2                0.387664  
forsinglefield:nan_columns []
forallfields:field_metrics.items
Field: text2, Metrics:    text2_first_order_coherence  text2_second_order_coherence
0                     0.306215                           NaN
1                          NaN                           NaN
2                          NaN                           NaN
Field: text, Metrics:    text_first_order_coherence  text_second_order_coherence
0                         NaN                          NaN
1                         NaN                          NaN
2            

In [44]:
updated_records

[RemoteFeedbackRecord(id=UUID('48d2733b-2a98-44be-8f7e-ee46119f0bf4'), client=<httpx.Client object at 0x0000021D00217880>, fields={'text': 'This is a test.', 'text2': 'This is a test. You wanna sing,sing if is your passion.', 'text3': 'This is a test? Are you sure? I dont think so.'}, metadata={'text3_entropy': 0.55, 'text3_perplexity': 1.74, 'text3_per_word_perplexity': 0.12, 'text_entropy': 0.28, 'text_perplexity': 1.32, 'text_per_word_perplexity': 0.26, 'text2_entropy': 0.63, 'text2_perplexity': 1.89, 'text2_per_word_perplexity': 0.13}, vectors={}, responses=[], suggestions=(), external_id=None),
 RemoteFeedbackRecord(id=UUID('705c9423-a657-41ec-bc36-e937323ead81'), client=<httpx.Client object at 0x0000021D00217880>, fields={'text': 'This i', 'text2': 'your house.', 'text3': 'try this'}, metadata={'text3_entropy': 0.03, 'text3_perplexity': 1.03, 'text3_per_word_perplexity': 0.51, 'text_entropy': 0.02, 'text_perplexity': 1.02, 'text_per_word_perplexity': 0.51, 'text2_entropy': 0.16, 

In [128]:
updated_records[2].metadata

{'text3_first_order_coherence': 0.41,
 'text3_second_order_coherence': 0.16,
 'text2_first_order_coherence': 0.31}

## records (remote/local) with two fields OK

In [29]:
tde = TextDescriptivesExtractor(fields=["text", "text2"])
updated_records = tde.update_records(records)

self.metrics None
field_metrics               text   entropy  perplexity  per_word_perplexity  \
0  This is a test.  0.280120    1.323289             0.264658   
1           This i  0.016005    1.016133             0.508067   
2   You went there  0.027561    1.027945             0.342648   

   token_length_mean  token_length_median  token_length_std  \
0               2.75                  3.0          1.299038   
1               2.50                  2.5          1.500000   
2               4.00                  4.0          0.816497   

   sentence_length_mean  sentence_length_median  sentence_length_std  ...  \
0                   4.0                     4.0                  0.0  ...   
1                   2.0                     2.0                  0.0  ...   
2                   3.0                     3.0                  0.0  ...   

   pos_prop_PROPN  pos_prop_PUNCT  pos_prop_SCONJ  pos_prop_SYM  \
0             0.0             0.2             0.0           0.0   
1          

basic_metrics None
self.metrics None
self.metrics None
field_metrics                                                 text   entropy  perplexity  \
0  This is a test. You wanna sing,sing if is your...  0.634320    1.885739   
1                                        your house.  0.164152    1.178393   
2  This thing is tooo shrt, i should write it lon...  0.421429    1.524138   

   per_word_perplexity  token_length_mean  token_length_median  \
0             0.125716                3.5                  4.0   
1             0.392798                4.5                  4.5   
2             0.095259                4.0                  4.0   

   token_length_std  sentence_length_mean  sentence_length_median  \
0          1.554563                   6.0                     6.0   
1          0.500000                   2.0                     2.0   
2          1.673320                  15.0                    15.0   

   sentence_length_std  ...  pos_prop_PROPN  pos_prop_PUNCT  pos_prop_SCONJ 

  similarities.append(sent.similarity(sents[i + order]))


Output()

basic_metrics None
self.metrics None


In [30]:
updated_records

[RemoteFeedbackRecord(id=UUID('6bc3c089-c347-4d78-969c-2f5b1af0cc07'), client=<httpx.Client object at 0x000001DB9E22C160>, fields={'text': 'This is a test.', 'text2': 'This is a test. You wanna sing,sing if is your passion.', 'text3': 'This is a test? Are you sure? I dont think so.'}, metadata={'text_n_tokens': 4, 'text_n_unique_tokens': 4, 'text_n_sentences': 1, 'text_perplexity': 1.32, 'text_entropy': 0.28, 'text_flesch_reading_ease': 118.18, 'text2_n_tokens': 12, 'text2_n_unique_tokens': 10, 'text2_n_sentences': 2, 'text2_perplexity': 1.89, 'text2_entropy': 0.63, 'text2_flesch_reading_ease': 109.1}, vectors={}, responses=[], suggestions=(), external_id=None),
 RemoteFeedbackRecord(id=UUID('f0b3a8a4-2414-4eb1-b8a6-3bebae8edaf1'), client=<httpx.Client object at 0x000001DB9E22C160>, fields={'text': 'This i', 'text2': 'your house.', 'text3': 'try this'}, metadata={'text_n_tokens': 2, 'text_n_unique_tokens': 2, 'text_n_sentences': 1, 'text_perplexity': 1.02, 'text_entropy': 0.02, 'text_f

## records (remote/local) with 1 field OK

In [76]:
tde = TextDescriptivesExtractor(fields=["text"])
updated_records = tde.update_records(records)

self.metrics None
self.fields ['text']
forallfields:before:fields None
forallfields: after:fields ['text']
forsinglefield:self.metrics None
forsinglefield:field_metrics               text  first_order_coherence  second_order_coherence  \
0  This is a test.                    NaN                     NaN   
1           This i                    NaN                     NaN   
2   You went there                    NaN                     NaN   

   flesch_reading_ease  flesch_kincaid_grade  smog  gunning_fog  \
0              118.175                 -2.23   NaN          1.6   
1              120.205                 -3.01   NaN          0.8   
2              119.190                 -2.62   NaN          1.2   

   automated_readability_index  coleman_liau_index  lix  ...  \
0                      -6.4775           -7.030000  4.0  ...   
1                      -8.6550          -15.900000  2.0  ...   
2                      -1.0900           -2.146667  3.0  ...   

   sentence_length_median  s

Output()

forsinglefield:basic_metrics None
forsinglefield:self.metrics None
forallfields:field_metrics.items
Field: text, Metrics:    text_n_tokens  text_n_unique_tokens  text_n_sentences  text_perplexity  \
0              4                     4                 1         1.323289   
1              2                     2                 1         1.016133   
2              3                     3                 1         1.027945   

   text_entropy  text_flesch_reading_ease  
0      0.280120                   118.175  
1      0.016005                   120.205  
2      0.027561                   119.190  
1
extracted_metrics:type <class 'pandas.core.frame.DataFrame'>


In [52]:
updated_records

[FeedbackRecord(fields={'text': 'This is a test.', 'text2': 'This is a test. You wanna sing,sing if is your passion.', 'text3': 'This is a test? Are you sure? I dont think so.'}, metadata={'text_n_tokens': 4, 'text_n_unique_tokens': 4, 'text_n_sentences': 1, 'text_perplexity': 1.32, 'text_entropy': 0.28, 'text_flesch_reading_ease': 118.18}, vectors={}, responses=[], suggestions=(), external_id=None),
 FeedbackRecord(fields={'text': 'This i', 'text2': 'your house.', 'text3': 'try this'}, metadata={'text_n_tokens': 2, 'text_n_unique_tokens': 2, 'text_n_sentences': 1, 'text_perplexity': 1.02, 'text_entropy': 0.02, 'text_flesch_reading_ease': 120.21}, vectors={}, responses=[], suggestions=(), external_id=None),
 FeedbackRecord(fields={'text': 'You went there', 'text2': 'This thing is tooo shrt, i should write it longer so that the metrics works', 'text3': "obviously this doesn't work, why? I dont. know. Who KNows?"}, metadata={'text_n_tokens': 3, 'text_n_unique_tokens': 3, 'text_n_sentence

In [53]:
updated_records[0].metadata

{'text_n_tokens': 4,
 'text_n_unique_tokens': 4,
 'text_n_sentences': 1,
 'text_perplexity': 1.32,
 'text_entropy': 0.28,
 'text_flesch_reading_ease': 118.18}

## local dataset with 1 field OK

In [77]:
tde = TextDescriptivesExtractor(fields=["text"])
updated_ds = tde.update_dataset(ds)

self.metrics None
self.fields ['text']
forallfields:before:fields None
forallfields: after:fields ['text']
forsinglefield:self.metrics None
forsinglefield:field_metrics               text  first_order_coherence  second_order_coherence  \
0  This is a test.                    NaN                     NaN   
1           This i                    NaN                     NaN   
2   You went there                    NaN                     NaN   

   flesch_reading_ease  flesch_kincaid_grade  smog  gunning_fog  \
0              118.175                 -2.23   NaN          1.6   
1              120.205                 -3.01   NaN          0.8   
2              119.190                 -2.62   NaN          1.2   

   automated_readability_index  coleman_liau_index  lix  ...  \
0                      -6.4775           -7.030000  4.0  ...   
1                      -8.6550          -15.900000  2.0  ...   
2                      -1.0900           -2.146667  3.0  ...   

   sentence_length_median  s

Output()

forsinglefield:basic_metrics None
forsinglefield:self.metrics None
forallfields:field_metrics.items
Field: text, Metrics:    text_n_tokens  text_n_unique_tokens  text_n_sentences  text_perplexity  \
0              4                     4                 1         1.323289   
1              2                     2                 1         1.016133   
2              3                     3                 1         1.027945   

   text_entropy  text_flesch_reading_ease  
0      0.280120                   118.175  
1      0.016005                   120.205  
2      0.027561                   119.190  
1
dtype int32
[IntegerMetadataProperty(name='text_n_tokens', title='Text N Tokens', visible_for_annotators=True, type='integer', min=None, max=None)]
dtype int32
[IntegerMetadataProperty(name='text_n_tokens', title='Text N Tokens', visible_for_annotators=True, type='integer', min=None, max=None), IntegerMetadataProperty(name='text_n_unique_tokens', title='Text N Unique Tokens', visible_for_

In [78]:
updated_ds

FeedbackDataset(
   fields=[TextField(name='text', title='Text', required=True, type='text', use_markdown=False), TextField(name='text2', title='Text2', required=True, type='text', use_markdown=False), TextField(name='text3', title='Text3', required=True, type='text', use_markdown=False)]
   questions=[RatingQuestion(name='answer_quality', title='Answer_quality', description='How would you rate the quality of the answer?', required=True, type='rating', values=[1, 2, 3, 4, 5])]
   guidelines=None)
   metadata_properties=[IntegerMetadataProperty(name='text_n_tokens', title='Text N Tokens', visible_for_annotators=True, type='integer', min=None, max=None), IntegerMetadataProperty(name='text_n_unique_tokens', title='Text N Unique Tokens', visible_for_annotators=True, type='integer', min=None, max=None), IntegerMetadataProperty(name='text_n_sentences', title='Text N Sentences', visible_for_annotators=True, type='integer', min=None, max=None), FloatMetadataProperty(name='text_perplexity', tit

In [79]:
updated_ds.records[0]

FeedbackRecord(fields={'text': 'This is a test.', 'text2': 'This is a test. You wanna sing,sing if is your passion.', 'text3': 'This is a test? Are you sure? I dont think so.'}, metadata={'text_n_tokens': 4, 'text_n_unique_tokens': 4, 'text_n_sentences': 1, 'text_perplexity': 1.32, 'text_entropy': 0.28, 'text_flesch_reading_ease': 118.18}, vectors={}, responses=[], suggestions=(), external_id=None)

In [80]:
updated_ds.records[0].metadata

{'text_n_tokens': 4,
 'text_n_unique_tokens': 4,
 'text_n_sentences': 1,
 'text_perplexity': 1.32,
 'text_entropy': 0.28,
 'text_flesch_reading_ease': 118.18}

## remote dataset with 1 field OK

In [83]:
tde = TextDescriptivesExtractor(fields=["text"])
updated_remote_ds = tde.update_dataset(remote_ds)

self.metrics None
self.fields ['text']
forallfields:before:fields None
forallfields: after:fields ['text']
forsinglefield:self.metrics None
forsinglefield:field_metrics               text  first_order_coherence  second_order_coherence  \
0  This is a test.                    NaN                     NaN   
1           This i                    NaN                     NaN   
2   You went there                    NaN                     NaN   

   flesch_reading_ease  flesch_kincaid_grade  smog  gunning_fog  \
0              118.175                 -2.23   NaN          1.6   
1              120.205                 -3.01   NaN          0.8   
2              119.190                 -2.62   NaN          1.2   

   automated_readability_index  coleman_liau_index  lix  ...  \
0                      -6.4775           -7.030000  4.0  ...   
1                      -8.6550          -15.900000  2.0  ...   
2                      -1.0900           -2.146667  3.0  ...   

   sentence_length_median  s

forsinglefield:basic_metrics None
forsinglefield:self.metrics None
forallfields:field_metrics.items
Field: text, Metrics:    text_n_tokens  text_n_unique_tokens  text_n_sentences  text_perplexity  \
0              4                     4                 1         1.323289   
1              2                     2                 1         1.016133   
2              3                     3                 1         1.027945   

   text_entropy  text_flesch_reading_ease  
0      0.280120                   118.175  
1      0.016005                   120.205  
2      0.027561                   119.190  
1
dtype int32
[IntegerMetadataProperty(name='text_n_tokens', title='Text N Tokens', visible_for_annotators=True, type='integer', min=None, max=None)]
dtype int32
[IntegerMetadataProperty(name='text_n_tokens', title='Text N Tokens', visible_for_annotators=True, type='integer', min=None, max=None), IntegerMetadataProperty(name='text_n_unique_tokens', title='Text N Unique Tokens', visible_for_

Output()

Output()

In [90]:
updated_remote_ds = remote_ds

In [91]:
updated_remote_ds

RemoteFeedbackDataset(
   id=c5b0b89a-f558-40b8-88d4-137d5194b425
   name=basic_one
   workspace=Workspace(id=507a6ccf-f7e0-40e9-9384-5c8840abb505, name=argilla, inserted_at=2023-12-12 14:04:58.940990, updated_at=2023-12-12 14:04:58.940990)
   url=http://localhost:6900/dataset/c5b0b89a-f558-40b8-88d4-137d5194b425/annotation-mode
   fields=[RemoteTextField(id=UUID('ac9268ac-7440-4195-bd94-8c7ebe06b389'), client=None, name='text', title='Text', required=True, type='text', use_markdown=False), RemoteTextField(id=UUID('dadeed79-c1d6-4cda-a422-d0ea9b64302a'), client=None, name='text2', title='Text2', required=True, type='text', use_markdown=False), RemoteTextField(id=UUID('d2f7ae46-1d99-4544-ba9a-d01a2eba3802'), client=None, name='text3', title='Text3', required=True, type='text', use_markdown=False)]
   questions=[RemoteRatingQuestion(id=UUID('24cab23e-77b5-4b8d-b0c5-b648f44fc6e8'), client=None, name='answer_quality', title='Answer_quality', description=None, required=True, type='rating', 

In [84]:
remote_ds.metadata_properties

[RemoteIntegerMetadataProperty(id=UUID('e7de7827-c136-410b-8a43-3ce71199ca4b'), client=<httpx.Client object at 0x0000021D00217880>, name='text_n_tokens', title='Text N Tokens', visible_for_annotators=True, type='integer', min=None, max=None),
 RemoteIntegerMetadataProperty(id=UUID('1b8ca655-42d6-428c-b0b5-faebc4b8f00f'), client=<httpx.Client object at 0x0000021D00217880>, name='text_n_unique_tokens', title='Text N Unique Tokens', visible_for_annotators=True, type='integer', min=None, max=None),
 RemoteIntegerMetadataProperty(id=UUID('0ca2ff3f-20ec-4341-be08-5096f054e79b'), client=<httpx.Client object at 0x0000021D00217880>, name='text_n_sentences', title='Text N Sentences', visible_for_annotators=True, type='integer', min=None, max=None),
 RemoteFloatMetadataProperty(id=UUID('66f4ff65-cdf5-48e6-b72c-5f309b337871'), client=<httpx.Client object at 0x0000021D00217880>, name='text_perplexity', title='Text Perplexity', visible_for_annotators=True, type='float', min=None, max=None),
 RemoteF

In [86]:
remote_ds.records[0].metadata

{'text_n_tokens': 4,
 'text_n_unique_tokens': 4,
 'text_n_sentences': 1,
 'text_perplexity': 1.32,
 'text_entropy': 0.28,
 'text_flesch_reading_ease': 118.18}

## local dataset fields and metrics OK

In [102]:
tde = TextDescriptivesExtractor(fields=["text", "text2"], metrics=["information_theory"])
tde.update_dataset(ds)

self.metrics ['information_theory']
self.fields ['text', 'text2']
forallfields:before:fields None
forallfields: after:fields ['text', 'text2']
forsinglefield:self.metrics ['information_theory']
forsinglefield:field_metrics               text   entropy  perplexity  per_word_perplexity
0  This is a test.  0.280120    1.323289             0.264658
1           This i  0.016005    1.016133             0.508067
2   You went there  0.027561    1.027945             0.342648
forsinglefield:nan_columns []
forsinglefield:self.metrics ['information_theory']


Output()

forsinglefield:field_metrics                                                 text   entropy  perplexity  \
0  This is a test. You wanna sing,sing if is your...  0.634320    1.885739   
1                                        your house.  0.164152    1.178393   
2  This thing is tooo shrt, i should write it lon...  0.421429    1.524138   

   per_word_perplexity  
0             0.125716  
1             0.392798  
2             0.095259  
forsinglefield:nan_columns []
forallfields:field_metrics.items
Field: text, Metrics:    text_entropy  text_perplexity  text_per_word_perplexity
0      0.280120         1.323289                  0.264658
1      0.016005         1.016133                  0.508067
2      0.027561         1.027945                  0.342648
Field: text2, Metrics:    text2_entropy  text2_perplexity  text2_per_word_perplexity
0       0.634320          1.885739                   0.125716
1       0.164152          1.178393                   0.392798
2       0.421429          1.

FeedbackDataset(
   fields=[TextField(name='text', title='Text', required=True, type='text', use_markdown=False), TextField(name='text2', title='Text2', required=True, type='text', use_markdown=False), TextField(name='text3', title='Text3', required=True, type='text', use_markdown=False)]
   questions=[RatingQuestion(name='answer_quality', title='Answer_quality', description='How would you rate the quality of the answer?', required=True, type='rating', values=[1, 2, 3, 4, 5])]
   guidelines=None)
   metadata_properties=[FloatMetadataProperty(name='text_entropy', title='Text Entropy', visible_for_annotators=True, type='float', min=None, max=None), FloatMetadataProperty(name='text_perplexity', title='Text Perplexity', visible_for_annotators=True, type='float', min=None, max=None), FloatMetadataProperty(name='text_per_word_perplexity', title='Text Per Word Perplexity', visible_for_annotators=True, type='float', min=None, max=None), FloatMetadataProperty(name='text2_entropy', title='Text2 

In [103]:
ds

FeedbackDataset(
   fields=[TextField(name='text', title='Text', required=True, type='text', use_markdown=False), TextField(name='text2', title='Text2', required=True, type='text', use_markdown=False), TextField(name='text3', title='Text3', required=True, type='text', use_markdown=False)]
   questions=[RatingQuestion(name='answer_quality', title='Answer_quality', description='How would you rate the quality of the answer?', required=True, type='rating', values=[1, 2, 3, 4, 5])]
   guidelines=None)
   metadata_properties=[FloatMetadataProperty(name='text_entropy', title='Text Entropy', visible_for_annotators=True, type='float', min=None, max=None), FloatMetadataProperty(name='text_perplexity', title='Text Perplexity', visible_for_annotators=True, type='float', min=None, max=None), FloatMetadataProperty(name='text_per_word_perplexity', title='Text Per Word Perplexity', visible_for_annotators=True, type='float', min=None, max=None), FloatMetadataProperty(name='text2_entropy', title='Text2 

In [104]:
ds.records[0]

FeedbackRecord(fields={'text': 'This is a test.', 'text2': 'This is a test. You wanna sing,sing if is your passion.', 'text3': 'This is a test? Are you sure? I dont think so.'}, metadata={'text_entropy': 0.28, 'text_perplexity': 1.32, 'text_per_word_perplexity': 0.26, 'text2_entropy': 0.63, 'text2_perplexity': 1.89, 'text2_per_word_perplexity': 0.13}, vectors={}, responses=[], suggestions=(), external_id=None)

## remote dataset fields and metrics OK

In [98]:
tde = TextDescriptivesExtractor(fields=["text", "text2"], metrics=["coherence"])
updated_remote_ds = tde.update_dataset(remote_ds)

self.metrics ['information_theory']
self.fields ['text', 'text2']
forallfields:before:fields None
forallfields: after:fields ['text', 'text2']
forsinglefield:self.metrics ['information_theory']
forsinglefield:field_metrics               text   entropy  perplexity  per_word_perplexity
0  This is a test.  0.280120    1.323289             0.264658
1           This i  0.016005    1.016133             0.508067
2   You went there  0.027561    1.027945             0.342648
forsinglefield:nan_columns []
forsinglefield:self.metrics ['information_theory']
forsinglefield:field_metrics                                                 text   entropy  perplexity  \
0  This is a test. You wanna sing,sing if is your...  0.634320    1.885739   
1                                        your house.  0.164152    1.178393   
2  This thing is tooo shrt, i should write it lon...  0.421429    1.524138   

   per_word_perplexity  
0             0.125716  
1             0.392798  
2             0.095259  
forsin

Output()

Output()

In [99]:
remote_ds

RemoteFeedbackDataset(
   id=6f750694-0841-4dd5-9b55-050789ec5acf
   name=basic_one
   workspace=Workspace(id=507a6ccf-f7e0-40e9-9384-5c8840abb505, name=argilla, inserted_at=2023-12-12 14:04:58.940990, updated_at=2023-12-12 14:04:58.940990)
   url=http://localhost:6900/dataset/6f750694-0841-4dd5-9b55-050789ec5acf/annotation-mode
   fields=[RemoteTextField(id=UUID('012b6e6d-1991-411f-a969-ef8e5694201d'), client=None, name='text', title='Text', required=True, type='text', use_markdown=False), RemoteTextField(id=UUID('3270e5db-d3c9-4c43-8d37-1dfe0643ad61'), client=None, name='text2', title='Text2', required=True, type='text', use_markdown=False), RemoteTextField(id=UUID('cdee2612-b143-4ec0-a0f6-4f02ded3ee19'), client=None, name='text3', title='Text3', required=True, type='text', use_markdown=False)]
   questions=[RemoteRatingQuestion(id=UUID('c8d7aa50-bb21-45a1-86b1-5ff4e5da6b1b'), client=None, name='answer_quality', title='Answer_quality', description=None, required=True, type='rating', 

## check NAN values

In [18]:
dataset = rg.FeedbackDataset.from_huggingface("argilla/oasst_response_quality", split="train[:100]")

Parsing records: 100%|██████████| 100/100 [00:00<00:00, 222.31it/s]


In [19]:
try:
    remote_dataset = dataset.push_to_argilla(name="oasst_response_quality", workspace="argilla")
except:
    rg.FeedbackDataset.from_argilla("oasst_response_quality", workspace="argilla").delete()
    remote_dataset = dataset.push_to_argilla(name="oasst_response_quality", workspace="argilla")

Output()

In [20]:
tde = rg.TextDescriptivesExtractor(metrics=["coherence"])
tde.update_dataset(remote_dataset)

self.metrics ['coherence']
self.fields None
forallfields:before:fields None
forallfields: after:fields ['response', 'prompt']
forsinglefield:self.metrics ['coherence']


  similarities.append(sent.similarity(sents[i + order]))


forsinglefield:field_metrics                                                  text  first_order_coherence  \
0   Sure! Let's say you want to build a model whic...               0.406619   
1   Getting started in astrophotography can seem d...               0.552922   
2   Sure! Here's an example Python script that use...               0.421581   
3   Learning to optimize your webpage for search e...               0.260190   
4   If you enjoyed Dvorak's "New World" Symphony, ...               0.419909   
..                                                ...                    ...   
95  Some examples for a species for a super-powere...               0.197079   
96  While you do need an oven to bake the bread, w...               0.410734   
97  The chain rule is a fundamental rule in calcul...               0.551900   
98  Here is a brief outline of the history of Turk...               0.333601   
99  There is no concrete evidence to support the c...               0.546017   

    second

  similarities.append(sent.similarity(sents[i + order]))


forsinglefield:field_metrics                                                  text  first_order_coherence  \
0   Can you explain contrastive learning in machin...                    NaN   
1   I want to start doing astrophotography as a ho...                    NaN   
2   Can you give me an example of a python script ...                    NaN   
3   How can I learn to optimize my webpage for sea...                    NaN   
4   Listened to Dvorak's "The New World" symphony,...               0.351385   
..                                                ...                    ...   
95  I want to create a super-powered character wit...               0.431740   
96  Can you give me an easy recipe for homemade br...               0.515825   
97     Can you explain to me the calculus chain rule?                    NaN   
98      Generate an outline of the history of Turkey.                    NaN   
99  is it true that thomas edison stole the light ...               0.139828   

    second

Output()

Output()

In [23]:
remote_dataset.metadata_properties

[RemoteFloatMetadataProperty(id=UUID('4d4af5f2-f035-489a-b9bb-724151f62e9a'), client=<httpx.Client object at 0x000002C73AFD08E0>, name='response_first_order_coherence', title='Response First Order Coherence', visible_for_annotators=True, type='float', min=None, max=None),
 RemoteFloatMetadataProperty(id=UUID('20d1856f-9e03-4ee0-a15e-9bb070dfe6b9'), client=<httpx.Client object at 0x000002C73AFD08E0>, name='response_second_order_coherence', title='Response Second Order Coherence', visible_for_annotators=True, type='float', min=None, max=None),
 RemoteFloatMetadataProperty(id=UUID('fb91ef41-24af-4c81-8a37-f19fa2b9f90e'), client=<httpx.Client object at 0x000002C73AFD08E0>, name='prompt_first_order_coherence', title='Prompt First Order Coherence', visible_for_annotators=True, type='float', min=None, max=None),
 RemoteFloatMetadataProperty(id=UUID('f53790fe-54ae-4126-84ed-b43f028e051a'), client=<httpx.Client object at 0x000002C73AFD08E0>, name='prompt_second_order_coherence', title='Prompt S

In [25]:
remote_dataset.records[14].metadata

{'response_first_order_coherence': 0.42,
 'response_second_order_coherence': 0.26,
 'prompt_first_order_coherence': 0.4,
 'prompt_second_order_coherence': 0.4}

In [26]:
remote_dataset.records[4].metadata

{'response_first_order_coherence': 0.42,
 'response_second_order_coherence': 0.32}

## check show to annotators (local/remote)

In [27]:
dataset = rg.FeedbackDataset.from_huggingface("argilla/oasst_response_quality", split="train[:100]")

Parsing records: 100%|██████████| 100/100 [00:00<00:00, 137.03it/s]


In [29]:
tde = rg.TextDescriptivesExtractor(fields=["prompt"], metrics=["information_theory"], visible_for_annotators=False)
tde.update_dataset(dataset)

self.metrics ['information_theory']
self.fields ['prompt']
forallfields:before:fields None
forallfields: after:fields ['prompt']
forsinglefield:self.metrics ['information_theory']


Output()

forsinglefield:field_metrics                                                  text   entropy  perplexity  \
0   Can you explain contrastive learning in machin...  0.476143    1.609854   
1   I want to start doing astrophotography as a ho...  0.490838    1.633685   
2   Can you give me an example of a python script ...  0.478009    1.612860   
3   How can I learn to optimize my webpage for sea...  0.281210    1.324732   
4   Listened to Dvorak's "The New World" symphony,...  1.687012    5.403310   
..                                                ...       ...         ...   
95  I want to create a super-powered character wit...  3.249339   25.773301   
96  Can you give me an easy recipe for homemade br...  0.931683    2.538778   
97     Can you explain to me the calculus chain rule?  0.292472    1.339736   
98      Generate an outline of the history of Turkey.  0.381030    1.463792   
99  is it true that thomas edison stole the light ...  0.489481    1.631470   

    per_word_perplexit

FeedbackDataset(
   fields=[TextField(name='prompt', title='Prompt', required=True, type=<FieldTypes.text: 'text'>, use_markdown=True), TextField(name='response', title='Response', required=True, type=<FieldTypes.text: 'text'>, use_markdown=True)]
   questions=[LabelQuestion(name='relevant', title='Is the response relevant for the given prompt?', description=None, required=True, type=<QuestionTypes.label_selection: 'label_selection'>, labels=['Yes', 'No'], visible_labels=None), MultiLabelQuestion(name='content_class', title='Does the response include any of the following?', description=None, required=False, type=<QuestionTypes.multi_label_selection: 'multi_label_selection'>, labels={'hate': 'Hate Speech', 'inappropriate': 'Inappropriate content', 'not_english': 'Not English', 'pii': 'Personal information', 'sexual': 'Sexual content', 'untruthful': 'Untruthful info', 'violent': 'Violent content'}, visible_labels=7), RatingQuestion(name='rating', title='Rate the quality of the response:'

In [30]:
dataset.push_to_argilla(name="oasst_response_quality_no_metadata", workspace="argilla")

Output()

RemoteFeedbackDataset(
   id=51f8b953-02fb-440c-8a66-19d3d3c935a0
   name=oasst_response_quality_no_metadata
   workspace=Workspace(id=507a6ccf-f7e0-40e9-9384-5c8840abb505, name=argilla, inserted_at=2023-12-12 14:04:58.940990, updated_at=2023-12-12 14:04:58.940990)
   url=http://localhost:6900/dataset/51f8b953-02fb-440c-8a66-19d3d3c935a0/annotation-mode
   fields=[RemoteTextField(id=UUID('236b9fe2-3886-438e-b549-925bc85626e8'), client=None, name='prompt', title='Prompt', required=True, type='text', use_markdown=True), RemoteTextField(id=UUID('d0bbf8ea-4a4a-46d8-8571-562d85e3020b'), client=None, name='response', title='Response', required=True, type='text', use_markdown=True)]
   questions=[RemoteLabelQuestion(id=UUID('a887386d-1840-4dd3-aa0e-644cab087f2a'), client=None, name='relevant', title='Is the response relevant for the given prompt?', description=None, required=True, type='label_selection', labels=['Yes', 'No'], visible_labels=None), RemoteMultiLabelQuestion(id=UUID('dbd1989a-58

## check other langauges OK

In [109]:
tde = TextDescriptivesExtractor(model="es")
tde.update_dataset(ds)

self.metrics None
self.fields None
forallfields:before:fields None
forallfields: after:fields ['text3', 'text', 'text2']
[38;5;4mℹ No spacy model provided. Inferring spacy model for es.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_lg')
forsinglefield:nan_columns []
forsinglefield:basic_metrics None
forsinglefield:self.metrics None
[38;5;4mℹ No spacy model provided. Inferring spacy model for es.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_lg')
forsinglefield:nan_columns ['second_order_coherence', 'smog']


forsinglefield:basic_metrics None
forsinglefield:self.metrics None
[38;5;4mℹ No spacy model provided. Inferring spacy model for es.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_lg')
forsinglefield:nan_columns ['second_order_coherence', 'smog']


Output()

forsinglefield:basic_metrics None
forsinglefield:self.metrics None
forallfields:field_metrics.items
Field: text3, Metrics:    text3_n_tokens  text3_n_unique_tokens  text3_n_sentences  text3_perplexity  \
0              11                     11                  2          1.228007   
1               2                      2                  1          1.000130   
2               9                      9                  4          1.465586   

   text3_entropy  text3_flesch_reading_ease  
0       0.205392                 101.270682  
1       0.000130                 120.205000  
2       0.382255                  82.351250  
Field: text, Metrics:    text_n_tokens  text_n_unique_tokens  text_n_sentences  text_perplexity  \
0              4                     4                 1         1.192937   
1              2                     2                 2         1.000806   
2              3                     3                 1         1.000135   

   text_entropy  text_flesch_reading_

FeedbackDataset(
   fields=[TextField(name='text', title='Text', required=True, type='text', use_markdown=False), TextField(name='text2', title='Text2', required=True, type='text', use_markdown=False), TextField(name='text3', title='Text3', required=True, type='text', use_markdown=False)]
   questions=[RatingQuestion(name='answer_quality', title='Answer_quality', description='How would you rate the quality of the answer?', required=True, type='rating', values=[1, 2, 3, 4, 5])]
   guidelines=None)
   metadata_properties=[IntegerMetadataProperty(name='text3_n_tokens', title='Text3 N Tokens', visible_for_annotators=True, type='integer', min=None, max=None), IntegerMetadataProperty(name='text3_n_unique_tokens', title='Text3 N Unique Tokens', visible_for_annotators=True, type='integer', min=None, max=None), IntegerMetadataProperty(name='text3_n_sentences', title='Text3 N Sentences', visible_for_annotators=True, type='integer', min=None, max=None), FloatMetadataProperty(name='text3_perplexit