diff --git a/docs/_static/getting_started/basics/coffee_reviews.png b/docs/_static/getting_started/basics/coffee_reviews.png new file mode 100644 index 0000000000..279297399f Binary files /dev/null and b/docs/_static/getting_started/basics/coffee_reviews.png differ diff --git a/docs/_static/getting_started/basics/ecdc_phrases.png b/docs/_static/getting_started/basics/ecdc_phrases.png new file mode 100644 index 0000000000..1b82aa39c1 Binary files /dev/null and b/docs/_static/getting_started/basics/ecdc_phrases.png differ diff --git a/docs/_static/getting_started/basics/first_record.png b/docs/_static/getting_started/basics/first_record.png new file mode 100644 index 0000000000..9b72624939 Binary files /dev/null and b/docs/_static/getting_started/basics/first_record.png differ diff --git a/docs/_static/getting_started/basics/manual_annotations.png b/docs/_static/getting_started/basics/manual_annotations.png new file mode 100644 index 0000000000..aa050335e4 Binary files /dev/null and b/docs/_static/getting_started/basics/manual_annotations.png differ diff --git a/docs/_static/getting_started/basics/snapchat_reviews.png b/docs/_static/getting_started/basics/snapchat_reviews.png new file mode 100644 index 0000000000..518f0f450b Binary files /dev/null and b/docs/_static/getting_started/basics/snapchat_reviews.png differ diff --git a/docs/getting_started/basics.ipynb b/docs/getting_started/basics.ipynb new file mode 100644 index 0000000000..0eb32a2ab4 --- /dev/null +++ b/docs/getting_started/basics.ipynb @@ -0,0 +1,933 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "5393fab0-2b6c-40d7-9fc5-419343e4ba26", + "metadata": { + "tags": [] + }, + "source": [ + "# Basics" + ] + }, + { + "cell_type": "markdown", + "id": "14d13ee2-ffeb-46fa-9c62-d77c8328e499", + "metadata": {}, + "source": [ + "Here you will find some basic guidelines on how to get started with Rubrix." + ] + }, + { + "cell_type": "markdown", + "id": "ae0038d1-86b1-4eb9-ada4-ae561ad25aa3", + "metadata": { + "tags": [] + }, + "source": [ + "## How to upload data" + ] + }, + { + "cell_type": "markdown", + "id": "07fbba86-4f60-46a1-95a1-36c8a003731d", + "metadata": {}, + "source": [ + "The working units in Rubrix are so-called records. \n", + "No matter your data, you always have to put it in records for Rubrix to understand it. \n", + "A dataset in Rubrix is consequently a collection of these records. \n", + "Records can be of three different types depending on the [task supported](supported_tasks.rst) by Rubrix:\n", + "\n", + " 1. `TextClassificationRecord`: Records for [text classification tasks](supported_tasks.rst#text-classification);\n", + " 2. `TokenClassificationRecord`: Records for [token classification tasks](supported_tasks.rst#token-classification);\n", + " 3. `Text2TextRecord`: Records for [text-to-text tasks](supported_tasks.rst#text2text);\n", + " \n", + "The most critical attributes of a record that are common to all types are:\n", + "\n", + " - `text`: The input text of the record (Required);\n", + " - `annotation`: Annotate your record in a task-specific manner (Optional);\n", + " - `prediction`: Add task-specific model predictions to the record (Optional);\n", + " - `metadata`: Add some arbitrary metadata to the record (Optional);\n", + " \n", + "In Rubrix, records are created programmatically with its [client library](../reference/python/python_client.rst) via a Python script, a [Jupyter notebook](https://jupyter.org/), etc. \n", + "Let us see how to create and upload a basic record to the Rubrix web app (make sure Rubrix is already installed on your machine as described in the [setup guide](setup&installation.rst)):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "866426c8-b3af-4307-a3eb-3d50171e4b7f", + "metadata": {}, + "outputs": [], + "source": [ + "import rubrix as rb\n", + "\n", + "# Create a basic text classification record\n", + "record = rb.TextClassificationRecord(text=\"Hello world, this is me!\")\n", + "\n", + "# Upload (log) the record to the Rubrix web app\n", + "rb.log(record, \"my_first_record\")" + ] + }, + { + "cell_type": "markdown", + "id": "e6e23a98-ec42-4f20-8087-c5eda1918455", + "metadata": {}, + "source": [ + "Now you can access the *\"my_first_record\"* dataset in the Rubrix web app and look at your first record. " + ] + }, + { + "cell_type": "markdown", + "id": "b47ac105-78ba-4211-a29b-496e88797376", + "metadata": {}, + "source": [ + "![image](../_static/getting_started/basics/first_record.png)" + ] + }, + { + "cell_type": "markdown", + "id": "c84db0f9-a9ca-4799-9a26-635d2f3b94d4", + "metadata": {}, + "source": [ + "However, most of the time, you will have your data in some file format, like TXT, CSV, or JSON. \n", + "Rubrix relies on two well-known Python libraries to read these files: [pandas](https://pandas.pydata.org/) and [datasets](https://huggingface.co/docs/datasets/index). \n", + "After reading the files with one of those libraries, Rubrix provides handy shortcuts to create your records automatically.\n", + "\n", + "Let us look at a few examples for each of the record types.\n", + "**As mentioned earlier, you choose the record type depending on the task you want to tackle.**" + ] + }, + { + "cell_type": "markdown", + "id": "c4137fd2-cc98-4f59-a14e-31cd7489d59b", + "metadata": {}, + "source": [ + "### 1. Text classification" + ] + }, + { + "cell_type": "markdown", + "id": "c1004cfb-6fed-4281-950f-1f19495cd114", + "metadata": {}, + "source": [ + "In this example, we will read a [CSV file](https://www.kaggle.com/datasets/databar/10k-snapchat-reviews) from a Kaggle competition that contains reviews for the Snapchat app. \n", + "The underlying task here could be to classify the reviews by their sentiment. \n", + "\n", + "Let us read the file with [pandas](https://pandas.pydata.org/)\n", + "\n", + "
\n", + "\n", + "Note\n", + " \n", + "If the file is too big to fit in memory, try using the [datasets library](https://huggingface.co/docs/datasets/index) with no memory constraints, as shown in the next section.\n", + " \n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d4ae148b-4d91-49ef-a7d1-6073ce8f2077", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "# Read the CSV file into a pandas DataFrame\n", + "dataframe = pd.read_csv(\"Snapchat_app_store_reviews.csv\")" + ] + }, + { + "cell_type": "markdown", + "id": "2ae293cb-4349-4d04-926f-e3abf0c3afad", + "metadata": {}, + "source": [ + "and have a quick look at the first three rows of the resulting [pandas DataFrame](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html):" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "id": "3eb22d64-b15c-42d6-a43c-8a6f11c2bf5f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Unnamed: 0userNameratingreviewisEditeddatetitle
00Savvanananahhh4For the most part I quite enjoy Snapchat it’s ...False10/4/20 6:01Performance issues
11Idek 9-1011123I’m sorry to say it, but something is definite...False10/14/20 2:13What happened?
22William Quintana3Snapchat update ruined my story organization! ...False7/31/20 19:54STORY ORGANIZATION RUINED!
\n", + "
" + ], + "text/plain": [ + " Unnamed: 0 userName rating \\\n", + "0 0 Savvanananahhh 4 \n", + "1 1 Idek 9-101112 3 \n", + "2 2 William Quintana 3 \n", + "\n", + " review isEdited date \\\n", + "0 For the most part I quite enjoy Snapchat it’s ... False 10/4/20 6:01 \n", + "1 I’m sorry to say it, but something is definite... False 10/14/20 2:13 \n", + "2 Snapchat update ruined my story organization! ... False 7/31/20 19:54 \n", + "\n", + " title \n", + "0 Performance issues \n", + "1 What happened? \n", + "2 STORY ORGANIZATION RUINED! " + ] + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dataframe.head(3)" + ] + }, + { + "cell_type": "markdown", + "id": "24f7560f-0929-478a-ae4e-62274a1f04c5", + "metadata": {}, + "source": [ + "We will choose the _review_ column as input text for our records.\n", + "For Rubrix to know, we have to rename the corresponding column to _text_." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "69fe9946-9e0f-4be9-ade2-9884d9bda998", + "metadata": {}, + "outputs": [], + "source": [ + "# Rename the 'review' column to 'text', \n", + "dataframe = dataframe.rename(columns={\"review\": \"text\"}) " + ] + }, + { + "cell_type": "markdown", + "id": "a191e8c9-e55a-41b2-ad24-b0860fd31445", + "metadata": {}, + "source": [ + "We can now read this `DataFrame` with Rubrix, which will automatically create the records and put them in a [Rubrix Dataset](../guides/datasets.ipynb)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f2f63eda-d041-4849-8828-f3de0b25cb1a", + "metadata": {}, + "outputs": [], + "source": [ + "import rubrix as rb\n", + "\n", + "# Read DataFrame into a Rubrix Dataset\n", + "dataset_rb = rb.read_pandas(dataframe, task=\"TextClassification\") " + ] + }, + { + "cell_type": "markdown", + "id": "ebf825a9-7aaf-4b43-970d-6b4c2d493bb6", + "metadata": {}, + "source": [ + "We will upload this dataset to the web app and give it the name *snapchat_reviews*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ea1c0cb7-f129-45e7-8784-88908d882104", + "metadata": {}, + "outputs": [], + "source": [ + "# Upload (log) the Dataset to the web app\n", + "rb.log(dataset_rb, \"snapchat_reviews\")" + ] + }, + { + "cell_type": "markdown", + "id": "930cb8c3-5dfc-4e5a-bdf4-3c19f3d5fb00", + "metadata": {}, + "source": [ + "![Screenshot of the uploaded snapchat reviews](../_static/getting_started/basics/snapchat_reviews.png)" + ] + }, + { + "cell_type": "markdown", + "id": "341abb81-2acd-411e-a1d3-7c54cfc257f8", + "metadata": {}, + "source": [ + "### 2. Token classification" + ] + }, + { + "cell_type": "markdown", + "id": "59944e44-4202-4890-9a45-f99fc3fb2dd1", + "metadata": { + "tags": [] + }, + "source": [ + "We will use German reviews of organic coffees in a [CSV file](https://www.kaggle.com/datasets/mldado/german-online-reviewsratings-of-organic-coffee) for this example. \n", + "The underlying task here could be to extract all attributes of an organic coffee.\n", + "\n", + "This time, let us read the file with [datasets](https://huggingface.co/docs/datasets/index)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "502febcb-26f1-4832-8218-4f029ebed697", + "metadata": {}, + "outputs": [], + "source": [ + "from datasets import Dataset\n", + "\n", + "# Read the csv file\n", + "dataset = Dataset.from_csv(\"kaffee_reviews.csv\")" + ] + }, + { + "cell_type": "markdown", + "id": "d29c2276-7cae-41e5-ba3b-157a8c1a6c6e", + "metadata": {}, + "source": [ + "and have a quick look at the first three rows of the resulting [dataset Dataset](https://huggingface.co/docs/datasets/access):" + ] + }, + { + "cell_type": "code", + "execution_count": 94, + "id": "1b77947c-ed89-4dce-ba75-158264c8d384", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Unnamed: 0brandratingreview
00GEPA Kaffee5Wenn ich Bohnenkaffee trinke (auf Arbeit trink...
11GEPA Kaffee5Für mich ist dieser Kaffee ideal. Die Grundvor...
22GEPA Kaffee5Ich persönlich bin insbesondere von dem Geschm...
\n", + "
" + ], + "text/plain": [ + " Unnamed: 0 brand rating \\\n", + "0 0 GEPA Kaffee 5 \n", + "1 1 GEPA Kaffee 5 \n", + "2 2 GEPA Kaffee 5 \n", + "\n", + " review \n", + "0 Wenn ich Bohnenkaffee trinke (auf Arbeit trink... \n", + "1 Für mich ist dieser Kaffee ideal. Die Grundvor... \n", + "2 Ich persönlich bin insbesondere von dem Geschm... " + ] + }, + "execution_count": 94, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# The best way to visualize a Dataset is actually via pandas\n", + "dataset.select(range(3)).to_pandas() " + ] + }, + { + "cell_type": "markdown", + "id": "4f2d75b4-c9d7-40c0-b66c-c52e9de7ef1a", + "metadata": {}, + "source": [ + "We will choose the _review_ column as input text for our records.\n", + "For Rubrix to know, we have to rename the corresponding column to _text_." + ] + }, + { + "cell_type": "code", + "execution_count": 95, + "id": "32194771-66a3-4ecd-960e-59a8f8be8c2e", + "metadata": {}, + "outputs": [], + "source": [ + "dataset = dataset.rename_column(\"review\", \"text\")" + ] + }, + { + "cell_type": "markdown", + "id": "fb7d2f14-37a0-4a5a-8ae7-86f8e9304fa9", + "metadata": {}, + "source": [ + "In contrast to the other types, token classification records need the input text **and** the corresponding tokens. \n", + "So let us tokenize our input text in a small helper function and add the tokens to a new column called _tokens_. \n", + "\n", + "
\n", + "\n", + "Note\n", + "\n", + "We will use [spaCy](https://spacy.io/) to tokenize the text, but you can use whatever library you prefer.\n", + " \n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3a4664fe-0840-4768-b856-79bdbd1dc178", + "metadata": {}, + "outputs": [], + "source": [ + "import spacy\n", + "\n", + "# Load a german spaCy model to tokenize our text\n", + "nlp = spacy.load(\"de_core_news_sm\")\n", + "\n", + "# Define our tokenize function\n", + "def tokenize(row):\n", + " tokens = [token.text for token in nlp(row[\"text\"])]\n", + " return {\"tokens\": tokens}\n", + "\n", + "# Map the tokenize function to our dataset\n", + "dataset = dataset.map(tokenize)" + ] + }, + { + "cell_type": "markdown", + "id": "26e5ee02-3f26-4db2-9729-e664e9740e18", + "metadata": {}, + "source": [ + "Let us have a quick look at our extended `Dataset`:" + ] + }, + { + "cell_type": "code", + "execution_count": 97, + "id": "efcd39d2-0cb3-4ce1-b9ec-ca341d4bdb8f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Unnamed: 0brandratingtexttokens
00GEPA Kaffee5Wenn ich Bohnenkaffee trinke (auf Arbeit trink...[Wenn, ich, Bohnenkaffee, trinke, (, auf, Arbe...
11GEPA Kaffee5Für mich ist dieser Kaffee ideal. Die Grundvor...[Für, mich, ist, dieser, Kaffee, ideal, ., Die...
22GEPA Kaffee5Ich persönlich bin insbesondere von dem Geschm...[Ich, persönlich, bin, insbesondere, von, dem,...
\n", + "
" + ], + "text/plain": [ + " Unnamed: 0 brand rating \\\n", + "0 0 GEPA Kaffee 5 \n", + "1 1 GEPA Kaffee 5 \n", + "2 2 GEPA Kaffee 5 \n", + "\n", + " text \\\n", + "0 Wenn ich Bohnenkaffee trinke (auf Arbeit trink... \n", + "1 Für mich ist dieser Kaffee ideal. Die Grundvor... \n", + "2 Ich persönlich bin insbesondere von dem Geschm... \n", + "\n", + " tokens \n", + "0 [Wenn, ich, Bohnenkaffee, trinke, (, auf, Arbe... \n", + "1 [Für, mich, ist, dieser, Kaffee, ideal, ., Die... \n", + "2 [Ich, persönlich, bin, insbesondere, von, dem,... " + ] + }, + "execution_count": 97, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dataset.select(range(3)).to_pandas()" + ] + }, + { + "cell_type": "markdown", + "id": "054e40cc-51f4-4321-b42a-2301775c0e9f", + "metadata": {}, + "source": [ + "We can now read this `Dataset` with Rubrix, which will automatically create the records and put them in a [Rubrix Dataset](../guides/datasets.ipynb)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0e4ff337-7b12-48fc-b8a1-a829a7800d49", + "metadata": {}, + "outputs": [], + "source": [ + "import rubrix as rb\n", + "\n", + "# Read Dataset into a Rubrix Dataset\n", + "dataset_rb = rb.read_datasets(dataset, task=\"TokenClassification\") " + ] + }, + { + "cell_type": "markdown", + "id": "4cb220a7-cb2c-4e38-bd41-4cf96ed478c2", + "metadata": {}, + "source": [ + "We will upload this dataset to the web app and give it the name *coffee_reviews*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5dee85f1-1a37-4850-9bda-c4e54aa5db03", + "metadata": {}, + "outputs": [], + "source": [ + "# Log the datset to the Rubrix web app\n", + "rb.log(dataset_rb, \"coffee-reviews\")" + ] + }, + { + "cell_type": "markdown", + "id": "bd1c2d53-37c6-438a-b447-6dcae3369f55", + "metadata": {}, + "source": [ + "![Screenshot of the uploaded coffee reviews](../_static/getting_started/basics/coffee_reviews.png)" + ] + }, + { + "cell_type": "markdown", + "id": "d66683c8-9ed5-4ab9-9937-39eeac9ccab0", + "metadata": {}, + "source": [ + "### 3. Text2Text" + ] + }, + { + "cell_type": "markdown", + "id": "3c17d862-5e64-4c34-8aa5-3941506913c6", + "metadata": {}, + "source": [ + "In this example, we will use English sentences from the European Center for Disease Prevention and Control available at the [Hugging Face Hub](https://huggingface.co/datasets/europa_ecdc_tm). \n", + "The underlying task here could be to translate the sentences into other European languages.\n", + "\n", + "Let us load the data with [datasets](https://huggingface.co/docs/datasets/index) from the [Hub](https://huggingface.co/datasets)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cfbce85f-200e-4b54-9650-308395b81770", + "metadata": {}, + "outputs": [], + "source": [ + "from datasets import load_dataset\n", + "\n", + "# Load the Dataset from the Hugging Face Hub and extract the train split\n", + "dataset = load_dataset(\"europa_ecdc_tm\", \"en2fr\", split=\"train\")" + ] + }, + { + "cell_type": "markdown", + "id": "0de908f2-ff08-4a0e-87d9-a278f9ccd452", + "metadata": {}, + "source": [ + "and have a quick look at the first row of the resulting [dataset Dataset](https://huggingface.co/docs/datasets/access):" + ] + }, + { + "cell_type": "code", + "execution_count": 101, + "id": "16d386a5-14cb-46cd-afb2-e8405ddd5232", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'translation': {'en': 'Vaccination against hepatitis C is not yet available.',\n", + " 'fr': 'Aucune vaccination contre l’hépatite C n’est encore disponible.'}}" + ] + }, + "execution_count": 101, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dataset[0]" + ] + }, + { + "cell_type": "markdown", + "id": "67e41a33-11c2-422f-a5f4-411dc464f451", + "metadata": {}, + "source": [ + "We can see that the English sentences are nested in a dictionary inside the _translation_ column. \n", + "To extract the phrases into a new _text_ column, we will write a quick helper function and [map](https://huggingface.co/docs/datasets/process#map) the whole `Dataset` with it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b27d93ad-86f0-4d6c-a31f-5cc1d55235a6", + "metadata": {}, + "outputs": [], + "source": [ + "# Define our helper extract function\n", + "def extract(row):\n", + " return {\"text\": row[\"translation\"][\"en\"]}\n", + "\n", + "# Map the extract function to our dataset\n", + "dataset = dataset.map(extract)" + ] + }, + { + "cell_type": "markdown", + "id": "c9aa9293-64bb-4061-b686-b3115a7942fc", + "metadata": {}, + "source": [ + "Let us have a quick look at our extended `Dataset`:" + ] + }, + { + "cell_type": "code", + "execution_count": 103, + "id": "c7f51520-5d8c-45ef-aae5-f3f32cad8654", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'translation': {'en': 'Vaccination against hepatitis C is not yet available.',\n", + " 'fr': 'Aucune vaccination contre l’hépatite C n’est encore disponible.'},\n", + " 'text': 'Vaccination against hepatitis C is not yet available.'}" + ] + }, + "execution_count": 103, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dataset[0]" + ] + }, + { + "cell_type": "markdown", + "id": "138e0dac-7772-49ae-b57e-52619ebf5899", + "metadata": {}, + "source": [ + "We can now read this `Dataset` with Rubrix, which will automatically create the records and put them in a [Rubrix Dataset](../guides/datasets.ipynb)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c6f98a37-5fda-4e79-aff1-f3fb770498ea", + "metadata": {}, + "outputs": [], + "source": [ + "import rubrix as rb\n", + "\n", + "# Read Dataset into a Rubrix Dataset\n", + "dataset_rb = rb.read_datasets(dataset, task=\"Text2Text\") " + ] + }, + { + "cell_type": "markdown", + "id": "a5236bf1-f98c-411f-81d0-e31740f0fc10", + "metadata": {}, + "source": [ + "We will upload this dataset to the web app and give it the name *ecdc_en*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9da01d6f-2728-4ed0-b6aa-cd2e2531202c", + "metadata": {}, + "outputs": [], + "source": [ + "# Log the datset to the Rubrix web app\n", + "rb.log(dataset_rb, \"ecdc_en\")" + ] + }, + { + "cell_type": "markdown", + "id": "f5a4969a-9132-459d-9f8d-e0006a1d52a0", + "metadata": {}, + "source": [ + "![Screenshot of the uploaded English phrases.](../_static/getting_started/basics/ecdc_phrases.png)" + ] + }, + { + "cell_type": "markdown", + "id": "e3139dd1-a939-4baa-8d26-bd1214c8cbcd", + "metadata": { + "tags": [] + }, + "source": [ + "## How to annotate datasets" + ] + }, + { + "cell_type": "markdown", + "id": "5bfeaf75-143e-4543-85f3-2a6e995dcf06", + "metadata": {}, + "source": [ + "Rubrix provides several ways to annotate your data. \n", + "With the intuitive Rubrix web app, you can choose between:\n", + "\n", + "1. Manually annotating each record using a dedicated interface for each task type;\n", + "2. Leveraging a user-provided model by validating its predictions;\n", + "3. Trying to define heuristic rules to produce \"noisy labels\", a technique also known as \"weak supervision\";\n", + "\n", + "Each way has its pros and cons, and the best match largely depends on your individual use case.\n" + ] + }, + { + "cell_type": "markdown", + "id": "cf0bccc4-05c7-4c27-920b-5b76eb4acd22", + "metadata": {}, + "source": [ + "### 1. Manual annotations" + ] + }, + { + "cell_type": "markdown", + "id": "3b085744-7c61-4614-b542-c4de2cea9181", + "metadata": {}, + "source": [ + "![Manual annotations of a sentiment classification task](../_static/getting_started/basics/manual_annotations.png)\n", + "\n", + "The straightforward approach of manual annotations might be necessary if you do not have a suitable model for your use case or cannot come up with good heuristic rules for your dataset. \n", + "It can also be a good approach if you dispose of a large annotation workforce or require few but unbiased and high-quality labels.\n", + "\n", + "Rubrix tries to make this relatively cumbersome approach as painless as possible. \n", + "Via an intuitive and adaptive UI, its exhaustive search and filter functionalities, and bulk annotation capabilities, Rubrix turns the manual annotation process into an efficient option. \n", + "\n", + "Look at our dedicated [feature reference](../reference/webapp/annotate_records.md) for a detailed and illustrative guide on manually annotating your dataset with Rubrix." + ] + }, + { + "cell_type": "markdown", + "id": "e631840b-9cf7-45e6-9dc4-5f6b24cf0e8b", + "metadata": {}, + "source": [ + "### 2. Validating predictions" + ] + }, + { + "cell_type": "markdown", + "id": "e3c46840-0991-4fc0-bbb0-15df33ee242b", + "metadata": {}, + "source": [ + "![Validate predictions for a token classification dataset](../_static/reference/webapp/annotation_ner.png)\n", + "\n", + "Nowadays, many pre-trained or zero-shot models are available online via model repositories like the Hugging Face Hub. \n", + "Most of the time, you probably will find a model that already suits your specific dataset task to some degree. \n", + "In Rubrix, you can pre-annotate your data by including predictions from these models in your records.\n", + "Assuming that the model works reasonably well on your dataset, you can filter for records with high prediction scores and validate the predictions.\n", + "In this way, you will rapidly annotate part of your data and alleviate the annotation process.\n", + "\n", + "One downside of this approach is that your annotations will be subject to the possible biases and mistakes of the pre-trained model.\n", + "When guided by pre-trained models, it is common to see human annotators get influenced by them.\n", + "Therefore, it is advisable to avoid pre-annotations when building a rigorous test set for the final model evaluation.\n", + "\n", + "Check the [introduction tutorial](../tutorials/01-labeling-finetuning.ipynb) to learn to add predictions to the records. \n", + "And our [feature reference](../reference/webapp/annotate_records.md#validate-predictions) includes a detailed guide on validating predictions in the Rubrix web app." + ] + }, + { + "cell_type": "markdown", + "id": "c2cbd593-e241-4f27-9a58-6932912ea9f1", + "metadata": {}, + "source": [ + "### 3. Define rules" + ] + }, + { + "cell_type": "markdown", + "id": "87f7ea92-d40f-4d09-a0e8-16b7d7867e6e", + "metadata": {}, + "source": [ + "![Defining a rule for a multi-label text classification task.](../_static/reference/webapp/define_rules_2.png)\n", + "\n", + "Another approach to annotating your data is to develop heuristic rules tailored to your dataset. \n", + "For example, let us assume you want to classify news articles into the categories of *Finance*, *Sports*, and *Culture*. \n", + "In this case, a reasonable rule would be to label all articles that include the word \"stock\" as *Finance*. \n", + "\n", + "It is easy to see how you can quickly annotate vast amounts of data in this way, which we refer to as *weak supervision*. \n", + "Rules can get arbitrarily complex and can also include the record's metadata. \n", + "The downsides of this approach are that it might be challenging to come up with working heuristic rules for some datasets. \n", + "Furthermore, rules are rarely 100% precise and often conflict with each other, which must be addressed by so-called label models. \n", + "It is usually a trade-off between the amount of annotated data and the quality of the labels.\n", + "\n", + "Check [our guide](../guides/weak-supervision.ipynb) for an extensive introduction to weak supervision with Rubrix. \n", + "Also, check the [feature reference](../reference/webapp/define_rules.md) for the Define rules mode of the web app and our [various tutorials](../tutorials/weak-supervision.md) to see practical examples of weak supervision workflows. " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/index.rst b/docs/index.rst index cd00ede1be..1fca1defb9 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -164,6 +164,7 @@ You can join the conversation on our Github page and our Github forum. getting_started/setup&installation getting_started/concepts + getting_started/basics getting_started/user-management getting_started/advanced_setup_guides diff --git a/docs/reference/webapp/annotate_records.md b/docs/reference/webapp/annotate_records.md index e9d6b2efd4..b6548461c3 100644 --- a/docs/reference/webapp/annotate_records.md +++ b/docs/reference/webapp/annotate_records.md @@ -7,34 +7,24 @@ Rubrix's powerful search and filter functionalities, together with potential mod You can access the _Annotate mode_ via the sidebar of the [Dataset page](dataset.md). -## Search and filter - -![Search and filter for annotation view](../../_static/reference/webapp/filters_all.png) - -The powerful search bar allows you to do simple, quick searches, as well as complex queries that take full advantage of Rubrix's [data models](../python/python_client.rst#module-rubrix.client.models). -In addition, the _filters_ provide you a quick and intuitive way to filter and sort your records with respect to various parameters, including the metadata of your records. -For example, you can use the **Status filter** to hide already annotated records (_Status: Default_), or to only show annotated records when revising previous annotations (_Status: Validated_). +## Create labels -You can find more information about how to use the search bar and the filters in our detailed [search guide](search_records.md) and [filter guide](filter_records.md). +![Create new label](../../_static/reference/webapp/create_newlabel.png) -```{note} -Not all filters are available for all [tasks](../../guides/task_examples.ipynb). -``` +For the text and token classification tasks, you can create new labels within the _Annotate mode_. +On the right side of the bulk validation bar, you will find a _"+ Create new label"_ button that lets you add new labels to your dataset. ## Annotate To annotate the records, the Rubrix web app provides a simple and intuitive interface that tries to follow the same interaction pattern as in the [Explore mode](explore_records.md). -As the _Explore mode_, the record cards in the _Annotate mode_ are also customized depending on the [task](../../guides/task_examples.ipynb) of the dataset. +As in the _Explore mode_, the record cards in the _Annotate mode_ are also customized depending on the [task](../../guides/task_examples.ipynb) of the dataset. ### Text Classification ![Multilabel card, validated](../../_static/reference/webapp/textclassification_multilabel.png) When switching in the _Annotate mode_ for a text classification dataset, the labels in the record cards become clickable and you can annotate the records by simply clicking on them. -You can also validate the predictions shown in a slightly darker tone by pressing the _Validate_ button: - -- for a **single label** classification task, this will be the prediction with the highest percentage -- for a **multi label** classification task, this will be the predictions with a percentage above 50% +For multi-label classification tasks, you can also annotate a record with no labels by either validating an empty selection or deselecting all labels. Once a record is annotated, it will be marked as _Validated_ in the upper right corner of the record card. @@ -47,15 +37,13 @@ Under the hood, the highlighting takes advantage of the `tokens` information in You can also remove annotations by hovering over the highlights and pressing the _X_ button. After modifying a record, either by adding or removing annotations, its status will change to _Pending_ and a _Save_ button will appear. -You can also validate the predictions (or the absent of them) by pressing the _Validate_ button. -Once the record is saved or validated, its status will change to _Validated_. +Once a record is saved, its status will change to _Validated_. ### Text2Text ![Text2Text View](../../_static/reference/webapp/text2text_annotation.png) For text2text datasets, you have a text box available, in which you can draft or edit an annotation. -You can also validate or edit a prediction, by first clicking on the _view predictions_ button, and then the _Edit_ or _Validate_ button. After editing or drafting your annotation, don't forget to save your changes. ## Bulk annotate @@ -68,12 +56,44 @@ Then you can either _Validate_ or _Discard_ the selected records. For the text classification task, you can additionally **bulk annotate** the selected records with a specific label, by simply selecting the label from the _"Annotate as ..."_ list. -## Create labels +## Validate predictions -![Create new label](../../_static/reference/webapp/create_newlabel.png) +In Rubrix you can pre-annotate your data by including model predictions in your records. +Assuming that the model works reasonably well on your dataset, you can filter for records with high prediction scores, and simply _validate_ their predictions to quickly annotate records. -For the text and token classification tasks, you can create new labels within the _Annotate mode_. -On the right side of the bulk validation bar, you will find a _"+ Create new label"_ button that lets you add new labels to your dataset. +### Text Classification + +For this task, model predictions are shown as percentages in the label tags. +You can validate the predictions shown in a slightly darker tone by pressing the _Validate_ button: + +- for a **single label** classification task, this will be the prediction with the highest percentage +- for a **multi label** classification task, this will be the predictions with a percentage above 50% + +### Token Classification + +For this task, predictions are shown as underlines. +You can also validate the predictions (or the absence of them) by pressing the _Validate_ button. + +Once the record is saved or validated, its status will change to _Validated_. + +### Text2Text + +You can validate or edit a prediction, by first clicking on the _view predictions_ button, and then the _Edit_ or _Validate_ button. +After editing or drafting your annotation, don't forget to save your changes. + +## Search and filter + +![Search and filter for annotation view](../../_static/reference/webapp/filters_all.png) + +The powerful search bar allows you to do simple, quick searches, as well as complex queries that take full advantage of Rubrix's [data models](../python/python_client.rst#module-rubrix.client.models). +In addition, the _filters_ provide you a quick and intuitive way to filter and sort your records with respect to various parameters, including the metadata of your records. +For example, you can use the **Status filter** to hide already annotated records (_Status: Default_), or to only show annotated records when revising previous annotations (_Status: Validated_). + +You can find more information about how to use the search bar and the filters in our detailed [search guide](search_records.md) and [filter guide](filter_records.md). + +```{note} +Not all filters are available for all [tasks](../../guides/task_examples.ipynb). +``` ## Progress metric