diff --git a/docs/_static/getting_started/basics/coffee_reviews.png b/docs/_static/getting_started/basics/coffee_reviews.png
new file mode 100644
index 0000000000..279297399f
Binary files /dev/null and b/docs/_static/getting_started/basics/coffee_reviews.png differ
diff --git a/docs/_static/getting_started/basics/ecdc_phrases.png b/docs/_static/getting_started/basics/ecdc_phrases.png
new file mode 100644
index 0000000000..1b82aa39c1
Binary files /dev/null and b/docs/_static/getting_started/basics/ecdc_phrases.png differ
diff --git a/docs/_static/getting_started/basics/first_record.png b/docs/_static/getting_started/basics/first_record.png
new file mode 100644
index 0000000000..9b72624939
Binary files /dev/null and b/docs/_static/getting_started/basics/first_record.png differ
diff --git a/docs/_static/getting_started/basics/manual_annotations.png b/docs/_static/getting_started/basics/manual_annotations.png
new file mode 100644
index 0000000000..aa050335e4
Binary files /dev/null and b/docs/_static/getting_started/basics/manual_annotations.png differ
diff --git a/docs/_static/getting_started/basics/snapchat_reviews.png b/docs/_static/getting_started/basics/snapchat_reviews.png
new file mode 100644
index 0000000000..518f0f450b
Binary files /dev/null and b/docs/_static/getting_started/basics/snapchat_reviews.png differ
diff --git a/docs/getting_started/basics.ipynb b/docs/getting_started/basics.ipynb
new file mode 100644
index 0000000000..0eb32a2ab4
--- /dev/null
+++ b/docs/getting_started/basics.ipynb
@@ -0,0 +1,933 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "5393fab0-2b6c-40d7-9fc5-419343e4ba26",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "# Basics"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "14d13ee2-ffeb-46fa-9c62-d77c8328e499",
+ "metadata": {},
+ "source": [
+ "Here you will find some basic guidelines on how to get started with Rubrix."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ae0038d1-86b1-4eb9-ada4-ae561ad25aa3",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## How to upload data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "07fbba86-4f60-46a1-95a1-36c8a003731d",
+ "metadata": {},
+ "source": [
+ "The working units in Rubrix are so-called records. \n",
+ "No matter your data, you always have to put it in records for Rubrix to understand it. \n",
+ "A dataset in Rubrix is consequently a collection of these records. \n",
+ "Records can be of three different types depending on the [task supported](supported_tasks.rst) by Rubrix:\n",
+ "\n",
+ " 1. `TextClassificationRecord`: Records for [text classification tasks](supported_tasks.rst#text-classification);\n",
+ " 2. `TokenClassificationRecord`: Records for [token classification tasks](supported_tasks.rst#token-classification);\n",
+ " 3. `Text2TextRecord`: Records for [text-to-text tasks](supported_tasks.rst#text2text);\n",
+ " \n",
+ "The most critical attributes of a record that are common to all types are:\n",
+ "\n",
+ " - `text`: The input text of the record (Required);\n",
+ " - `annotation`: Annotate your record in a task-specific manner (Optional);\n",
+ " - `prediction`: Add task-specific model predictions to the record (Optional);\n",
+ " - `metadata`: Add some arbitrary metadata to the record (Optional);\n",
+ " \n",
+ "In Rubrix, records are created programmatically with its [client library](../reference/python/python_client.rst) via a Python script, a [Jupyter notebook](https://jupyter.org/), etc. \n",
+ "Let us see how to create and upload a basic record to the Rubrix web app (make sure Rubrix is already installed on your machine as described in the [setup guide](setup&installation.rst)):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "866426c8-b3af-4307-a3eb-3d50171e4b7f",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import rubrix as rb\n",
+ "\n",
+ "# Create a basic text classification record\n",
+ "record = rb.TextClassificationRecord(text=\"Hello world, this is me!\")\n",
+ "\n",
+ "# Upload (log) the record to the Rubrix web app\n",
+ "rb.log(record, \"my_first_record\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e6e23a98-ec42-4f20-8087-c5eda1918455",
+ "metadata": {},
+ "source": [
+ "Now you can access the *\"my_first_record\"* dataset in the Rubrix web app and look at your first record. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b47ac105-78ba-4211-a29b-496e88797376",
+ "metadata": {},
+ "source": [
+ "![image](../_static/getting_started/basics/first_record.png)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c84db0f9-a9ca-4799-9a26-635d2f3b94d4",
+ "metadata": {},
+ "source": [
+ "However, most of the time, you will have your data in some file format, like TXT, CSV, or JSON. \n",
+ "Rubrix relies on two well-known Python libraries to read these files: [pandas](https://pandas.pydata.org/) and [datasets](https://huggingface.co/docs/datasets/index). \n",
+ "After reading the files with one of those libraries, Rubrix provides handy shortcuts to create your records automatically.\n",
+ "\n",
+ "Let us look at a few examples for each of the record types.\n",
+ "**As mentioned earlier, you choose the record type depending on the task you want to tackle.**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c4137fd2-cc98-4f59-a14e-31cd7489d59b",
+ "metadata": {},
+ "source": [
+ "### 1. Text classification"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c1004cfb-6fed-4281-950f-1f19495cd114",
+ "metadata": {},
+ "source": [
+ "In this example, we will read a [CSV file](https://www.kaggle.com/datasets/databar/10k-snapchat-reviews) from a Kaggle competition that contains reviews for the Snapchat app. \n",
+ "The underlying task here could be to classify the reviews by their sentiment. \n",
+ "\n",
+ "Let us read the file with [pandas](https://pandas.pydata.org/)\n",
+ "\n",
+ "
\n",
+ "\n",
+ "Note\n",
+ " \n",
+ "If the file is too big to fit in memory, try using the [datasets library](https://huggingface.co/docs/datasets/index) with no memory constraints, as shown in the next section.\n",
+ " \n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d4ae148b-4d91-49ef-a7d1-6073ce8f2077",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "\n",
+ "# Read the CSV file into a pandas DataFrame\n",
+ "dataframe = pd.read_csv(\"Snapchat_app_store_reviews.csv\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2ae293cb-4349-4d04-926f-e3abf0c3afad",
+ "metadata": {},
+ "source": [
+ "and have a quick look at the first three rows of the resulting [pandas DataFrame](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 41,
+ "id": "3eb22d64-b15c-42d6-a43c-8a6f11c2bf5f",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Unnamed: 0 | \n",
+ " userName | \n",
+ " rating | \n",
+ " review | \n",
+ " isEdited | \n",
+ " date | \n",
+ " title | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 0 | \n",
+ " Savvanananahhh | \n",
+ " 4 | \n",
+ " For the most part I quite enjoy Snapchat it’s ... | \n",
+ " False | \n",
+ " 10/4/20 6:01 | \n",
+ " Performance issues | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 1 | \n",
+ " Idek 9-101112 | \n",
+ " 3 | \n",
+ " I’m sorry to say it, but something is definite... | \n",
+ " False | \n",
+ " 10/14/20 2:13 | \n",
+ " What happened? | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 2 | \n",
+ " William Quintana | \n",
+ " 3 | \n",
+ " Snapchat update ruined my story organization! ... | \n",
+ " False | \n",
+ " 7/31/20 19:54 | \n",
+ " STORY ORGANIZATION RUINED! | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Unnamed: 0 userName rating \\\n",
+ "0 0 Savvanananahhh 4 \n",
+ "1 1 Idek 9-101112 3 \n",
+ "2 2 William Quintana 3 \n",
+ "\n",
+ " review isEdited date \\\n",
+ "0 For the most part I quite enjoy Snapchat it’s ... False 10/4/20 6:01 \n",
+ "1 I’m sorry to say it, but something is definite... False 10/14/20 2:13 \n",
+ "2 Snapchat update ruined my story organization! ... False 7/31/20 19:54 \n",
+ "\n",
+ " title \n",
+ "0 Performance issues \n",
+ "1 What happened? \n",
+ "2 STORY ORGANIZATION RUINED! "
+ ]
+ },
+ "execution_count": 41,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "dataframe.head(3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "24f7560f-0929-478a-ae4e-62274a1f04c5",
+ "metadata": {},
+ "source": [
+ "We will choose the _review_ column as input text for our records.\n",
+ "For Rubrix to know, we have to rename the corresponding column to _text_."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "69fe9946-9e0f-4be9-ade2-9884d9bda998",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Rename the 'review' column to 'text', \n",
+ "dataframe = dataframe.rename(columns={\"review\": \"text\"}) "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a191e8c9-e55a-41b2-ad24-b0860fd31445",
+ "metadata": {},
+ "source": [
+ "We can now read this `DataFrame` with Rubrix, which will automatically create the records and put them in a [Rubrix Dataset](../guides/datasets.ipynb)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f2f63eda-d041-4849-8828-f3de0b25cb1a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import rubrix as rb\n",
+ "\n",
+ "# Read DataFrame into a Rubrix Dataset\n",
+ "dataset_rb = rb.read_pandas(dataframe, task=\"TextClassification\") "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ebf825a9-7aaf-4b43-970d-6b4c2d493bb6",
+ "metadata": {},
+ "source": [
+ "We will upload this dataset to the web app and give it the name *snapchat_reviews*"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ea1c0cb7-f129-45e7-8784-88908d882104",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Upload (log) the Dataset to the web app\n",
+ "rb.log(dataset_rb, \"snapchat_reviews\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "930cb8c3-5dfc-4e5a-bdf4-3c19f3d5fb00",
+ "metadata": {},
+ "source": [
+ "![Screenshot of the uploaded snapchat reviews](../_static/getting_started/basics/snapchat_reviews.png)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "341abb81-2acd-411e-a1d3-7c54cfc257f8",
+ "metadata": {},
+ "source": [
+ "### 2. Token classification"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "59944e44-4202-4890-9a45-f99fc3fb2dd1",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "We will use German reviews of organic coffees in a [CSV file](https://www.kaggle.com/datasets/mldado/german-online-reviewsratings-of-organic-coffee) for this example. \n",
+ "The underlying task here could be to extract all attributes of an organic coffee.\n",
+ "\n",
+ "This time, let us read the file with [datasets](https://huggingface.co/docs/datasets/index)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "502febcb-26f1-4832-8218-4f029ebed697",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from datasets import Dataset\n",
+ "\n",
+ "# Read the csv file\n",
+ "dataset = Dataset.from_csv(\"kaffee_reviews.csv\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d29c2276-7cae-41e5-ba3b-157a8c1a6c6e",
+ "metadata": {},
+ "source": [
+ "and have a quick look at the first three rows of the resulting [dataset Dataset](https://huggingface.co/docs/datasets/access):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 94,
+ "id": "1b77947c-ed89-4dce-ba75-158264c8d384",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Unnamed: 0 | \n",
+ " brand | \n",
+ " rating | \n",
+ " review | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 0 | \n",
+ " GEPA Kaffee | \n",
+ " 5 | \n",
+ " Wenn ich Bohnenkaffee trinke (auf Arbeit trink... | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 1 | \n",
+ " GEPA Kaffee | \n",
+ " 5 | \n",
+ " Für mich ist dieser Kaffee ideal. Die Grundvor... | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 2 | \n",
+ " GEPA Kaffee | \n",
+ " 5 | \n",
+ " Ich persönlich bin insbesondere von dem Geschm... | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Unnamed: 0 brand rating \\\n",
+ "0 0 GEPA Kaffee 5 \n",
+ "1 1 GEPA Kaffee 5 \n",
+ "2 2 GEPA Kaffee 5 \n",
+ "\n",
+ " review \n",
+ "0 Wenn ich Bohnenkaffee trinke (auf Arbeit trink... \n",
+ "1 Für mich ist dieser Kaffee ideal. Die Grundvor... \n",
+ "2 Ich persönlich bin insbesondere von dem Geschm... "
+ ]
+ },
+ "execution_count": 94,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# The best way to visualize a Dataset is actually via pandas\n",
+ "dataset.select(range(3)).to_pandas() "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4f2d75b4-c9d7-40c0-b66c-c52e9de7ef1a",
+ "metadata": {},
+ "source": [
+ "We will choose the _review_ column as input text for our records.\n",
+ "For Rubrix to know, we have to rename the corresponding column to _text_."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 95,
+ "id": "32194771-66a3-4ecd-960e-59a8f8be8c2e",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "dataset = dataset.rename_column(\"review\", \"text\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fb7d2f14-37a0-4a5a-8ae7-86f8e9304fa9",
+ "metadata": {},
+ "source": [
+ "In contrast to the other types, token classification records need the input text **and** the corresponding tokens. \n",
+ "So let us tokenize our input text in a small helper function and add the tokens to a new column called _tokens_. \n",
+ "\n",
+ "\n",
+ "\n",
+ "Note\n",
+ "\n",
+ "We will use [spaCy](https://spacy.io/) to tokenize the text, but you can use whatever library you prefer.\n",
+ " \n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "3a4664fe-0840-4768-b856-79bdbd1dc178",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import spacy\n",
+ "\n",
+ "# Load a german spaCy model to tokenize our text\n",
+ "nlp = spacy.load(\"de_core_news_sm\")\n",
+ "\n",
+ "# Define our tokenize function\n",
+ "def tokenize(row):\n",
+ " tokens = [token.text for token in nlp(row[\"text\"])]\n",
+ " return {\"tokens\": tokens}\n",
+ "\n",
+ "# Map the tokenize function to our dataset\n",
+ "dataset = dataset.map(tokenize)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "26e5ee02-3f26-4db2-9729-e664e9740e18",
+ "metadata": {},
+ "source": [
+ "Let us have a quick look at our extended `Dataset`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 97,
+ "id": "efcd39d2-0cb3-4ce1-b9ec-ca341d4bdb8f",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Unnamed: 0 | \n",
+ " brand | \n",
+ " rating | \n",
+ " text | \n",
+ " tokens | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 0 | \n",
+ " GEPA Kaffee | \n",
+ " 5 | \n",
+ " Wenn ich Bohnenkaffee trinke (auf Arbeit trink... | \n",
+ " [Wenn, ich, Bohnenkaffee, trinke, (, auf, Arbe... | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 1 | \n",
+ " GEPA Kaffee | \n",
+ " 5 | \n",
+ " Für mich ist dieser Kaffee ideal. Die Grundvor... | \n",
+ " [Für, mich, ist, dieser, Kaffee, ideal, ., Die... | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 2 | \n",
+ " GEPA Kaffee | \n",
+ " 5 | \n",
+ " Ich persönlich bin insbesondere von dem Geschm... | \n",
+ " [Ich, persönlich, bin, insbesondere, von, dem,... | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Unnamed: 0 brand rating \\\n",
+ "0 0 GEPA Kaffee 5 \n",
+ "1 1 GEPA Kaffee 5 \n",
+ "2 2 GEPA Kaffee 5 \n",
+ "\n",
+ " text \\\n",
+ "0 Wenn ich Bohnenkaffee trinke (auf Arbeit trink... \n",
+ "1 Für mich ist dieser Kaffee ideal. Die Grundvor... \n",
+ "2 Ich persönlich bin insbesondere von dem Geschm... \n",
+ "\n",
+ " tokens \n",
+ "0 [Wenn, ich, Bohnenkaffee, trinke, (, auf, Arbe... \n",
+ "1 [Für, mich, ist, dieser, Kaffee, ideal, ., Die... \n",
+ "2 [Ich, persönlich, bin, insbesondere, von, dem,... "
+ ]
+ },
+ "execution_count": 97,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "dataset.select(range(3)).to_pandas()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "054e40cc-51f4-4321-b42a-2301775c0e9f",
+ "metadata": {},
+ "source": [
+ "We can now read this `Dataset` with Rubrix, which will automatically create the records and put them in a [Rubrix Dataset](../guides/datasets.ipynb)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0e4ff337-7b12-48fc-b8a1-a829a7800d49",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import rubrix as rb\n",
+ "\n",
+ "# Read Dataset into a Rubrix Dataset\n",
+ "dataset_rb = rb.read_datasets(dataset, task=\"TokenClassification\") "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4cb220a7-cb2c-4e38-bd41-4cf96ed478c2",
+ "metadata": {},
+ "source": [
+ "We will upload this dataset to the web app and give it the name *coffee_reviews*"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5dee85f1-1a37-4850-9bda-c4e54aa5db03",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Log the datset to the Rubrix web app\n",
+ "rb.log(dataset_rb, \"coffee-reviews\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bd1c2d53-37c6-438a-b447-6dcae3369f55",
+ "metadata": {},
+ "source": [
+ "![Screenshot of the uploaded coffee reviews](../_static/getting_started/basics/coffee_reviews.png)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d66683c8-9ed5-4ab9-9937-39eeac9ccab0",
+ "metadata": {},
+ "source": [
+ "### 3. Text2Text"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3c17d862-5e64-4c34-8aa5-3941506913c6",
+ "metadata": {},
+ "source": [
+ "In this example, we will use English sentences from the European Center for Disease Prevention and Control available at the [Hugging Face Hub](https://huggingface.co/datasets/europa_ecdc_tm). \n",
+ "The underlying task here could be to translate the sentences into other European languages.\n",
+ "\n",
+ "Let us load the data with [datasets](https://huggingface.co/docs/datasets/index) from the [Hub](https://huggingface.co/datasets)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "cfbce85f-200e-4b54-9650-308395b81770",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from datasets import load_dataset\n",
+ "\n",
+ "# Load the Dataset from the Hugging Face Hub and extract the train split\n",
+ "dataset = load_dataset(\"europa_ecdc_tm\", \"en2fr\", split=\"train\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0de908f2-ff08-4a0e-87d9-a278f9ccd452",
+ "metadata": {},
+ "source": [
+ "and have a quick look at the first row of the resulting [dataset Dataset](https://huggingface.co/docs/datasets/access):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 101,
+ "id": "16d386a5-14cb-46cd-afb2-e8405ddd5232",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "{'translation': {'en': 'Vaccination against hepatitis C is not yet available.',\n",
+ " 'fr': 'Aucune vaccination contre l’hépatite C n’est encore disponible.'}}"
+ ]
+ },
+ "execution_count": 101,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "dataset[0]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "67e41a33-11c2-422f-a5f4-411dc464f451",
+ "metadata": {},
+ "source": [
+ "We can see that the English sentences are nested in a dictionary inside the _translation_ column. \n",
+ "To extract the phrases into a new _text_ column, we will write a quick helper function and [map](https://huggingface.co/docs/datasets/process#map) the whole `Dataset` with it."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b27d93ad-86f0-4d6c-a31f-5cc1d55235a6",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Define our helper extract function\n",
+ "def extract(row):\n",
+ " return {\"text\": row[\"translation\"][\"en\"]}\n",
+ "\n",
+ "# Map the extract function to our dataset\n",
+ "dataset = dataset.map(extract)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c9aa9293-64bb-4061-b686-b3115a7942fc",
+ "metadata": {},
+ "source": [
+ "Let us have a quick look at our extended `Dataset`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 103,
+ "id": "c7f51520-5d8c-45ef-aae5-f3f32cad8654",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "{'translation': {'en': 'Vaccination against hepatitis C is not yet available.',\n",
+ " 'fr': 'Aucune vaccination contre l’hépatite C n’est encore disponible.'},\n",
+ " 'text': 'Vaccination against hepatitis C is not yet available.'}"
+ ]
+ },
+ "execution_count": 103,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "dataset[0]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "138e0dac-7772-49ae-b57e-52619ebf5899",
+ "metadata": {},
+ "source": [
+ "We can now read this `Dataset` with Rubrix, which will automatically create the records and put them in a [Rubrix Dataset](../guides/datasets.ipynb)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c6f98a37-5fda-4e79-aff1-f3fb770498ea",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import rubrix as rb\n",
+ "\n",
+ "# Read Dataset into a Rubrix Dataset\n",
+ "dataset_rb = rb.read_datasets(dataset, task=\"Text2Text\") "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a5236bf1-f98c-411f-81d0-e31740f0fc10",
+ "metadata": {},
+ "source": [
+ "We will upload this dataset to the web app and give it the name *ecdc_en*"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9da01d6f-2728-4ed0-b6aa-cd2e2531202c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Log the datset to the Rubrix web app\n",
+ "rb.log(dataset_rb, \"ecdc_en\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f5a4969a-9132-459d-9f8d-e0006a1d52a0",
+ "metadata": {},
+ "source": [
+ "![Screenshot of the uploaded English phrases.](../_static/getting_started/basics/ecdc_phrases.png)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e3139dd1-a939-4baa-8d26-bd1214c8cbcd",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## How to annotate datasets"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5bfeaf75-143e-4543-85f3-2a6e995dcf06",
+ "metadata": {},
+ "source": [
+ "Rubrix provides several ways to annotate your data. \n",
+ "With the intuitive Rubrix web app, you can choose between:\n",
+ "\n",
+ "1. Manually annotating each record using a dedicated interface for each task type;\n",
+ "2. Leveraging a user-provided model by validating its predictions;\n",
+ "3. Trying to define heuristic rules to produce \"noisy labels\", a technique also known as \"weak supervision\";\n",
+ "\n",
+ "Each way has its pros and cons, and the best match largely depends on your individual use case.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cf0bccc4-05c7-4c27-920b-5b76eb4acd22",
+ "metadata": {},
+ "source": [
+ "### 1. Manual annotations"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3b085744-7c61-4614-b542-c4de2cea9181",
+ "metadata": {},
+ "source": [
+ "![Manual annotations of a sentiment classification task](../_static/getting_started/basics/manual_annotations.png)\n",
+ "\n",
+ "The straightforward approach of manual annotations might be necessary if you do not have a suitable model for your use case or cannot come up with good heuristic rules for your dataset. \n",
+ "It can also be a good approach if you dispose of a large annotation workforce or require few but unbiased and high-quality labels.\n",
+ "\n",
+ "Rubrix tries to make this relatively cumbersome approach as painless as possible. \n",
+ "Via an intuitive and adaptive UI, its exhaustive search and filter functionalities, and bulk annotation capabilities, Rubrix turns the manual annotation process into an efficient option. \n",
+ "\n",
+ "Look at our dedicated [feature reference](../reference/webapp/annotate_records.md) for a detailed and illustrative guide on manually annotating your dataset with Rubrix."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e631840b-9cf7-45e6-9dc4-5f6b24cf0e8b",
+ "metadata": {},
+ "source": [
+ "### 2. Validating predictions"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e3c46840-0991-4fc0-bbb0-15df33ee242b",
+ "metadata": {},
+ "source": [
+ "![Validate predictions for a token classification dataset](../_static/reference/webapp/annotation_ner.png)\n",
+ "\n",
+ "Nowadays, many pre-trained or zero-shot models are available online via model repositories like the Hugging Face Hub. \n",
+ "Most of the time, you probably will find a model that already suits your specific dataset task to some degree. \n",
+ "In Rubrix, you can pre-annotate your data by including predictions from these models in your records.\n",
+ "Assuming that the model works reasonably well on your dataset, you can filter for records with high prediction scores and validate the predictions.\n",
+ "In this way, you will rapidly annotate part of your data and alleviate the annotation process.\n",
+ "\n",
+ "One downside of this approach is that your annotations will be subject to the possible biases and mistakes of the pre-trained model.\n",
+ "When guided by pre-trained models, it is common to see human annotators get influenced by them.\n",
+ "Therefore, it is advisable to avoid pre-annotations when building a rigorous test set for the final model evaluation.\n",
+ "\n",
+ "Check the [introduction tutorial](../tutorials/01-labeling-finetuning.ipynb) to learn to add predictions to the records. \n",
+ "And our [feature reference](../reference/webapp/annotate_records.md#validate-predictions) includes a detailed guide on validating predictions in the Rubrix web app."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c2cbd593-e241-4f27-9a58-6932912ea9f1",
+ "metadata": {},
+ "source": [
+ "### 3. Define rules"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "87f7ea92-d40f-4d09-a0e8-16b7d7867e6e",
+ "metadata": {},
+ "source": [
+ "![Defining a rule for a multi-label text classification task.](../_static/reference/webapp/define_rules_2.png)\n",
+ "\n",
+ "Another approach to annotating your data is to develop heuristic rules tailored to your dataset. \n",
+ "For example, let us assume you want to classify news articles into the categories of *Finance*, *Sports*, and *Culture*. \n",
+ "In this case, a reasonable rule would be to label all articles that include the word \"stock\" as *Finance*. \n",
+ "\n",
+ "It is easy to see how you can quickly annotate vast amounts of data in this way, which we refer to as *weak supervision*. \n",
+ "Rules can get arbitrarily complex and can also include the record's metadata. \n",
+ "The downsides of this approach are that it might be challenging to come up with working heuristic rules for some datasets. \n",
+ "Furthermore, rules are rarely 100% precise and often conflict with each other, which must be addressed by so-called label models. \n",
+ "It is usually a trade-off between the amount of annotated data and the quality of the labels.\n",
+ "\n",
+ "Check [our guide](../guides/weak-supervision.ipynb) for an extensive introduction to weak supervision with Rubrix. \n",
+ "Also, check the [feature reference](../reference/webapp/define_rules.md) for the Define rules mode of the web app and our [various tutorials](../tutorials/weak-supervision.md) to see practical examples of weak supervision workflows. "
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.8.12"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/docs/index.rst b/docs/index.rst
index cd00ede1be..1fca1defb9 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -164,6 +164,7 @@ You can join the conversation on our Github page and our Github forum.
getting_started/setup&installation
getting_started/concepts
+ getting_started/basics
getting_started/user-management
getting_started/advanced_setup_guides
diff --git a/docs/reference/webapp/annotate_records.md b/docs/reference/webapp/annotate_records.md
index e9d6b2efd4..b6548461c3 100644
--- a/docs/reference/webapp/annotate_records.md
+++ b/docs/reference/webapp/annotate_records.md
@@ -7,34 +7,24 @@ Rubrix's powerful search and filter functionalities, together with potential mod
You can access the _Annotate mode_ via the sidebar of the [Dataset page](dataset.md).
-## Search and filter
-
-![Search and filter for annotation view](../../_static/reference/webapp/filters_all.png)
-
-The powerful search bar allows you to do simple, quick searches, as well as complex queries that take full advantage of Rubrix's [data models](../python/python_client.rst#module-rubrix.client.models).
-In addition, the _filters_ provide you a quick and intuitive way to filter and sort your records with respect to various parameters, including the metadata of your records.
-For example, you can use the **Status filter** to hide already annotated records (_Status: Default_), or to only show annotated records when revising previous annotations (_Status: Validated_).
+## Create labels
-You can find more information about how to use the search bar and the filters in our detailed [search guide](search_records.md) and [filter guide](filter_records.md).
+![Create new label](../../_static/reference/webapp/create_newlabel.png)
-```{note}
-Not all filters are available for all [tasks](../../guides/task_examples.ipynb).
-```
+For the text and token classification tasks, you can create new labels within the _Annotate mode_.
+On the right side of the bulk validation bar, you will find a _"+ Create new label"_ button that lets you add new labels to your dataset.
## Annotate
To annotate the records, the Rubrix web app provides a simple and intuitive interface that tries to follow the same interaction pattern as in the [Explore mode](explore_records.md).
-As the _Explore mode_, the record cards in the _Annotate mode_ are also customized depending on the [task](../../guides/task_examples.ipynb) of the dataset.
+As in the _Explore mode_, the record cards in the _Annotate mode_ are also customized depending on the [task](../../guides/task_examples.ipynb) of the dataset.
### Text Classification
![Multilabel card, validated](../../_static/reference/webapp/textclassification_multilabel.png)
When switching in the _Annotate mode_ for a text classification dataset, the labels in the record cards become clickable and you can annotate the records by simply clicking on them.
-You can also validate the predictions shown in a slightly darker tone by pressing the _Validate_ button:
-
-- for a **single label** classification task, this will be the prediction with the highest percentage
-- for a **multi label** classification task, this will be the predictions with a percentage above 50%
+For multi-label classification tasks, you can also annotate a record with no labels by either validating an empty selection or deselecting all labels.
Once a record is annotated, it will be marked as _Validated_ in the upper right corner of the record card.
@@ -47,15 +37,13 @@ Under the hood, the highlighting takes advantage of the `tokens` information in
You can also remove annotations by hovering over the highlights and pressing the _X_ button.
After modifying a record, either by adding or removing annotations, its status will change to _Pending_ and a _Save_ button will appear.
-You can also validate the predictions (or the absent of them) by pressing the _Validate_ button.
-Once the record is saved or validated, its status will change to _Validated_.
+Once a record is saved, its status will change to _Validated_.
### Text2Text
![Text2Text View](../../_static/reference/webapp/text2text_annotation.png)
For text2text datasets, you have a text box available, in which you can draft or edit an annotation.
-You can also validate or edit a prediction, by first clicking on the _view predictions_ button, and then the _Edit_ or _Validate_ button.
After editing or drafting your annotation, don't forget to save your changes.
## Bulk annotate
@@ -68,12 +56,44 @@ Then you can either _Validate_ or _Discard_ the selected records.
For the text classification task, you can additionally **bulk annotate** the selected records with a specific label, by simply selecting the label from the _"Annotate as ..."_ list.
-## Create labels
+## Validate predictions
-![Create new label](../../_static/reference/webapp/create_newlabel.png)
+In Rubrix you can pre-annotate your data by including model predictions in your records.
+Assuming that the model works reasonably well on your dataset, you can filter for records with high prediction scores, and simply _validate_ their predictions to quickly annotate records.
-For the text and token classification tasks, you can create new labels within the _Annotate mode_.
-On the right side of the bulk validation bar, you will find a _"+ Create new label"_ button that lets you add new labels to your dataset.
+### Text Classification
+
+For this task, model predictions are shown as percentages in the label tags.
+You can validate the predictions shown in a slightly darker tone by pressing the _Validate_ button:
+
+- for a **single label** classification task, this will be the prediction with the highest percentage
+- for a **multi label** classification task, this will be the predictions with a percentage above 50%
+
+### Token Classification
+
+For this task, predictions are shown as underlines.
+You can also validate the predictions (or the absence of them) by pressing the _Validate_ button.
+
+Once the record is saved or validated, its status will change to _Validated_.
+
+### Text2Text
+
+You can validate or edit a prediction, by first clicking on the _view predictions_ button, and then the _Edit_ or _Validate_ button.
+After editing or drafting your annotation, don't forget to save your changes.
+
+## Search and filter
+
+![Search and filter for annotation view](../../_static/reference/webapp/filters_all.png)
+
+The powerful search bar allows you to do simple, quick searches, as well as complex queries that take full advantage of Rubrix's [data models](../python/python_client.rst#module-rubrix.client.models).
+In addition, the _filters_ provide you a quick and intuitive way to filter and sort your records with respect to various parameters, including the metadata of your records.
+For example, you can use the **Status filter** to hide already annotated records (_Status: Default_), or to only show annotated records when revising previous annotations (_Status: Validated_).
+
+You can find more information about how to use the search bar and the filters in our detailed [search guide](search_records.md) and [filter guide](filter_records.md).
+
+```{note}
+Not all filters are available for all [tasks](../../guides/task_examples.ipynb).
+```
## Progress metric