diff --git a/docs/getting_started/basics.ipynb b/docs/getting_started/basics.ipynb
index c7db6d73ba..a782d6add3 100644
--- a/docs/getting_started/basics.ipynb
+++ b/docs/getting_started/basics.ipynb
@@ -15,7 +15,7 @@
"id": "14d13ee2-ffeb-46fa-9c62-d77c8328e499",
"metadata": {},
"source": [
- "This guide will help you to get started with Rubrix to perform basic tasks such as uploading data or data annotation."
+ "This guide will help you get started with Rubrix to perform basic tasks such as uploading or annotating data."
]
},
{
@@ -904,6 +904,161 @@
"Check [our guide](../guides/weak-supervision.ipynb) for an extensive introduction to weak supervision with Rubrix. \n",
"Also, check the [feature reference](../reference/webapp/define_rules.md) for the Define rules mode of the web app and our [various tutorials](../tutorials/weak-supervision.md) to see practical examples of weak supervision workflows. "
]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f1144ce2-0fe2-48c0-8116-d57dc4429640",
+ "metadata": {},
+ "source": [
+ "## How to prepare your data for training"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c5437bcd-b42c-4f6b-a9f5-d4b45572c648",
+ "metadata": {},
+ "source": [
+ "Once you have uploaded and annotated your dataset in Rubrix, you are ready to prepare it for training a model. Most NLP models today are trained via [supervised learning](https://en.wikipedia.org/wiki/Supervised_learning) and need input-output pairs to serve as training examples for the model. The input part of such pairs is usually the text itself, while the output is the corresponding annotation. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a62573a8-54c8-4002-9686-3450ad90c7a3",
+ "metadata": {},
+ "source": [
+ "### Manual extraction"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "73b1923f-b100-4755-aba9-68d7de48d247",
+ "metadata": {},
+ "source": [
+ "The exact data format for training a model depends on your [training framework](#how-to-train-a-model) and the task you are tackling (text classification, token classification, etc.). Rubrix is framework agnostic; you can always manually extract from the records what you need for the training. \n",
+ "\n",
+ "The extraction happens using the [client library](../reference/python/python_client.rst) within a Python script, a Jupyter notebook, or another IDE. First, we have to load the annotated dataset from the Rubrix UI:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "268ed86e-881d-4196-adc2-ebe01dacb306",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import rubrix as rb\n",
+ "\n",
+ "dataset = rb.load(\"my_annotated_dataset\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d061ca1a-98db-4c31-9362-f816e401c2b5",
+ "metadata": {},
+ "source": [
+ "
\n",
+ "\n",
+ "Note\n",
+ " \n",
+ "If you follow a weak supervision approach, the steps are slightly different. \n",
+ "We refer you to our [weak supervision guide](../guides/weak-supervision.ipynb) for a complete workflow.\n",
+ " \n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e6d94095-7cac-4810-97fe-c257b8f34a2c",
+ "metadata": {},
+ "source": [
+ "Then we can iterate over the records and extract our training examples. For example, let's assume you want to train a text classifier with a [sklearn pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) that takes as input a text and outputs a label. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d34397c3-5b0d-4151-9945-135f54520f7e",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Save the inputs and labels in Python lists\n",
+ "inputs, labels = [], []\n",
+ "\n",
+ "# Iterate over the records in the dataset\n",
+ "for record in dataset:\n",
+ " \n",
+ " # We only want records with annotations\n",
+ " if record.annotation:\n",
+ " inputs.append(record.text)\n",
+ " labels.append(record.annotation)\n",
+ "\n",
+ "# Train the model\n",
+ "sklearn_pipeline.fit(inputs, labels)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "dbd8be8e-dd83-4506-8ed8-2f32cf6b5835",
+ "metadata": {},
+ "source": [
+ "### Automatic extraction"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bbf82fc5-d3a5-4a56-b308-caf31a4d763b",
+ "metadata": {},
+ "source": [
+ "For a few frameworks and tasks, Rubrix provides a convenient method to automatically extract training examples in a suitable format from a dataset. \n",
+ "\n",
+ "For example: If you want to train a [transformers](https://huggingface.co/docs/transformers/index) model for text classification, you can load an annotated dataset for text classification and call the `prepare_for_training()` method:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "aa1c2aa8-603b-48a5-8da4-11c7e93f9772",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "dataset = rb.load(\"my_annotated_dataset\")\n",
+ "\n",
+ "dataset_for_training = dataset.prepare_for_training()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c71ee9dc-93b8-4802-83bc-e63048a83073",
+ "metadata": {},
+ "source": [
+ "With the returned `dataset_for_training`, you can continue following the steps to [fine-tune a pre-trained model](https://huggingface.co/docs/transformers/training#finetune-a-pretrained-model) with the [transformers library](https://huggingface.co/docs/transformers/index). \n",
+ "\n",
+ "Check the dedicated [dataset guide](../guides/datasets.ipynb#prepare-dataset-for-training) for more examples of the `prepare_for_training()` method."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "90307acf-ba85-4f8c-86d3-ca398be7a496",
+ "metadata": {},
+ "source": [
+ "## How to train a model"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "29cb1351-6324-4faa-9067-fd50785844f5",
+ "metadata": {},
+ "source": [
+ "Rubrix helps you to create and curate training data. **It is not a framework for training a model.** You can use Rubrix complementary with other excellent open-source frameworks that focus on developing and training NLP models.\n",
+ "\n",
+ "Here we list three of the most commonly used open-source libraries, but many more are available and may be more suited for your specific use case:\n",
+ "\n",
+ " - [transformers](https://huggingface.co/docs/transformers/index): This library provides thousands of pre-trained models for various NLP tasks and modalities. Its excellent documentation focuses on fine-tuning those models to your specific use case;\n",
+ " - [spaCy](https://spacy.io/): This library also comes with pre-trained models built into a pipeline tackling multiple tasks simultaneously. Since its a purely NLP library, it comes with much more NLP features than just model training;\n",
+ " - [scikit-learn](https://scikit-learn.org/stable/): This de facto standard library is a powerful swiss army knife for machine learning with some NLP support. Usually, their NLP models lack the performance when compared to transformers or spacy, but give it a try if you want to train a lightweight model quickly; \n",
+ " \n",
+ "Check our [cookbook](../guides/cookbook.ipynb) for many examples of how to train models using these frameworks together with Rubrix."
+ ]
}
],
"metadata": {