diff --git a/docs/getting_started/basics.ipynb b/docs/getting_started/basics.ipynb index c7db6d73ba..a782d6add3 100644 --- a/docs/getting_started/basics.ipynb +++ b/docs/getting_started/basics.ipynb @@ -15,7 +15,7 @@ "id": "14d13ee2-ffeb-46fa-9c62-d77c8328e499", "metadata": {}, "source": [ - "This guide will help you to get started with Rubrix to perform basic tasks such as uploading data or data annotation." + "This guide will help you get started with Rubrix to perform basic tasks such as uploading or annotating data." ] }, { @@ -904,6 +904,161 @@ "Check [our guide](../guides/weak-supervision.ipynb) for an extensive introduction to weak supervision with Rubrix. \n", "Also, check the [feature reference](../reference/webapp/define_rules.md) for the Define rules mode of the web app and our [various tutorials](../tutorials/weak-supervision.md) to see practical examples of weak supervision workflows. " ] + }, + { + "cell_type": "markdown", + "id": "f1144ce2-0fe2-48c0-8116-d57dc4429640", + "metadata": {}, + "source": [ + "## How to prepare your data for training" + ] + }, + { + "cell_type": "markdown", + "id": "c5437bcd-b42c-4f6b-a9f5-d4b45572c648", + "metadata": {}, + "source": [ + "Once you have uploaded and annotated your dataset in Rubrix, you are ready to prepare it for training a model. Most NLP models today are trained via [supervised learning](https://en.wikipedia.org/wiki/Supervised_learning) and need input-output pairs to serve as training examples for the model. The input part of such pairs is usually the text itself, while the output is the corresponding annotation. " + ] + }, + { + "cell_type": "markdown", + "id": "a62573a8-54c8-4002-9686-3450ad90c7a3", + "metadata": {}, + "source": [ + "### Manual extraction" + ] + }, + { + "cell_type": "markdown", + "id": "73b1923f-b100-4755-aba9-68d7de48d247", + "metadata": {}, + "source": [ + "The exact data format for training a model depends on your [training framework](#how-to-train-a-model) and the task you are tackling (text classification, token classification, etc.). Rubrix is framework agnostic; you can always manually extract from the records what you need for the training. \n", + "\n", + "The extraction happens using the [client library](../reference/python/python_client.rst) within a Python script, a Jupyter notebook, or another IDE. First, we have to load the annotated dataset from the Rubrix UI:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "268ed86e-881d-4196-adc2-ebe01dacb306", + "metadata": {}, + "outputs": [], + "source": [ + "import rubrix as rb\n", + "\n", + "dataset = rb.load(\"my_annotated_dataset\")" + ] + }, + { + "cell_type": "markdown", + "id": "d061ca1a-98db-4c31-9362-f816e401c2b5", + "metadata": {}, + "source": [ + "
\n", + "\n", + "Note\n", + " \n", + "If you follow a weak supervision approach, the steps are slightly different. \n", + "We refer you to our [weak supervision guide](../guides/weak-supervision.ipynb) for a complete workflow.\n", + " \n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "e6d94095-7cac-4810-97fe-c257b8f34a2c", + "metadata": {}, + "source": [ + "Then we can iterate over the records and extract our training examples. For example, let's assume you want to train a text classifier with a [sklearn pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) that takes as input a text and outputs a label. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d34397c3-5b0d-4151-9945-135f54520f7e", + "metadata": {}, + "outputs": [], + "source": [ + "# Save the inputs and labels in Python lists\n", + "inputs, labels = [], []\n", + "\n", + "# Iterate over the records in the dataset\n", + "for record in dataset:\n", + " \n", + " # We only want records with annotations\n", + " if record.annotation:\n", + " inputs.append(record.text)\n", + " labels.append(record.annotation)\n", + "\n", + "# Train the model\n", + "sklearn_pipeline.fit(inputs, labels)" + ] + }, + { + "cell_type": "markdown", + "id": "dbd8be8e-dd83-4506-8ed8-2f32cf6b5835", + "metadata": {}, + "source": [ + "### Automatic extraction" + ] + }, + { + "cell_type": "markdown", + "id": "bbf82fc5-d3a5-4a56-b308-caf31a4d763b", + "metadata": {}, + "source": [ + "For a few frameworks and tasks, Rubrix provides a convenient method to automatically extract training examples in a suitable format from a dataset. \n", + "\n", + "For example: If you want to train a [transformers](https://huggingface.co/docs/transformers/index) model for text classification, you can load an annotated dataset for text classification and call the `prepare_for_training()` method:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "aa1c2aa8-603b-48a5-8da4-11c7e93f9772", + "metadata": {}, + "outputs": [], + "source": [ + "dataset = rb.load(\"my_annotated_dataset\")\n", + "\n", + "dataset_for_training = dataset.prepare_for_training()" + ] + }, + { + "cell_type": "markdown", + "id": "c71ee9dc-93b8-4802-83bc-e63048a83073", + "metadata": {}, + "source": [ + "With the returned `dataset_for_training`, you can continue following the steps to [fine-tune a pre-trained model](https://huggingface.co/docs/transformers/training#finetune-a-pretrained-model) with the [transformers library](https://huggingface.co/docs/transformers/index). \n", + "\n", + "Check the dedicated [dataset guide](../guides/datasets.ipynb#prepare-dataset-for-training) for more examples of the `prepare_for_training()` method." + ] + }, + { + "cell_type": "markdown", + "id": "90307acf-ba85-4f8c-86d3-ca398be7a496", + "metadata": {}, + "source": [ + "## How to train a model" + ] + }, + { + "cell_type": "markdown", + "id": "29cb1351-6324-4faa-9067-fd50785844f5", + "metadata": {}, + "source": [ + "Rubrix helps you to create and curate training data. **It is not a framework for training a model.** You can use Rubrix complementary with other excellent open-source frameworks that focus on developing and training NLP models.\n", + "\n", + "Here we list three of the most commonly used open-source libraries, but many more are available and may be more suited for your specific use case:\n", + "\n", + " - [transformers](https://huggingface.co/docs/transformers/index): This library provides thousands of pre-trained models for various NLP tasks and modalities. Its excellent documentation focuses on fine-tuning those models to your specific use case;\n", + " - [spaCy](https://spacy.io/): This library also comes with pre-trained models built into a pipeline tackling multiple tasks simultaneously. Since its a purely NLP library, it comes with much more NLP features than just model training;\n", + " - [scikit-learn](https://scikit-learn.org/stable/): This de facto standard library is a powerful swiss army knife for machine learning with some NLP support. Usually, their NLP models lack the performance when compared to transformers or spacy, but give it a try if you want to train a lightweight model quickly; \n", + " \n", + "Check our [cookbook](../guides/cookbook.ipynb) for many examples of how to train models using these frameworks together with Rubrix." + ] } ], "metadata": {