# Basics

Here you will find some basic guidelines on how to get started with Rubrix.

## How to upload datasets

In **Rubrix**, a dataset is a [collection of records](../reference/webapp/dataset.md), each one containing an input text. 

This "collection of records" can be different depending on the the **task** to be performed **(Text, Token Classification and Text2Text)**, and contain features such as:

- Annotations (the labels for each element of a dataset),
- Predictions (the results obtained when a model is applied to a dataset),
- Metadata (reference data to identify elements on a dataset). 

---

First of all, you should understand how Rubrix works. Rubrix's working units are **records**, which are basically texts. 

These texts are part of the aforementioned **datasets**, and are usually in any format **(.CSV, JSON, HuggingFace datasets, XML...)**. 
To perform any kind of task in any format, there are different ways to upload these datasets, as we will see further on. 
Besides, Rubrix is **compatible** with most of NLP libraries, so the process is even easier.

Let's see how you can upload a dataset to start working with **Rubrix**. 
After this, you can explore or annotate datasets, apply weak supervision rules, obtain predictions or even training a model. 

This is a very easy example. 
As you see, a **Text Classification record** is created from a sentence and logged into Rubrix:

In [None]:
import rubrix as rb

# This record consists of one simple sentence
record = rb.TextClassificationRecord(text="hello world, this is me")

# Logging the record into rubrix.
rb.log(record, "my_first_record")

![image](../_static/getting_started/first_record.png)

### Text classification

These tasks focus on categorizing sentences or documents into one or more groups. When we only deal with a category, it is **single-label text classification**, but when we deal with more than one, then we are talking about **multi-label text classification**. In addition to deal with different tasks, **Rubrix** also provides some interesting features, like the **Define rules mode** or the available **metrics** (see next section).

In this example, the chosen [dataset](https://www.kaggle.com/datasets/databar/10k-snapchat-reviews) contains 10K reviews about the Snapchat app from App Store. This dataset (available for download) could be used for tasks such as **sentiment analysis**, or **text categorization**. 

After retrieving the dataset from Kaggle and identifying the column that contains the **text input**, the dataset can be easily uploaded. After this, **100 records** will be available in the **Rubrix UI**.

In [9]:
import pandas as pd
import rubrix as rb

#converting the CSV file into a Pandas Dataframe. This dataset has been limited to 100 results.
dataset_txt = pd.read_csv("snapchat.csv")[:100] #probar URL 

dataset_txt.head(3) #displaying the dataframe to see the first three columns

Unnamed: 0.1,Unnamed: 0,userName,rating,review,isEdited,date,title
0,0,Savvanananahhh,4,For the most part I quite enjoy Snapchat it’s ...,False,10/4/20 6:01,Performance issues
1,1,Idek 9-101112,3,"I’m sorry to say it, but something is definite...",False,10/14/20 2:13,What happened?
2,2,William Quintana,3,Snapchat update ruined my story organization! ...,False,7/31/20 19:54,STORY ORGANIZATION RUINED!


In [None]:
#renaming the column related to the text input
data = dataset_txt.rename(columns={"review": "text"}) 
#to be processed with the rb.read_pandas function, the text column must be named with the same name

#rubrix is able to read the dataframe and to identify the columns
record_txt = rb.read_pandas(data, task="TextClassification") 

In [None]:
#logging the records
rb.log(record_txt, "snapchat_reviews")

### Token Classification

The aim of **Token Classification tasks** is to divide the text into **tokens** to put them **labels**. This process is called **tokenize**, and consists of dividing the text into tokens, which are **units of text**. Rubrix can handle different **token classification tasks**, being **Named Entity Recognition (NER)** one of the most remarkable, as its UI is particularly useful for this purpose.

This example shows how to tokenize the **input text** from this [Kaggle dataset](https://www.kaggle.com/datasets/mldado/german-online-reviewsratings-of-organic-coffee), which contains reviews of organic coffee in German. After this tokenization, the dataset is ready to be uploaded.

In this case, the **tokenization** has been made with **spaCy**- however, there are other libraries such as  [NLTK](https://www.nltk.org/) or [HuggingFace](https://huggingface.co/docs/transformers/main_classes/tokenizer) that also work for this process. The most important thing is to obtain a **tokenized text**.

In [10]:
import pandas as pd

# Read the csv file
dataframe = pd.read_csv("kaffee_reviews.csv")

# Display the first three rows of the dataset
dataframe.head(3) 

Unnamed: 0.1,Unnamed: 0,brand,rating,review
0,0,GEPA Kaffee,5,Wenn ich Bohnenkaffee trinke (auf Arbeit trink...
1,1,GEPA Kaffee,5,Für mich ist dieser Kaffee ideal. Die Grundvor...
2,2,GEPA Kaffee,5,Ich persönlich bin insbesondere von dem Geschm...


Since Rubrix expects the input text to be in a column named *"text"*, let us simply rename the *"review"* column.

In [11]:
# We want to use the review column as input text, so simply rename it
dataframe = dataframe.rename(columns={"review": "text"}) 

In [12]:
import spacy

# Load a german spaCy model to tokenize our text
nlp = spacy.load("de_core_news_sm")

# Define our tokenize function and apply it to our text
def tokenize(text):
    return [token.text for token in nlp(text)]

dataframe["tokens"] = dataframe["text"].apply(tokenize)

In [8]:
import rubrix as rb

# Create a Rubrix dataset by reading a pandas DataFrame with required columns
dataset = rb.read_pandas(dataframe, task="TokenClassification") 



In [None]:
# Log the datset to the Rubrix web app
rb.log(dataset, "coffee-reviews_de")

### Text2Text

These tasks are, basically, **text generation tasks**. They normally require a **text input** to provide an **output**, which can be a translation or a summary, for instance. 

To generate new text we need a **text input**, so identifying the text in the dataset is key. As this example is made with a HuggingFace dataset, the process is slightly different from the previous ones. In this case, the text input will be retrieved thanks to the [map function](https://huggingface.co/docs/datasets/process#map).

This [dataset](https://huggingface.co/datasets/europa_ecdc_tm), aimed for **Machine Translation tasks**, contains texts from the European Centre for Disease Prevention and Control (ECDC). Only the chosen **source language** (English) will be uploaded.

In [None]:
from transformers import pipeline
from datasets import load_dataset

#retrieving the dataset from HuggingFace, with its language configuration and the desired split
dataset = load_dataset("europa_ecdc_tm", 'en2fr', split="train[0:100]")

dataset.to_pandas().head(3) #converting the HF dataset into a dataframe to read its content

In [9]:
#this function will help the map function to retrieve the text input 
def extract_frphrase(example):
    example['text'] = example['translation']['en'] #English as the source language
    return example

In [None]:
#the map function shows the text input

updated_dataset = dataset.map(extract_frphrase)
updated_dataset['text'][:5] #displaying the first 5 results

In [None]:
# now the read_datasets function (similar to read_pandas) is able to process the data
ecdc_en = rb.read_datasets(updated_dataset, task="Text2Text") 

#uploading the dataset
rb.log(ecdc_en, "ecdc_en")

## How to annotate datasets

Rubrix provides several ways to annotate your data. 
With the intuitive Rubrix web app, you can choose between:

1. Manually annotating each record using a dedicated interface for each task type;
2. Leveraging a user-provided model by validating its predictions;
3. Trying to define heuristic rules to produce "noisy labels", a technique also known as "weak supervision";

Each way has its pros and cons, and the best match largely depends on your individual use case.


### 1. Manual annotations

The straightforward approach of manual annotations might be necessary if you do not have a suitable model for your use case or cannot come up with good heuristic rules for your dataset. 
It can also be a good approach if you dispose of a large annotation workforce or require few but unbiased and high-quality labels.

Rubrix tries to make this relatively cumbersome approach as painless as possible. 
Via an intuitive and adaptive UI, its exhaustive search and filter functionalities, and bulk annotation capabilities, Rubrix turns the manual annotation process into an efficient option.  

Look at our dedicated [feature reference](../reference/webapp/annotate_records.md) for a detailed and illustrative guide on manually annotating your dataset with Rubrix.

### 2. Validating predictions

![Validate predictions for a token classification dataset](../_static/reference/webapp/annotation_ner.png)

Nowadays, many pre-trained or zero-shot models are available online via model repositories like the Hugging Face Hub. 
Most of the time, you probably will find a model that already suits your specific dataset task to some degree. 
In Rubrix, you can pre-annotate your data by including predictions from these models in your records.
Assuming that the model works reasonably well on your dataset, you can filter for records with high prediction scores and validate the predictions.
In this way, you will rapidly annotate part of your data and alleviate the annotation process.

One downside of this approach is that your annotations will be subject to the possible biases and mistakes of the pre-trained model.
When guided by pre-trained models, it is common to see human annotators get influenced by them.
Therefore, it is advisable to avoid pre-annotations when building a rigorous test set for the final model evaluation.

Check the [introduction tutorial](../tutorials/01-labeling-finetuning.ipynb) to learn to add predictions to the records. 
And our [feature reference](../reference/webapp/annotate_records.md#validate-predictions) includes a detailed guide on validating predictions in the Rubrix web app.

### 3. Define rules

![Defining a rule for a multi-label text classification task.](../_static/reference/webapp/define_rules_2.png)

Another approach to annotating your data is to develop heuristic rules tailored to your dataset. 
For example, let us assume you want to classify news articles into the categories of *Finance*, *Sports*, and *Culture*. 
In this case, a reasonable rule would be to label all articles that include the word "stock" as *Finance*. 

It is easy to see how you can quickly annotate vast amounts of data in this way, which we refer to as *weak supervision*. 
Rules can get arbitrarily complex and can also include the record's metadata. 
The downsides of this approach are that it might be challenging to come up with working heuristic rules for some datasets. 
Furthermore, rules are rarely 100% precise and often conflict with each other, which must be addressed by so-called label models. 
It is usually a trade-off between the amount of annotated data and the quality of the labels.

Check [our guide](../guides/weak-supervision.ipynb) for an extensive introduction to weak supervision with Rubrix. 
Also, check the [feature reference](../reference/webapp/define_rules.md) for the Define rules mode of the web app and our various tutorials **(TODO: add link once we have the gallery)** to see practical examples of weak supervision workflows. 