# Rubrix Basics

Here you will find some basic guidelines on how to get started with Rubrix.

## How to upload datasets

In **Rubrix**, a dataset is a [collection of records](https://rubrix.readthedocs.io/en/stable/reference/webapp/dataset.html), each one containing an input text. 

This "collection of records" can be different depending on the the **task** to be performed **(Text, Token Classification and Text2Text)**, and might contain features such as:

- Annotations (the labels for each element of a dataset),
- Predictions (the results obtained when a model is applied to a dataset), and/or
- Metadata (reference data to identify elements on a dataset). 

Rubrix is not only **compatible** with most of NLP libraries, but also is able to work and preprocess any format (.CSV, JSON, HuggingFace datasets...). 

Let's see how you can upload a dataset to start working with **Rubrix**. After this, you can explore or annotate datasets, apply weak supervision rules, obtain predictions or even training a model. 

---

### Text classification

***Regular tasks**: Text Categorization, Sentiment Analysis, Semantic Textual Similarity, Natural Language Inference (NLI)...*

These tasks focus on categorizing sentences or documents into one or more groups. When we only deal with a category, it is **single-label text classification**, but when we deal with more than one, then we are talking about **multi-label text classification**. In addition to deal with different tasks, **Rubrix** also provides some interesting features, like the **Define rules mode** or the available **metrics** (see next section).

In this example, the chosen [dataset](https://www.kaggle.com/datasets/databar/10k-snapchat-reviews) contains 10K reviews about the Snapchat app from App Store. This dataset could be used for tasks such as **sentiment analysis**, or **text categorization**. 

After retrieving the dataset from Kaggle and identifying the column that contains the **text input**, the dataset can be easily uploaded. After this, **100 records** will be available in the **Rubrix UI**.

In [9]:
import pandas as pd
import rubrix as rb

#converting the CSV file into a Pandas Dataframe. This dataset has been limited to 100 results.
dataset_txt = pd.read_csv("snapchat.csv")[:100]

dataset_txt.head(3) #displaying the dataframe to see the first three columns

Unnamed: 0.1,Unnamed: 0,userName,rating,review,isEdited,date,title
0,0,Savvanananahhh,4,For the most part I quite enjoy Snapchat it’s ...,False,10/4/20 6:01,Performance issues
1,1,Idek 9-101112,3,"I’m sorry to say it, but something is definite...",False,10/14/20 2:13,What happened?
2,2,William Quintana,3,Snapchat update ruined my story organization! ...,False,7/31/20 19:54,STORY ORGANIZATION RUINED!


In [None]:
#renaming the column related to the text input
data = dataset_txt.rename(columns={"review": "text"}) 
#to be processed with the rb.read_pandas function, the text column must be named with the same name

#rubrix is able to read the dataframe and to identify the columns
record_txt = rb.read_pandas(data, task="TextClassification") 

In [None]:
#logging the records
rb.log(record_txt, "snapchat_reviews")

### Token Classification

**Regular tasks**: Named Entity Recognition (NER), Part-of-speech tagging, Slot filling...*

The aim of **Token Classification tasks** is to divide the text into **tokens** to put them **labels**. This process is called **tokenize**, and consists of dividing the text into tokens, which are **units of text**. Rubrix can handle different **token classification tasks**, being **Named Entity Recognition (NER)** one of the most remarkable, as its UI is particularly useful for this purpose.

This example shows how to tokenize the **input text** from this [Kaggle dataset](https://www.kaggle.com/datasets/mldado/german-online-reviewsratings-of-organic-coffee), which contains reviews of organic coffee in German. After this tokenization, the dataset is ready to be uploaded.

In this case, the **tokenization** has been made with **spaCy**- however, there are other libraries such as  [NLTK](https://www.nltk.org/) or [HuggingFace](https://huggingface.co/docs/transformers/main_classes/tokenizer) that also work for this process. The most important thing is to obtain a **tokenized text**.

In [13]:
import rubrix as rb
import pandas as pd
import spacy

dataset_tok = pd.read_csv("kaffee_reviews.csv")[:50] 

dataset_tok.head(3) #displaying the dataset to see the columns

Unnamed: 0.1,Unnamed: 0,brand,rating,review
0,0,GEPA Kaffee,5,Wenn ich Bohnenkaffee trinke (auf Arbeit trink...
1,1,GEPA Kaffee,5,Für mich ist dieser Kaffee ideal. Die Grundvor...
2,2,GEPA Kaffee,5,Ich persönlich bin insbesondere von dem Geschm...


In [5]:
#it is better to leave just the text column
dataset_tok = dataset_tok.drop(['brand', 'rating'], axis=1) 

#renaming the text column in order to upload the dataset to Rubrix
dataset_tok = dataset_tok.rename(columns={"review": "text"}) 

In [6]:
import nltk

#creating a new column for saving the tokenized text
dataset_tok['tokens'] = dataset_tok.apply(lambda row: nltk.word_tokenize(row['text']), axis=1) 

In [None]:
#rubrix is able to read the dataframe and identify the columns

record_tok = rb.read_pandas(dataset_tok, task="TokenClassification") 

In [None]:
#now record can be logged into Rubrix
rb.log(record_tok, "coffee-reviews_de")

### Text2Text

***Regular tasks***: Machine translation, Text summarization, Paraphrase generation...*

These tasks are, basically, **text generation tasks**. They normally require a **text input** to provide an **output**, which can be a translation or a summary, for instance.

To generate new text we need a **text input**, so identifying the text in the dataset is key. As this example is made with a HuggingFace dataset, the process is slightly different from the previous ones. In this case, the text input will be retrieved thanks to the [map function](https://huggingface.co/docs/datasets/process#map).

This [dataset](https://huggingface.co/datasets/europa_ecdc_tm), aimed for **Machine Translation tasks**, contains texts from the European Centre for Disease Prevention and Control (ECDC), and only the chosen **source language** (English) will be uploaded.

In [10]:
from transformers import pipeline
from datasets import load_dataset

#retrieving the dataset from HuggingFace, with its language configuration and the desired split
dataset = load_dataset("europa_ecdc_tm", 'en2fr', split="train[0:100]")

dataset.to_pandas().head(3) #converting the HF dataset into a dataframe to read its content



Unnamed: 0,translation
0,{'en': 'Vaccination against hepatitis C is not...
1,"{'en': 'HIV infection', 'fr': 'Infection à VIH'}"
2,{'en': 'The human immunodeficiency virus (HIV)...


In [9]:
#this function will help the map function to retrieve the text input 
def extract_frphrase(example):
    example['text'] = example['translation']['en'] #English as the source language
    return example

In [10]:
#the map function shows the text input

updated_dataset = dataset.map(extract_frphrase)
updated_dataset['text'][:5] #displaying the first 5 results

0ex [00:00, ?ex/s]

['Vaccination against hepatitis C is not yet available.',
 'HIV infection',
 'The human immunodeficiency virus (HIV) remains one of the most important communicable diseases in Europe.',
 'It is an infection associated with serious disease, persistently high costs of treatment and care, significant number of deaths and shortened life expectancy.',
 'HIV is a virus, which attacks the immune system and causes a lifelong severe illness with a long incubation period.']

In [None]:
# now the read_datasets function (similar to read_pandas) is able to process the data
ecdc_en = rb.read_datasets(updated_dataset, task="Text2Text") 

#uploading the datasets
rb.log(ecdc_en, "ecdc_en")

## How to annotate datasets

When it comes to annotating records, **Rubrix** offers two ways to do this: manually in the UI, or via the client by uploading an annotated dataset. 

### UI annotation

**Rubrix** allows users to **manually annotate** records through its intuitive UI. The annotation process is customized and varies depending on the task to be performed, and these annotations can be used to obtain predictions, and to train a model as well. Click [here](https://rubrix.readthedocs.io/en/master/reference/webapp/annotate_records.html#annotate-records) to learn more about **annotation** with Rubrix .

### Text classification

#### Manual annotation

After uploading a dataset (plain or annotated) to **Rubrix**, users can use the UI to **manually annotate** records. 

Taking the previous Text Classification example, users could **create one or more labels** with the **"Create label"** button in order to manually annotate the dataset:

![image](../_static/reference/getting_started/createlabel_text.png)

When labels are ready, the dataset is ready to be annotated. There are features like **bulk annotation**, which can ease the workload:

#VIDEO

#### Rules in UI

_Click [here](https://rubrix.readthedocs.io/en/master/reference/webapp/define_rules.html) to read more about the Define rules mode._

The **Define rules mode** is also a good method to quickly annotate these kind of datasets with noisy labels, as it is a "semiautomatic" system. These rules apply a specific set of labels to the records that match a given query. Besides, some **metrics** will be available for the created rule.

This is an easy example. The chosen dataset, available in [Kaggle](https://www.kaggle.com/datasets/ishantjuyal/emotions-in-text), is an annotated dataset for **multilabel text classification**, which deals with texts and different emotions.

#VIDEO

#### Weak supervision

_Click [here](https://rubrix.readthedocs.io/en/master/guides/weak-supervision.html) to read more about Weak Supervision with Rubrix._

This feature is related to the **Define rules mode**. When saving one or more rules, it is possible to see the information by clicking on the **Manage rules** button.

![image](../_static/reference/getting_started/managerules_text.png)

By using this feature, it is possible to **annotate** in a rapid, efficient way and to easily obtain information about the labels. 

### Token classification

#### Manual annotation

When dealing with **Token classification** datasets, the manual annotation is particularly interesting. It is possible not only to **bulk annotate** or create new labels, but the UI is also useful to easily annotate, as the video shows:

<video width="100%" controls><source src=../_static/reference/getting_started/ner_annotation.mp4 type="video/mp4"></video> 

Note that these features are available for both annotated and plain datasets, and that the **Metrics sidebar** can also provide good insights of the data:

![image](../_static/reference/getting_started/token_annotation.png)

### Text2Text

#### Manual annotation

Again, it is also possible to annotate **preannotated** or **plain** datasets for **Text2Text** tasks. In this case, the **annotation** is inside a text box that can be modified:

#VIDEO (TBD)
