# RUBRIX BASICS

Here you will find some basic guidelines on how to get started with Rubrix.

## HOW TO UPLOAD RECORDS

In **Rubrix**, a dataset is a [collection of records](https://rubrix.readthedocs.io/en/stable/reference/webapp/dataset.html), and each one contains an input text. They might also have annotations, predictions, and/or some metadata. 

These datasets are used for the different **tasks** available in Rubrix (Text/Token Classification and Text2Text). For each task, datasets will be different. These are some examples:


### TEXT CLASSIFICATION

***REGULAR TASKS**: Text Categorization, Sentiment Analysis, Semantic Textual Similarity, Natural Language Inference (NLI)...*

This is an example of how you can upload records for **Text Classification tasks**. We used a [dataset](https://www.kaggle.com/datasets/databar/10k-snapchat-reviews) from Kaggle, which contains 10K reviews about the Snapchat app from App Store. 

In [12]:
import pandas as pd
import rubrix as rb

#converting the CSV file into a Pandas Dataframe
dataset_txt = pd.read_csv("snapchat.csv") 

dataset_txt.head(3) #displaying the dataframe to see its columns

Unnamed: 0.1,Unnamed: 0,userName,rating,review,isEdited,date,title
0,0,Savvanananahhh,4,For the most part I quite enjoy Snapchat it’s ...,False,10/4/20 6:01,Performance issues
1,1,Idek 9-101112,3,"I’m sorry to say it, but something is definite...",False,10/14/20 2:13,What happened?
2,2,William Quintana,3,Snapchat update ruined my story organization! ...,False,7/31/20 19:54,STORY ORGANIZATION RUINED!


In [None]:
#renaming the column related to the text input
data = dataset_txt.rename(columns={"review": "text"}) 

#rubrix is able to read the dataframe and identify the columns
record_txt = rb.read_pandas(data, task="TextClassification") 

In [None]:
#logging the records
rb.log(record_txt, "snapchat_reviews")

### TOKEN CLASSIFICATION

***REGULAR TASKS**: Named Entity Recognition (NER), Part-of-speech tagging, Slot filling...*

This **Token classification tasks** example shows how to create a new CSV from a dataframe with sample German sentences, to tokenize text with the [NLTK library](https://www.nltk.org/), and to save these tokens in a new column. We used this [dataset](https://www.kaggle.com/datasets/mldado/german-online-reviewsratings-of-organic-coffee), containing reviews of organic coffee in German.

In [13]:
import rubrix as rb
import pandas as pd
import spacy

dataset_tok = pd.read_csv("kaffee_reviews.csv")[:50] 

dataset_tok.head(3) #displaying the dataset to see the columns

Unnamed: 0.1,Unnamed: 0,brand,rating,review
0,0,GEPA Kaffee,5,Wenn ich Bohnenkaffee trinke (auf Arbeit trink...
1,1,GEPA Kaffee,5,Für mich ist dieser Kaffee ideal. Die Grundvor...
2,2,GEPA Kaffee,5,Ich persönlich bin insbesondere von dem Geschm...


In [5]:
#using this function to delete unnecessary columns for this task
dataset_tok = dataset_tok.drop(['brand', 'rating'], axis=1) 

#renaming the text column
dataset_tok = dataset_tok.rename(columns={"review": "text"}) 

In [6]:
import nltk

#creating a new column for saving the tokenized text
dataset_tok['tokens'] = dataset_tok.apply(lambda row: nltk.word_tokenize(row['text']), axis=1) 

In [None]:
#rubrix is able to read the dataframe and identify the columns

record_tok = rb.read_pandas(dataset_tok, task="TokenClassification") 

In [None]:
rb.log(record_tok, "coffee-reviews_de")

### TEXT2TEXT

***REGULAR TASKS**: Machine translation, Text summarization, Paraphrase generation...*

You can see here how you can easily upload records for **Text2Text tasks**. With this [HuggingFace dataset](https://huggingface.co/datasets/europa_ecdc_tm), containing texts from the European Centre for Disease Prevention and Control (ECDC), and the [map](https://huggingface.co/docs/datasets/process#map) function, it can be easily done.

In this case, only the chosen **source language** (English) is uploaded, as the **target language** (French) would be the annotations (or the predicted output, depending on the task).

In [None]:
from transformers import pipeline
from datasets import load_dataset

dataset = load_dataset("europa_ecdc_tm", 'en2fr', split="train[0:100]")

dataset.to_pandas().head(3)

In [9]:
def extract_frphrase(example):
    example['text'] = example['translation']['en']
    return example

In [10]:
updated_dataset = dataset.map(extract_frphrase)
updated_dataset['text'][:5]

0ex [00:00, ?ex/s]

['Vaccination against hepatitis C is not yet available.',
 'HIV infection',
 'The human immunodeficiency virus (HIV) remains one of the most important communicable diseases in Europe.',
 'It is an infection associated with serious disease, persistently high costs of treatment and care, significant number of deaths and shortened life expectancy.',
 'HIV is a virus, which attacks the immune system and causes a lifelong severe illness with a long incubation period.']

In [None]:
ecdc_en = rb.read_datasets(updated_dataset, task="Text2Text") 

rb.log(ecdc_en, "ecdc_en")

## HOW TO ANNOTATE RECORDS

When it comes to annotating records, **Rubrix** offers two ways to do this: manually in the UI, or by uploading the annotations of the datasets themselves.  

### UI ANNOTATION

**Rubrix** allows users to **manually annotate** records through its intuitive UI. The annotation process is customized and varies depending on the task to be performed, and this annotations can be used to obtain predictions and to train a model as well. You can learn more about **annotation** with Rubrix [here](https://rubrix.readthedocs.io/en/master/reference/webapp/annotate_records.html#annotate-records).

If you want to upload the annotations via **Rubrix**, there are different ways to do so. Here you will find some simple examples of how to upload annotated records for each task.

### ANNOTATED DATASETS   
### TEXT CLASSIFICATION

In this example, the chosen dataset is available in [Kaggle](https://www.kaggle.com/datasets/ishantjuyal/emotions-in-text), and it is an annotated dataset for **multilabel text classification**, which deals with text and different emotions. Taking emotions as the annotations, both **text** and **annotations** can be easily uploaded with the `rb.read_pandas`function.

In [13]:
import pandas as pd
import rubrix as rb

#converting the CSV file into a Pandas Dataframe
datasetxt = pd.read_csv("Emotion_final.csv") 

datasetxt.head(5) #displaying the dataframe to see its columns

Unnamed: 0,Text,Emotion
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger


In [14]:
#renaming the columns related to the text and the annotations to upload it with Rubrix
emotions = datasetxt.rename(columns={"Text": "text", 
                                   "Emotion": "annotation"}) 

In [None]:
#rubrix now identify both columns
emotions = rb.read_pandas(emotions, task="TextClassification") 

rb.log(emotions, "emotions_dataset")

### TOKEN CLASSIFICATION

In this case, we are using [GermaNER](https://huggingface.co/datasets/germaner), a dataset from **HuggingFace** for Named Entity Recognition tasks in German. In this case, the text has been already tokenized and we need to identify the **NER tags** to upload the annotated dataset.

In [None]:
from transformers import pipeline

from datasets import load_dataset

# a split is necessary to upload the records
dataset_de = load_dataset("germaner", split="train[0:100]")

In [55]:
# showing the 3 first results
dataset_de.to_pandas().head(3)

Unnamed: 0,id,tokens,ner_tags
0,0,"[Schartau, sagte, dem, "", Tagesspiegel, "", vom...","[3, 8, 8, 8, 1, 8, 8, 8, 8, 3, 8, 8, 8, 8, 8, ..."
1,1,"[Firmengründer, Wolf, Peter, Bree, arbeitete, ...","[8, 3, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, ..."
2,2,"[Ob, sie, dabei, nach, dem, Runden, Tisch, am,...","[8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 0, 8, 8, 8, ..."


In [None]:
#identifying the columns for the read_pandas function
def data_tokens(example):
    example['tokens'] = example['tokens']
    example['ner_tags'] = example['ner_tags']
    return example

datatok = dataset_de.map(data_tokens)

In [None]:
import rubrix as rb

# as we already have a tokens and a tag column, rubrix can easily read this information
datatok = rb.read_datasets(datatok, task="TokenClassification", tokens="tokens", tags="ner_tags") 

rb.log(datatok, "germa_ner")

### TEXT2TEXT 

For this example, we are using the same [dataset](https://huggingface.co/datasets/europa_ecdc_tm) as in the previous **Text2Text task**. 

Now, the annotations (which are the **target language**, French) can be easily uploaded by just modifying the previous function:

In [None]:
from transformers import pipeline
from datasets import load_dataset

dataset = load_dataset("europa_ecdc_tm", 'en2fr', split="train[0:100]")

In [2]:
# showing the first 3 columns
dataset.to_pandas().head(3)

Unnamed: 0,translation
0,{'en': 'Vaccination against hepatitis C is not...
1,"{'en': 'HIV infection', 'fr': 'Infection à VIH'}"
2,{'en': 'The human immunodeficiency virus (HIV)...


In [3]:
# now we add the column corresponding to the target language
def extract_phrase(example):
    example['text'] = example['translation']['en']
    example['annotation'] = example['translation']['fr']
    return example

In [None]:
updated_dataset= dataset.map(extract_phrase)

In [None]:
import rubrix as rb

ecdc_en_fr = rb.read_datasets(updated_dataset, task="Text2Text") 

rb.log(ecdc_en_fr, "ecdc_en_fr")

## HOW TO ADD MODEL PREDICTIONS

TBD