# RUBRIX BASICS

Here you will find some basic guidelines on how to get started with Rubrix.

## UPLOADING RECORDS

### TEXT CLASSIFICATION

***REGULAR TASKS**: Text Categorization, Sentiment Analysis, Semantic Textual Similarity, Natural Language Inference (NLI)...*

This is an example of how you can upload records for **Text Classification tasks**. We used a [dataset](https://www.kaggle.com/datasets/databar/10k-snapchat-reviews) from Kaggle, which contains 10K reviews about the Snapchat app from App Store. 

In [1]:
import pandas as pd
import rubrix as rb

#converting the CSV file into a Pandas Dataframe
dataset_txt = pd.read_csv("snapchat.csv") 

dataset_txt #displaying the dataframe to see its columns

Unnamed: 0.1,Unnamed: 0,userName,rating,review,isEdited,date,title
0,0,Savvanananahhh,4,For the most part I quite enjoy Snapchat it’s ...,False,10/4/20 6:01,Performance issues
1,1,Idek 9-101112,3,"I’m sorry to say it, but something is definite...",False,10/14/20 2:13,What happened?
2,2,William Quintana,3,Snapchat update ruined my story organization! ...,False,7/31/20 19:54,STORY ORGANIZATION RUINED!
3,3,an gonna be unkown😏,5,I really love the app for how long i have been...,False,4/22/21 14:10,The app is great
4,4,gzhangziqi,1,This is super frustrating. I was in the middle...,False,10/2/20 13:58,"Locked me out, customer service not helping"
...,...,...,...,...,...,...,...
9555,9555,geekygirl17,1,I used to love using Snapchat and now I hardly...,False,6/24/19 0:58,Major issue...not that it will get fixed
9556,9556,changemaker kkdd,2,"Well, I did deleted it because there was some ...",False,6/23/19 13:42,I got then deleted it.
9557,9557,teekay2much,4,Every time I upload a photo or video to my sto...,False,6/3/19 3:35,Story problem
9558,9558,whoratheexplora,4,"Love this app, but since he update I can’t upl...",False,6/3/19 3:26,Bugs


In [2]:
#renaming the column related to the text input
data = dataset_txt.rename(columns={"review": "text"}) 

#rubrix is able to read the dataframe and identify the columns
record_txt = rb.read_pandas(data, task="TextClassification") 



In [3]:
#logging the records
rb.log(record_txt, "snapchat_reviews")

  0%|          | 0/9560 [00:00<?, ?it/s]

9560 records logged to http://localhost:6900/datasets/rubrix/snapchat_reviews


BulkResponse(dataset='snapchat_reviews', processed=9560, failed=0)

### TOKEN CLASSIFICATION

***REGULAR TASKS**: Named Entity Recognition (NER), Part-of-speech tagging, Slot filling...*

This **Token classification tasks** example shows how to create a new CSV from a dataframe with sample German sentences, to tokenize text with the [NLTK library](https://www.nltk.org/), and to save these tokens in a new column.

In [4]:
import rubrix as rb
import pandas as pd
import spacy

dataset_tok = pd.DataFrame({'text': ["Er war ein österreichischer Politiker, Bundeskanzler der Republik Österreich und wurde bekannt als 'Staatsvertragskanzler'", 
                                  "Diese deutsche Stadt ist die drittglücklichste der Welt", 
                                  "Eins, zwei, Polizei, drei, vier, Grenadier "],
                           })

dataset_tok.to_csv() #saving the sample sentences into a new csv  
dataset_tok

Unnamed: 0,text
0,"Er war ein österreichischer Politiker, Bundesk..."
1,Diese deutsche Stadt ist die drittglücklichste...
2,"Eins, zwei, Polizei, drei, vier, Grenadier"


In [5]:
import nltk

#creating a new column for saving the tokenized text
dataset_tok['tokens'] = dataset_tok.apply(lambda row: nltk.word_tokenize(row['text']), axis=1) 

dataset_tok

Unnamed: 0,text,tokens
0,"Er war ein österreichischer Politiker, Bundesk...","[Er, war, ein, österreichischer, Politiker, ,,..."
1,Diese deutsche Stadt ist die drittglücklichste...,"[Diese, deutsche, Stadt, ist, die, drittglückl..."
2,"Eins, zwei, Polizei, drei, vier, Grenadier","[Eins, ,, zwei, ,, Polizei, ,, drei, ,, vier, ..."


In [6]:
#rubrix is able to read the dataframe and identify the columns

record_tok = rb.read_pandas(dataset_tok, task="TokenClassification") 

In [7]:
rb.log(record_tok, "deutsch_ner")

  0%|          | 0/3 [00:00<?, ?it/s]

3 records logged to http://localhost:6900/datasets/rubrix/deutsch_ner


BulkResponse(dataset='deutsch_ner', processed=3, failed=0)

### TEXT2TEXT

***REGULAR TASKS**: Machine translation, Text summarization, Paraphrase generation...*

You can see here how you can easily upload records for **Text2Text tasks**. With this [HuggingFace dataset](https://huggingface.co/datasets/bible_para/viewer/en-fr/train), containing biblical phrases in English and French, and the [map](https://huggingface.co/docs/datasets/process#map) function, it can be easily done.

In this case, only the chosen **source language** (French) is uploaded, as the **target language** would be the predicted output.

In [8]:
from transformers import pipeline
from datasets import load_dataset

dataset = load_dataset("bible_para", 'en-fr', split="train[0:100]")



In [9]:
def extract_frphrase(example):
    example['text'] = example['translation']['fr']
    return example

In [10]:
updated_dataset = dataset.map(extract_frphrase)
updated_dataset['text'][:5]

0ex [00:00, ?ex/s]

['Au commencement, Dieu créa les cieux et la terre.',
 'La terre était informe et vide: il y avait des ténèbres à la surface de l`abîme, et l`esprit de Dieu se mouvait au-dessus des eaux.',
 'Dieu dit: Que la lumière soit! Et la lumière fut.',
 'Dieu vit que la lumière était bonne; et Dieu sépara la lumière d`avec les ténèbres.',
 'Dieu appela la lumière jour, et il appela les ténèbres nuit. Ainsi, il y eut un soir, et il y eut un matin: ce fut le premier jour.']

In [13]:
bible_fr_en = rb.read_datasets(updated_dataset, task="Text2Text") 

rb.log(bible_fr_en, "bible_fr-en")



  0%|          | 0/100 [00:00<?, ?it/s]

100 records logged to http://localhost:6900/datasets/rubrix/bible_fr-en


BulkResponse(dataset='bible_fr-en', processed=100, failed=0)