# Rubrix Cookbook

Yeah, you heard it right! Not a cheatsheet, but a cookbook. A notebook of recipes. 

In this quick guide, we are going to show you how easy can Rubrix be used side by side with some of the most popular AI Python libraries. Rubrix is *agnostic*, it can be used  with any library or framework, no need to implement any interface or modify your existing toolbox and workflows. With these few example you will be able to start loging and exploring your data for any of these libraries with just a glance, and maybe pick up some inspiration if your library of choice is not in this list.

If you miss one AI library in this list, tell us about it at [our Github forum](https://github.com/recognai/rubrix/discussions).

## HuggingFace Transformers

[HuggingFace](https://huggingface.co) has given to the NLP community many useful tools, and with HuggingFace Transformers produce cutting-edge models is easier than ever. With a few lines of code we can take a Transformer model from their hub, start making some predictions and then log them into Rubrix.

### Text Classification

Let's try a zero-shot classifier using SqueezeBERT for predicting the topic of a sentence.

In [37]:
import rubrix as rb
from transformers import pipeline

text_input = "I love watching rock climbing competitions!"

# We define our HuggingFace Pipeline
classifier = pipeline(
    "zero-shot-classification",
    model="typeform/squeezebert-mnli",
    framework="pt",
)

# Making the prediction
prediction = classifier(
    text_input,
    candidate_labels=[
        "politics",
        "sports",
        "technology",
    ],
    hypothesis_template="This text is about {}.",
)

# Creating a record object to log into rubrix.
record = rb.TextClassificationRecord(
    inputs={"text": prediction["sequence"]},
    prediction=list(zip(prediction["labels"], prediction["scores"])),
    prediction_agent="https://huggingface.co/typeform/squeezebert-mnli",
)

# Logging into Rubrix
rb.log(records=record, name="zeroshot-topic-classifier")

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


BulkResponse(dataset='zeroshot-topic-classifier', processed=1, failed=0)

### Token Classification

We will explore a NER zero-shot classifier in the English language.

In [38]:
import rubrix as rb
from transformers import pipeline

text_input = "My name is Sarah and I live in London"

# We define our HuggingFace Pipeline
classifier = pipeline(
    "ner",
    model="elastic/distilbert-base-cased-finetuned-conll03-english",
    framework="pt",
)

# Making the prediction
predictions = classifier(
    text_input,
)

# Creating a record object to log into rubrix.
record = rb.TokenClassificationRecord(
    text=text_input,
    tokens=text_input.split(),
    prediction=[(pred["entity"], pred["start"], pred["end"]) for pred in predictions],
    prediction_agent="https://huggingface.co/elastic/distilbert-base-cased-finetuned-conll03-english",
)

# Logging into Rubrix
rb.log(records=record, name="zeroshot-ner")


BulkResponse(dataset='zeroshot-ner', processed=1, failed=0)

## spaCy

One of the queens and kings of NLP libraries. [spaCy](https://spacy.io) offers industrial-strength Natural Language Processing, with support for 64+ languages, 55 trained pipelines for 17 languages, multi-task learning with pretrained transformers like BERT, pretrained word vectors and much more. Combining spaCy with Rubrix allows you to combine these learning capabilities with the power to monitor your predictions, collect and iterate through ground truth and build custom applications and dashboards.

### Text Classification

### Token Classification

In [39]:
import rubrix as rb
import spacy

input_text = "Paris a un enfant et la forêt a un oiseau ; l’oiseau s’appelle le moineau ; l’enfant s’appelle le gamin"

# Loading spaCy model
nlp = spacy.load("fr_core_news_sm")

# Creating spaCy doc
doc = nlp(input_text)

# Building TokenClassificationRecord
record = rb.TokenClassificationRecord(
    text=input_text,
    tokens=[token.text for token in doc],
    prediction=[(ent.label_, ent.start_char, ent.end_char) for ent in doc.ents],
    prediction_agent="spacy.fr_core_news_sm",
)

# Logging into Rubrix
rb.log(records=record, name="lesmiserables-ner")

BulkResponse(dataset='lesmiserables-ner', processed=1, failed=0)

## Flair

Developed by the University of Berlin, it is a simple, yet powerful state-of-the-art NLP framework. Itprovides an NLP library, a text embedding library and a PyTorch framework for NLP. [Flair](https://github.com/flairNLP/flair) offers sequence tagging language models in English, Spanish, Dutch, German and many more, and they are also hosted on [HuggingFace Model Hub](https://huggingface.co/models).

### Text Classification

Flair offers some zero-shot models ready to be used, which we are going to use to introduce logging `TextClassificationRecords` with Rubrix. Let's see how to integrate Rubrix in their Deutch offensive language model (we promise to not get very explicit).

In [40]:
import rubrix as rb
from flair.models import TextClassifier
from flair.data import Sentence

input_text = "Du erzählst immer Quatsch."  # something like: "You are always narrating silliness."

# Load our pre-trained TARS model for English
classifier = TextClassifier.load("de-offensive-language")

# Creating Sentence object
sentence = Sentence(input_text)

# Make the prediction
classifier.predict(sentence, multi_class_prob=True)

# Creating a record object to log into rubrix.
record = rb.TextClassificationRecord(
    inputs={"text": input_text},
    prediction=[(pred.value, pred.score) for pred in sentence.labels],
    prediction_agent="de-offensive-language",
)

# Logging into Rubrix
rb.log(records=record, name="german-offensive-language")

2021-05-29 21:50:30,492 loading file /Users/ignaciotalaveracepeda/.flair/models/germ-eval-2018-task-1-v0.5.pt


BulkResponse(dataset='german-offensive-language', processed=1, failed=0)

### Token Classification

Flair offers a lot of tools for Token Classification, supporting tasks like named entity recognition (NER), part-of-speech tagging (POS), special support for biomedical data... and with a growing number of supported languages. Lets see some examples for NER and POS tagging.

#### NER

In this example, we will try the pretrained Dutch NER model from Flair.

In [41]:
import rubrix as rb
from flair.data import Sentence
from flair.models import SequenceTagger

input_text = "De Nachtwacht is in het Rijksmuseum"

# Loading our NER model
tagger = SequenceTagger.load('flair/ner-dutch')

# Creating Sentence object
sentence = Sentence(input_text)

# run NER over sentence
tagger.predict(sentence)

# Building TokenClassificationRecord
record = rb.TokenClassificationRecord(
    text=input_text,
    tokens=[token.text for token in sentence],
    prediction=[(entity.get_labels()[0].value, entity.start_pos, entity.end_pos) for entity in sentence.get_spans('ner')],
    prediction_agent="flair/ner-dutch",
)

# Logging into Rubrix
rb.log(records=record, name="dutch-flair-ner")

2021-05-29 21:50:34,951 loading file /Users/ignaciotalaveracepeda/.flair/models/ner-dutch/fd03fc5c7a02268a538432a010f4d09ec15e55fe70efd02dfea158916fa4cba8.04438768e42ba7d6599cea01fcabf77563c8c7e2b27a245618f0ed535ad8919c


BulkResponse(dataset='dutch-flair-ner', processed=1, failed=0)

#### POS tagging

In the following snippet we will use de multilingual POS tagging model from Flair.

In [42]:
import rubrix as rb
from flair.data import Sentence
from flair.models import SequenceTagger

input_text = "George Washington went to Washington. Dort kaufte er einen Hut."

# Loading our NER model
tagger = SequenceTagger.load("flair/upos-multi")

# Creating Sentence object
sentence = Sentence(input_text)

# run NER over sentence
tagger.predict(sentence)

# Building TokenClassificationRecord
record = rb.TokenClassificationRecord(
    text=input_text,
    tokens=[token.text for token in sentence],
    prediction=[
        (entity.get_labels()[0].value, entity.start_pos, entity.end_pos)
        for entity in sentence.get_spans()
    ],
    prediction_agent="flair/upos-multi",
)

# Logging into Rubrix
rb.log(records=record, name="flair-pos-tagging")

2021-05-29 21:50:51,193 loading file /Users/ignaciotalaveracepeda/.flair/models/upos-multi/1a44f168663182024fd3ea6d7dcaeee47fe5bcb537cc737ad058b64ad4db9736.5f899f25846741510a6567b89027d988bd6f634b2776a7c3e834fea4629367cb


BulkResponse(dataset='flair-pos-tagging', processed=1, failed=0)

## Stanza

[Stanza]() is a collection of efficient tools for many NLP tasks and processes, all in one library. It was created and it's maintained by the [Standford NLP Group](https://nlp.stanford.edy). We are going to take a look at a few interactions that can be done with Rubrix.

### Text Classification

Let's start by using a Sentiment Analysis model to log some `TextClassificationRecords`.

In [19]:
import rubrix as rb
import stanza

input_text = (
    "There are so many NLP libraries available, I don't know which one to choose!"
)

# Downloading our model, in case we don't have it cached
stanza.download("en")

# Creating the pipeline
nlp = stanza.Pipeline(lang="en", processors="tokenize,sentiment")

# Analizing the input text
doc = nlp(input_text)

# This model returns 0 for negative, 1 for neutral and 2 for positive outcome.
# We are going to log them into Rubrix using a dictionary to translate numbers to labels.
num_to_labels = {0: "negative", 1: "neutral", 2: "positive"}


# Build a prediction entities list
# Stanza, at the moment, only output the most likely label without probability.
# So we will suppouse Stanza predicts the most likely label with 1.0 probability, and the rest with 0.
entities = []

for key in num_to_labels:
    if key == sentence.sentiment:
        entities.append((num_to_labels[key], 1))
    else:
        entities.append((num_to_labels[key], 0))

# Creating a record object to log into rubrix.
record = rb.TextClassificationRecord(
    inputs={"text": input_text},
    prediction=entities,
    prediction_agent="stanza/en",
)

# Logging into Rubrix
rb.log(records=record, name="stanza-sentiment")

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.2.0.json: 128kB [00:00, 28.3MB/s]                    
2021-05-30 10:43:52 INFO: Downloading default packages for language: en (English)...
2021-05-30 10:43:57 INFO: File exists: /Users/ignaciotalaveracepeda/stanza_resources/en/default.zip.
2021-05-30 10:44:05 INFO: Finished downloading models and saved to /Users/ignaciotalaveracepeda/stanza_resources.
2021-05-30 10:44:05 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |
| sentiment | sstplus  |

2021-05-30 10:44:05 INFO: Use device: cpu
2021-05-30 10:44:05 INFO: Loading: tokenize
2021-05-30 10:44:05 INFO: Loading: sentiment
2021-05-30 10:44:06 INFO: Done loading processors!


BulkResponse(dataset='stanza-sentiment', processed=1, failed=0)

### Token Classification

Stanza offers so many different pretrained language models for Token Classification Tasks, and the list does not stop growing.

#### POS tagging

We can use one of the many UD models, used for POS tags, morphological features and syntantic relations. UD stands for [Universal Dependencies](https://universaldependencies.org), the framework where these models has been trained. For this example, let's try to extract POS tags of some Catalan lyrics.

In [24]:
import rubrix as rb
import stanza

# Loading a cool Obrint Pas lyric
input_text = "Viure mantenint viva la flama a través del temps. La flama de tot un poble en moviment" 

# Downloading our model, in case we don't have it cached
stanza.download("ca")

# Creating the pipeline
nlp = stanza.Pipeline(lang="ca", processors="tokenize,mwt,pos")

# Analizing the input text
doc = nlp(input_text)

# Building TokenClassificationRecord
record = rb.TokenClassificationRecord(
    text=input_text,
    tokens=[word.text for sent in doc.sentences for word in sent.words],
    prediction=[
        (word.pos, token.start_char, token.end_char)
        for sent in doc.sentences
        for token in sent.tokens
        for word in token.words
    ],
    prediction_agent="flair/catalan",
)

# Logging into Rubrix
rb.log(records=record, name="stanza-catalan-pos")

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.2.0.json: 128kB [00:00, 12.1MB/s]                    
2021-05-30 11:02:04 INFO: Downloading default packages for language: ca (Catalan)...
2021-05-30 11:02:06 INFO: File exists: /Users/ignaciotalaveracepeda/stanza_resources/ca/default.zip.
2021-05-30 11:02:16 INFO: Finished downloading models and saved to /Users/ignaciotalaveracepeda/stanza_resources.
2021-05-30 11:02:16 INFO: Loading these models for language: ca (Catalan):
| Processor | Package |
-----------------------
| tokenize  | ancora  |
| mwt       | ancora  |
| pos       | ancora  |

2021-05-30 11:02:16 INFO: Use device: cpu
2021-05-30 11:02:16 INFO: Loading: tokenize
2021-05-30 11:02:16 INFO: Loading: mwt
2021-05-30 11:02:16 INFO: Loading: pos
2021-05-30 11:02:17 INFO: Done loading processors!


BulkResponse(dataset='stanza-catalan-pos', processed=1, failed=0)

#### NER

Stanza also offers a list of available pretrained models for NER tasks. So, let's try Russian

In [27]:
import rubrix as rb
import stanza

input_text = "Герра-и-Пас - одна из моих любимых книг" #War and Peace is one my favourite books

# Downloading our model, in case we don't have it cached
stanza.download("ru")

# Creating the pipeline
nlp = stanza.Pipeline(lang="ru", processors="tokenize,ner")

# Analizing the input text
doc = nlp(input_text)

# Building TokenClassificationRecord
record = rb.TokenClassificationRecord(
    text=input_text,
    tokens=[word.text for sent in doc.sentences for word in sent.words],
    prediction=[
        (token.ner, token.start_char, token.end_char)
        for sent in doc.sentences
        for token in sent.tokens
    ],
    prediction_agent="flair/russian",
)

# Logging into Rubrix
rb.log(records=record, name="stanza-russian-ner")

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.2.0.json: 128kB [00:00, 2.98MB/s]                    
2021-05-30 11:24:03 INFO: Downloading default packages for language: ru (Russian)...
2021-05-30 11:24:06 INFO: File exists: /Users/ignaciotalaveracepeda/stanza_resources/ru/default.zip.
2021-05-30 11:24:13 INFO: Finished downloading models and saved to /Users/ignaciotalaveracepeda/stanza_resources.
2021-05-30 11:24:13 INFO: Loading these models for language: ru (Russian):
| Processor | Package   |
-------------------------
| tokenize  | syntagrus |
| ner       | wikiner   |

2021-05-30 11:24:13 INFO: Use device: cpu
2021-05-30 11:24:13 INFO: Loading: tokenize
2021-05-30 11:24:13 INFO: Loading: ner
2021-05-30 11:24:16 INFO: Done loading processors!


BulkResponse(dataset='stanza-russian-ner', processed=1, failed=0)