# ü§ó Using Rubrix to explore NLP data with Hugging Face datasets and transformers 

In this tutorial, we will walk through the process of using Rubrix to explore NLP datasets in combination with the amazing `datasets` and `transformer` libraries from Hugging Face.


## Introduction

**Our goal is to show you how to store and explore NLP datasets using Rubrix** for use cases like training data management or model evaluation and debugging.

The tutorial is organized into three parts:

1. **Storing and exploring text classification data**: We will use the ü§ó datasets library and Rubrix to store text classification datasets.
2. **Storing and exploring token classification data**: We will use the ü§ó datasets library and Rubrix to store token classification data.
2. **Exploring predictions**: We will use a pretrained ü§ó transformers model and store its predictions into Rubrix to explore and evaluate our pretrained model.

## Install transformers and datasets

In [None]:
#!pip install transformers datasets -qqq
#!pip install tdqm  # for progress bars

## Setup Rubrix

[here we should point the user to the install and setup guide]

By default, rubrix will make a local initialization (as shown in the setup guide). If you want to specify an API url and key, you can pass that information via two environment variables: **RUBRIX_API_KEY** and **RUBRIX_API_URL**.

In [1]:
import rubrix as rb
from rubrix import TextClassificationRecord, TextClassificationAnnotation, ClassPrediction, TokenClassificationRecord, TokenClassificationAnnotation

# Removing this after rubrix.init() is possible
import os
os.environ["RUBRIX_API_URL"] = "https://rubrix-dev.biome.recogn.ai"
os.environ["RUBRIX_API_KEY"] = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJAcmVjb2duYWkiLCJleHAiOjE2MjEwMjE1NDd9.5r5yFsUnpukinzUBanG3pTOVlpy_aNd6OUuS0K3R2JM"

rb.init()

## 1. Storing and exploring text classification training data

Rubrix allows you to log and track data for different NLP tasks (such as `Token Classification` or `Text Classification`). 

With Rubrix you can track both training data and predictions from models. In this part, we will focus only on training data. Typically, training data is data which has been curated, supervised or annotated by a human with the goal of training a machine learning model. Other terms for this same concept are: ground-truth data, "gold-standard" data, or even "annotated" data.

In this part of the tutorial, you will learn how to use ü§ó datasets library for quick exploration of Text Classification and Token Classification training data. This is useful during model development, for getting a sense of the data, identifying potential issues, debugging, etc. Here we will use rather static "research "datasets but Rubrix really shines when you are collecting and using training data in the wild, or in other words in real data science projects.

Let's get started!

### Text classification with the `tweet_eval` dataset (Emoji classification)

Text classification task is all about predicting in which categories a text fits. As if you're shown an image you could quickly tell if there's a dog or a cat in it, we build NLP models to distinguish between a Jane Austen's novel or a Charlotte Bronte's poem. It's all about feeding models with labeled examples and see how it start predicting over the very same labels.

In this first case, we are going to play with `tweet_eval`, a dataset with a bunch of tweets from different authors and topics and the sentiment it transmits. This is, in fact, a very common NLP task called Sentiment Analysis, but with a cool tweak: we are representing these sentiments with emojis. Each tweet comes with a number between 0 and 19, which represents different emojis. You can see each one in a cell below or in the [tweet_eval site](https://huggingface.co/datasets/tweet_eval) at ü§ó Hub.

First of all, we are going to load the dataset from ü§ó Hub and visualize its content.

In [3]:
from datasets import load_dataset

dataset = load_dataset("tweet_eval", 'emoji', script_version="master")

Reusing dataset tweet_eval (/Users/ignaciotalaveracepeda/.cache/huggingface/datasets/tweet_eval/emoji/1.1.0/79e21f7659e902ea14f624232219492d972fe5e0f9d8c94363acc7f916a6be48)


In [5]:
labels = dataset['train'].features['label'].names; labels

['‚ù§',
 'üòç',
 'üòÇ',
 'üíï',
 'üî•',
 'üòä',
 'üòé',
 '‚ú®',
 'üíô',
 'üòò',
 'üì∑',
 'üá∫üá∏',
 '‚òÄ',
 'üíú',
 'üòâ',
 'üíØ',
 'üòÅ',
 'üéÑ',
 'üì∏',
 'üòú']

Usually, datasets are divided into train, validation and test splits, and each one of them is used in a certain part of the training. For now, we can stick to the training split, which usually contains the majority of the instances of a dataset. Let's see what's inside!

In [6]:
with dataset['train'].formatted_as("pandas"):
    print(dataset['train'][:5])

   label                                               text
0     12  Sunday afternoon walking through Venice in the...
1     19  Time for some BBQ and whiskey libations. Chomp...
2      0  Love love love all these people Ô∏è Ô∏è Ô∏è #friends...
3      0                                Ô∏è Ô∏è Ô∏è Ô∏è @ Toys"R"Us
4      2  Man these are the funniest kids ever!! That fa...


Now, we are going to create our records from this dataset and log them into rubrix. Rubrix comes with `TextClassificationRecord` and `TokenClassificationRecord` classes, which can be created from a dictionary. These objects passes information to rubrix  about the input of the model, the predictions obtained and the annotations made, as well as a metadata field for other important details. 

In our case, we haven't predicted anything, so we are only going to include the labels of each instance as annotations, as we know they are the ground truth. We will also include each tweet into inputs, and specify in the metadata section that we are into the training split. Once `records` is populated, we can log it with `rubric.log()`, specifying the name of our dataset. 

In [7]:
records = []
for record in dataset['train']:
    item = TextClassificationRecord(
        inputs={"text": record["text"]},
        annotation=TextClassificationAnnotation(
            labels=[ClassPrediction(class_label=labels[record["label"]])], 
            agent="https://huggingface.co/datasets/tweet_eval"),
        metadata={"split": "train"})
    records.append(item) 


In [8]:
rb.log(records=records, name="tweet_eval_emojis")

BulkResponse(dataset='tweet_eval_emojis', processed=45000, failed=0)

Thanks to our metadata section in the Text Classification Record, we can go all the way and log tweets from validation and test split too, so we can visualize all three of them in Rubrix UI. 

In [9]:
records_validation = []
for record in dataset['validation']:
    item = TextClassificationRecord(
        inputs={"text": record["text"]},
        annotation=TextClassificationAnnotation(
            labels=[ClassPrediction(class_label=labels[record["label"]])], 
            agent="https://huggingface.co/datasets/tweet_eval"),
        metadata={"split": "validation"})
    records_validation.append(item) 

rb.log(records=records_validation, name="tweet_eval_emojis")

BulkResponse(dataset='tweet_eval_emojis', processed=5000, failed=0)

In [7]:
records_test = []
for record in dataset['test']:
    item = TextClassificationRecord(
        inputs={"text": record["text"]},
        annotation=TextClassificationAnnotation(
            labels=[ClassPrediction(class_label=labels[record["label"]])], 
            agent="https://huggingface.co/datasets/tweet_eval"),
        metadata={"split": "test"})
    records_test.append(item) 

rb.log(records=records_test, name="tweet_eval_emojis")

Exception: Connection error: API is not responding. The API answered with a 500 code: None

### Natural language inference with the `MRPC` dataset

Natural Language Inference (NLI) is also a very common NLP task, but a little bit different to regular Text Classification. In NLI, the model receives a premise and a hypothesis, and it must figure out if the premise hypothesis is true or not given the premise. We have three categories: entailment (true), contradiction (false) or neutral (undetermined or unrelated). With the premise *"We live in a flat planet called Earth"*, the hypothesis *"The Earth is flat"* must be classified as entailment, as it is stated in the premise. NLI works with a sort of close-world assumption, in that everything not defined in the premise cannot be suppoused from the real world.

Another key difference from Text Classification is that the input come in pairs of two sentences or texts, not only one. Text Classification treats its input as a cohesive and correlated unit, while NLI treats its input as a pair and tries to find correlation. 

To play around with NLI we are going to use ü§ó Hub [GLUE benchmark](https://huggingface.co/datasets/glue) over the MRPC task. GLUE is a well-known benchmark resource for NLP, and allow us to use its data directly over the Microsoft Research Paraphrase Corpus, a corpus of online news.


In [11]:
from datasets import load_dataset
dataset = load_dataset('glue', 'mrpc', split='train')

Reusing dataset glue (/Users/ignaciotalaveracepeda/.cache/huggingface/datasets/glue/mrpc/1.0.0/7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4)


In [12]:
dataset[0]

{'idx': 0,
 'label': 1,
 'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'}

We can see the two input sentences instead of one. In order to simplify the workflow, let's just test if they are equivalent or not.

In [14]:
labels = dataset.features['label'].names ; labels

['not_equivalent', 'equivalent']

Populating our record list follows the same procedure as in Text Classification, adapting our input to the new scenario of pairs.

In [14]:
records= []
for record in dataset:
    item = TextClassificationRecord(
        inputs={"sentence1": record["sentence1"], "sentence2": record["sentence2"]},
        annotation=TextClassificationAnnotation(
            labels=[ClassPrediction(class_label=labels[record["label"]])], 
            agent="https://huggingface.co/datasets/glue"),
        metadata={"split": "train"})
    records.append(item) 

In [15]:
rb.log(records=records[0:10], name="mrpc")

Exception: Connection error: API is not responding. The API answered with a 500 code: None

#### TODO: exploration 
[we need to define the best way to show the webapp using these use cases, show basic features, etc.: short videos? screenshots? demo instance?]

### Multilabel text classification with `go_emotions` dataset

Another similar task to Text Classification, but yet a bit different, is Multilabel Text Classification. Just one key difference: more than one label may be predicted. While in a regular Text Classification task we may decide that the tweet *"I can't wait to travel to Egypts and visit the pyramids"* fits into the hastag **#Travel**, which is accurate, in Multilabel Text Classification we can classify it as more than one hastag, like **#Travel #History #Africa #Sightseeing #Desert**.

In Text Classification, the category with the highest score (which our model predicted) is going to be the category predicted, but in this task we need to establish a threshold, a value between 0 and 1, from which we will classify the labels as predictions or not. If we set it to 0.5, only categories with more than a 0.5 probability value will be considered predictions. 

To get used to this task and see how we can log data to Rubrix, we are going to use ü§ó Hub [go_emotions dataset](https://huggingface.co/datasets/go_emotions), with comments from different reddit forums and an associated sentiment (this experiment would also be considered Sentiment Analysis).

In [11]:
from datasets import load_dataset

dataset = load_dataset('go_emotions')

No config specified, defaulting to: go_emotions/simplified
Reusing dataset go_emotions (/Users/ignaciotalaveracepeda/.cache/huggingface/datasets/go_emotions/simplified/0.0.0/ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e)


Here's an example of an instance of the datasets, and the different labels, ordered. Each label will be represented in the dataset as a number, but we will translate to its name before logging to rubrix, to see things more clearly.

In [12]:
dataset['train'][0]

{'id': 'eebbqej',
 'labels': [27],
 'text': "My favourite food is anything I didn't have to cook myself."}

In [4]:
labels = dataset['train'].features['labels'].feature.names; labels

['admiration',
 'amusement',
 'anger',
 'annoyance',
 'approval',
 'caring',
 'confusion',
 'curiosity',
 'desire',
 'disappointment',
 'disapproval',
 'disgust',
 'embarrassment',
 'excitement',
 'fear',
 'gratitude',
 'grief',
 'joy',
 'love',
 'nervousness',
 'optimism',
 'pride',
 'realization',
 'relief',
 'remorse',
 'sadness',
 'surprise',
 'neutral']

Now, we need to add a confidence value to our annotation, from 0 to 1. As these are all ground truths, we consider they have the maximum probability.

In [13]:
records= []
for record in dataset['train']:
    item = TextClassificationRecord(
        inputs={"text": record["text"]},
        annotation=TextClassificationAnnotation(
            labels=[ClassPrediction(class_label=labels[cls], confidence= 1) for cls in record['labels']], 
            agent="https://huggingface.co/datasets/go_emotions"),
        metadata={"split": "train"})
    records.append(item) 

And logging is just as easy as before!

In [14]:
rb.log(records=records[0:10], name="go_emotions")

Exception: Connection error: API is not responding. The API answered with a 500 code: None

## 2. Storing and exploring token classification training data

In this second part, we will cover Token Classification while still using ü§ó datasets library. These kind of NLP tasks aim to divide the input text into words, or syllabes, and assign certain values to them. Think about giving each word in a sentence its gramatical category, or highlight which parts of a medical report belong to a certain speciality. 

We are going to cover a few cases using ü§ó datasets, and see how `TokenClassificationRecord` allows us to log data in rubrix in a similar fashion.

### Named-Entity Recognition with `wnut17` dataset

Named-Entity Recognition (NER) seeks to locate and classify named entities metioned in unstructured text into pre-defined categories. And, what's powerful about NER is that this predefined categories can be whatever we want. Maybe gramatical categories, and be the best at syntax analysis in our English class, maybe person names, or organizations, or even medical codes.

For this case, we are going to use ü§ó Hub [WNUT 17 dataset](https://huggingface.co/datasets/wnut_17), about rare entities on written text. Take for example the tweet ‚Äúso.. kktny in 30 mins?‚Äù - even human experts find entity kktny hard to detect and resolve. This task will evaluate the ability to detect and classify novel, emerging, singleton named entities in written text.

As always, let's first dive into the data and see how it looks like.

In [70]:
from datasets import load_dataset

dataset = load_dataset("wnut_17", split="train")

Reusing dataset wnut_17 (/Users/ignaciotalaveracepeda/.cache/huggingface/datasets/wnut_17/wnut_17/1.0.0/983205ce50100b6e8fce4b3d402f36dce9b206f736e1a630c78fb25e1d23b9e8)


In [71]:
dataset[0]

{'id': '0',
 'ner_tags': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  7,
  8,
  8,
  0,
  7,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'tokens': ['@paulwalk',
  'It',
  "'s",
  'the',
  'view',
  'from',
  'where',
  'I',
  "'m",
  'living',
  'for',
  'two',
  'weeks',
  '.',
  'Empire',
  'State',
  'Building',
  '=',
  'ESB',
  '.',
  'Pretty',
  'bad',
  'storm',
  'here',
  'last',
  'evening',
  '.']}

We can see a list of tags and the tokens they are refering to. We have the following rare entities in this example.

In [72]:
for entity, token in zip(dataset[0]["ner_tags"], dataset[0]["tokens"]):
    if entity != 0:
        print(f"""{token}: {dataset.features["ner_tags"].feature.names[entity]}""")

Empire: B-location
State: I-location
Building: I-location
ESB: B-location


So, it make a lot of sense to translate these tags into NER tags, which are much more self-explanatory than an integer.

In [73]:
dataset = dataset.map(lambda instance: {"ner_tags_translated": [dataset.features["ner_tags"].feature.names[tag] for tag in instance["ner_tags"]]})

Loading cached processed dataset at /Users/ignaciotalaveracepeda/.cache/huggingface/datasets/wnut_17/wnut_17/1.0.0/983205ce50100b6e8fce4b3d402f36dce9b206f736e1a630c78fb25e1d23b9e8/cache-3ed11e136ed38c6e.arrow


What we did is a mapping function over ü§ó  dataset, which allow us to make changes in every instance of the dataset. The very same instance that we printed before is much more readable now.

In [74]:
dataset[0]

{'id': '0',
 'ner_tags': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  7,
  8,
  8,
  0,
  7,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'ner_tags_translated': ['O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'B-location',
  'I-location',
  'I-location',
  'O',
  'B-location',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O'],
 'tokens': ['@paulwalk',
  'It',
  "'s",
  'the',
  'view',
  'from',
  'where',
  'I',
  "'m",
  'living',
  'for',
  'two',
  'weeks',
  '.',
  'Empire',
  'State',
  'Building',
  '=',
  'ESB',
  '.',
  'Pretty',
  'bad',
  'storm',
  'here',
  'last',
  'evening',
  '.']}

Info about the meaning of the tags is available [here](https://huggingface.co/datasets/viewer/?dataset=wnut_17), but to sum up, *Empire* and *ESB* has been classified as **B-LOC**, or beggining of a location name, *State* and *Building* has been classified as **I-LOC** or intermediate/final of a location name.

We need to transform a bit this information, providing an entity annotation, and for that use we will use `TokenClassificationAnnotation` objects. We can pass them entities, agents, scores and additional information. In our case, we will focus on population an entity list for each record in the dataset, with the following structure:

```python
{
    "start": #position of the first character of the entity in the sentence
    "end": #position of the last character of the entity in the sentence
    "start_token": #position of the first token of the entity in the token list
    "end_token": #position of the last token of the entity in the token list
    "label": #token tag
}
```

Let's create a function that transform our dataset records into entities. It's a bit weird, but don't worry! What's doing inside is getting the entities information as shown above.

In [75]:
def parse_entities(record):

    entities = []
    counter = 0
    for i in range(len(record["ner_tags"])):
        
        if record["ner_tags_translated"][i][0] == 'B':
            label = record["ner_tags_translated"][i][2:]

            inner_counter = counter
        
            for j in range(i, len(record["ner_tags"])):
                
                if record["ner_tags_translated"][j][0] != 'I' and j!=i:

                    entities.append({
                        "start": counter,
                        "end": inner_counter-1,
                        "start_token": i,
                        "end_token": j,
                        "label": label})
                    break
                
                inner_counter += len(record["tokens"][j]) + 1

        counter += len(record["tokens"][i]) + 1
        

    return entities


Let's proceed and create a record list to log it

In [76]:
records = []

for record in dataset:

    entities = parse_entities(record)

    item = TokenClassificationRecord(
         annotation= TokenClassificationAnnotation(
             agent= "https://huggingface.co/datasets/wnut_17",
             entities= entities
         ),
        raw_text=" ".join(record["tokens"]),
        tokens = record["tokens"],
        metadata= {"split": "train"})

    records.append(item)

In [77]:
records[0]

TokenClassificationRecord(tokens=['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.'], raw_text="@paulwalk It 's the view from where I 'm living for two weeks . Empire State Building = ESB . Pretty bad storm here last evening .", prediction=None, annotation=TokenClassificationAnnotation(agent='https://huggingface.co/datasets/wnut_17', entities=[{'start': 64, 'end': 85, 'start_token': 14, 'end_token': 17, 'label': 'location'}, {'start': 88, 'end': 91, 'start_token': 18, 'end_token': 19, 'label': 'location'}], score=None), id=None, metadata={'split': 'train'}, status=None, event_timestamp=None)

In [78]:
rb.log(records=records[0:10], name="ner_wnut_17")

BulkResponse(dataset='ner_wnut_17', processed=10, failed=0)

### Part of speech tagging with `conll2003` dataset

Another cool NLP task related with tokens is Part-of-Speech tagging (POS tagging). In it we will identify names, verbs, adverbs, adjectives...based on the context and the meaning of the words. It is a little bit trickier than having a huge dictionary where we can look up that *drink* is a verb and *dog* is a name. Many words change its gramatical type according to the context of the sentence, and here is where AI comes to save the day.

With just our dictionary and a regular script, *dog* in `The sailor dogs the hatch.` would be classified as a name, because *dog* is a name, right? A trained NLP model would step up and say *No! That's is a very common example to ilustrate the ambiguity of words. It is a verb!*. Or maybe it would just say *verb*. That's up to you.

In this [dataset](https://huggingface.co/datasets/conll2003) from ü§ó hub, we will see how differente sentence has POS and NER tags, and how we can log this POS tag information into Rubrix.

In [47]:
from datasets import load_dataset

dataset = load_dataset("conll2003", split="train")

Reusing dataset conll2003 (/Users/ignaciotalaveracepeda/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/63ba56944e35c1943434322a07ceefd79864672041b7834583709af4a5de4664)


In [48]:
dataset[0]

{'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'id': '0',
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.']}

Each POS and NER tag are represented by a number. In `dataset.features` we can see to which tag they refer (this [link](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) may serve you to look up the meaning).

In [65]:
dataset.features


{'id': Value(dtype='string', id=None),
 'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'pos_tags': Sequence(feature=ClassLabel(num_classes=47, names=['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'], names_file=None, id=None), length=-1, id=None),
 'chunk_tags': Sequence(feature=ClassLabel(num_classes=23, names=['O', 'B-ADJP', 'I-ADJP', 'B-ADVP', 'I-ADVP', 'B-CONJP', 'I-CONJP', 'B-INTJ', 'I-INTJ', 'B-LST', 'I-LST', 'B-NP', 'I-NP', 'B-PP', 'I-PP', 'B-PRT', 'I-PRT', 'B-SBAR', 'I-SBAR', 'B-UCP', 'I-UCP', 'B-VP', 'I-VP'], names_file=None, id=None), length=-1, id=None),
 'ner_tags': Sequence(feature=ClassLabel(num_classes=9, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], 

The following function will help us create the entities.

In [63]:
def parse_entities_POS(record):

    entities = []
    counter = 0

    for i in range(len(record["pos_tags"])):
        
        entities.append({
            "start": counter,
            "end": counter + len(record["tokens"][i]),
            "label": dataset.features["pos_tags"].feature.names[record["pos_tags"][i]],
        })

        counter += len(record["tokens"][i]) + 1
        
    return entities

In [66]:
records = []

for record in dataset:

    entities = parse_entities_POS(record)

    item = TokenClassificationRecord(
         annotation= TokenClassificationAnnotation(
             agent= "https://huggingface.co/datasets/conll2003",
             entities= entities
         ),
        raw_text=" ".join(record["tokens"]),
        tokens = record["tokens"],
        metadata= {"split": "train"})

    records.append(item)

In [69]:
rb.log(records=records[0:10], name="conll2003")

BulkResponse(dataset='conll2003', processed=10, failed=0)

And so it is done! We have logged data from 5 different type of experiments, which now can be visualized in Rubrix UI 

## 3. Exploring predictions

In this third part of the tutorial we are going to focus on loading a lot of predictions and annotations into Rubrix and visualize them from the UI. This process is something esential in any ML project, and Rubrix let us play with the data in many different ways: visualizing by predicted class, by annotated class, by split, selecting which ones were wrongly classified, etc.

### Agnews and zeroshot classification

To explore some logged data on Rubrix UI, we are going to predict the topic of some news with a zero-shot classifier (that we don't need to train), and compare the predicted category with the ground truth. The dataset we are going to use in this part is [ag_news](https://huggingface.co/datasets/ag_news), with information of over 1 million articles written in English.

First of all, as always, we are going to load the dataset from ü§ó Hub and visualize its content.

In [2]:
from datasets import load_dataset  

dataset = load_dataset("ag_news", split='test[:20%]') # 20% is over 1500 records

Using custom data configuration default
Reusing dataset ag_news (/Users/ignaciotalaveracepeda/.cache/huggingface/datasets/ag_news/default/0.0.0/fb5c5e74a110037311ef5e904583ce9f8b9fbc1354290f97b4929f01b3f48b1a)


In [3]:
dataset[0]

{'label': 2,
 'text': "Fears for T N pension after talks Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul."}

In [4]:
dataset.features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(num_classes=4, names=['World', 'Sports', 'Business', 'Sci/Tech'], names_file=None, id=None)}

This dataset has articles from four different classes, so we can define a category list, which may come in handy.

In [5]:
categories = ['World', 'Sports', 'Business', 'Sci/Tech']

Now, its time to load our zero-shot classificator model. We present to options:

1. [DistilBart-MNLI](https://huggingface.co/valhalla/distilbart-mnli-12-1)
2. [BERT-tiny](https://huggingface.co/prajjwal1/bert-tiny)

With the first model, the obtained results are probably going to be better, but it is a larger model, which could take longer to use. We are going to stick with the first one, but feel free to change it, and even to compare them!

In [6]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
  
tokenizer = AutoTokenizer.from_pretrained("valhalla/distilbart-mnli-12-1")
model = AutoModelForSequenceClassification.from_pretrained("valhalla/distilbart-mnli-12-1")

#tokenizer = AutoTokenizer.from_pretrained("prajjwal1/bert-tiny")
#model = AutoModel.from_pretrained("prajjwal1/bert-tiny")

pl = pipeline('zero-shot-classification', model=model, tokenizer=tokenizer)

Let's try to make a quick prediction and take a look.

In [7]:
pl(dataset[0]['text'], ['World', 'Sports', 'Business', 'Sci/Tech'], hypothesis_template='This example is {}.',multi_class=False)

{'sequence': "Fears for T N pension after talks Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.",
 'labels': ['Business', 'World', 'Sports', 'Sci/Tech'],
 'scores': [0.765126645565033,
  0.14847052097320557,
  0.04636916145682335,
  0.040033720433712006]}

Knowing how to make a prediction, we can now apply this to the whole selected dataset. Here, we also present you with two options:

1. Traverse through all records in the dataset, predict each record and log it to Rubrix.
2. Apply a map function to make the predictions and add that field to each record, and then log it as a whole to Rubrix.

In the following categories, each approach is presented. You choose what you like the most, or even both (be careful with the time and the duplicated records, though!).

### First approach

In [8]:
from tqdm import tqdm

for record in tqdm(dataset):
    
    # Make the prediction
    model_output = pl(record['text'], categories, hypothesis_template='This example is {}.')
    
    # Create the prediction object
    prediction = TextClassificationAnnotation(
        labels=[ClassPrediction(class_label=model_output['labels'][i], confidence=model_output['scores'][i]) for i in range(len(model_output['labels']))], 
        agent="https://huggingface.co/valhalla/distilbart-mnli-12-1",
        )
    
    # Create the annotation object
    annotation = TextClassificationAnnotation(
        labels=[ClassPrediction(class_label=categories[record["label"]])], 
        agent="https://huggingface.co/datasets/ag_news",
        )
    
    # Create the item to log
    item = TextClassificationRecord(
        inputs={'text': record['text']},
        prediction=prediction,
        annotation=annotation,
        metadata={'split':'train'})
    
    # Log to rubrix
    rb.log(records=item, name="ag_news")
    

  0%|          | 0/1520 [00:00<?, ?it/s]


KeyboardInterrupt: 

### Second approach

In [17]:
def add_predictions(records):
    
    predictions = pl([record for record in records['text']], categories, hypothesis_template='This example is {}.')
    
    
    if isinstance(predictions, list):
        return {"labels_predicted": [pred["labels"] for pred in predictions], "probabilities_predicted": [pred["scores"] for pred in predictions]}
    else:
        return {"labels_predicted": predictions["labels"], "probabilities_predicted": predictions["scores"]}

In [18]:
dataset_predicted = dataset.map(add_predictions, batched=True, batch_size=4)

HBox(children=(FloatProgress(value=0.0, max=380.0), HTML(value='')))




In [19]:
dataset_predicted[0]

{'label': 2,
 'labels_predicted': ['Business', 'World', 'Sports', 'Sci/Tech'],
 'probabilities_predicted': [0.7651262879371643,
  0.1484706699848175,
  0.0463692881166935,
  0.040033772587776184],
 'text': "Fears for T N pension after talks Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul."}

In [22]:
from tqdm import tqdm

for record in tqdm(dataset_predicted):
    # Create the prediction object
    prediction = TextClassificationAnnotation(
        labels=[ClassPrediction(class_label=predicted_label, confidence=predicted_probability) 
                for predicted_label, predicted_probability in zip(record['labels_predicted'], record['probabilities_predicted'])], 
        agent="https://huggingface.co/valhalla/distilbart-mnli-12-1",
        )
    
    # Create the annotation object
    annotation = TextClassificationAnnotation(
        labels=[ClassPrediction(class_label=categories[record["label"]])], 
        agent="https://huggingface.co/datasets/ag_news",
        )
    
    # Create the item to log
    item = TextClassificationRecord(
        inputs={'text': record['text']},
        prediction=prediction,
        annotation=annotation,
        metadata={'split':'train'})
    
    # Log to rubrix
    rb.log(records=item, name="ag_news")

  2%|‚ñè         | 32/1520 [00:40<31:42,  1.28s/it] 


KeyboardInterrupt: 

### Pretrained NER with conll2003

## Summary

In this tutorial, we have learnt to:

* bla bla
* bla bla




## Next steps

- Invite the reader to join the commu

Point the reader to other materials.