# 🏷️ 🔫 Faster data annotation with a zero-shot text classifier

## TL;DR

1. A simple workflow for data annotation with Rubrix is introduced: **using a zero-shot classification model to pre-annotate and hand-label data more efficiently**.
2. The technique is shown with a Spanish **zero-shot classifier from the Hugging Face Hub** on the **Spanish portion of the MLSum dataset**.
3. We perform two data annotation rounds as an example: (1) **labeling random examples**, and (2) **bulk labeling high score examples**.
4. Besides boosting the labeling process, this workflow lets you **evaluate the performance of zero-shot classification for a specific use case**. In this example use case, we observe the pre-trained zero-shot classifier provides pretty decent results, which might be enough for general news categorization.


<video width="100%" controls><source src="https://github.com/recognai/rubrix-materials/raw/main/tutorials/videos/zeroshot_selectra_news_data_annotation.mp4" type="video/mp4"></video>

## Setup Rubrix

Rubrix, is a free and open-source tool to explore, annotate, and monitor data for NLP projects.

If you are new to Rubrix, check out the ⭐ [Github repository](https://github.com/recognai/rubrix).

If you have not installed and launched Rubrix, check the [Setup and Installation guide](../getting_started/setup&installation.rst).

Once installed, you only need to import Rubrix:

In [None]:
import rubrix as rb

## 1. Load the Spanish zero-shot classifier: `Selectra`

We will use the recently release Selectra zero-shot classifier: https://huggingface.co/Recognai/zeroshot_selectra_medium

In [None]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification", 
                       model="Recognai/zeroshot_selectra_medium")

predictions = classifier(
    "El autor se perfila, a los 50 años de su muerte, como uno de los grandes de su siglo",
    candidate_labels=["cultura", "sociedad", "economia", "salud", "deportes"],
    hypothesis_template="Ejemplo de {}."
)

hypothesis_template = "Esta noticia habla de {}."
candidate_labels = ["política", "cultura", "sociedad", "economia", "deportes", "ciencia y tecnología"]

In [3]:
predictions

{'sequence': 'El autor se perfila, a los 50 años de su muerte, como uno de los grandes de su siglo',
 'labels': ['sociedad', 'cultura', 'economia', 'salud', 'deportes'],
 'scores': [0.450505793094635,
  0.20290058851242065,
  0.18361839652061462,
  0.12826894223690033,
  0.03470633178949356]}

## 2. Loading the `MLSum` dataset

In [None]:
from datasets import load_dataset

mlsum = load_dataset("mlsum", "es", split="test[0:500]")

## 3. Logging predictions in Rubrix

In [None]:
records = []
for article in mlsum:
    predictions = classifier(
        article["summary"],
        candidate_labels = candidate_labels,
        hypothesis_template = hypothesis_template
    )
    records.append(
        rb.TextClassificationRecord(
            inputs=article["summary"],
            prediction=list(zip(predictions['labels'], predictions['scores'])),
            metadata={"topic": article["topic"]}
        )
    )

In [None]:
rb.log(records, name="zeroshot_noticias", metadata={"tags": "data-annotation"})

## 4. Hand-labeling session

### Label first 20 random examples

<video width="100%" controls><source src="https://github.com/recognai/rubrix-materials/raw/main/tutorials/videos/zeroshot_selectra_news_data_annotation.mp4" type="video/mp4"></video>

### Label records with high score predictions

<video width="100%" controls><source src="https://github.com/recognai/rubrix-materials/raw/main/tutorials/videos/zeroshot_high_confidence.mp4" type="video/mp4"></video>

## Next steps

If you are interested in the topic of zero-shot models, check out the tutorial for using [Rubrix with Flair's zero-shot NER](07-zeroshot_ner).

### 📚 [Rubrix documentation](https://docs.rubrix.ml) for more guides and tutorials.

### 🙋‍♀️ Join the Rubrix community! A good place to start is the [discussion forum](https://github.com/recognai/rubrix/discussions).

### ⭐ Rubrix [Github repo](https://github.com/recognai/rubrix) to stay updated.