# 🗂 Weak supervision in multi-label text classification tasks

In this tutorial we use Argilla and weak supervision to tackle two multi-label classification datasets:

- The first dataset is a curated version of [**GoEmotions**](https://huggingface.co/datasets/go_emotions), a dataset intended for **multi-label emotion classification**.
- We inspect the dataset in Argilla, come up with good heuristics, and combine them with a label model to train a **weakly supervised Hugging Face transformer**.
- In the second dataset, we [**categorize research papers**](https://www.kaggle.com/shivanandmn/multilabel-classification-dataset) by topic based on their titles, which is a **multi-label topic classification** problem.
- We repeat the process of finding good heuristics, combine them with a label model and train a **lightweight downstream model using sklearn** in the end.

![labelling-textclassification-sklearn-weaksupervision](../../_static/tutorials/labelling-textclassification-sklearn-weaksupervision/labelling-textclassification-sklearn-weaksupervision.png)

<div class="alert alert-info">

Note

The `Snorkel` and `FlyingSquid` label models are not suited for multi-label classification tasks and do not support them.
    
</div>

## Setup

For this tutorial we also need some third party libraries that can be installed via pip:

In [1]:
%pip install datasets "transformers[torch]" scikit-multilearn ipywidgets -qqq

Note: you may need to restart the kernel to use updated packages.


## GoEmotions

The original [GoEmotions](https://huggingface.co/datasets/go_emotions) is a challenging dataset intended for multi-label emotion classification.
For this tutorial, we simplify it a bit by selecting only 6 out of the 28 emotions: *admiration, annoyance, approval, curiosity, gratitude, optimism*.
We also try to accentuate the multi-label part of the dataset by down-sampling the examples that are classified with only one label.
See Appendix A for all the details of this preprocessing step.

### Define rules

Let us start by downloading our curated version of the dataset from the Hugging Face Hub, and log it to Argilla:

In [2]:
import argilla as rg
from datasets import load_dataset

# Download preprocessed dataset
ds_rb = rg.read_datasets(
    load_dataset("argilla/go_emotions_multi-label", split="train"),
    task="TextClassification",
)




In [3]:
# Log dataset to Argilla to find good heuristics
rg.log(ds_rb, name="go_emotions")


  0%|          | 0/4208 [00:00<?, ?it/s]

4208 records logged to http://localhost:6900/datasets/argilla/go_emotions


BulkResponse(dataset='go_emotions', processed=4208, failed=0)

After uploading the dataset, we can explore and inspect it to find good heuristic rules.
For this we highly recommend the dedicated [*Define rules* mode](../reference/webapp/define_rules.md) of the Argilla web app, that allows you to quickly iterate over heuristic rules, compute their metrics and save them.

Here we copy our rules found via the web app to the notebook for you to easily follow along the tutorial.

In [4]:
from argilla.labeling.text_classification import Rule

# Define our heuristic rules, they can surely be improved
rules = [
    Rule("thank*", "gratitude"),
    Rule("appreciate", "gratitude"),
    Rule("text:(thanks AND good)", ["admiration", "gratitude"]),
    Rule("advice", "admiration"),
    Rule("amazing", "admiration"),
    Rule("awesome", "admiration"),
    Rule("impressed", "admiration"),
    Rule("text:(good AND (point OR call OR idea OR job))", "admiration"),
    Rule("legend", "admiration"),
    Rule("exactly", "approval"),
    Rule("agree", "approval"),
    Rule("yeah", "optimism"),
    Rule("suck", "annoyance"),
    Rule("pissed", "annoyance"),
    Rule("annoying", "annoyance"),
    Rule("ruined", "annoyance"),
    Rule("hoping", "optimism"),
    Rule("joking", ["optimism", "admiration"]),
    Rule('text:("good luck")', "optimism"),
    Rule('"nice day"', "optimism"),
    Rule('"what is"', "curiosity"),
    Rule('"can you"', "curiosity"),
    Rule('"would you"', "curiosity"),
    Rule('"do you"', ["curiosity", "admiration"]),
    Rule('"great"', ["annoyance"])
]

We go on and apply these heuristic rules to our dataset creating our weak label matrix.
Since we are dealing with a multi-label classification task, the weak label matrix will have 3 dimensions.

> Dimensions of the weak multi label matrix: *number of records* x *number of rules* x *number of labels* 

It will be filled with 0 and 1, depending on if the rule voted for the respective label or not.
If the rule abstained for a given record, the matrix will be filled with -1. 

In [5]:
from argilla.labeling.text_classification import WeakMultiLabels, add_rules, delete_rules, update_rules

# Compute the weak labels for our dataset given the rules.
# If your dataset already contains rules you can omit the rules argument.


add_rules(dataset="go_emotions", rules=rules)


In [6]:
weak_labels = WeakMultiLabels("go_emotions")

Preparing rules:   0%|          | 0/25 [00:00<?, ?it/s]

Applying rules:   0%|          | 0/4208 [00:00<?, ?it/s]

Filling weak label matrix:   0%|          | 0/4208 [00:00<?, ?it/s]

We can call the `weak_labels.summary()` method to check the precision of each rule as well as our total coverage of the dataset.

In [7]:
# Check coverage/precision of our rules
weak_labels.summary()


Unnamed: 0,label,coverage,annotated_coverage,overlaps,correct,incorrect,precision
thank*,{gratitude},0.199382,0.198925,0.048004,74,0,1.0
appreciate,{gratitude},0.016397,0.021505,0.009981,7,1,0.875
text:(thanks AND good),"{admiration, gratitude}",0.007842,0.010753,0.007842,8,0,1.0
advice,{admiration},0.008317,0.008065,0.007605,3,0,1.0
amazing,{admiration},0.025428,0.021505,0.00499,8,0,1.0
awesome,{admiration},0.02519,0.034946,0.007605,12,1,0.923077
impressed,{admiration},0.002139,0.005376,0.0,2,0,1.0
text:(good AND (point OR call OR idea OR job)),{admiration},0.008555,0.018817,0.003089,7,0,1.0
legend,{admiration},0.001901,0.002688,0.000475,1,0,1.0
exactly,{approval},0.007842,0.010753,0.002376,3,1,0.75


we can observe that "joking" does not have any support and also "do you" is not informative, because its correct/incorrect ratio equals to 1. We can delete these two rules from the dataset using "delete_rules" method 

In [8]:
rules_to_delete = [
    Rule("joking", ["optimism", "admiration"]),
    Rule('"do you"', ["curiosity", "admiration"])]

delete_rules(dataset="go_emotions", rules=rules_to_delete)

# lets apply Weak Labeling again 

In [9]:
weak_labels = WeakMultiLabels("go_emotions")

Preparing rules:   0%|          | 0/23 [00:00<?, ?it/s]

Applying rules:   0%|          | 0/4208 [00:00<?, ?it/s]

Filling weak label matrix:   0%|          | 0/4208 [00:00<?, ?it/s]

In [10]:
weak_labels.summary()

Unnamed: 0,label,coverage,annotated_coverage,overlaps,correct,incorrect,precision
thank*,{gratitude},0.199382,0.198925,0.047766,74,0,1.0
appreciate,{gratitude},0.016397,0.021505,0.009743,7,1,0.875
text:(thanks AND good),"{admiration, gratitude}",0.007842,0.010753,0.007842,8,0,1.0
advice,{admiration},0.008317,0.008065,0.007367,3,0,1.0
amazing,{admiration},0.025428,0.021505,0.00499,8,0,1.0
awesome,{admiration},0.02519,0.034946,0.007129,12,1,0.923077
impressed,{admiration},0.002139,0.005376,0.0,2,0,1.0
text:(good AND (point OR call OR idea OR job)),{admiration},0.008555,0.018817,0.003089,7,0,1.0
legend,{admiration},0.001901,0.002688,0.000475,1,0,1.0
exactly,{approval},0.007842,0.010753,0.002139,3,1,0.75


We can observe that following rules are not working well; 

        Rule('"great"', ["annoyance"])

        Rule("yeah", "optimism"),

Let's update this two rules such that:

        Rule('"great"', ["admiration"])

        Rule("yeah", "approval"),

In [10]:
rules_to_update = [
    Rule('"great"', ["admiration"]),
    Rule("yeah", "approval")]

In [11]:
update_rules(dataset="go_emotions", rules=rules_to_update)

Lets' run weak labeling with final rules of the dataset

In [12]:
weak_labels = WeakMultiLabels(dataset="go_emotions")

Preparing rules:   0%|          | 0/23 [00:00<?, ?it/s]

Applying rules:   0%|          | 0/4208 [00:00<?, ?it/s]

Filling weak label matrix:   0%|          | 0/4208 [00:00<?, ?it/s]

In [13]:
weak_labels.summary()

Unnamed: 0,label,coverage,annotated_coverage,overlaps,correct,incorrect,precision
thank*,{gratitude},0.199382,0.198925,0.047766,74,0,1.0
appreciate,{gratitude},0.016397,0.021505,0.009743,7,1,0.875
text:(thanks AND good),"{admiration, gratitude}",0.007842,0.010753,0.007842,8,0,1.0
advice,{admiration},0.008317,0.008065,0.007367,3,0,1.0
amazing,{admiration},0.025428,0.021505,0.00499,8,0,1.0
awesome,{admiration},0.02519,0.034946,0.007129,12,1,0.923077
impressed,{admiration},0.002139,0.005376,0.0,2,0,1.0
text:(good AND (point OR call OR idea OR job)),{admiration},0.008555,0.018817,0.003089,7,0,1.0
legend,{admiration},0.001901,0.002688,0.000475,1,0,1.0
exactly,{approval},0.007842,0.010753,0.002139,3,1,0.75


Lets consider we want to try a rule

In [14]:
optimism_rule = Rule("wish*", "optimism")

In [15]:
optimism_rule.apply(dataset="go_emotions")

In [16]:
optimism_rule.metrics(dataset="go_emotions")

{'coverage': 0.006178707224334601,
 'annotated_coverage': 0.0,
 'correct': 0,
 'incorrect': 0,
 'precision': None}

__optimism_rule__ is not informative so we don't add it to dataset

Let's try a rule for __curiosity__ class

In [18]:
curiosity_rule = Rule("could you", "curiosity")

In [19]:
curiosity_rule.apply("go_emotions")

In [20]:
curiosity_rule.metrics(dataset="go_emotions")

{'coverage': 0.005465779467680608,
 'annotated_coverage': 0.002688172043010753,
 'correct': 1,
 'incorrect': 0,
 'precision': 1.0}

__curiosity_rule__ have a positive support, we can add it to dataset as follows:

In [24]:
curiosity_rule.add_to_dataset(dataset="go_emotions")

Let's apply Weak Labeling again with final rule set

In [26]:
weak_labels = WeakMultiLabels(dataset="go_emotions")

Preparing rules:   0%|          | 0/24 [00:00<?, ?it/s]

Applying rules:   0%|          | 0/4208 [00:00<?, ?it/s]

Filling weak label matrix:   0%|          | 0/4208 [00:00<?, ?it/s]

In [27]:
weak_labels.summary()

Unnamed: 0,label,coverage,annotated_coverage,overlaps,correct,incorrect,precision
thank*,{gratitude},0.199382,0.198925,0.048004,74,0,1.0
appreciate,{gratitude},0.016397,0.021505,0.009743,7,1,0.875
text:(thanks AND good),"{admiration, gratitude}",0.007842,0.010753,0.007842,8,0,1.0
advice,{admiration},0.008317,0.008065,0.007367,3,0,1.0
amazing,{admiration},0.025428,0.021505,0.00499,8,0,1.0
awesome,{admiration},0.02519,0.034946,0.007367,12,1,0.923077
impressed,{admiration},0.002139,0.005376,0.0,2,0,1.0
text:(good AND (point OR call OR idea OR job)),{admiration},0.008555,0.018817,0.003089,7,0,1.0
legend,{admiration},0.001901,0.002688,0.000475,1,0,1.0
exactly,{approval},0.007842,0.010753,0.002139,3,1,0.75


### Create training set

When we are happy with our heuristics, it is time to combine them and compute weak labels for the training of our downstream model.
For this we will use the `MajorityVoter`.
In the multi-label case, it sets the probability of a label to 0 or 1 depending on whether at least one non-abstaining rule voted for the respective label or not.

In [28]:
from argilla.labeling.text_classification import MajorityVoter

# Use the majority voter as the label model
label_model = MajorityVoter(weak_labels)


From our label model we get the training records together with its weak labels and probabilities.
We will use the weak labels with a probability greater than 0.5 as labels for our training, and hence copy them to the `annotation` property of our records.

In [29]:
# Get records with the predictions from the label model to train a down-stream model
train_rg = label_model.predict()

# Copy label model predictions to annotation with a threshold of 0.5
for rec in train_rg:
    rec.annotation = [pred[0] for pred in rec.prediction if pred[1] > 0.5]


We extract the test set with manual annotations from our `WeakMultiLabels` object:

In [30]:
# Get records with manual annotations to use as test set for the down-stream model
test_rg = rg.DatasetForTextClassification(weak_labels.records(has_annotation=True))


We will use the convenient `DatasetForTextClassification.prepare_for_training()` method to create datasets optimized for training with the Hugging Face transformers library:

In [31]:
from datasets import DatasetDict

# Create dataset dictionary and shuffle training set
ds = DatasetDict(
    train=train_rg.prepare_for_training().shuffle(seed=42),
    test=test_rg.prepare_for_training(),
)


Let us push the dataset to the Hub to share it with our colleagues.
It is also an easy way to outsource the training of the model to an environment with an accelerator, like Google Colab for example.

In [32]:
# Push dataset for training our down-stream model to the HF hub
ds.push_to_hub("argilla/go_emotions_training")




OSError: You need to provide a `token` or be logged in to Hugging Face with `huggingface-cli login`.

### Train a transformer downstream model

The following steps are basically a copy&paste from the amazing documentation of the [Hugging Face transformers](https://huggingface.co/docs/transformers) library.

First, we will load the tokenizer corresponding to our model, which we choose to be the [distilled version](https://huggingface.co/distilbert-base-uncased) of the infamous BERT.

<div class="alert alert-info">

Note

Since we will use a full-blown transformer as a downstream model (albeit a distilled one), we recommend executing the following code on a machine with a GPU, or in a Google Colab with a GPU backend enabled.
    
</div>

In [33]:
from transformers import AutoTokenizer

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")


Afterward, we tokenize our data:

In [34]:
def tokenize_func(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


# Tokenize the data
tokenized_ds = ds.map(tokenize_func, batched=True)


  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

The transformer model expects our labels to follow a common multi-label format of binaries, so let us use [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html) for this transformation.

In [35]:
from sklearn.preprocessing import MultiLabelBinarizer

# Turn labels into multi-label format
mb = MultiLabelBinarizer()
mb.fit(ds["test"]["label"])


def binarize_labels(examples):
    return {"label": mb.transform(examples["label"])}


binarized_tokenized_ds = tokenized_ds.map(binarize_labels, batched=True)


  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Before we start the training, it is important to define our metric for the evaluation.
Here we settle on the commonly used micro averaged *F1* metric, but we will also keep track of the *F1 per label*, for a more in-depth error analysis afterward.

In [36]:
from datasets import load_metric
import numpy as np

# Define our metrics
metric = load_metric("f1", config_name="multilabel")


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    # apply sigmoid
    predictions = (1.0 / (1 + np.exp(-logits))) > 0.5

    # f1 micro averaged
    metrics = metric.compute(
        predictions=predictions, references=labels, average="micro"
    )
    # f1 per label
    per_label_metric = metric.compute(
        predictions=predictions, references=labels, average=None
    )
    for label, f1 in zip(
        ds["train"].features["label"][0].names, per_label_metric["f1"]
    ):
        metrics[f"f1_{label}"] = f1

    return metrics


Now we are ready to load our pretrained transformer model and prepare it for our task: multi-label text classification with 6 labels.

In [37]:
from transformers import AutoModelForSequenceClassification

# Init our down-stream model
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", problem_type="multi_label_classification", num_labels=6
)


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

The only thing missing for the training is the `Trainer` and its `TrainingArguments`.
To keep it simple, we mostly rely on the default arguments, that often work out of the box, but tweak a bit the batch size to train faster. 
We also checked that 2 epochs are enough for our rather small dataset.

In [38]:
from transformers import TrainingArguments

# Set our training arguments
training_args = TrainingArguments(
    output_dir="test_trainer",
    evaluation_strategy="epoch",
    num_train_epochs=2,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
)


In [39]:
from transformers import Trainer

# Init the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=binarized_tokenized_ds["train"],
    eval_dataset=binarized_tokenized_ds["test"],
    compute_metrics=compute_metrics,
)


In [40]:
# Train the down-stream model
trainer.train()


The following columns in the training set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1417
  Num Epochs = 2
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 178
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
2022-11-21 15:41:38.459 | ERROR    | wandb.jupyter:notebook_metadata:231 - Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in yo

We achieved an micro averaged *F1* of abut 0.54, which is not perfect, but a good baseline for this challenging dataset.
When inspecting the *F1s per label*, we clearly see that the worst performing labels are the ones with the poorest heuristics in terms of accuracy and coverage, which comes to no surprise.

## Research topic dataset

After covering a multi-label emotion classification task, we will try to do the same for a multi-label classification task related to topic modeling.
In this dataset, research papers were classified with 6 non-exclusive labels based on their title and abstract.

We will try to classify the papers only based on the title, which is considerably harder, but allows us to quickly scan through the data and come up with heuristics.
See Appendix B for all the details of the minimal data preprocessing.

### Define rules

Let us start by downloading our preprocessed dataset from the Hugging Face Hub, and log it to Argilla:

In [None]:
import argilla as rg
from datasets import load_dataset

# Download preprocessed dataset
ds_rb = rg.read_datasets(
    load_dataset("argilla/research_titles_multi-label", split="train"),
    task="TextClassification",
)


Downloading:   0%|          | 0.00/2.26k [00:00<?, ?B/s]



Downloading and preparing dataset None/None (download: 1.17 MiB, generated: 2.10 MiB, post-processed: Unknown size, total: 3.27 MiB) to C:\Users\ufukh\.cache\huggingface\datasets\argilla___parquet\argilla--research_titles_multi-label-b196580940b58959\0.0.0\0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset parquet downloaded and prepared to C:\Users\ufukh\.cache\huggingface\datasets\argilla___parquet\argilla--research_titles_multi-label-b196580940b58959\0.0.0\0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901. Subsequent calls will reuse this data.


In [39]:
# Log dataset to Argilla to find good heuristics
rg.log(ds_rb, "research_titles")

  0%|          | 0/20972 [00:00<?, ?it/s]

20972 records logged to http://localhost:6900/datasets/argilla/research_titles


BulkResponse(dataset='research_titles', processed=20972, failed=0)

After uploading the dataset, we can explore and inspect it to find good heuristic rules.
For this we highly recommend the dedicated [*Define rules* mode](../../reference/webapp/features.html#weak-labelling) of the Argilla web app, that allows you to quickly iterate over heuristic rules, compute their metrics and save them.

Here we copy our rules found via the web app to the notebook for you to easily follow along the tutorial.

In [40]:
from argilla.labeling.text_classification import Rule

# Define our heuristic rules (can probably be improved)

rules = [
    Rule("stock*", "Quantitative Finance"),
    Rule("*asset*", "Quantitative Finance"),
    Rule("pric*", "Quantitative Finance"),
    Rule("economy", "Quantitative Finance"),
    Rule("deep AND neural AND network*", "Computer Science"),
    Rule("convolutional", "Computer Science"),
    Rule("allocat* AND *net*", "Computer Science"),
    Rule("program", "Computer Science"),
    Rule("classification* AND (label* OR deep)", "Computer Science"),
    Rule("scattering", "Physics"),
    Rule("astro*", "Physics"),
    Rule("optical", "Physics"),
    Rule("ray", "Physics"),
    Rule("entangle*", "Physics"),
    Rule("*algebra*", "Mathematics"),
    Rule("spaces", "Mathematics"),
    Rule("operators", "Mathematics"),
    Rule("estimation", "Statistics"),
    Rule("mixture", "Statistics"),
    Rule("gaussian", "Statistics"),
    Rule("gene", "Quantitative Biology"),
]


We go on and apply these heuristic rules to our dataset creating our weak label matrix.
As mentioned in the [GoEmotions](#goemotions) section, the weak label matrix will have 3 dimensions and values of -1, 0 and 1.

In [42]:
from argilla.labeling.text_classification import WeakMultiLabels

# Compute the weak labels for our dataset given the rules
# If your dataset already contains rules you can omit the rules argument.


add_rules(dataset="research_titles", rules=rules)
weak_labels = WeakMultiLabels("research_titles")


Preparing rules:   0%|          | 0/21 [00:00<?, ?it/s]

Applying rules:   0%|          | 0/20972 [00:00<?, ?it/s]

Filling weak label matrix:   0%|          | 0/20972 [00:00<?, ?it/s]

Let us get an overview of the our heuristics and how they perform:

In [43]:
# Check coverage/precision of our rules
weak_labels.summary()


Unnamed: 0,label,coverage,annotated_coverage,overlaps,correct,incorrect,precision
stock*,{Quantitative Finance},0.000954,0.000715,0.000191,3,0,1.0
*asset*,{Quantitative Finance},0.000477,0.000715,0.000238,3,0,1.0
pric*,{Quantitative Finance},0.003433,0.003337,0.000668,9,5,0.642857
economy,{Quantitative Finance},0.000238,0.000238,0.0,1,0,1.0
deep AND neural AND network*,{Computer Science},0.009155,0.01025,0.002575,32,11,0.744186
convolutional,{Computer Science},0.010109,0.009297,0.002003,32,7,0.820513
allocat* AND *net*,{Computer Science},0.000763,0.000715,0.0,3,0,1.0
program,{Computer Science},0.002623,0.003099,9.5e-05,11,2,0.846154
classification* AND (label* OR deep),{Computer Science},0.003338,0.004052,0.001287,14,3,0.823529
scattering,{Physics},0.004053,0.002861,0.000572,10,2,0.833333


Consider the case we have come up with new rules and want to add them to dataset 

In [45]:
additional_rules = [
    Rule("trading", "Quantitative Finance"),
    Rule("finance", "Quantitative Finance"),
    Rule("memor* AND (design* OR network*)", "Computer Science"),
    Rule("system* AND design*", "Computer Science"),
    Rule("material*", "Physics"),
    Rule("spin", "Physics"),
    Rule("magnetic", "Physics"),
    Rule("manifold* AND (NOT learn*)", "Mathematics"),
    Rule("equation", "Mathematics"),
    Rule("regression", "Statistics"),
    Rule("bayes*", "Statistics"),
]

In [46]:
add_rules(dataset="research_titles", rules=additional_rules)

In [49]:
weak_labels = WeakMultiLabels("research_titles")

Preparing rules:   0%|          | 0/32 [00:00<?, ?it/s]

Applying rules:   0%|          | 0/20972 [00:00<?, ?it/s]

Filling weak label matrix:   0%|          | 0/20972 [00:00<?, ?it/s]

In [50]:
weak_labels.summary()

Unnamed: 0,label,coverage,annotated_coverage,overlaps,correct,incorrect,precision
stock*,{Quantitative Finance},0.000954,0.000715,0.000334,3,0,1.0
*asset*,{Quantitative Finance},0.000477,0.000715,0.000286,3,0,1.0
pric*,{Quantitative Finance},0.003433,0.003337,0.000715,9,5,0.642857
economy,{Quantitative Finance},0.000238,0.000238,0.0,1,0,1.0
deep AND neural AND network*,{Computer Science},0.009155,0.01025,0.002909,32,11,0.744186
convolutional,{Computer Science},0.010109,0.009297,0.002241,32,7,0.820513
allocat* AND *net*,{Computer Science},0.000763,0.000715,0.0,3,0,1.0
program,{Computer Science},0.002623,0.003099,0.000143,11,2,0.846154
classification* AND (label* OR deep),{Computer Science},0.003338,0.004052,0.001335,14,3,0.823529
scattering,{Physics},0.004053,0.002861,0.001001,10,2,0.833333


Let's create new rules and see their affects, if they are informative enough we can proceed by adding them to dataset

In [52]:
# create a statistics rule and get its metrics
statistics_rule = Rule("sample", "Statistics")
statistics_rule.apply("research_titles")
statistics_rule.metrics("research_titles")

{'coverage': 0.004672897196261682,
 'annotated_coverage': 0.004529201430274136,
 'correct': 17,
 'incorrect': 2,
 'precision': 0.8947368421052632}

In [60]:
finance_rule = Rule("risk", "Quantitative Finance")
finance_rule.apply("research_titles")
finance_rule.metrics("research_titles")


{'coverage': 0.004815945069616631,
 'annotated_coverage': 0.004290822407628129,
 'correct': 1,
 'incorrect': 17,
 'precision': 0.05555555555555555}

In [61]:
finance_rule.add_to_dataset("research_titles")

Our assertion does not seem correct lets update this rule

In [64]:
finance_rule =  Rule("risk", "Statistics")

In [63]:
finance_rule.metrics("research_titles")

{'coverage': 0.004815945069616631,
 'annotated_coverage': 0.004290822407628129,
 'correct': 11,
 'incorrect': 7,
 'precision': 0.6111111111111112}

In [65]:
finance_rule.update_at_dataset("research_titles")

In [66]:
statistics_rule.add_to_dataset("research_titles")

In [71]:
quantitative_biology_rule = Rule("dna", "Quantitative Biology")

In [72]:
quantitative_biology_rule.metrics("research_titles")

{'coverage': 0.0013351134846461949,
 'annotated_coverage': 0.0011918951132300357,
 'correct': 4,
 'incorrect': 1,
 'precision': 0.8}

In [73]:
quantitative_biology_rule.add_to_dataset("research_titles")

Lets see the final matrix with new added rules

In [74]:
weak_labels = WeakMultiLabels("research_titles")

Preparing rules:   0%|          | 0/35 [00:00<?, ?it/s]

Applying rules:   0%|          | 0/20972 [00:00<?, ?it/s]

Filling weak label matrix:   0%|          | 0/20972 [00:00<?, ?it/s]

In [75]:
weak_labels.summary()

Unnamed: 0,label,coverage,annotated_coverage,overlaps,correct,incorrect,precision
stock*,{Quantitative Finance},0.000954,0.000715,0.000334,3,0,1.0
*asset*,{Quantitative Finance},0.000477,0.000715,0.000334,3,0,1.0
pric*,{Quantitative Finance},0.003433,0.003337,0.000811,9,5,0.642857
economy,{Quantitative Finance},0.000238,0.000238,4.8e-05,1,0,1.0
deep AND neural AND network*,{Computer Science},0.009155,0.01025,0.002956,32,11,0.744186
convolutional,{Computer Science},0.010109,0.009297,0.002336,32,7,0.820513
allocat* AND *net*,{Computer Science},0.000763,0.000715,4.8e-05,3,0,1.0
program,{Computer Science},0.002623,0.003099,0.000191,11,2,0.846154
classification* AND (label* OR deep),{Computer Science},0.003338,0.004052,0.001335,14,3,0.823529
scattering,{Physics},0.004053,0.002861,0.001049,10,2,0.833333


### Create training set

When we are happy with our heuristics, it is time to combine them and compute weak labels for the training of our downstream model.
As for the "GoEmotions" dataset, we will use the simple `MajorityVoter`.

In [76]:
from argilla.labeling.text_classification import MajorityVoter

# Use the majority voter as the label model
label_model = MajorityVoter(weak_labels)


From our label model we get the training records together with its weak labels and probabilities.
Since we are going to train an sklearn model, we will put the records in a pandas DataFrame that generally has a good integration with the sklearn ecosystem.

In [77]:
train_df = label_model.predict().to_pandas()


Before training our model, we need to extract the training labels from the label model predictions and transform them into a multi-label compatible format.

In [78]:
# Create labels in multi-label format, we will use a threshold of 0.5 for the probability
def multi_label_binarizer(predictions, threshold=0.5):
    predicted_labels = [label for label, prob in predictions if prob > threshold]
    binary_labels = [
        1 if label in predicted_labels else 0 for label in weak_labels.labels
    ]
    return binary_labels


train_df["label"] = train_df.prediction.map(multi_label_binarizer)


Now, let us define our downstream model and train it.

We will use the [scikit-multilearn library](http://scikit.ml/) to wrap a multinomial **Naive Bayes classifier** that is suitable for classification with discrete features (e.g., word counts for text classification).
The `BinaryRelevance` class transforms the multi-label problem with L labels into L single-label binary classification problems, so in the end we will automatically fit L naive bayes classifiers to our data.

The features for our classifier will be the counts of different word [n-grams](https://en.wikipedia.org/wiki/N-gram): that is, for each example we count the number of contiguous sequences of *n* words, where n goes from 1 to 5.
We extract these features with the `CountVectorizer`.

Finally, we will put our feature extractor and multi-label classifier in a sklearn pipeline that makes fitting and scoring the model a breeze.

In [79]:
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Define our down-stream model
classifier = Pipeline(
    [("vect", CountVectorizer()), ("clf", BinaryRelevance(MultinomialNB()))]
)


Training the model is as easy as calling the `fit` method on the our pipeline, and provide our training text and training labels.

In [80]:
import numpy as np

# Fit the down-stream classifier
classifier.fit(
    X=train_df.text,
    y=np.array(train_df.label.tolist()),
)


To score our trained model, we retrieve its predictions of the test set and use sklearn's `classification_report` to get all important classification metrics in a nicely formatted string.

In [81]:
# Get predictions for test set
predictions = classifier.predict(
    X=[rec.text for rec in weak_labels.records(has_annotation=True)]
)


In [82]:
from sklearn.metrics import classification_report

# Compute metrics
print(
    classification_report(
        weak_labels.annotation(), predictions, target_names=weak_labels.labels
    )
)


                      precision    recall  f1-score   support

    Computer Science       0.81      0.24      0.38      1740
         Mathematics       0.79      0.58      0.67      1141
             Physics       0.88      0.65      0.74      1186
Quantitative Biology       0.67      0.02      0.04       109
Quantitative Finance       0.46      0.13      0.21        45
          Statistics       0.52      0.69      0.60      1069

           micro avg       0.71      0.49      0.58      5290
           macro avg       0.69      0.39      0.44      5290
        weighted avg       0.76      0.49      0.56      5290
         samples avg       0.58      0.52      0.53      5290



  _warn_prf(average, modifier, msg_start, len(result))


We obtain a micro averaged F1 score of around 0.59, which again is not perfect but can serve as a decent baseline for future improvements.
Looking at the F1 per label, we see that the main problem is the recall of our heuristics and we should either define more of them, or try to find more general ones. 

## Summary

In this tutorial we saw how you can use *Argilla* to tackle multi-label text classification problems with weak supervision.
We showed you how to train two downstream models on two different multi-label datasets using the discovered heuristics.

For the emotion classification task, we trained a full-blown transformer model with Hugging Face, while for the topic classification task, we relied on a more lightweight Bayes classifier from sklearn.
Although the results are not perfect, they can serve as a good baseline for future improvements.

So the next time you encounter a multi-label classification problem, maybe try out weak supervision with *Argilla* and save some time for your annotation team 😀.

## Next steps

⭐ Star Argilla [Github repo](https://github.com/argilla-io/argilla) to stay updated.

📚 [Argilla documentation](https://docs.argilla.io) for more guides and tutorials.

🙋‍♀️ Join the Argilla community! A good place to start is the [discussion forum](https://github.com/argilla-io/argilla/discussions).

## Appendix A

This appendix summarizes the preprocessing steps for our curated *GoEmotions* dataset.
The goal was to limit the labels, and down-sample single-label annotations to move the focus to multi-label outputs.

In [None]:
# load original dataset and check label frequencies

import pandas as pd
import datasets

go_emotions = datasets.load_dataset("go_emotions")
df = go_emotions["test"].to_pandas()


def int2str(i):
    # return int(i)
    return go_emotions["train"].features["labels"].feature.int2str(int(i))


label_freq = []
idx_multi = df.labels.map(lambda x: len(x) > 1)
df["is_single"] = df.labels.map(lambda x: 0 if len(x) > 1 else 1)
df[idx_multi].labels.map(lambda x: [label_freq.append(int(l)) for l in x])
pd.Series(label_freq).value_counts()


In [None]:
# limit labels, down-sample single-label annotations and create Argilla records

import argilla as rg


def create(split: str) -> pd.DataFrame:
    df = go_emotions[split].to_pandas()
    df["is_single"] = df.labels.map(lambda x: 0 if len(x) > 1 else 1)

    # ['admiration', 'approval', 'annoyance', 'gratitude', 'curiosity', 'optimism', 'amusement']
    idx_most_common = df.labels.map(
        lambda x: all([int(label) in [0, 4, 3, 15, 7, 15, 20] for label in x])
    )
    df_multi = df[(df.is_single == 0) & idx_most_common]
    df_single = df[idx_most_common].sample(
        3 * len(df_multi), weights="is_single", axis=0, random_state=42
    )
    return pd.concat([df_multi, df_single]).sample(frac=1, random_state=42)


def make_records(row, is_train: bool) -> rg.TextClassificationRecord:
    annotation = [int2str(i) for i in row.labels] if not is_train else None
    return rg.TextClassificationRecord(
        inputs=row.text,
        annotation=annotation,
        multi_label=True,
        id=row.id,
    )


train_recs = create("train").apply(make_records, axis=1, is_train=True)
test_recs = create("test").apply(make_records, axis=1, is_train=False)

records = train_recs.to_list() + test_recs.tolist()


In [None]:
# publish dataset in the Hub

ds_rg = rg.DatasetForTextClassification(records).to_datasets()

ds_rg.push_to_hub("argilla/go_emotions_multi-label", private=True)


## Appendix B

This appendix summarizes the minimal preprocessing done to [this multi-label classification dataset](https://www.kaggle.com/shivanandmn/multilabel-classification-dataset) from Kaggle.
You can download the original data (`train.csv`) following the Kaggle link.

The preprocessing consists of extracting only the title from the research paper, and split the data into a train and validation set.

In [None]:
# Extract the title and split the data

import pandas as pd
import argilla as rg
from sklearn.model_selection import train_test_split

df = pd.read_csv("train.csv")

_, test_id = train_test_split(df.ID, test_size=0.2, random_state=42)

labels = [
    "Computer Science",
    "Physics",
    "Mathematics",
    "Statistics",
    "Quantitative Biology",
    "Quantitative Finance",
]


def make_record(row):
    annotation = [label for label in labels if row[label] == 1]
    return rg.TextClassificationRecord(
        inputs=row.TITLE,
        # inputs={"title": row.TITLE, "abstract": row.ABSTRACT},
        annotation=annotation if row.ID in test_id else None,
        multi_label=True,
        id=row.ID,
    )


records = df.apply(make_record, axis=1)


In [None]:
# publish the dataset in the Hub

dataset_rg = rg.DatasetForTextClassification(records.tolist())

dataset_rg.to_datasets().push_to_hub("argilla/research_titles_multi-label")
