# Annotate new data to improve NLP models using Rubrix and biome.text

## Introduction

Hey there! In this guide, we will show you how to use Rubrix to annotate new data and use this new data to improve existing Deep Learning models. Our use case will be Automatic Misogyny Detection (AMI): Deep Learning models able to detect the underlying misogyny on a given text. Ground-breaking work is being made every year on this subject, with shared tasks and new models that push the performance of these models closer and closer to be implemented in apps, social networks, and other digital environments. 

To train these NLP models we are going to use [biome.text](https://github.com/recognai/biome-text), an open-source library to train models with a simple workflow. Rubrix is compatible with almost any library or service, so we will work back and forth with both of them. 

The data used to feed the models and make the annotations comes from the [IberEval 2018](https://sites.google.com/view/ibereval-2…) shared task. It's a compilation of tweets, analyzed by experts and classified into 5 different misogyny categories. We are also making the specific datasets used in each step of this guide available, so they can be reproduced in the best way possible.

## Dependencies

If you want to reproduce this code, make sure that all the libraries needed to run this guide are installed and imported.

In [None]:
%pip install -U git+https://github.com/recognai/biome-text
%pip install rubrix
%pip install pandas
exit(0)  # Force restart of the runtime

In [None]:
#TODO:erase

import os
os.environ['WANDB_API_KEY'] = '7bd265df21100baa9767bb9f69108bc417db4b4a'

In [None]:
from biome.text import *
import pandas as pd
import rubrix as rb

#TODO: erase
from biome.text import *
from biome.text.hpo import TuneExperiment
from ray.tune.suggest.hyperopt import HyperOptSearch
from ray import tune
import math

import wandb 

## Loading datasets

Let's load some prepared datasets we've made to quickly train our first model.

In [None]:
# Loading the datasets
training_ds = Dataset.from_csv('annotation_data/training_full_df.csv')
test_ds = Dataset.from_csv('annotation_data/test_df.csv')

# Removing non-useful generated columns
training_ds = training_ds.map(remove_columns=["Unnamed: 0", "id"])
test_ds = test_ds.map(remove_columns=["Unnamed: 0", "id"])

Taking a quick look at the columns and the number of rows of the dataset.

In [None]:
training_ds

## Training the first model

Creating NLP pipelines with biome.text is quick and convenient! We performed an HPO process on the background, to find suitable hyperparameters for this domain, so let's use them to create our first AMI model. Note that we're making a pipeline with BETO, a Spanish Transformer model, at the head. To learn more about what a Transformer is, please visit the [Transformer guide of biome.text](https://recognai.github.io/biome-text/v3.0.0/documentation/tutorials/4-Using_Transformers_in_biome_text.html).

In [None]:
pipeline_dict = {
    "name": "AMI_first_model",
    "features": {
        "transformers": {
            "model_name": "dccuchile/bert-base-spanish-wwm-cased", # BETO model
            "trainable": True,
            "max_length": 280,  # As we are working with data from Twitter, this is our max length
        }
    },
    "head": {
        "type": "TextClassification",
        
        # These are the possible misogyny categories. 0 indicates it is non-sexist
        "labels": [
            'sexual_harassment',
             'dominance',
             'discredit',
             'stereotype',
             'derailing',
             'passive',
             'active',
             '0'
        ],
        "pooler": {
            "type": "lstm",
            "num_layers": 1,
            "hidden_size": 256,
            "bidirectional": True,
        },
    },
}

pl = Pipeline.from_config(pipeline_dict)

In [None]:
batch_size = 16
trainer_dict = {
    "optimizer": {
        "type": "adamw",
        "lr": tune.loguniform(1e-5, 1e-4),
        "weight_decay": tune.loguniform(2e-3, 6e-2 )
    },
    "learning_rate_scheduler": {
        "type": "linear_with_warmup",
        "num_epochs": 10,
        "num_steps_per_epoch": int(math.floor(len(training_ds)/batch_size)),
        "warmup_steps": 100,
    },
    "batch_size": batch_size,
    "num_epochs": 10,
    
}

In [None]:
trainer_config = TrainerConfiguration(
    optimizer={
        "type": "adamw",
        "lr": 0.000023636840436059507,
        "weight_decay": 0.01438297700463013,
    },
    batch_size=8,
    max_epochs=10,
)

In [None]:
trainer = Trainer(
    pipeline=pl,
    train_dataset=training_ds,
    valid_dataset=test_ds,
    trainer_config=trainer_config
)

In [None]:
trainer.fit()

After `trainer.fit()` stops, the results of the training and the obtained model will be in the output folder. Nevertheless, we know that this training can long on non-dedicated machines, so we also provide the obtained model to download and import. If you don't want to manually train the model, run the cell below, which downloads and imports the trained model into a biome pipeline.

In [None]:
#TODO:Descargar e importar código

In [None]:
#TODO: eliminar
pl = Pipeline.from_pretrained("model_annotation_guide.tar.gz")

We can make some predictions, and take a look at the performance of the model.

In [None]:
pl.predict("Las mujeres no deberían tener derecho a voto")

## Annotating as a single agent

When we said that we prepared some datasets with the tweets from IberEval 2018, we might have lied a little bit. We prepared the datasets with almost all tweets from IberEval 2018, but we also make a small compilation of 15 instances for you to start annotating. Picture you, after training your first model, trying to push a little bit its performance, or include some new data to cover as many different cases as possible. You came across new instances, and you want to annotate them and include them in a follow-up training. This is where Rubrix comes along. 

In this chapter of the guide we will show you how to:
* Import datasets to Rubrix (in our case, from a csv file).
* Annotate datasets using Rubrix.
* Export the annotated datasets to use them in your pipelines.

And we will cover the scenario of a single annotation agent. In the next chapter, we will give you some insight on how to annotate in teams.

### Logging a dataset into Rubrix

Let's start by logging the dataset into a Rubrix dataset. As these instances were initially annotated by the IberEval team, we can treat them as predictions. In the annotation process, therefore, we will decide if we agree with those predictions or not. If we take raw data, we wouldn't have these predictions to support our annotation process, but that's okay too!

The first step is to download the datasets. Then, we will iterate through all the instances, logging them into Rubrix.

In [None]:
#TODO: download to_annotate.csv

annotation_ds = Dataset.from_csv('annotation_data/to_annotate.csv')

In [None]:
records = []    # here we will store the TextClassificationRecord objects

# Possible labels, used to build the predictions
labels = ['sexual_harassment','dominance','discredit','stereotype','derailing','passive','active','0']

for record in annotation_ds:

    # Prediction list of each record in the dataset
    predictions = []

    # We build the prediction list with tuples.
    for label in labels:

        # Ff the label is the one predicted in the dataset, it has a score of 1
        if label==record["label"]:
            pred = (label, 1)
            predictions.append(pred)

        # Else, it has a score of 0
        else:
            pred = (label, 0)
            predictions.append(pred)

    # Appending the record into the list
    records.append(rb.TextClassificationRecord(
        inputs=record["text"],
        prediction=predictions,
        prediction_agent="IberEval 2018",
        metadata={'id': record["id"]},
        )
    )

# Logging the records into Rubrix
rb.log(records=records, name="annotation_misogyny")

Once we've logged our annotation dataset into Rubrix, we can start annotating on the UI. We know that first times can be challenging, so here we have some instructions and a GIF to show you around.

1. Open Rubrix in your browser. If you're running it locally, it is usually running on [http://localhost:6900](http://localhost:6900).
2. Select the `annotation_misogyny` dataset.
3. On the upper-right corner, toggle the `Annotation mode`. 
4. Start selecting the categories that you think fit the input text. If you don't know Spanish, don't worry! 15 instances are not going to change the final model that much, and you will still learn how to annotate.
5. For each instance you can annotate a category by pressing it, discarding the record (if you think it does not fit the problem domain), or leave it without an annotation.

![Example of Annotation](https://imgur.com/cdZpkXp.gif)

If you're wondering why we annotated that instance as 'non-sexist' is because we are trying to make a model capable to differentiate if the input text is being misogynistic or if it is talking about something misogynistic that happened. This second case is considered non-sexist. 

And that's it! We have annotated a dataset as a single annotator, and these new data can be used to retrain and fine-tune our NLP model.