# Annotate new data to improve NLP models using Rubrix and biome.text

## Introduction

Hey there! In this guide, we will show you how to use Rubrix to annotate new data, and use this new data to improve existing Deep Learning models. Our use case will be Automatic Misogyny Detection (AMI): Deep Learning models able to detect the underlying misogyny on a given text. Ground-breaking work is being made every year on this subject, with shared tasks and new models that push the performance of these models closer and closer to be implemented in apps, social networks and other digital environments. 

To train these NLP models we are going to use [biome.text](https://github.com/recognai/biome-text), an open-source library to train models with a simple workflow. Rubrix is compatible with almost any library or service, so we will work back and forth with both of them. 

The data used to feed the models and make the annotations comes from the [IberEval 2018](https://sites.google.com/view/ibereval-2…) shared task. It's a compilation of tweets, analyzed by experts and classified in 5 different misogyny categories. We are also making the specific datasets used in each step of this guide available, so it can be reproduced in the best way possible.

## Dependencies

If you want to reproduce this code, make sure that all the libraries needed to run this guide are installed and imported.

In [None]:
%pip install -U git+https://github.com/recognai/biome-text
%pip install rubrix
%pip install pandas
exit(0)  # Force restart of the runtime

In [None]:
#TODO:erase

import os
os.environ['WANDB_API_KEY'] = '7bd265df21100baa9767bb9f69108bc417db4b4a'

In [1]:
from biome.text import *
import pandas as pd
import rubrix as rb

#TODO: erase
from biome.text import *
from biome.text.hpo import TuneExperiment
from ray.tune.suggest.hyperopt import HyperOptSearch
from ray import tune
import math

import wandb 



## Loading datasets

Let's load some prepared datasets we made to quickly train our first model.

In [2]:
# Loading the datasets
training_ds = Dataset.from_csv('annotation_data/training_full_df.csv')
test_ds = Dataset.from_csv('annotation_data/test_df.csv')

# Removing non-useful generated columns
training_ds = training_ds.map(remove_columns=["Unnamed: 0", "id"])
test_ds = test_ds.map(remove_columns=["Unnamed: 0", "id"])

Using custom data configuration default-408234f3167ff690
Reusing dataset csv (/Users/ignaciotalaveracepeda/.cache/huggingface/datasets/csv/default-408234f3167ff690/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0)
Using custom data configuration default-ae7b327d976b362b
Reusing dataset csv (/Users/ignaciotalaveracepeda/.cache/huggingface/datasets/csv/default-ae7b327d976b362b/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0)


Taking a quick look at the columns and the number of rows of the dataset.

In [4]:
training_ds

Dataset({
    features: ['label', 'text'],
    num_rows: 3292
})

## Training the first model

Creating NLP pipelines with biome.text is quick and convenient! We performed a HPO process on the background, to find suitable hyperparameters for this domain, so let's use them to create our first AMI model. Note that we're making a pipeline with BETO, an Spanish Transformer model, at the head. To learn more about what a Transformer is, please visit the [Transformer guide of biome.text](https://recognai.github.io/biome-text/v3.0.0/documentation/tutorials/4-Using_Transformers_in_biome_text.html).

In [5]:
pipeline_dict = {
    "name": "AMI_first_model",
    "features": {
        "transformers": {
            "model_name": "dccuchile/bert-base-spanish-wwm-cased", # BETO model
            "trainable": True,
            "max_length": 280,  # As we are working with data from Twitter, this is our max length
        }
    },
    "head": {
        "type": "TextClassification",
        # These are the possible misogyny categories. 0 indicates it is non-sexist
        "labels": [
            'sexual_harassment',
             'dominance',
             'discredit',
             'stereotype',
             'derailing',
             'passive',
             'active',
             '0'
        ],
        "pooler": {
            "type": "lstm",
            "num_layers": 1,
            "hidden_size": 256,
            "bidirectional": True,
        },
    },
}

pl = Pipeline.from_config(pipeline_dict)

Some weights of BertModel were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
batch_size = 16
trainer_dict = {
    "optimizer": {
        "type": "adamw",
        "lr": tune.loguniform(1e-5, 1e-4),
        "weight_decay": tune.loguniform(2e-3, 6e-2 )
    },
    "learning_rate_scheduler": {
        "type": "linear_with_warmup",
        "num_epochs": 10,
        "num_steps_per_epoch": int(math.floor(len(training_ds)/batch_size)),
        "warmup_steps": 100,
    },
    "batch_size": batch_size,
    "num_epochs": 10,
    
}

In [7]:
trainer_config = TrainerConfiguration(
    optimizer={
        "type": "adamw",
        "lr": 0.000023636840436059507,
        "weight_decay": 0.01438297700463013,
    },
    batch_size=8,
    max_epochs=10,
)

In [2]:
TrainerConfiguration?

In [8]:
trainer = Trainer(
    pipeline=pl,
    train_dataset=training_ds,
    valid_dataset=test_ds,
    trainer_config=trainer_config
)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores


In [None]:
trainer.fit()

[34m[1mwandb[0m: Currently logged in as: [33mignacioct[0m (use `wandb login --relogin` to force relogin)



  | Name  | Type               | Params
---------------------------------------------
0 | _head | TextClassification | 111 M 
---------------------------------------------
111 M     Trainable params
0         Non-trainable params
111 M     Total params
447.825   Total estimated model params size (MB)


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validation sanity check', layout=Layout…



HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Training', layout=Layout(flex='2'), max…