# Annotate new data to improve NLP models using Rubrix and biome.text

## Introduction

Hey there! In this guide, we will show you how to use Rubrix to annotate new data, and use this new data to improve existing Deep Learning models. Our use case will be Automatic Misogyny Detection (AMI): Deep Learning models able to detect the underlying misogyny on a given text. Ground-breaking work is being made every year on this subject, with shared tasks and new models that push the performance of these models closer and closer to be implemented in apps, social networks and other digital environments. 

To train these NLP models we are going to use [biome.text](https://github.com/recognai/biome-text), an open-source library to train models with a simple workflow. Rubrix is compatible with almost any library or service, so we will work back and forth with both of them. 

The data used to feed the models and make the annotations comes from the [IberEval 2018](https://sites.google.com/view/ibereval-2…) shared task. We are also making the specific datasets used in each step of this guide available, so it can be reproduced in the best way possible.

## Dependencies

If you want to reproduce this code, make sure that all the libraries needed to run this guide are installed and imported.

In [None]:
%pip install -U git+https://github.com/recognai/biome-text
%pip install rubrix
%pip install pandas
exit(0)  # Force restart of the runtime

In [None]:
import os

os.environ['WANDB_API_KEY'] = '7bd265df21100baa9767bb9f69108bc417db4b4a'

In [33]:
from biome.text import *
import pandas as pd
import rubrix as rb

#TODO: erase
from biome.text import *
from biome.text.hpo import TuneExperiment
from ray.tune.suggest.hyperopt import HyperOptSearch
from ray import tune
import math

import wandb 

## Training the first model

Hey there! In this guide, we will show you how to use Rubrix to annotate new data, and use this new data to improve existing Deep Learning models. Our use case will be Automatic Misogyny Detection (AMI): Deep Learning models able to detect the underlying misogyny on a given text. Ground-breaking work is being made every year on this subject, with shared-tasks and new models that push the performance of this model closer and closer to be implemented in apps, social networks and other digital environments. 

To train these NLP models we are going to use [biome.text](https://github.com/recognai/biome-text), an open-source library to train models with a simple workflow. Rubrix is compatible with almost any library or service, so we will work back and forth with both of them. 

The data used to feed the models and make the annotations comes from the [IberEval 2018](https://sites.google.com/view/ibereval-2…) shared-task. We are also making the specific datasets used in each step of this guide available, so it can be reproduced in the best way possible.

## HPO (wont be included)

In [41]:
training_ds = Dataset.from_csv('annotation_data/training_df.csv')
validation_ds = Dataset.from_csv('annotation_data/validation_df.csv')
test_ds = Dataset.from_csv('annotation_data/test_df.csv')
training_full_ds = Dataset.from_csv('annotation_data/training_full_df.csv')

Using custom data configuration default-dd9545aa755b36d8
Reusing dataset csv (/Users/ignaciotalaveracepeda/.cache/huggingface/datasets/csv/default-dd9545aa755b36d8/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0)
Using custom data configuration default-22ae5ced3fbbbbb1
Reusing dataset csv (/Users/ignaciotalaveracepeda/.cache/huggingface/datasets/csv/default-22ae5ced3fbbbbb1/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0)
Using custom data configuration default-ae7b327d976b362b
Reusing dataset csv (/Users/ignaciotalaveracepeda/.cache/huggingface/datasets/csv/default-ae7b327d976b362b/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0)
Using custom data configuration default-408234f3167ff690
Reusing dataset csv (/Users/ignaciotalaveracepeda/.cache/huggingface/datasets/csv/default-408234f3167ff690/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0)


In [46]:
training_ds

Dataset({
    features: ['Unnamed: 0', 'id', 'text', 'label'],
    num_rows: 2810
})

In [42]:
pipeline_dict = {
    "name": "rubrix_guide",
    "features": {
        "transformers": {
            "model_name": "dccuchile/bert-base-spanish-wwm-cased",
            "trainable": True,
            "max_length": 280,  #twitter characters cap
        }
    },
    "head": {
        "type": "TextClassification",
        "multilabel": True,
        "labels": [
            'sexual_harassment',
             'dominance',
             'discredit',
             'stereotype',
             'derailing',
             'passive',
             'active',
             '0'
        ],
        "pooler": {
            "type": tune.choice(["gru", "lstm"]),
            "num_layers": 1,
            "hidden_size": tune.choice([32,64,128,256]),
            "bidirectional": tune.choice([True, False]),
        },
    },
}

In [50]:
batch_size = 16

trainer_config = TrainerConfiguration(
    optimizer={
        "type": "adamw",
        "lr": tune.loguniform(1e-5, 1e-4),
        "weight_decay": tune.loguniform(2e-3, 6e-2 )
    },
    max_epochs=10,
    batch_size= batch_size,
    monitor="validation_macro/fscore",
    monitor_mode="max"
)

In [51]:
search_alg = HyperOptSearch(metric="validation_macro/fscore", mode="max")

In [53]:
tune_exp = TuneExperiment(
    pipeline_config=pipeline_dict, 
    trainer_config=trainer_config,
    train_dataset=training_ds,
    valid_dataset=validation_ds,
    name="rubrix_guide",
    # parameters for tune.run
    num_samples=100,
    local_dir="tune_runs",
    resources_per_trial={"cpu": 2, "gpu": 1}
)

In [54]:
analysis_frozen = tune.run(
    tune_exp,
    scheduler=tune.schedulers.ASHAScheduler(), 
    config=tune_exp.config,
    metric="validation_macro/fscore",
    search_alg=search_alg,
    mode="max",
    progress_reporter=tune.JupyterNotebookReporter(overwrite=True),
)

Trial name,status,loc,metrics,mlflow,mlflow_tracking_uri,name,pipeline_config/features/transformers/max_length,pipeline_config/features/transformers/model_name,pipeline_config/features/transformers/trainable,pipeline_config/head/labels,pipeline_config/head/multilabel,pipeline_config/head/pooler/bidirectional,pipeline_config/head/pooler/hidden_size,pipeline_config/head/pooler/num_layers,pipeline_config/head/pooler/type,pipeline_config/head/type,pipeline_config/name,train_dataset_path,trainer_config/accumulate_grad_batches,trainer_config/add_csv_logger,trainer_config/add_early_stopping,trainer_config/add_lr_monitor,trainer_config/add_tensorboard_logger,trainer_config/add_wandb_logger,trainer_config/auto_lr_find,trainer_config/auto_scale_batch_size,trainer_config/batch_size,trainer_config/callbacks,trainer_config/check_val_every_n_epoch,trainer_config/checkpoint_callback,trainer_config/default_root_dir,trainer_config/fast_dev_run,trainer_config/flush_logs_every_n_steps,trainer_config/gpus,trainer_config/gradient_clip_val,trainer_config/limit_test_batches,trainer_config/limit_train_batches,trainer_config/limit_val_batches,trainer_config/log_every_n_steps,trainer_config/logger,trainer_config/lr_decay,trainer_config/max_epochs,trainer_config/max_steps,trainer_config/min_epochs,trainer_config/min_steps,trainer_config/monitor,trainer_config/monitor_mode,trainer_config/num_sanity_val_steps,trainer_config/num_workers_for_dataloader,trainer_config/optimizer/lr,trainer_config/optimizer/type,trainer_config/optimizer/weight_decay,trainer_config/overfit_batches,trainer_config/patience,trainer_config/precision,trainer_config/progress_bar_refresh_rate,trainer_config/resume_from_checkpoint,trainer_config/save_top_k_checkpoints,trainer_config/stochastic_weight_avg,trainer_config/terminate_on_nan,trainer_config/val_check_interval,trainer_config/warmup_steps,trainer_config/weights_save_path,valid_dataset_path,wandb
_default_trainable_15bf5a4a,PENDING,,,True,file:///Users/ignaciotalaveracepeda/Documents/RecognAI/rubrix/docs/guides/mlruns,rubrix_guide,280,dccuchile/bert-base-spanish-wwm-cased,True,"('sexual_harassment', 'dominance', 'discredit', 'stereotype', 'derailing', 'passive', 'active', '0')",True,False,256,1,gru,TextClassification,rubrix_guide,/var/folders/mb/lvj4fyds5757cy_7swmlpt_00000gn/T/tmpz2x55gq7,1,True,True,,True,True,False,False,16,,1,True,/Users/ignaciotalaveracepeda/Documents/RecognAI/rubrix/docs/guides/training_logs,False,100,,0,1,1,1,50,True,,10,,,,validation_macro/fscore,max,2,0,5.49823e-05,adamw,0.00597915,0,3,32,,,1,False,False,1,0,,/var/folders/mb/lvj4fyds5757cy_7swmlpt_00000gn/T/tmpp148g_y0,True


2021-06-30 13:40:24,044	ERROR tune.py:545 -- Trials did not complete: [_default_trainable_15bf5a4a]
2021-06-30 13:40:24,047	INFO tune.py:549 -- Total run time: 374.91 seconds (374.46 seconds for the tuning loop).
