# Toxicity Type Detection

![SageMaker](https://img.shields.io/badge/SageMaker-%23FF9900.svg?style=for-the-badge&logo=amazon-aws&logoColor=white)

This notebook is a part of [ToChiquinho](https://dougtrajano.github.io/ToChiquinho/) project, which trains a model to detect toxicity types in a Portuguese text using the [OLID-BR](https://dougtrajano.github.io/olid-br/) dataset.

The model is trained using [Amazon SageMaker](https://aws.amazon.com/sagemaker/).

- [Setup](#Setup)
- [Prepare the data](#Prepare-the-data)
  - [Uploading the data to S3](#Uploading-the-data-to-S3)
- [Training process](#Training-process)
  - [Define the estimator](#Define-the-estimator)
  - [Hyperparameter tuning](#Hyperparameter-tuning)
  - [Training best model](#Training-best-model)
- [Documentation](#Documentation)

## Setup

In this section, we will import the necessary libraries and set up the environment.

In [1]:
from dotenv import load_dotenv
load_dotenv("../.env")

True

In [2]:
import os
from ml.arguments import NotebookArguments
from ml.utils import remove_checkpoints

params = NotebookArguments(
    mlflow_experiment_name="toxicity-type-detection",
    mlflow_tags={
        "project": "ToChiquinho",
        "dataset": "OLID-BR",
        "model_type": "bert",
        "problem_type": "multi_label_classification"
    },
    sagemaker_execution_role_arn=os.environ.get("SAGEMAKER_EXECUTION_ROLE_ARN"),
    sagemaker_tuning_job_name = "pytorch-training-221118-2303",
    aws_profile_name=os.environ.get("AWS_PROFILE")
)

params

NotebookArguments(num_train_epochs=30, early_stopping_patience=2, batch_size=8, validation_split=0.2, seed=1993, mlflow_experiment_name='toxicity-type-detection', mlflow_tags='{"project": "ToChiquinho", "dataset": "OLID-BR", "model_type": "bert", "problem_type": "multi_label_classification"}', sagemaker_image_uri='215993976552.dkr.ecr.us-east-1.amazonaws.com/sagemaker-transformers:1.12.0-gpu-py38')

## Setup

In [3]:
import boto3
import sagemaker

sagemaker_session = sagemaker.Session(
    boto_session=boto3.Session(profile_name=params.aws_profile_name)
)

bucket_name = sagemaker_session.default_bucket()
prefix = f"ToChiquinho/{params.mlflow_experiment_name}"

if params.sagemaker_execution_role_arn is None:
    params.sagemaker_execution_role_arn = sagemaker.get_execution_role(sagemaker_session)

## Prepare the data

In this section, we will prepare the data to be used in the training process.

We will download OLID-BR dataset from [HuggingFace Datasets](https://huggingface.co/datasets/olidbr), process it and upload it to S3 to be used in the training process.

In [4]:
from datasets import load_dataset

dataset = load_dataset("dougtrajano/olid-br")

Using custom data configuration dougtrajano--olid-br-f83aad8215e23434
Found cached dataset parquet (C:/Users/trajano/.cache/huggingface/datasets/dougtrajano___parquet/dougtrajano--olid-br-f83aad8215e23434/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


  0%|          | 0/2 [00:00<?, ?it/s]

In [5]:
import datasets
from typing import Union

def prepare_dataset(
    dataset: Union[datasets.Dataset, datasets.DatasetDict],
    test_size: float = 0.2,
    seed: int = 42
) -> Union[datasets.Dataset, datasets.DatasetDict]:

    # Filter only rows with is_offensive = "OFF"
    dataset = dataset.filter(lambda example: example["is_offensive"] == "OFF")

    # Filter only offensive comments with at least one toxicity label
    labels = [
        "health",
        "ideology",
        "insult",
        "lgbtqphobia",
        "other_lifestyle",
        "physical_aspects",
        "profanity_obscene",
        "racism",
        "sexism",
        "xenophobia"
    ]

    dataset = dataset.filter(
        lambda example: any([example[label] == True for label in labels])
    )

    # Keep only toxicity labels columns and text column
    dataset = dataset.remove_columns(
        [
            col for col in dataset["train"].column_names if col not in labels + ["text"]
        ]
    )

    train_dataset = dataset["train"].train_test_split(
        test_size=test_size,
        shuffle=True,
        seed=seed
    )

    dataset["train"] = train_dataset["train"]
    dataset["validation"] = train_dataset["test"]

    return dataset


dataset = prepare_dataset(
    dataset,
    test_size=params.validation_split,
    seed=params.seed
)

dataset

Loading cached processed dataset at C:\Users\trajano\.cache\huggingface\datasets\dougtrajano___parquet\dougtrajano--olid-br-f83aad8215e23434\0.0.0\2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec\cache-33e930f77f9076e5.arrow
Loading cached processed dataset at C:\Users\trajano\.cache\huggingface\datasets\dougtrajano___parquet\dougtrajano--olid-br-f83aad8215e23434\0.0.0\2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec\cache-2aedd0dd5e9a3144.arrow


  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

DatasetDict({
    test: Dataset({
        features: ['text', 'health', 'ideology', 'insult', 'lgbtqphobia', 'other_lifestyle', 'physical_aspects', 'profanity_obscene', 'racism', 'sexism', 'xenophobia'],
        num_rows: 1438
    })
    train: Dataset({
        features: ['text', 'health', 'ideology', 'insult', 'lgbtqphobia', 'other_lifestyle', 'physical_aspects', 'profanity_obscene', 'racism', 'sexism', 'xenophobia'],
        num_rows: 3417
    })
    validation: Dataset({
        features: ['text', 'health', 'ideology', 'insult', 'lgbtqphobia', 'other_lifestyle', 'physical_aspects', 'profanity_obscene', 'racism', 'sexism', 'xenophobia'],
        num_rows: 855
    })
})

In [8]:
dataset.save_to_disk("data")

Flattening the indices:   0%|          | 0/2 [00:00<?, ?ba/s]

Flattening the indices:   0%|          | 0/4 [00:00<?, ?ba/s]

Flattening the indices:   0%|          | 0/1 [00:00<?, ?ba/s]

### Uploading the data to S3

We are going to use the `sagemaker.Session.upload_data` function to upload our datasets to an S3 location.

The return value inputs identifies the location -- we will use later when we start the training job.

In [4]:
# inputs = sagemaker_session.upload_data(
#     path="data",
#     bucket=bucket_name,
#     key_prefix=f"{prefix}/data"
# )

inputs = "s3://sagemaker-us-east-1-215993976552/ToChiquinho/toxicity-type-detection/data"

print("input spec (in this case, just an S3 path): {}".format(inputs))

input spec (in this case, just an S3 path): s3://sagemaker-us-east-1-215993976552/ToChiquinho/toxicity-type-detection/data


In [10]:
import shutil
shutil.rmtree("data")

## Training session

In this section, we will run the training process.

To use Amazon SageMaker to run Docker containers, we need to provide a Python script for the container to run. In our case, all the code is in the `ml` folder, including the `train.py` script.

We will start doing a hyperparameter tuning process to find the best hyperparameters for our model.

Then, we will train the model using the best hyperparameters found.

In [5]:
import os
import mlflow

mlflow.start_run()

print(f"MLFlow run ID: {mlflow.active_run().info.run_id}")

MLFlow run ID: 481954e8af32432e83b2af1151d781a0


### Define the estimator

We will use the `sagemaker.huggingface.HuggingFace` class to define our estimator.

In [6]:
import logging
from sagemaker.huggingface import HuggingFace

checkpoint_s3_uri = f"s3://{bucket_name}/{prefix}/checkpoints"

instance_type = "ml.g4dn.xlarge" # 4 vCPUs, 16 GB RAM, 1 x NVIDIA T4 16GB GPU - $ 0.736 per hour
# instance_type = "ml.g4dn.2xlarge" # 8 vCPUs, 32 GB RAM, 1 x NVIDIA T4 16GB GPU - $ 0.94 per hour
# instance_type = "ml.g5.xlarge" # 4 vCPUs, 16 GB RAM, 1 x NVIDIA A10G 24GB GPU - $ 1.408 per hour
# instance_type = "ml.g5.2xlarge" # 8 vCPUs, 32 GB RAM, 1 x NVIDIA A10G 24GB GPU - $ 1.515 per hour
# instance_type = "ml.g5.4xlarge" # 16 vCPUs, 64 GB RAM, 2 x NVIDIA A10G 24GB GPU - $ 2.03 per hour
# instance_type = "ml.g5.8xlarge" # 32 vCPUs, 128 GB RAM, 4 x NVIDIA A10G 24GB GPU - $ 3.06 per hour

estimator = HuggingFace(
    entry_point="train.py",
    source_dir="ml",
    base_job_name=params.mlflow_experiment_name,
    container_log_level=logging.DEBUG,
    role=params.sagemaker_execution_role_arn,
    sagemaker_session=sagemaker_session,
    py_version="py38",
    pytorch_version="1.10.2",
    transformers_version="4.17.0",
    instance_count=1,
    instance_type=instance_type,
    use_spot_instances=True,
    max_wait=10800,
    max_run=10800,
    checkpoint_s3_uri=checkpoint_s3_uri,
    checkpoint_local_path="/opt/ml/checkpoints",
    environment={
        "MLFLOW_TRACKING_URI": params.mlflow_tracking_uri,
        "MLFLOW_EXPERIMENT_NAME": params.mlflow_experiment_name,
        "MLFLOW_TRACKING_USERNAME": params.mlflow_tracking_username,
        "MLFLOW_TRACKING_PASSWORD": params.mlflow_tracking_password,
        "MLFLOW_TAGS": params.mlflow_tags,
        "MLFLOW_RUN_ID": mlflow.active_run().info.run_id,
        "MLFLOW_FLATTEN_PARAMS": "True",
        "WANDB_DISABLED": "True"
    },
    hyperparameters={
        ## If you want to test the code, uncomment the following lines to use smaller datasets
        # "max_train_samples": 100,
        # "max_val_samples": 100,
        # "max_test_samples": 100,
        "num_train_epochs": params.num_train_epochs,
        "early_stopping_patience": params.early_stopping_patience,
        "batch_size": params.batch_size,
        "eval_dataset": "validation",
        "seed": params.seed
    },
)

In [7]:
mlflow.log_params(
    {
        "instance_type": estimator.instance_type,
        "instance_count": estimator.instance_count,
        "early_stopping_patience": params.early_stopping_patience
    }
)

To test our training job before hyperparameter tuning, we will run it with a small number of samples.

In [9]:
estimator.fit(inputs, wait=False)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: pytorch-training-2022-12-29-14-41-19-455


### Hyperparameter Tuning

We will use the `sagemaker.tuner.HyperparameterTuner` class to run a hyperparameter tuning process.

We use MLflow to track the training process, so we can analyze the results through the MLflow UI.

In [23]:
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

estimator._hyperparameters.pop("max_train_samples", None)
estimator._hyperparameters.pop("max_val_samples", None)
estimator._hyperparameters.pop("max_test_samples", None)

tuner = HyperparameterTuner(
    estimator,
    max_jobs=18,
    max_parallel_jobs=3,
    objective_type="Maximize",
    objective_metric_name="eval_f1",
    metric_definitions=[
        {
            "Name": "eval_f1",
            "Regex": "eval_f1_weighted: ([0-9\\.]+)"
        }
    ],
    hyperparameter_ranges={
        "learning_rate": ContinuousParameter(1e-5, 1e-3),
        "weight_decay": ContinuousParameter(0.0, 0.1),
        "adam_beta1": ContinuousParameter(0.8, 0.999),
        "adam_beta2": ContinuousParameter(0.8, 0.999),
        "adam_epsilon": ContinuousParameter(1e-8, 1e-6),
        "optim": CategoricalParameter(
            [
                "adamw_hf",
                "adamw_torch",
                "adamw_apex_fused",
                "adafactor"
            ]
        )
    }
)

In [None]:
tuner.fit(inputs, wait=False)

params.sagemaker_tuning_job_name = tuner.latest_tuning_job.name

In [8]:
print(f"SageMaker tuning job name: {params.sagemaker_tuning_job_name}")

SageMaker tuning job name: pytorch-training-221118-2303


In [9]:
import pandas as pd

tuner_metrics: pd.DataFrame = sagemaker.HyperparameterTuningJobAnalytics(
    hyperparameter_tuning_job_name=params.sagemaker_tuning_job_name,
    sagemaker_session=sagemaker_session
).dataframe()

tuner_metrics["optim"] = tuner_metrics["optim"].apply(lambda x: x.strip('"'))

tuner_metrics.sort_values("FinalObjectiveValue", ascending=False, inplace=True)
tuner_metrics[["TrainingJobName", "FinalObjectiveValue", "TrainingJobStatus"]]

Unnamed: 0,TrainingJobName,FinalObjectiveValue,TrainingJobStatus
17,pytorch-training-221118-2303-001-2c25be0c,0.759114,Completed
12,pytorch-training-221118-2303-006-49d87b23,0.75866,Completed
11,pytorch-training-221118-2303-007-1182e099,0.725817,Completed
10,pytorch-training-221118-2303-008-10c5fe8b,0.719377,Completed
13,pytorch-training-221118-2303-005-f1adc05a,0.718727,Completed
14,pytorch-training-221118-2303-004-1fdab0a9,0.714234,Completed
16,pytorch-training-221118-2303-002-c40b0d07,0.70998,Completed
9,pytorch-training-221118-2303-009-5743f9ce,0.709527,Completed
6,pytorch-training-221118-2303-012-c98d6e1c,0.69105,Completed
7,pytorch-training-221118-2303-011-cd4e57b8,0.675431,Completed


Now, we can sort the results by the `FinalObjectiveValue` metric and see the best hyperparameters found.

In [10]:
best_job = tuner_metrics.iloc[0]
best_job.to_dict()

{'adam_beta1': 0.9339215524915885,
 'adam_beta2': 0.9916979096990963,
 'adam_epsilon': 3.4435900142455904e-07,
 'learning_rate': 7.044186985160909e-05,
 'optim': 'adamw_apex_fused',
 'weight_decay': 0.02426675806866223,
 'TrainingJobName': 'pytorch-training-221118-2303-001-2c25be0c',
 'TrainingJobStatus': 'Completed',
 'FinalObjectiveValue': 0.7591144442558289,
 'TrainingStartTime': Timestamp('2022-11-18 23:04:37-0300', tz='tzlocal()'),
 'TrainingEndTime': Timestamp('2022-11-18 23:51:48-0300', tz='tzlocal()'),
 'TrainingElapsedTimeSeconds': 2831.0}

### Training best model

We will train the model using the best hyperparameters found.

We will concatenate the training and validation datasets to train the model with more data and evaluate it using the test dataset.

In [11]:
estimator.environment = {
    "MLFLOW_TRACKING_URI": params.mlflow_tracking_uri,
    "MLFLOW_EXPERIMENT_NAME": params.mlflow_experiment_name,
    "MLFLOW_TRACKING_USERNAME": params.mlflow_tracking_username,
    "MLFLOW_TRACKING_PASSWORD": params.mlflow_tracking_password,
    "MLFLOW_TAGS": params.mlflow_tags,
    "MLFLOW_RUN_ID": mlflow.active_run().info.run_id,
    "MLFLOW_FLATTEN_PARAMS": "True",
    "HF_MLFLOW_LOG_ARTIFACTS": "True",
    "HUGGINGFACE_HUB_TOKEN": params.huggingface_hub_token
}

estimator._hyperparameters = {
    "push_to_hub": "True",
    "hub_model_id": f"dougtrajano/{params.mlflow_experiment_name}",
    "num_train_epochs": params.num_train_epochs,
    "early_stopping_patience": params.early_stopping_patience,
    "batch_size": params.batch_size,
    "seed": params.seed,
    "concat_validation_set": "True",
    "eval_dataset": "test",
    "adam_beta1": best_job["adam_beta1"],
    "adam_beta2": best_job["adam_beta2"],
    "adam_epsilon": best_job["adam_epsilon"],
    "learning_rate": best_job["learning_rate"],
    "weight_decay": best_job["weight_decay"],
    "optim": best_job["optim"]
}

estimator.fit(inputs, wait=False)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: toxicity-type-detection-2023-02-26-21-42-23-182


In [13]:
remove_checkpoints(
    bucket_name=bucket_name,
    checkpoint_prefix=f"{prefix}/checkpoints",
    aws_profile_name=params.aws_profile_name
)

INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials


Deleted 85 checkpoints from sagemaker-us-east-1-215993976552/ToChiquinho/toxicity-type-detection/checkpoints.


In [None]:
mlflow.end_run()

## Documentation

- [Estimators — sagemaker documentation](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html)
- [HyperparameterTuner — sagemaker documentation](https://sagemaker.readthedocs.io/en/stable/api/training/tuner.html)
- [Configure and Launch a Hyperparameter Tuning Job - Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-ex-tuning-job.html)
- [Managed Spot Training in Amazon SageMaker - Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html)