# Effortless models deployment with Mlflow
## Fine tunning a text classification model based on BERT to detect hate speech on tweets in portuguese.

In [None]:
import mlflow
from jobtools.arguments import file2namespace

In [None]:
from transformers import Trainer, TrainingArguments

from hatedetection.text_datasets import ClassificationDataset
from hatedetection.hate_detection_classifier import HateDetectionClassifier
from hatedetection.evaluation import compute_classification_metrics
from hatedetection.text_preparation import load_examples

First let's load same data for using in our training. We have 2 datasets available, train and testing. As you can image, we will use train for training and eval for evaluation of the performance of the model.

In [None]:
input_dataset_path = 'data/train/*.csv'
eval_dataset_path = 'data/eval/*.csv'

Let's see how the data looks like:

In [None]:
examples_train, labels_train = load_examples(input_dataset_path)
examples_eval, labels_eval = load_examples(eval_dataset_path)

Let's configure the experiment we will use with MLFlow

In [None]:
mlflow.set_experiment("hate-pt-speech-mlflow")

### Loading our training configuration

We will place all the configuration for our training routine in a `YAML` file. Let's load it:

In [None]:
params = file2namespace('train.params.yml')

> The method `file2namespace` allows us to parse a `YAML` file into a `SimpleNamespace` object that we can manipulate easily to access the configuration of our model. It is always a good practice to take away hiperparamters from our training code.

The file contents looks as follows:

In [None]:
!cat train.params.yml

We will create an instance of our HateDetectionClassifier. This class contains:

- A loader function `_load_pyfunc` that allows us to register this model as a `pyfunc` model in MLFlow.
- A method `predict()` that conforms the signature we will define later for our model.

In [None]:
classifier = HateDetectionClassifier()

We are going to fine tune this model to help us resolve the specific task in hand. The base model is a BERT architecture that has been trainined with text in portuguese. Our baseline is the model `neuralmind/bert-base-portuguese-cased` registered in HuggingFace.

In [None]:
classifier.build(baseline=params.model.baseline)


This method does the following:

- It loads the tokenizer for the given pretrained model.
- It loads the classifier for the given pretrained model. Since this base model has not been fine tuned to perform any specific task, we will recieve a warning about that. Don't worry, cause that's what we are looking for. We will fine tune this model with data specific to the task in hand.

```python
def build(self, baseline: str, tokenizer: str = None):
    self.tokenizer = AutoTokenizer.from_pretrained(tokenizer or baseline)
    self.model = AutoModelForSequenceClassification.from_pretrained(baseline)
```

### Handling long sequences of text

Transfomers are well know for growing exponentially in memory requirements regarding the lenght of the sequences they process. Even though there are models that can take longer senquences that the one we are using, it will always be a limit because memory is not infinit. So we need a way to handle this. There are a couple of options here:

- Truncate the sequences. Easy but data loss will occur.
- Split long sequences in smaller sequences and then combine the results of the subsequences to make the final prediction.

We are going to the the later here.

In [None]:
classifier.split_unique_words = params.data.preprocessing.split_unique_words
classifier.split_seq_len = params.data.preprocessing.split_seq_len

Let's reload the datasets with this parameters then:

In [None]:
examples_train, labels_train = load_examples(input_dataset_path, 
                                             split_seq=True,
                                             unique_words=params.data.preprocessing.split_unique_words,
                                             seq_len = params.data.preprocessing.split_seq_len)
examples_eval, labels_eval = load_examples(eval_dataset_path, 
                                           split_seq=True,
                                           unique_words=params.data.preprocessing.split_unique_words,
                                           seq_len = params.data.preprocessing.split_seq_len)

Let's build datasets to feed our transformer model:

In [None]:
train_dataset = ClassificationDataset(examples=examples_train,
                                      labels=labels_train,
                                      tokenizer=classifier.tokenizer)
eval_dataset = ClassificationDataset(examples=examples_eval,
                                     labels=labels_eval,
                                     tokenizer=classifier.tokenizer)

In [None]:
training_args = TrainingArguments(**vars(params.trainer))

In [None]:
trainer = Trainer(
        model=classifier.model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        compute_metrics=compute_classification_metrics
    )

In [None]:
history = trainer.train()

In [None]:
evaluation_metrics = trainer.evaluate()

In [None]:
saved_location=f"{params.model.output_dir}/{params.model.name}"
artifacts = classifier.save_pretrained(saved_location)

In [None]:
from mlflow.models.signature import ModelSignature
from mlflow.types.schema import Schema, ColSpec
from mlflow.types import DataType

input_schema = Schema([
  ColSpec(DataType.string, "text"),
])
output_schema = Schema([
  ColSpec(DataType.integer, "hate"),
  ColSpec(DataType.double, "confidence"),
])
signature = ModelSignature(inputs=input_schema, outputs=output_schema)

In [None]:
mlflow.log_metrics(dict(filter(lambda item: item[1] is not None, evaluation_metrics.items())))
mlflow.log_params(history.metrics)

In [None]:
mlflow.pyfunc.log_model("classifier", 
                        data_path=saved_location, 
                        code_path=["hatedetection"], 
                        loader_module="hatedetection.hate_detection_classifier", 
                        registered_model_name="hate-pt-speech", 
                        signature=signature)

In [None]:
mlflow.end_run()