# Using Transformers in biome.text


<a target="_blank" href="https://recognai.github.io/biome-text/v3.1.0/documentation/tutorials/4-Using_Transformers_in_biome_text.html"><img class="icon" src="https://recognai.github.io/biome-text/v3.1.0/assets/img/biome-isotype.svg" width=24 /></a>
[View on recogn.ai](https://recognai.github.io/biome-text/v3.1.0/documentation/tutorials/4-Using_Transformers_in_biome_text.html)
    
<a target="_blank" href="https://colab.research.google.com/github/recognai/biome-text/blob/v3.1.0/docs/docs/documentation/tutorials/4-Using_Transformers_in_biome_text.ipynb"><img class="icon" src="https://www.tensorflow.org/images/colab_logo_32px.png" width=24 /></a>
[Run in Google Colab](https://colab.research.google.com/github/recognai/biome-text/blob/v3.1.0/docs/docs/documentation/tutorials/4-Using_Transformers_in_biome_text.ipynb)
        
<a target="_blank" href="https://github.com/recognai/biome-text/blob/v3.1.0/docs/docs/documentation/tutorials/4-Using_Transformers_in_biome_text.ipynb"><img class="icon" src="https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png" width=24 /></a>
[View source on GitHub](https://github.com/recognai/biome-text/blob/v3.1.0/docs/docs/documentation/tutorials/4-Using_Transformers_in_biome_text.ipynb)

When running this tutorial in Google Colab, make sure to install *biome.text* first:

In [None]:
!pip install -U pip
!pip install -U biome-text
exit(0)  # Force restart of the runtime

*If* you want to log your runs with [WandB](https://wandb.ai/home), don't forget to install its client and log in.

In [None]:
!pip install wandb
!wandb login

## Introduction

In the last years we experienced a shift towards transfer learning as the standard approach to solve NLP problems. Before models were usually trained entirely from scratch, utilizing at most pretrained word embeddings. But nowadays it is very common to start with large pretrained language models as backbone of a system, and to set a task specific head on top of it. This new paradigm has made it easier to find state-of-the-art architectures for a great variety of NLP tasks.

Almost all current language models are based on the transformer architecture. The awesome [Hugging Face Transformers](https://github.com/huggingface/transformers) library provides access to hundreds of such pretrained language models including state-of-the-art models such as infamous [BERT](https://github.com/google-research/bert), as well as community driven models often covering a specific language type or resource requirements.

In this tutorial, we are going to classify [arXiv](https://arxiv.org/) papers into [categories](https://arxiv.org/category_taxonomy), analyzing the title of the paper and its abstract. We will use Hugging Face [distilled](https://medium.com/huggingface/distilbert-8cf3380435b5) implementation of [RoBERTa](https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/) and explore ways how to easily include pretrained transformers in a *biome.text* pipeline.

### External links about transformers
If this is the first time you hear about "Transformers" not referring to giant robots, here is a small list of resources at which you might want to have a look first:

* [Attention is all your need](https://arxiv.org/pdf/1706.03762.pdf): paper that introduced the architecture.
* [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/): 20-30 minute article covering how they work.
* [Illustrated Guide to Transformer Neural Network: a step by step explanation](https://youtu.be/4Bdc55j80l8): 15 minute long video covering how they work.
* [An Introduction To Transfer Learning In NLP and HuggingFace](https://www.youtube.com/watch?v=8Hg2UtQg6G4): 1 hour talk by Thomas Wolf

### Imports

Let us first import all the stuff we need for this tutorial:

In [None]:
from biome.text import Dataset, Pipeline, Trainer
from biome.text.configuration import VocabularyConfiguration, TrainerConfiguration
from biome.text.hpo import TuneExperiment
from ray import tune
import os

## Exploring and preparing the data

For this tutorial we are going to use the [arXiv dataset](https://www.kaggle.com/Cornell-University/arxiv) compiled by the Cornell University, which consists of metadata of scientific papers stored in [arXiv](https://arxiv.org/).

We preprocessed the data in a separate [notebook](https://drive.google.com/file/d/1zUSz81x15RH5mL5GoN7i7xqiNGEqclU0/view?usp=sharing) producing three csv files (train, validate and test datasets) that contain the title, the abstract and the category of the corresponding paper. 

Our NLP task will be to classify the papers into the given categories based on the title and abstract. Below we download the preprocessed data and create our [Datasets](https://recognai.github.io/biome-text/v3.1.0/api/biome/text/dataset.html#dataset) with it.

In [None]:
# Downloading the datasets
!curl -O https://biome-tutorials-data.s3-eu-west-1.amazonaws.com/transformers_arxiv-classifier/arxiv-dataset-train.json
!curl -O https://biome-tutorials-data.s3-eu-west-1.amazonaws.com/transformers_arxiv-classifier/arxiv-dataset-validate.json
!curl -O https://biome-tutorials-data.s3-eu-west-1.amazonaws.com/transformers_arxiv-classifier/arxiv-dataset-test.json

In [None]:
# Loading from local
train_ds = Dataset.from_json("arxiv-dataset-train.json")
valid_ds = Dataset.from_json("arxiv-dataset-validate.json")
test_ds = Dataset.from_json("arxiv-dataset-test.json")

Let's have a look at the first 10 examples of the train dataset.

In [None]:
train_ds.head()

Our pipeline defined in the next section, or to be more precise the `TaskClassification` task [head](https://recognai.github.io/biome-text/v3.1.0/documentation/basics.html#head), will expect a *text* and *label* column to be present in our data.
Therefore, we need to map our input to these two columns:

In [None]:
# Renaming the 'categories' column into 'label'
train_ds.rename_column_("categories", "label")
valid_ds.rename_column_("categories", "label")
test_ds.rename_column_("categories", "label")

# Combining 'title' and 'abstract' into a 'text' column, and remove them afterwards
train_ds = train_ds.map(lambda x: {"text": x["title"] + " " + x["abstract"]}, remove_columns=["title", "abstract"])
valid_ds = valid_ds.map(lambda x: {"text": x["title"] + " " + x["abstract"]}, remove_columns=["title", "abstract"])
test_ds = test_ds.map(lambda x: {"text": x["title"] + " " + x["abstract"]}, remove_columns=["title", "abstract"])

## Configuring and training the pipeline

As we have seen in [previous tutorials](https://recognai.github.io/biome-text/v3.1.0/documentation/tutorials/1-Training_a_text_classifier.html#explore-the-training-data), a *biome.text* [`Pipeline`](https://recognai.github.io/biome-text/v3.1.0/documentation/basics.html#pipeline) consists of tokenizing the input, extracting text features, applying a language encoding (optionally) and executing a task-specific head in the end. In *biome.text* the pre-trained transformers by Hugging Face are treated as a text feature, just like the *word* and *char* feature.

In this section we will configure and train 3 different pipelines to showcase the usage of transformers in *biome.text*.

### Fine-tuning the transformer

In our first pipeline we follow the common approach to use pretrained transformers in classification tasks. It consists of fine-tuning the transformer weights and using a special token as pooler in the end. In our configuration the former step means setting the `trainable` parameter in the *transformers* features to `True`. The downside of fine-tuning is that most of the pre-trained transformers are relatively big and require dedicated hardware to be fine-tuned. For example, in this tutorial we will use `distilroberta-base`, a [distilled version](https://github.com/huggingface/transformers/tree/master/examples/distillation) of RoBERTa with a total of ~80M parameters.

We also need to specify the maximum number of input tokens `max_length` supported by the pretrained transformer. If you are sure that your input data does not exceed this limit, you can skip this parameter.

With BERT-like models, such as RoBERTa, a special [CLS] token is added as first token to each input. It is pretrained to effectively represent the entire input and can be used as pooler in the head component. Many BERT like models pass this token through a non-linear tanh activation layer that is part of the pretraining. If you want to use these pretrained weights you have to use the `bert_pooler` together with the corresponding `pretrained_model`. We will fine-tune these weights as well (setting `require_grad` to `True`) and add a little dropout.

::: tip Tip

You can also use the [CLS] token directly without passing it through the non-linear layer by using the `cls_pooler`.

:::

The `TextClassification` head automatically applies a linear layer with an output dimension corresponding to the number of labels in the end.


In [None]:
pipeline_dict_finetuning = {
    "name": "arxiv_categories_classification",
    "features": {
        "transformers": {
            "model_name": "distilroberta-base",
            "trainable": True,  # freeze the weights of the transformer
            "max_length": 512,
        },
    },
    "head": {
        "type": "TextClassification",
        "labels": train_ds.unique("label"),
        "pooler": {
            "type": "bert_pooler",
            "pretrained_model": "distilroberta-base",
            "requires_grad": True,
            "dropout": 0.1,
        },
        # If you do not want to use the pre-trained activation layer for the CLS token (see text) 
        # "pooler": {
        #     "type": "cls_pooler",
        # }
    },
}

In [None]:
pl = Pipeline.from_config(pipeline_dict_finetuning)

In our trainer configuration we will use canonical values for the `batch_size` and `lr` taken from the Hugging Face transformers library. We also will apply a linearly decaying learning rate scheduler with 50 warm-up steps, which is recommended when fine-tuning a pretrained model. For now we will stick to two epochs to allow for a rapid iteration.

In [None]:
trainer_config = TrainerConfiguration(
    optimizer={
        "type": "adamw",
        "lr": 5e-5
    },
    warmup_steps=50,
    lr_decay="linear",
    batch_size=8,
    max_epochs=2,
)

In [None]:
trainer = Trainer(
    pipeline=pl,
    train_dataset=train_ds,
    valid_dataset=valid_ds,
    trainer_config=trainer_config
)

In [None]:
trainer.fit("output/fine_tuning")

After two epochs we achieve an accuracy of about 0.65, which is competetive looking at the corresponding [Kaggle notebooks](https://www.kaggle.com/Cornell-University/arxiv/notebooks). Keep in mind that we did not optimize any of the training parameters.

### Training with a frozen transformer

In our second pipeline we keep the weights of the transformer frozen by setting `trainable: False` and only train the pooler in the head component. In this setup the training will be significantly faster and does not necessarily require dedicated hardware.

As pooler we will use a bidirectional [GRU](https://en.wikipedia.org/wiki/Gated_recurrent_unit) in the head.

In [None]:
pipeline_dict_frozen = {
    "name": "arxiv_categories_classification",
    "features": {
        "transformers": {
            "model_name": "distilroberta-base",
            "trainable": False,
            "max_length": 512,
        }
    },
    "head": {
        "type": "TextClassification",
        "labels": train_ds.unique("label"),
        "pooler": {
            "type": "gru",
            "num_layers": 1,
            "hidden_size": 128,
            "bidirectional": True,
        },
    },
}

In [None]:
pl = Pipeline.from_config(pipeline_dict_frozen)

In our training configuration we will use the same `batch_size` as in the previous configuration but increase the learning rate to Pytorch's default value for the AdamW optimizer, in order to work well with the GRU. We also remove the learning rate scheduler with its warmup steps, since we do not modify the pretrained transformer weights.

In [None]:
trainer_config = TrainerConfiguration(
    optimizer={
        "type": "adamw",
        "lr": 0.002,
    },
    batch_size=8,
    max_epochs=2,
)

In [None]:
trainer = Trainer(
    pipeline=pl,
    train_dataset=train_ds,
    valid_dataset=valid_ds,
    trainer_config=trainer_config,
)

In [None]:
trainer.fit("output/frozen_transformer")

The training is about 4 times faster compared with fine-tuning the transformer, and after two epochs we reach a respectable accuracy of about 0.60. Keep in mind that we did not optimize any of the training parameters.

### Combining text features

As mentioned earlier, the pretrained transformers are treated as a text feature in *biome.text*. We can easily combine them with other features, such as the *char* feature for example, which encodes word tokens based on their characters.

Keep in mind that the *char* feature provides a feature vector per word (spaCy) token, while the *transformers* feature provides a contextualized feature vector per word piece. Therefore, we simply sum up the word piece vectors of the transformers feature, to end up with concatenated feature vectors per word token. 

::: warning Note

This also means that special transformer tokens, such as BERT's [CLS] token, are ignored when combining text features.

:::

As in the second configuration, we will pool the feature vectors with a *GRU* in the *head* component.

In [None]:
pipeline_dict_combining = {
    "name": "arxiv_categories_classification",
    "tokenizer": {},
    "features": {
        "char": {
            "embedding_dim": 32,
            "lowercase_characters": True,
            "encoder": {
                "type": "gru",
                "num_layers": 1,
                "hidden_size": 32,
                "bidirectional": True,
            },
            "dropout": 0.1,
        },
        "transformers": {
            "model_name": "distilroberta-base",
            "trainable": False,
            "max_length": 512,
        }
    },
    "head": {
        "type": "TextClassification",
        "labels": train_ds.unique("label"),
        "pooler": {
            "type": "gru",
            "num_layers": 1,
            "hidden_size": 128,
            "bidirectional": True,
        },
    },
}

In [None]:
pl = Pipeline.from_config(pipeline_dict_combining)

We will use the same training configuration as in the frozen transformer section.

In [None]:
trainer_config = TrainerConfiguration(
    optimizer={
        "type": "adamw",
        "lr": 0.001,
    },
    batch_size=8,
    max_epochs=2,
)

In [None]:
trainer = Trainer(
    pipeline=pl,
    train_dataset=train_ds,
    valid_dataset=valid_ds,
    trainer_config=trainer_config
)

In [None]:
trainer.fit("output/combined_features")

With an accuracy of 0.55, combining features in this case seems to be counterproductive. The main reason is the exclusion of the special transformers tokens and the usage of feature vectors per word instead of word-pieces. Even when fine-tuning the transformer, those differences seem to significantly affect the performance as shown in our [WandB report](https://wandb.ai/ignacioct/biome/reports/Exploring-Ways-to-use-Pretrained-Transformers-in-biome-text--VmlldzoyNzk2MTM).

### Compare performances with TensorBoard (optional)

In the logging folder of the trainig we automatically log the results with [TensorBoard](https://www.tensorflow.org/tensorboard/). This helps us to conveniently compare the three training runs from above. Alternatively, if you installed and logged in to WandB, the runs should have been logged automatically to the *biome* project of your account. 

In [None]:
%load_ext tensorboard
%tensorboard --logdir=training_logs/tensorboard

## Optimizing the trainer configuration

As described in the [HPO tutorial](https://recognai.github.io/biome-text/v3.1.0/documentation/tutorials/3-Hyperparameter_optimization_with_Ray_Tune.html#imports), *biome.text* relies on the [Ray Tune library](https://docs.ray.io/en/latest/tune.html#tune-index) to perform hyperparameter optimization.
We recommend to go through that tutorial first, as we will be skipping most of the implementation details here.

### Frozen transformer

In this section we will first try to improve the performance of the frozen-transformer configuration by conducting a random search for three training parameters:
- learning rate
- weight decay
- batch size

In [None]:
trainer_config = TrainerConfiguration(
    optimizer={
        "type": "adamw",
        "lr": tune.loguniform(5e-3, 5e-4),
        "weight_decay": tune.loguniform(1e-3, 0.1)
    },
    batch_size=tune.choice([4, 8, 16]),
    max_epochs=2,
    monitor="validation_accuracy",
    monitor_mode="max"
)

Having defined the search space for our hyperparameters, we create a `TuneExperiment` where we specify the number of samples to be dranw from our search space, the `local_dir` for our HPO output and the computing resources we want Ray Tune to have access to.

In [None]:
tune_exp = TuneExperiment(
    pipeline_config=pipeline_dict_frozen, 
    trainer_config=trainer_config,
    train_dataset=train_ds,
    valid_dataset=valid_ds,
    name="frozen_transformer_sweep",
    # parameters for tune.run
    num_samples=50,
    local_dir="tune_runs",
    resources_per_trial={"gpu": 1, "cpu":2},
)

With our TuneExperiment object at hand, we simply have to pass it on to the [`tune.run`](https://docs.ray.io/en/master/tune/api_docs/execution.html#tune-run) function to start our random search.

To speed things up we will use the [ASHA](https://blog.ml.cmu.edu/2018/12/12/massively-parallel-hyperparameter-optimization/) trial scheduler that terminates low performing trials early. In our case we take the *validation_accuracy* as a meassure of the models performance.

In Google Colab with a GPU backend this random search should not take more than about 1.5 hours and we recommend following the progress via WandB. Alternatively, you could follow the progress via [TensorBoard](https://www.tensorflow.org/tensorboard/) by launching a TensorBoard instance before starting the random search, and pointing it to the *local_dir* output:

In [None]:
%tensorboard --logdir=tune_runs

In [None]:
analysis_frozen = tune.run(
    tune_exp,
    scheduler=tune.schedulers.ASHAScheduler(), 
    metric="validation_accuracy",
    mode="max",
    progress_reporter=tune.JupyterNotebookReporter(overwrite=True)
)

The best configuration in our random search achieved an accuracy of about 0.63 with following parameters:

- learning rate: 0.002541
- batch size: 16
- weight decay: 0.04194

### Fine-tuning the transformer

We will also try to optimize the training parameters for a fine-tuning of the transformer. Since this is computationally much more expensive, we will take only a subset of our training data for the random search.

In [None]:
train_1000 = train_ds.shuffle(seed=43).select(range(1000))

The training parameters we are going to tune are the following:


*   learning rate
*   weight decay
*   warmup steps

In [None]:
trainer_config = TrainerConfiguration(
    optimizer={
        "type": "adamw",
        "lr": tune.loguniform(1e-5, 1e-4),
        "weight_decay": tune.loguniform(1e-3, 0.1)
    },
    warmup_steps=tune.randint(0, 100),
    lr_decay="linear",
    batch_size=8,
    max_epochs=2,
    monitor="validation_accuracy",
    monitor_mode="max"
)

After having defined the search space, we create a `TuneExperiment` providing this time the subset of the training data.

In [None]:
tune_exp = TuneExperiment(
    pipeline_config=pipeline_dict_finetuning, 
    trainer_config=trainer_config,
    train_dataset=train_ds,
    valid_dataset=valid_ds,
    name="finetuning_sweep",
    # parameters for tune.run
    num_samples=50,
    local_dir="tune_runs",
    resources_per_trial={"gpu": 1, "cpu":2},
)

Again, we will use the [ASHA](https://blog.ml.cmu.edu/2018/12/12/massively-parallel-hyperparameter-optimization/) trial scheduler and maximize the *validation_accuracy*.

In Google Colab with a GPU backend, this random search should not take longer than 1.5 hours.

In [None]:
analysis_finetuning = tune.run(
    tune_exp,
    scheduler=tune.schedulers.ASHAScheduler(),
    metric="validation_accuracy",
    mode="max",
    progress_reporter=tune.JupyterNotebookReporter(overwrite=True),
)

We now take the configuration that yielded the best *validation accuracy* and train the pipeline on the full training set. In our random search the best configuration was following:

*   learning rate: 0.0000453
*   warmup steps: 45
*   weight decay: 0.003197



In [None]:
pl = Pipeline.from_config(pipeline_dict_finetuning)

In [None]:
trainer_config = TrainerConfiguration(
    optimizer={
        "type": "adamw",
        "lr": 0.0000453,
        "weight_decay": 0.003197,
    },
    warmup_steps=45,
    lr_decay="linear",
    batch_size=8,
    max_epochs=2,
)

In [None]:
trainer = Trainer(
    pipeline=pl,
    train_dataset=train_ds,
    valid_dataset=valid_ds,
    trainer_config=trainer_config
)

In [None]:
trainer.fit("output/transformer_final_model")

With the optimized training parameters we achieve an accuracy of about 0.67.

### Evaluating with a test set

Having optimized the training parameters of both models, we will now evaluate them on an independent test set.

For the frozen-transformer configuration we can use the `analysis_frozen` object of the random search to directly access the best performing model:

In [None]:
best_model_path = os.path.join(analysis_frozen.get_best_logdir(), "output", "model.tar.gz")

pl_frozen = Pipeline.from_pretrained(best_model_path)

With the best performing pipeline at hand we will call its evaluate method together with the test data set.
By default the evaluation will be done with a batch size of 16 and on a CUDA device if one is available.

In [None]:
pl_frozen.evaluate(test_ds)

On the test set we achieve an accuracy of about 0.65, which is a bit better than the 0.63 on our validation set.

Let us also quickly check the accuracy of our best fine-tuned model:

In [None]:
pl_finetuned = Pipeline.from_pretrained("output/transformer_final_model/model.tar.gz")

pl_finetuned.evaluate(test_ds)

Here we achieve roughly the same accuracy of 0.67 as with the validation data set. So it seems both models generalized well during the random search and there is no strong bias towards the validation data set.

## Making predictions

Let's quickly recap what we have learnt so far:

* Freezing the pretrained transformer and optimizing a GRU pooler in the head can be valid option if computing resources are limited;
* However, fine-tuning the transformer at word-piece level and using the CLS token as "pooler" works best;
* A quick HPO of the training parameters improved the accuracies by ~0.03.


With our best model at hand we will finally make a simple prediction. We can call the `predict` method of our pipeline that outputs a dictionary with a labels and probabilities key containing a list of labels and their corresponding probabilities, ordered from most to less likely. 

In [None]:
pl_finetuned.predict(text="This is a title of a super intelligent Natural Language Processing system")

The most likely category predicted is the "*cs.CL*" category, which seems fitting according to this [list of arxiv categories and their meanings](https://arxiv.org/category_taxonomy).