# Implementing RoBERTa with fastai and HuggingFace 🤗Transformers

## Acknowledgements:

This notebook is based off of this great tutorial kernel [and accompanying article](https://medium.com/p/fastai-with-transformers-bert-roberta-xlnet-xlm-distilbert-4f41ee18ecb2?source=email-29c8f5cf1dc4--writer.postDistributed&sk=119c3e5d748b2827af3ea863faae6376): <br>
https://www.kaggle.com/maroberti/fastai-with-transformers-bert-roberta

Here, I've just implemented roBERTa, but go check out the original kernel to see how the same procedure could be used for BERT, RoBERTa, XLNet, XLM, and DistilBERT). I'd love if you upvote my kernel, but make sure to give the original votes, too. 

## Google Quest Q&A Overview

This challenge is about questions and answers. 

In [question answering (QA)](https://en.wikipedia.org/wiki/Question_answering) systems are built that automatically answer questions posed by humans in a natural language. These computer systems excel at answering questions with single, verifiable answers. In contrast, humans are better at addressing subjective questions that require a deeper, multidimensional understanding of context.  

For the [Google QUEST Q&A Labeling competition](https://www.kaggle.com/c/google-quest-challenge/overview), we're tasked with predicting different subjective aspects of question-answering. The data for this competition includes questions and answers, and the task is to predict target values of 30 labels for each question-answer pair.Target labels with the prefix question_ relate to the question_title and/or question_body features in the data, and target labels with the prefix answer_ relate to the answer feature.

This is not a binary prediction challenge. Target labels are aggregated from multiple raters, and can have continuous values in the range [0,1]. Submissions are evaluated with the mean [Spearman's rank correlation coefficient](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient).

## Transfer learning approach

The first **transfer learning** method applied to Natural Language Processing (NLP) was [Universal Language Model Fine-tuning for Text Classification](https://medium.com/r/?url=https%3A%2F%2Farxiv.org%2Fpdf%2F1801.06146.pdf).(ULMFiT) method. This method involves starting with a pre-trained language model (LM), for example, trained on the Wikitext 103 dataset, and then fine tuning the language model on a new dataset. The fine tuned language model can then be used ina classification task for the new dataset. A demonstration is in the [fast.ai course](https://course.fast.ai/videos/?lesson=4), incorporating other techniques like discriminate learning rates, gradual model unfreezing, and slanted triangular learning rates.

Recently, a new architecture called the **Transformer** (cf. [Attention is all you need](https://arxiv.org/abs/1706.03762)) has been shown to be powerful. Google (BERT, Transformer-XL, XLNet), Facebook (RoBERTa, XLM) or even OpenAI (GPT, GPT-2) have pre-trained their own models (that use architectures based on the Transformer) on very large corpora. 

These transformers are availiable through the [HuggingFace](https://huggingface.co/) 🤗 [transformers library](https://github.com/huggingface/transformers). Formerly knew as ``pytorch-transformers`` or ``pytorch-pretrained-bert``, this library has both pre-trained NLP models and additional utilities like tokenizers, optimizers and schedulers. 

This kernel uses the ``transformers`` library within the ``fastai`` framework. Specifically, I am using the [RoBERTa model](https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8). I've broken the process down into different parts:
1. Specifying Data Preprocessing
1. Loading and Processing Data
1. Creating the Model
1. Training the Model
1. Predictions and Submission

# Set Up and Data Loading

This kernel uses fastai and Huggingface transformser. fasta is already installed on Kaggle, and [here](https://www.kaggle.com/c/tensorflow2-question-answering/discussion/117716) is a discussion post that shows how to get Huggingface installled.

In [None]:
!pip install ../input/sacremoses/sacremoses-master 
!pip install ../input/transformers/transformers-master 

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from pathlib import Path 

import os

import torch
import torch.optim as optim

import random 

# fastai
from fastai import *
from fastai.text import *
from fastai.callbacks import *

# classification metric
from scipy.stats import spearmanr

# transformers
from transformers import PreTrainedModel, PreTrainedTokenizer, PretrainedConfig
from transformers import RobertaForSequenceClassification, RobertaTokenizer, RobertaConfig

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    print(dirname)

A utility function to set the seed for generating random numbers

In [None]:
def seed_all(seed_value):
    random.seed(seed_value) # Python
    np.random.seed(seed_value) # cpu vars
    torch.manual_seed(seed_value) # cpu  vars
    
    if torch.cuda.is_available(): 
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value) # gpu vars
        torch.backends.cudnn.deterministic = True  #needed
        torch.backends.cudnn.benchmark = False

In [None]:
seed=42
seed_all(seed)

In [None]:
DATA_ROOT = Path("../input/google-quest-challenge/")
MODEL_ROOT = Path("../input/robertabasepretrained")
train = pd.read_csv(DATA_ROOT / 'train.csv')
test = pd.read_csv(DATA_ROOT / 'test.csv')
sample_sub = pd.read_csv(DATA_ROOT / 'sample_submission.csv')
print(train.shape,test.shape)

The training data. In this kernel, I'll use the `question_title`, `question_body` and `answer` columns.

In [None]:
train.head()

The predicted lables are in the columns of the sample submission. Note that some labels are with respect to the question, and some are with respect to the answer.

In [None]:
labels = list(sample_sub.columns[1:].values)

In [None]:
for label in labels: print(label) 

# Specifying Data Preprocessing 

When using pretrained models, the current data needs to be preprocessed in the same way as the data that trained the model. In ``transformers``, each model architecture is associated with 3 main types of classes:
* A **model class** to load/store a particular pre-train model.
* A **tokenizer class** to pre-process the data and make it compatible with a particular model.
* A **configuration class** to load/store the configuration of a particular model.

For the RoBERTa architecture, we use `RobertaForSequenceClassification` for the **model class**, `RobertaTokenizer` for the **tokenizer class**, and `RobertaConfig` for the **configuration class**. 

In [None]:
MODEL_CLASSES = {
    'roberta': (RobertaForSequenceClassification, RobertaTokenizer, RobertaConfig),
}

You will see later, that those classes share a common class method ``from_pretrained(pretrained_model_name, ...)``. In our case, the parameter ``pretrained_model_name`` is a string with the shortcut name of a pre-trained model/tokenizer/configuration to load, e.g ``'bert-base-uncased'``. We can find all the shortcut names in the transformers documentation [here](https://huggingface.co/transformers/pretrained_models.html#pretrained-models).

In [None]:
# Parameters
seed = 42
use_fp16 = False
bs = 16

model_type = 'roberta'
pretrained_model_name = 'roberta-base' # 'roberta-base-openai-detector'

In [None]:
model_class, tokenizer_class, config_class = MODEL_CLASSES[model_type]

In [None]:
model_class.pretrained_model_archive_map.keys()

## Implementing the RoBERTa tokenizer and numericalizer in fastai

Text data is preprocessed through tokenization and numericalization. To match the pretrained models, we need to use the same tokenization and numericalization as the model. Fortunately, the **tokenizer class** from ``transformers`` provides the correct pre-process tools that correspond to each pre-trained model.

In ``fastai``, data pre-processing is performed during the creation of the ``DataBunch``. When creating a `DataBunch`, the tokenizer and numericalizer are passed in the processor argument.

Therefore, the first step is to create a customized tokenize and numericalizer that use the correct transformer tokenizer classes. 

### Custom Tokenizer

A tokentizer takes the text and transforms it into tokens. The ``fastai`` documentation notes that: 
1. The [``TokenizeProcessor`` object](https://docs.fast.ai/text.data.html#TokenizeProcessor) takes as ``tokenizer`` argument a ``Tokenizer`` object.
2. The [``Tokenizer`` object](https://docs.fast.ai/text.transform.html#Tokenizer) takes as ``tok_func`` argument a ``BaseTokenizer`` object.
3. The [``BaseTokenizer`` object](https://docs.fast.ai/text.transform.html#BaseTokenizer) implement the function ``tokenizer(t:str) → List[str]`` that take a text ``t`` and returns the list of its tokens.

To use the RoBERTa tokenizer, we create a new class ``TransformersBaseTokenizer`` that inherits from ``BaseTokenizer`` and overwrite a new ``tokenizer`` function. It is important to note that RoBERTa requires a space to start the input string. The encoding methods should be called with ``add_prefix_space`` set to ``True``. The output of the tokenizer should have the following pattern. (Note that padding is added when the `DataBunch` is created.)

    roberta: [CLS] + prefix_space + tokens + [SEP] + padding

In [None]:
class TransformersBaseTokenizer(BaseTokenizer):
    """Wrapper around PreTrainedTokenizer to be compatible with fast.ai"""
    def __init__(self, pretrained_tokenizer: PreTrainedTokenizer, model_type = 'bert', **kwargs):
        self._pretrained_tokenizer = pretrained_tokenizer
        self.max_seq_len = pretrained_tokenizer.max_len
        self.model_type = model_type

    def __call__(self, *args, **kwargs): 
        return self

    def tokenizer(self, t:str) -> List[str]:
        """Limits the maximum sequence length and add the spesial tokens"""
        CLS = self._pretrained_tokenizer.cls_token
        SEP = self._pretrained_tokenizer.sep_token
        if self.model_type in ['roberta']:
            tokens = self._pretrained_tokenizer.tokenize(t, add_prefix_space=True)[:self.max_seq_len - 2]
        else:
            tokens = self._pretrained_tokenizer.tokenize(t)[:self.max_seq_len - 2]
        return [CLS] + tokens + [SEP]

In [None]:
transformer_tokenizer = tokenizer_class.from_pretrained(MODEL_ROOT)
transformer_base_tokenizer = TransformersBaseTokenizer(pretrained_tokenizer = transformer_tokenizer, model_type = model_type)
fastai_tokenizer = Tokenizer(tok_func = transformer_base_tokenizer, pre_rules=[], post_rules=[])

### Custom Numericalizer

The numericalizer takes the the tokens, and turns them into numbers. The ``fastai`` documentation notes that:
1. The [``NumericalizeProcessor``  object](https://docs.fast.ai/text.data.html#NumericalizeProcessor) takes as ``vocab`` argument a [``Vocab`` object](https://docs.fast.ai/text.transform.html#Vocab)

To use the RoBERTa numericalizer, we create a new class ``TransformersVocab`` that inherits from ``Vocab`` and overwrite ``numericalize`` and ``textify`` functions.

In [None]:
class TransformersVocab(Vocab):
    def __init__(self, tokenizer: PreTrainedTokenizer):
        super(TransformersVocab, self).__init__(itos = [])
        self.tokenizer = tokenizer
    
    def numericalize(self, t:Collection[str]) -> List[int]:
        "Convert a list of tokens `t` to their ids."
        return self.tokenizer.convert_tokens_to_ids(t)
        #return self.tokenizer.encode(t)

    def textify(self, nums:Collection[int], sep=' ') -> List[str]:
        "Convert a list of `nums` to their tokens."
        nums = np.array(nums).tolist()
        return sep.join(self.tokenizer.convert_ids_to_tokens(nums)) if sep is not None else self.tokenizer.convert_ids_to_tokens(nums)

### Custom processor

Now that we have our custom **tokenizer** and **numericalizer**, we can create the custom **processor**. Notice we are passing the ``include_bos = False`` and ``include_eos = False`` options. This is because ``fastai`` adds its own special tokens by default which interferes with the ``[CLS]`` and ``[SEP]`` tokens added by our custom tokenizer.

In [None]:
transformer_vocab =  TransformersVocab(tokenizer = transformer_tokenizer)
numericalize_processor = NumericalizeProcessor(vocab=transformer_vocab)

tokenize_processor = TokenizeProcessor(tokenizer=fastai_tokenizer, include_bos=False, include_eos=False)

transformer_processor = [tokenize_processor, numericalize_processor]

# Loading and Processing Data

Now that we have a custom processor, which contains the custom tokenizer and numericalizer, we can create the `DataBunch`. During the DataBunch creation, we have to pay attention to set the processor argument to our new custom processor ``transformer_processor`` and manage correctly the padding. For RoBERTa, it's usually advised to pad the inputs on the right rather than the left.

In [None]:
pad_first = bool(model_type in ['xlnet'])
pad_idx = transformer_tokenizer.pad_token_id

This kernel uses [the data block API](https://docs.fast.ai/data_block.html#The-data-block-API), to create the `DataBunch`. 

In the `DataBunch` creation, I have specified to use the 'question_title','question_body', and 'answer' columns as the training data. Recall from the introduction that some of the target answers relate to the question (title + body) and some only to the answer. It's an open question as to whether it's a good choice to stick these all together. 

In [None]:
databunch = (TextList.from_df(train, cols=['question_title','question_body','answer'], processor=transformer_processor)
             .split_by_rand_pct(0.1,seed=seed)
             .label_from_df(cols=labels)
             .add_test(test)
             .databunch(bs=bs, pad_first=pad_first, pad_idx=pad_idx))

Check batch and tokenizer. I am not sure why I see a lot of 'G' in the text column. This was also present in the original tutorial. 

In [None]:
print('[CLS] token :', transformer_tokenizer.cls_token)
print('[SEP] token :', transformer_tokenizer.sep_token)
print('[PAD] token :', transformer_tokenizer.pad_token)
databunch.show_batch()

Check batch and numericalizer :

In [None]:
print('[CLS] id :', transformer_tokenizer.cls_token_id)
print('[SEP] id :', transformer_tokenizer.sep_token_id)
print('[PAD] id :', pad_idx)
test_one_batch = databunch.one_batch()[0]
print('Batch shape : ',test_one_batch.shape)
print(test_one_batch)

# Creating the Model

As mentioned [here](https://github.com/huggingface/transformers#models-always-output-tuples), the RoBERTa model's forward method always outputs a ``tuple`` with various elements depending on the model and the configuration parameters. In our case, we are interested to access only to the logits. One way to access them is to create a custom model.

In [None]:
# defining our model architecture 
class CustomTransformerModel(nn.Module):
    def __init__(self, transformer_model: PreTrainedModel):
        super(CustomTransformerModel,self).__init__()
        self.transformer = transformer_model
        
    def forward(self, input_ids, attention_mask=None):
            
        logits = self.transformer(input_ids,
                                attention_mask = attention_mask)[0]   
        return logits

To make the transformer adapted to multiclass classification, we need to specify the number of labels before loading the pre-trained model.

In [None]:
config = config_class.from_pretrained(MODEL_ROOT)
config.num_labels = 30
config.use_bfloat16 = use_fp16

In [None]:
transformer_model = model_class.from_pretrained(MODEL_ROOT, config = config)
custom_transformer_model = CustomTransformerModel(transformer_model = transformer_model)

### Fastai Learner with Custom Optimizer

In fastai, the `Learner` holds the data, model and other parameter, like the optimizer. Since we're using transformers, we want to use an optimizer designed for them: the AdamW optimizer. This optimizer matches Pytorch Adam optimizer Api, therefore, it becomes straightforward to integrate it within ``fastai``. To reproduce BertAdam specific behavior, you have to set ``correct_bias = False``.

It's worth noting that the metric used here is accuracy, while the metric used for evaluation is the mean Spearman's Rho. In addition, this  has multiple target labels, so I'm not quire sure what accuracy is doing. I want to create a custom metric to be able to show these statistics, and get a better idea about how the model might place on the learderboards. 

In [None]:
from fastai.callbacks import *
from transformers import AdamW

learner = Learner(databunch, 
                  custom_transformer_model, 
                  opt_func = lambda input: AdamW(input,correct_bias=False), 
                  metrics=[accuracy])

# Show graph of learner stats and metrics after each epoch.
learner.callbacks.append(ShowGraph(learner))

# Put learn in FP16 precision mode. --> Not working in the tutorial
if use_fp16: learner = learner.to_fp16()

# Training the Model

Now that we've created the Learner, we can train the model. During training, we are going to use techniques known to help in other classification tasks: **discriminative layer training**, **gradual unfreezing** and **slanted triangular learning rates**. The kernel tutorial author noted that he didn't find any documentation about influence of these techniques with transformers. I've used them because I think that these techniques are probably domain general, and will therefore give a boost in this system. 

To implement unfreezing, our model needs to be specified into different layer groups. ``fastai`` allows us to "split" the structure model into groups, [described here](https://docs.fast.ai/basic_train.html#Discriminative-layer-training).

Here, we'll look at the RobBERTa model:

In [None]:
print(learner.model)

Let's check how many layer groups we currently have:

In [None]:
num_groups = len(learner.layer_groups)
print('Learner split in',num_groups,'groups')

One group won't allow us to unfreeze parts of the model. The tutorial kernel suggested to divide the RoBERTa model in 14 blocks:
* 1 Embedding
* 12 transformer
* 1 classifier

In [None]:
list_layers = [learner.model.transformer.roberta.embeddings,
              learner.model.transformer.roberta.encoder.layer[0],
              learner.model.transformer.roberta.encoder.layer[1],
              learner.model.transformer.roberta.encoder.layer[2],
              learner.model.transformer.roberta.encoder.layer[3],
              learner.model.transformer.roberta.encoder.layer[4],
              learner.model.transformer.roberta.encoder.layer[5],
              learner.model.transformer.roberta.encoder.layer[6],
              learner.model.transformer.roberta.encoder.layer[7],
              learner.model.transformer.roberta.encoder.layer[8],
              learner.model.transformer.roberta.encoder.layer[9],
              learner.model.transformer.roberta.encoder.layer[10],
              learner.model.transformer.roberta.encoder.layer[11],
              learner.model.transformer.roberta.pooler]

learner.split(list_layers);

Let's check that we now have 14 layer groups:

In [None]:
num_groups = len(learner.layer_groups)
print('Learner split in',num_groups,'groups')

### Model Training

To train the model we will:
1. Freeze all but the last (-1) layer and train
2. Freeze all but the last two (-2) layers and train
3. Freeze all bu the last three layers (-3) and train

During all training, we use the **Slanted Triangular Learning Rates** with the `.fit_one_cycle` command, described [here](https://docs.fast.ai/callbacks.one_cycle.html). Originally, I wanted to unfreeze the entire model, but I kept running out of space. I'll trouble shoot in other versions. 

#### Freeze to -1

In [None]:
seed_all(seed)
learner.freeze_to(-1)

We need to find a good learning rate for our model.

In [None]:
learner.lr_find()

In [None]:
learner.recorder.plot(skip_end=7,suggestion=True)

Due to randomness, tthere can be little differences in the learning rate. Based on a few runs on my computer, I've chosen 2e-4 for the kaggle submission. Since there's only one layer unfrozen, we give one learning rate.

In [None]:
learner.fit_one_cycle(3,max_lr=2e-04,moms=(0.8,0.7))

In [None]:
seed_all(seed)

#### Freeze to -3

Now, we unfreeze the second group of layers and repeat the operations, minus the learning rate. Here, it's set to be a bit smaller.

In [None]:
learner.freeze_to(-3)

In [None]:
lr = 1e-5

Note here that we use slice to create separate learning rate for each group.

In [None]:
learner.fit_one_cycle(4, max_lr=slice(lr*0.95**num_groups, lr), moms=(0.8, 0.9))

In [None]:
seed_all(seed)

#### Freeze to -5

Same story, frozen to the 3rd layer group. 

In [None]:
learner.freeze_to(-5)

In [None]:
learner.fit_one_cycle(5, max_lr=slice(lr*0.95**num_groups, lr), moms=(0.8, 0.9))

# Predictions and Submission

Now that the model is trained, we can generate our predictions from the test dataset. As [noted in other tutorials](https://mlexplained.com/2019/05/13/a-tutorial-to-fine-tuning-bert-with-fast-ai/) the function ``get_preds`` does not return elements in order by default. Therefore, we will have to resort the test elements into their correct order.

In [None]:
def get_preds_as_nparray(ds_type) -> np.ndarray:
    """
    the get_preds method does not yield the elements in order by default
    we borrow the code from the RNNLearner to resort the elements into their correct order
    """
    preds = learner.get_preds(ds_type)[0].detach().cpu().numpy()
    sampler = [i for i in databunch.dl(ds_type).sampler]
    reverse_sampler = np.argsort(sampler)
    return preds[reverse_sampler, :]

test_preds = get_preds_as_nparray(DatasetType.Test)

In [None]:
test_preds,test_preds.shape

In [None]:
sample_submission = pd.read_csv(DATA_ROOT / 'sample_submission.csv')
sample_submission[labels] = test_preds
sample_submission.to_csv("submission.csv", index=False)

We check the order

In [None]:
test.head()

In [None]:
sample_submission.head()

Thanks for looking through this kernel! I hope that it helps you understand transformers, and how to integrate Huggingface with fastai. 

Check out the original for some other cool architectures:
[Fastai with HuggingFace 🤗Transformers (BERT, RoBERTa, XLNet, XLM, DistilBERT)](https://www.kaggle.com/maroberti/fastai-with-transformers-bert-roberta)

## References
* [Fastai with HuggingFace 🤗Transformers (BERT, RoBERTa, XLNet, XLM, DistilBERT)](https://www.kaggle.com/maroberti/fastai-with-transformers-bert-roberta)
* Hugging Face, Transformers GitHub (Nov 2019), [https://github.com/huggingface/transformers](https://github.com/huggingface/transformers)
* Fast.ai, Fastai documentation (Nov 2019), [https://docs.fast.ai/text.html](https://docs.fast.ai/text.html)
* Jeremy Howard & Sebastian Ruder, Universal Language Model Fine-tuning for Text Classification (May 2018), [https://arxiv.org/abs/1801.06146](https://arxiv.org/abs/1801.06146)
* Keita Kurita's article : [A Tutorial to Fine-Tuning BERT with Fast AI](https://mlexplained.com/2019/05/13/a-tutorial-to-fine-tuning-bert-with-fast-ai/) (May 2019)
* Dev Sharma's article : [Using RoBERTa with Fastai for NLP](https://medium.com/analytics-vidhya/using-roberta-with-fastai-for-nlp-7ed3fed21f6c) (Sep 2019)