# Retraining RoBERTa for MLM

Retraining roberta-base for masked language model (MLM) using the RoBERTa pre-training procedure

In [1]:
# It is oftentimes desirable to re-train the LM to better capture the language characteristics of a downstream task.

# A recently published work BerTweet (Nguyen et al., 2020) provides a pre-trained BERT model 
# (using the RoBERTa procedure) on vast Twitter corpora in English. 
# They argue that BerTweet better models the characteristic of language used on the Twitter subspace, 
# outperforming previous SOTA models on Tweet NLP tasks.

# That is, the performance on downstream tasks is can be greatly influenced by what our LM captures!

### Create Virtual Environment

https://janakiev.com/blog/jupyter-virtual-envs/

In [2]:
# ! pip install --user virtualenv

In [3]:
# ! python -m venv venv_robarta

In [4]:
# ! source venv_robarta/bin/activate

In [5]:
# ! pip install --user ipykernel
# ! jupyter kernelspec uninstall myenv

In [6]:
# ! python -m ipykernel install --user --name=venv_robarta

### 0. Get Data

Get Hate Speech Detection dataset (Basile et al., 2019) made available through TweetEval (Barbieri et al., 2020). 

In [7]:
# !git clone https://github.com/cardiffnlp/tweeteval /tmp/tweeteval

## 1. Include required libraries

In [8]:
# ! pip install torch==1.4.0 torchvision==0.5.0
# ! pip install transformers==3.5.1
from transformers import RobertaTokenizer, RobertaForMaskedLM
from transformers import LineByLineTextDataset
from transformers import DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

In [9]:
# import transformers
# transformers.__version__

In [10]:
# import torch
# torch.__version__

In [11]:
# ! pip -V

## 2. Prepare Data

### 2.1 Create tokenizer and model object

In [12]:
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaForMaskedLM.from_pretrained('roberta-base')

Downloading pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

### 2.2 LineByLineTextDataset class

Since our data is already present in a single file, we can go ahead and use the LineByLineTextDataset class.

In [15]:
# The block_size argument gives the largest token length supported by the LM to be trained. 
# “roberta-base” supports sequences of length 512 (including special tokens like <s> (start of sequence) and </s> (end of sequence).

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./data/messagesForX.txt",
    block_size=512,
)

### 2.3. Data collator

The data collator object helps us to form input data batches in a form on which the LM can be trained. For example, it pads all examples of a batch to bring them to the same length.

In [16]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

## 3. Training Model

### 3.1 Training object

TrainingArguments object holds some fields that help define the training process. The Trainer finally brings all of the objects that we have created till now together to facilitate the train process.

seed=1: seeds the RNG for the Trainer so that the results can be replicated when needed.

In [17]:
training_args = TrainingArguments(
    output_dir="./roberta-retrained",
    overwrite_output_dir=True,
    num_train_epochs=25,
    per_device_train_batch_size=48,
    save_steps=500,
    save_total_limit=2,
    seed=1
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset
)

### 3.2 Run training

trainer.save_model(output_dir): helps us save the model to the output_dir so that we can load it using from_pretrained (or as done below).

In [18]:
# import torch
# torch.__version__

In [19]:
trainer.train()

trainer.save_model("./roberta-retrained")

***** Running training *****
  Num examples = 107
  Num Epochs = 25
  Instantaneous batch size per device = 48
  Total train batch size (w. parallel, distributed & accumulation) = 48
  Gradient Accumulation steps = 1
  Total optimization steps = 75
  Number of trainable parameters = 124697433


  0%|          | 0/75 [00:00<?, ?it/s]



Training completed. Do not forget to share your model on huggingface.co/models =)


Saving model checkpoint to ./roberta-retrained
Configuration saved in ./roberta-retrained/config.json


{'train_runtime': 1023.8414, 'train_samples_per_second': 2.613, 'train_steps_per_second': 0.073, 'train_loss': 1.7512172444661458, 'epoch': 25.0}


Model weights saved in ./roberta-retrained/pytorch_model.bin


In [70]:
from transformers import RobertaTokenizer, TFRobertaModel
import tensorflow as tf

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

MASK_TOKEN = tokenizer.mask_token
model = TFRobertaModel.from_pretrained("./roberta-retrained", from_pt=True)

text = "10x Software engineer, java god, python expert, kubernetes. We have {}.".format(MASK_TOKEN)

encoded_input = tokenizer(text, return_tensors='tf')

from transformers import pipeline
clf = pipeline("fill-mask", model="./roberta-retrained", tokenizer=tokenizer)
answer = clf(text)

print(f'For position: {text}')
answer

loading file vocab.json from cache at /Users/adarmani/.cache/huggingface/hub/models--roberta-base/snapshots/bc2764f8af2e92b6eb5679868df33e224075ca68/vocab.json
loading file merges.txt from cache at /Users/adarmani/.cache/huggingface/hub/models--roberta-base/snapshots/bc2764f8af2e92b6eb5679868df33e224075ca68/merges.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at None
loading configuration file config.json from cache at /Users/adarmani/.cache/huggingface/hub/models--roberta-base/snapshots/bc2764f8af2e92b6eb5679868df33e224075ca68/config.json
Model config RobertaConfig {
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermedia

For position: 10x Software engineer, java god, python expert, kubernetes. We have <mask>.


[{'score': 0.19786226749420166,
  'token': 960,
  'token_str': ' everything',
  'sequence': '10x Software engineer, java god, python expert, kubernetes. We have everything.'},
 {'score': 0.15671701729297638,
  'token': 24,
  'token_str': ' it',
  'sequence': '10x Software engineer, java god, python expert, kubernetes. We have it.'},
 {'score': 0.07473452389240265,
  'token': 47,
  'token_str': ' you',
  'sequence': '10x Software engineer, java god, python expert, kubernetes. We have you.'},
 {'score': 0.033374007791280746,
  'token': 9622,
  'token_str': ' suggestions',
  'sequence': '10x Software engineer, java god, python expert, kubernetes. We have suggestions.'},
 {'score': 0.027171382680535316,
  'token': 6401,
  'token_str': ' guidelines',
  'sequence': '10x Software engineer, java god, python expert, kubernetes. We have guidelines.'}]