# Retraining RoBERTa for MLM

Retraining roberta-base for masked language model (MLM) using the RoBERTa pre-training procedure

In [None]:
# It is oftentimes desirable to re-train the LM to better capture the language characteristics of a downstream task.

# A recently published work BerTweet (Nguyen et al., 2020) provides a pre-trained BERT model 
# (using the RoBERTa procedure) on vast Twitter corpora in English. 
# They argue that BerTweet better models the characteristic of language used on the Twitter subspace, 
# outperforming previous SOTA models on Tweet NLP tasks.

# That is, the performance on downstream tasks is can be greatly influenced by what our LM captures!

### Create Virtual Environment

https://janakiev.com/blog/jupyter-virtual-envs/

In [None]:
# ! pip install --user virtualenv

In [None]:
# ! python -m venv venv_robarta

In [None]:
# ! source venv_robarta/bin/activate

In [None]:
# ! pip install --user ipykernel
# ! jupyter kernelspec uninstall myenv

In [None]:
# ! python -m ipykernel install --user --name=venv_robarta

### 0. Get Data

Get Hate Speech Detection dataset (Basile et al., 2019) made available through TweetEval (Barbieri et al., 2020). 

In [None]:
# !git clone https://github.com/cardiffnlp/tweeteval /tmp/tweeteval

## 1. Include required libraries

In [None]:
# ! pip install torch==1.4.0 torchvision==0.5.0
# ! pip install transformers==3.5.1
from transformers import RobertaTokenizer, RobertaForMaskedLM
from transformers import LineByLineTextDataset
from transformers import DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

In [None]:
# import transformers
# transformers.__version__

In [None]:
# import torch
# torch.__version__

In [None]:
# ! pip -V

## 2. Prepare Data

### 2.1 Create tokenizer and model object

In [None]:
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaForMaskedLM.from_pretrained('roberta-base')

### 2.2 LineByLineTextDataset class

Since our data is already present in a single file, we can go ahead and use the LineByLineTextDataset class.

In [None]:
# The block_size argument gives the largest token length supported by the LM to be trained. 
# “roberta-base” supports sequences of length 512 (including special tokens like <s> (start of sequence) and </s> (end of sequence).

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./data/messagesForX.txt",
    block_size=512,
)

### 2.3. Data collator

The data collator object helps us to form input data batches in a form on which the LM can be trained. For example, it pads all examples of a batch to bring them to the same length.

In [None]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

## 3. Training Model

### 3.1 Training object

TrainingArguments object holds some fields that help define the training process. The Trainer finally brings all of the objects that we have created till now together to facilitate the train process.

seed=1: seeds the RNG for the Trainer so that the results can be replicated when needed.

In [None]:
training_args = TrainingArguments(
    output_dir="./roberta-retrained",
    overwrite_output_dir=True,
    num_train_epochs=25,
    per_device_train_batch_size=48,
    save_steps=500,
    save_total_limit=2,
    seed=1
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset
)

### 3.2 Run training

trainer.save_model(output_dir): helps us save the model to the output_dir so that we can load it using from_pretrained (or as done below).

In [None]:
# import torch
# torch.__version__

In [None]:
trainer.train()

trainer.save_model("./roberta-retrained")

In [78]:
from transformers import RobertaTokenizer, TFRobertaModel
import tensorflow as tf

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

MASK_TOKEN = tokenizer.mask_token
model = TFRobertaModel.from_pretrained("./models/roberta-retrained", from_pt=True)


text = f"Hey {MASK_TOKEN} would you like to {MASK_TOKEN} and make a living {MASK_TOKEN} ?"

chats = \
    f"Let’s talk about your {MASK_TOKEN}! I am looking for new {MASK_TOKEN} for the fintech company Blackrock to further increase its {MASK_TOKEN}.\
    Based on your {MASK_TOKEN}, I believe you, {MASK_TOKEN}, could be {MASK_TOKEN} in our {MASK_TOKEN}.\
    Find the details here: https://careers.blackrock.com/job/16845601/ {MASK_TOKEN}\
    If you're {MASK_TOKEN}, please let me know your {MASK_TOKEN} and your {MASK_TOKEN} for a {MASK_TOKEN}.\
    What {MASK_TOKEN} Blackrock {MASK_TOKEN} {MASK_TOKEN}?: {MASK_TOKEN}\
    {MASK_TOKEN}, {MASK_TOKEN}, and {MASK_TOKEN}."

encoded_input = tokenizer(chats, return_tensors='tf')

from transformers import pipeline
clf = pipeline("fill-mask", model="./models/roberta-retrained", tokenizer=tokenizer)
answer = clf(chats)
answer

loading file vocab.json from cache at /Users/adarmani/.cache/huggingface/hub/models--roberta-base/snapshots/bc2764f8af2e92b6eb5679868df33e224075ca68/vocab.json
loading file merges.txt from cache at /Users/adarmani/.cache/huggingface/hub/models--roberta-base/snapshots/bc2764f8af2e92b6eb5679868df33e224075ca68/merges.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at None
loading configuration file config.json from cache at /Users/adarmani/.cache/huggingface/hub/models--roberta-base/snapshots/bc2764f8af2e92b6eb5679868df33e224075ca68/config.json
Model config RobertaConfig {
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermedia

[[{'score': 0.10631920397281647,
   'token': 1142,
   'token_str': ' questions',
   'sequence': "<s>Let’s talk about your questions! I am looking for new<mask> for the fintech company Blackrock to further increase its<mask>.    Based on your<mask>, I believe you,<mask>, could be<mask> in our<mask>.    Find the details here: https://careers.blackrock.com/job/16845601/<mask>    If you're<mask>, please let me know your<mask> and your<mask> for a<mask>.    What<mask> Blackrock<mask><mask>?:<mask><mask>,<mask>, and<mask>.</s>"},
  {'score': 0.07948160916566849,
   'token': 2502,
   'token_str': ' application',
   'sequence': "<s>Let’s talk about your application! I am looking for new<mask> for the fintech company Blackrock to further increase its<mask>.    Based on your<mask>, I believe you,<mask>, could be<mask> in our<mask>.    Find the details here: https://careers.blackrock.com/job/16845601/<mask>    If you're<mask>, please let me know your<mask> and your<mask> for a<mask>.    What<mask

In [90]:
import random

new_chats = chats
toks = []
for ans in answer:
    toks.append(random.choice(ans)['token_str'])
toks

for tok in toks:
    new_chats = new_chats.replace(MASK_TOKEN, tok, 1)
new_chats

"Let’s talk about your  ideas! I am looking for new  applicants for the fintech company Blackrock to further increase its  presence.    Based on your  feedback, I believe you,  Adam, could be  participating in our  team.    Find the details here: https://careers.blackrock.com/job/16845601/ ...    If you're  interested, please let me know your  requirements and your  preferences for a  callback.    What  are Blackrock  look  for?:  My    mail,  Thanks, and  Developers."

In [None]:
answer[1][0]['sequence']

In [None]:
answer[2][0]['sequence']