# Retraining RoBERTa for Own Corpus

Retraining roberta-base for masked language model (MLM) using the RoBERTa pre-training procedure.

It is oftentimes desirable to re-train the LM to better capture the language characteristics of a downstream task.

## 1. Include required libraries

In [1]:
from transformers import RobertaTokenizer, RobertaForMaskedLM
from transformers import LineByLineTextDataset
from transformers import DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

In [2]:
import transformers
transformers.__version__

'4.27.4'

In [3]:
# import torch
# torch.__version__

In [4]:
# ! pip -V

## 2. Prepare Data

### 2.1 Create tokenizer and model object

In [5]:
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaForMaskedLM.from_pretrained('roberta-base')

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

### 2.2 LineByLineTextDataset class

Since our data is already present in a single file, we can go ahead and use the LineByLineTextDataset class.

In [8]:
# The block_size argument gives the largest token length supported by the LM to be trained. 
# “roberta-base” supports sequences of length 512 (including special tokens like <s> (start of sequence) and </s> (end of sequence).

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./data/outreachMessages.txt",
    block_size=512,
)

### 2.3. Data collator

The data collator object helps us to form input data batches in a form on which the LM can be trained. For example, it pads all examples of a batch to bring them to the same length.

In [9]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

## 3. Training Model

### 3.1 Training object

TrainingArguments object holds some fields that help define the training process. The Trainer finally brings all of the objects that we have created till now together to facilitate the train process.

seed=1: seeds the RNG for the Trainer so that the results can be replicated when needed.

In [10]:
training_args = TrainingArguments(
    output_dir="./roberta-retrained",
    overwrite_output_dir=True,
    num_train_epochs=25,
    per_device_train_batch_size=48,
    save_steps=500,
    save_total_limit=2,
    seed=1
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset
)

### 3.2 Run training

trainer.save_model(output_dir): helps us save the model to the output_dir so that we can load it using from_pretrained (or as done below).

In [11]:
# import torch
# torch.__version__

In [12]:
trainer.train()

trainer.save_model("./roberta-retrained")



  0%|          | 0/25 [00:00<?, ?it/s]

{'train_runtime': 191.148, 'train_samples_per_second': 2.485, 'train_steps_per_second': 0.131, 'train_loss': 0.36373397827148435, 'epoch': 25.0}


In [13]:
from transformers import RobertaTokenizer, TFRobertaModel
import tensorflow as tf

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

MASK_TOKEN = tokenizer.mask_token
model = TFRobertaModel.from_pretrained("./models/roberta-retrained", from_pt=True)


text = f"Hey {MASK_TOKEN} would you like to {MASK_TOKEN} and make a living {MASK_TOKEN} ?"

chats = \
    f"Let’s talk about your {MASK_TOKEN}! I am looking for new {MASK_TOKEN} for the fintech company Blackrock to further increase its {MASK_TOKEN}.\
    Based on your {MASK_TOKEN}, I believe you, {MASK_TOKEN}, could be {MASK_TOKEN} in our {MASK_TOKEN}.\
    Find the details here: https://careers.blackrock.com/job/16845601/ {MASK_TOKEN}\
    If you're {MASK_TOKEN}, please let me know your {MASK_TOKEN} and your {MASK_TOKEN} for a {MASK_TOKEN}.\
    What {MASK_TOKEN} Blackrock {MASK_TOKEN} {MASK_TOKEN}?: {MASK_TOKEN}\
    {MASK_TOKEN}, {MASK_TOKEN}, and {MASK_TOKEN}."

encoded_input = tokenizer(chats, return_tensors='tf')

from transformers import pipeline
clf = pipeline("fill-mask", model="./models/roberta-retrained", tokenizer=tokenizer)
answer = clf(chats)
answer

ImportError: Traceback (most recent call last):
  File "c:\Users\bigworker\anaconda3\envs\tutorials\lib\site-packages\tensorflow\python\pywrap_tensorflow.py", line 62, in <module>
    from tensorflow.python._pywrap_tensorflow_internal import *
ImportError: DLL load failed while importing _pywrap_tensorflow_internal: The specified module could not be found.


Failed to load the native TensorFlow runtime.
See https://www.tensorflow.org/install/errors for some common causes and solutions.
If you need help, create an issue at https://github.com/tensorflow/tensorflow/issues and include the entire stack trace above this error message.

In [None]:
import random

new_chats = chats
toks = []
for ans in answer:
    toks.append(random.choice(ans)['token_str'])
toks

for tok in toks:
    new_chats = new_chats.replace(MASK_TOKEN, tok, 1)
new_chats

"Let’s talk about your  ideas! I am looking for new  applicants for the fintech company Blackrock to further increase its  presence.    Based on your  feedback, I believe you,  Adam, could be  participating in our  team.    Find the details here: https://careers.blackrock.com/job/16845601/ ...    If you're  interested, please let me know your  requirements and your  preferences for a  callback.    What  are Blackrock  look  for?:  My    mail,  Thanks, and  Developers."

In [None]:
answer[1][0]['sequence']

In [None]:
answer[2][0]['sequence']