<a href="https://colab.research.google.com/github/arutraj/.githubcl/blob/main/Roberta_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Introduction to RoBERTa**

1. RoBERTa is a reimplementation of BERT with some changes in the key
   hyperparameters and minor embedding tweaks. It uses a byte-level BPE as a tokenizer and a different pretraining scheme.
2. RoBERTa is trained for longer sequences i.e. the number of iterations
   is increased from 100K to 300K and then further to 500K.
3. RoBERTa uses larger byte-level BPE vocabulary with 50K subword units instead
   of character-level BPE vocabulary of size 30K used in BERT.
4. In the Masked Language Model (MLM) training objective, RoBERTa employs dynamic masking to generate the masking pattern every time a sequence is fed to the model.
5. RoBERTa doesn’t use token_type_ids, and we don’t need to define which token
   belongs to which segment. Only separate segments with the separation token tokenizer.sep_token (or ).
6. Larger mini-batches and learning rates are used in RoBERTa’s training.
7. NSP is removed from its objective.

## Uncomment below lines , only require to mount notebook to a given directory and persist the model in **google** drive

In [None]:
# mounting the drive
# from google.colab import drive
# drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
# # Path to the directory where this model will be saved. If you want to save it to another path then link it with google drive.
# cd  drive/MyDrive/Colab Notebooks

/content/drive/MyDrive/Colab Notebooks


In [None]:
# Installing the required libraries
!pip install transformers[torch]

Collecting transformers[torch]
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m51.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers[torch])
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m31.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers[torch])
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m109.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers[torch])
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m

In [None]:
# Accelerate is a library that enables the same PyTorch code to be run across any distributed configuration.
!pip install accelerate -U



In [None]:
# Cloning the dataset tweeteval repo here
!git clone https://github.com/cardiffnlp/tweeteval /tmp/tweeteval

Cloning into '/tmp/tweeteval'...
remote: Enumerating objects: 370, done.[K
remote: Counting objects: 100% (16/16), done.[K
remote: Compressing objects: 100% (15/15), done.[K
remote: Total 370 (delta 13), reused 3 (delta 1), pack-reused 354[K
Receiving objects: 100% (370/370), 8.49 MiB | 9.28 MiB/s, done.
Resolving deltas: 100% (122/122), done.


In [None]:
# defining the roberta base model here
from transformers import pipeline

roberta_base_model = pipeline(
                              "fill-mask",
                              model = "roberta-base",
                              tokenizer = "roberta-base"
                             )


Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
# function to predict masked token in sentence
def predict_mask_token(model, sentence):
  predictions = model(sentence)
  for prediction in predictions:
    print(prediction['sequence'].strip('<s>').strip('</s>'), end='\t--- ')
    print(f"{round(100*prediction['score'],2)}% confidence")

In [None]:
# function call with parameters roberta base model and sentence with mask token
predict_mask_token(roberta_base_model, "Send these <mask> back!")

Send these pictures back!	--- 16.66% confidence
Send these photos back!	--- 10.79% confidence
Send these emails back!	--- 7.67% confidence
Send these images back!	--- 4.86% confidence
Send these letters back!	--- 4.84% confidence


In [None]:
predict_mask_token(roberta_base_model, "Elon Musk is the founder of <mask>")

Elon Musk is the founder of Tesla	--- 69.61% confidence
Elon Musk is the founder of SpaceX	--- 29.53% confidence
Elon Musk is the founder of PayPal	--- 0.73% confidence
Elon Musk is the founder of Twitter	--- 0.05% confidence
Elon Musk is the founder of Facebook	--- 0.02% confidence


**Configuring, tokenizing, training & saving the model**

Preparation of dataset(i.e train data converted to tokens with a block size of 512) for training model.

In [None]:
from transformers import RobertaTokenizer, RobertaForMaskedLM
# constructs a RoBERTa BPE tokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
# configuring the model and loading the model weights
model = RobertaForMaskedLM.from_pretrained('roberta-base')

In [None]:
from transformers import LineByLineTextDataset
# Preparing the dataset for training model
dataset = LineByLineTextDataset(
    tokenizer=tokenizer, # for converting train data into tokens
    file_path="/tmp/tweeteval/datasets/hate/train_text.txt", # path where train data file exists
    block_size=512,
)

Here first we will understand-

**What is Data Collator and what it does?**

Data collators are objects that will form a batch by using a list of dataset elements as input. These elements are of the same type as the elements of train_dataset or eval_dataset.
To be able to build batches, data collators may apply some processing (like padding). **DataCollatorForLanguageModeling** also apply some random data augmentation (like random masking) on the formed batch.

Here we will make use of **DataCollatorForLanguageModeling** function for masking randomly 15% of the tokens for MLM task. Parameters are

**tokenizer-** The tokenizer used for encoding the data,

**mlm-** Whether or not to use masked language modeling. If set to False, the labels are the same as the inputs with the padding tokens ignored (by setting them to -100). Otherwise, the labels are -100 for non-masked tokens and the value to predict for the masked token.

**mlm_probability-** The probability with which to (randomly) mask tokens in the input, when mlm is set to True.

In [None]:
# importing the library DataCollatorForLanguageModeling
from transformers import DataCollatorForLanguageModeling
# For Mask Language Modelling(MLM) task
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15 # for masking randomly 15% of the tokens for MLM task
)

**Note-**

 **Difference between RoBERTa base model and RoBERTa retrained model?**

 RoBERTa base model is very generic model and when we are training this model on a new database then after getting trained it is RoBERTa retrained model.

In [None]:
# training the model
from transformers import Trainer, TrainingArguments

# initalizing the training arguments
training_args = TrainingArguments(
    output_dir="./roberta-retrained",  # path where roberta retrained model will be saved.
    overwrite_output_dir=True, # permission for overwriting the output directory
    num_train_epochs=1, # num_train_epochs is a hyperparameter that defines the number of times the learning algorithm will work through the entire training dataset.
    per_device_train_batch_size=48, # The batch size per GPU/TPU core/CPU for training.
    seed=1 # Random seed for initialization. This is optional and defaults to 42.
)

# Passing the model, training arguments, data collator(for masking tokens), dataset to trainer function for training the model
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset
)

In [None]:
# model is getting trained here
trainer.train()
# saving the model
trainer.save_model("roberta-retrained/model")



Step,Training Loss


In [None]:
# loading the trained model
roberta_retrained_model = pipeline(
                                    "fill-mask",
                                    model="roberta-retrained/model",
                                    tokenizer="roberta-base"
                                )

In [None]:
# Predicting the masked token in sentence with trained model by function call
predict_mask_token(roberta_retrained_model, "Send these <mask> back!")

Send these guys back!	--- 10.9% confidence
Send these pics back!	--- 6.37% confidence
Send these pictures back!	--- 5.97% confidence
Send these people back!	--- 4.13% confidence
Send these photos back!	--- 3.98% confidence


In [None]:
predict_mask_token(roberta_retrained_model, "I hate watching <mask> sports")

I hate watching fantasy sport	--- 11.8% confidence
I hate watching stupid sport	--- 8.83% confidence
I hate watching college sport	--- 7.0% confidence
I hate watching live sport	--- 5.87% confidence
I hate watching American sport	--- 2.79% confidence


In [None]:
predict_mask_token(roberta_base_model, "I hate watching <mask> sports")

I hate watching fantasy sport	--- 24.12% confidence
I hate watching college sport	--- 10.72% confidence
I hate watching live sport	--- 9.0% confidence
I hate watching professional sport	--- 7.1% confidence
I hate watching pro sport	--- 3.46% confidence


In [None]:
predict_mask_token(roberta_retrained_model, "Hello I'm a <mask> model.")

Hello I'm a male model.	--- 17.01% confidence
Hello I'm a fashion model.	--- 10.6% confidence
Hello I'm a female model.	--- 5.1% confidence
Hello I'm a Russian model.	--- 2.07% confidence
Hello I'm a former model.	--- 1.96% confidence


In [None]:
predict_mask_token(roberta_base_model, "Hello I'm a <mask> model.")

Hello I'm a male model.	--- 33.07% confidence
Hello I'm a female model.	--- 4.66% confidence
Hello I'm a professional model.	--- 4.23% confidence
Hello I'm a fashion model.	--- 3.72% confidence
Hello I'm a Russian model.	--- 3.25% confidence


In [None]:
predict_mask_token(roberta_retrained_model, "The man worked as a <mask>.")

The man worked as a waiter.	--- 19.26% confidence
The man worked as a mechanic.	--- 7.8% confidence
The man worked as a cop.	--- 4.13% confidence
The man worked as a prostitute.	--- 3.98% confidence
The man worked as a slave.	--- 2.98% confidence


In [None]:
predict_mask_token(roberta_base_model, "The man worked as a <mask>.")

The man worked as a mechanic.	--- 8.7% confidence
The man worked as a waiter.	--- 8.2% confidence
The man worked as a butcher.	--- 7.33% confidence
The man worked as a miner.	--- 4.63% confidence
The man worked as a guard.	--- 4.02% confidence
