# BERT variant : RoBERTa

Robustly Optimized BERT pre-training Approach (RoBERTa) is another interesting and popular variant of BERT. Researchers observed that BERT is severely
undertrained and proposed several approaches to pre-train the BERT model. RoBERTa is essentially BERT with the following changes in pre-training:

Use dynamic masking instead of static masking in the MLM task.

Remove the NSP task and train using only the MLM task.

Train with a large batch size.

Use byte-level BPE (BBPE) as a tokenizer.

 BERT uses the WordPiece tokenizer. The WordPiece tokenizer works similar to BPE, and it merges the symbol pair based on likelihood instead of 
frequency. Unlike BERT, RoBERTa uses BBPE as a tokenizer.

The BBPE works very similar to BPE, but instead of using a character-level sequence, it uses a byte-level sequence. We know that BERT uses a vocabulary 
size of 30,000 tokens, but RoBERTa uses a vocabulary size of 50,000 tokens. Let's explore the RoBERTa tokenizer further.



In [1]:
# Suppressing "INFO" and "WARNING" messages by setting the verbosity of the Transformers library.
from transformers import logging
logging.set_verbosity_error()

# Import the necessary modules

In [2]:
from transformers import RobertaConfig, RobertaModel, RobertaTokenizer

# Downloading and loading the model and the tokenizer

In [3]:
model = RobertaModel.from_pretrained('roberta-base')

Downloading config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

In [4]:
model.config

RobertaConfig {
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.30.0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

In [5]:
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

Downloading vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

# Declare and tokenize the sentence

In [8]:
sentence = "It was a great day" 
inputs = tokenizer(sentence, return_tensors="pt")

In [9]:
print(inputs)

{'input_ids': tensor([[  0, 243,  21,  10, 372, 183,   2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}


# Generate Embedding

In [10]:
objects = model(**inputs)
hidden_rep = objects.last_hidden_state

In [11]:
print(hidden_rep.shape)
print(hidden_rep[0][0])

torch.Size([1, 7, 768])
tensor([-5.6060e-02,  9.9218e-02, -1.1224e-03, -1.2050e-01,  6.7013e-02,
        -8.7908e-02, -3.7936e-02,  1.2658e-02,  7.8730e-02, -5.4073e-02,
        -2.7654e-02,  5.3000e-02,  4.1000e-02, -2.2198e-02,  5.9687e-02,
         1.3421e-02, -8.0508e-02,  6.2014e-03,  1.6381e-02, -6.8445e-02,
        -1.0022e-01,  3.1296e-02, -1.8465e-02,  8.5712e-02,  9.7401e-03,
         2.2570e-02,  8.2694e-02,  9.4121e-02, -2.2577e-02, -4.4360e-04,
        -2.1319e-02, -3.7122e-02,  4.1177e-02, -3.1471e-02,  5.8183e-02,
         6.9783e-02,  6.4301e-02, -9.9231e-03, -9.6397e-02,  8.9104e-03,
        -1.0614e-02,  2.9335e-02,  1.2379e-02,  2.6416e-02,  9.2916e-02,
         4.3703e-02, -9.2723e-03,  2.2292e-02, -3.9757e-02,  1.1471e-02,
         5.1443e-03,  9.9927e-02, -4.7048e-02,  5.2403e-03, -8.9470e-02,
         1.1163e-03,  7.6182e-03,  7.9610e-02,  6.2181e-02, -3.9759e-02,
         8.8514e-03, -1.2961e-01, -1.2428e-01, -3.4741e-02,  1.3176e-02,
        -1.8243e-02, -1.255

In [12]:
print(hidden_rep.shape)
print(hidden_rep[0][1])

torch.Size([1, 7, 768])
tensor([-6.9859e-02,  6.7408e-02, -1.1274e-01, -1.5529e-01, -5.8559e-02,
         1.4286e-01,  6.1409e-03, -1.3168e-01,  1.2968e-01, -2.6003e-02,
        -6.9229e-02, -1.7551e-01, -3.6770e-03, -4.0581e-02,  2.0476e-02,
        -3.0628e-01, -9.8279e-02, -1.4351e-01, -1.5501e-02, -3.6197e-01,
        -4.2362e-02,  2.0551e-01, -1.3545e-01, -2.4461e-01,  1.1360e-01,
        -1.5792e-01,  8.7477e-02,  1.5481e-01,  1.8376e-01,  3.3387e-01,
        -6.0240e-02, -3.4335e-02,  9.0902e-02, -1.3554e-01, -8.0224e-02,
        -4.0956e-02,  4.3047e-01, -5.1422e-02,  2.1437e-01,  1.1635e-01,
        -2.1666e-01, -2.9332e-01, -1.2838e-01,  2.8084e-02,  2.8396e-02,
         3.0709e-02, -3.8857e-03, -1.5958e-01, -8.4159e-02, -8.1438e-02,
         8.1447e-02,  1.1410e-01,  1.4819e-01, -8.4121e-02,  4.7258e-02,
        -1.2093e-01,  1.4428e-02, -6.8630e-02, -3.0161e-03,  1.1640e-01,
        -2.0020e-02,  4.3074e-01, -3.9611e-01, -1.3073e-01, -1.4355e-02,
        -1.4098e-01,  1.862

In [13]:
print(hidden_rep.shape)
print(hidden_rep[0][2])

torch.Size([1, 7, 768])
tensor([ 9.9345e-02,  1.3705e-01,  1.0635e-01, -1.4131e-01,  5.8355e-01,
         4.2877e-01,  7.6867e-02, -3.2896e-01,  1.0036e-02, -4.0053e-04,
        -1.8976e-01,  6.0722e-01,  2.9647e-02,  1.1795e-01, -1.8351e-03,
        -6.4836e-01,  7.1196e-02, -1.1041e-01, -3.5760e-02, -1.9079e-01,
         1.3396e-01,  3.2353e-02, -2.6282e-01, -9.9392e-02,  3.3931e-01,
        -7.1115e-02,  2.3428e-01,  6.1475e-02, -8.2166e-03,  2.3488e-02,
        -1.1148e-01,  7.2489e-02,  5.7627e-02, -1.2557e-01, -2.0836e-01,
         4.8073e-02,  2.7461e-01, -9.6232e-02,  5.6473e-01,  7.1759e-02,
        -1.8905e-01, -2.9211e-01,  4.3209e-03,  1.5082e-01,  4.2688e-02,
         4.7611e-02, -3.1366e-01, -6.3374e-02, -1.5489e-01, -1.2204e-01,
        -1.7593e-01,  1.7197e-01,  8.1526e-02, -1.0637e-01,  1.2260e-01,
         3.4112e-02, -5.4786e-02,  2.3087e-02, -2.7058e-02,  3.4305e-03,
        -2.1147e-02, -9.5444e-01, -3.5223e-01,  1.1326e-01, -1.1177e-01,
        -7.3589e-02,  6.343