<a href="https://colab.research.google.com/github/docheem/NLP-Portfolio/blob/main/PR_NLP_(Pretraining_RoBERTa_Model_from_Scratch_ipynb).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pretraining RoBERTa Model from Scratch
We will be building a pretrained transformer model from scratch, that can perform language modeling on masked tokens.

The model name is KantaiBERT and it will use:
- A custom dataset
- Train a tokenizer
- Train the transformer model
- Save it
- Run it with an MLM example

In [None]:
!curl -L https://raw.githubusercontent.com/Denis2054/Transformers-for-NLP-2nd-Edition/master/Chapter04/kant.txt --output "kant.txt"
     

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10.7M  100 10.7M    0     0  28.1M      0 --:--:-- --:--:-- --:--:-- 28.1M


Installing Hugging Face transformers

In [None]:
# We won't need TensorFlow here
!pip uninstall -y tensorflow



Found existing installation: tensorflow 2.11.0
Uninstalling tensorflow-2.11.0:
  Successfully uninstalled tensorflow-2.11.0


In [None]:
# Install `transformers` from master

!pip install git+https://github.com/huggingface/transformers

# transformers version at notebook update --- 2.9.1
# tokenizers version at notebook update --- 0.7.0

!pip list | grep -E 'transformers|tokenizers'


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-_869cyw5
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-_869cyw5
  Resolved https://github.com/huggingface/transformers to commit b29e2dcaff114762e65eaea739ba1076fc5d1c84
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m90.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.1-py3-none-any.w

#Training a tokenizer from scratch

Kant.txt will be used to train the ByteLevelBPETokenizer() for Hugging Face. A BPE tokenizer will separate a string or word into substrings or subwords. 2 main advantages: 
- Words can be reduced to their simplest forms using the tokenizer. Then, it will combine these minor elements to create statistically intriguing ones. It is possible to change the words "smaller" and "smallest" into "small," "er," and "est." The tokenizer is capable of more. For example, we might receive "sm" and "all." In any case, the words are divided into smaller subword tokens and subword components, such as "sm" and "all," rather than just "small."
- Sections of strings that are labeled as unknown, unk_token, using WordPiece level encoding, will practically disappear.


we will be training the tokenizer with the following parameters

- files = paths is the path to the dataset
- vocab_size = 52_000 is the size of our tokenizer’s model length
- min_frequency threshold,  is the minimum frequency threshold
- special_tokens = [] is a list of special tokens.



In [None]:
# the list of special tokens is, 
# <s>: a start token
# <pad>: a padding token, </s>: an end token,
# <unk>: an unknown token 
# <mask>: the mask token for language modeling


%%time 

from pathlib import Path

from tokenizers import ByteLevelBPETokenizer



paths = [str(x) for x in Path(".").glob("**/*.txt")]



# Initialize a tokenizer

tokenizer = ByteLevelBPETokenizer()


# Customize training

tokenizer.train(files=paths, 
                
                vocab_size = 52_000,

                min_frequency = 2,

                special_tokens = ["<s>",
                                  "<pad>",
                                  "</s>", 
                                  "<unk>",
                                  "<mask>"])

CPU times: user 11.2 s, sys: 2.74 s, total: 14 s
Wall time: 1.63 s


Our tokenizer is trained and ready to be saved.


Saving the files to disk


The tokenizer will generate two files when trained:

- merges.txt, which contains the merged tokenized substrings

- vocab.json, which contains the indices of the tokenized substrings

In [None]:
import os


token_dir = '/content/KantaiBERT'


if not os.path.exists(token_dir):
  
  os.makedirs(token_dir)


tokenizer.save_model('KantaiBERT')

['KantaiBERT/vocab.json', 'KantaiBERT/merges.txt']

Loading the trained tokenizer files

In [None]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing


tokenizer = ByteLevelBPETokenizer("./KantaiBERT/vocab.json",
                                  "./KantaiBERT/merges.txt",)


In [None]:
# Our tokenizer is able to encode a sequence

#Encoding a sequence
tokenizer.encode("The Critique of Pure Reason.").tokens

['The', 'ĠCritique', 'Ġof', 'ĠPure', 'ĠReason', '.']

In [None]:
# We can also ask to see the number of tokens in this sequence

tokenizer.encode("The Critique of Pure Reason.")

Encoding(num_tokens=6, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

The tokenizer now processes the tokens to fit the BERT model variant. The post-processor will add a start and end token.

In [None]:
# for example

tokenizer._tokenizer.post_processor = BertProcessing(
    
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

In [None]:
# Let’s encode a post-processed sequence

tokenizer.encode("The Critique of Pure Reason.")

Encoding(num_tokens=8, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [None]:
# If we want to see what was added, 
# we can ask the tokenizer to encode 
# the post-processed sequence

tokenizer.encode("The Critique of Pure Reason.").tokens

['<s>', 'The', 'ĠCritique', 'Ġof', 'ĠPure', 'ĠReason', '.', '</s>']

In [None]:
!nvidia-smi

import torch
torch.cuda.is_available()

Tue Feb 28 21:36:00 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    50W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

True

# Defining Model configuration 

In [None]:
from transformers import RobertaConfig



config = RobertaConfig(vocab_size = 52_000,
                       
                       max_position_embeddings = 514,

                       num_attention_heads = 12,

                       num_hidden_layers = 6,

                       type_vocab_size = 1)
                      

Reloading the tokenizer in transformers

In [None]:
from transformers import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained("./KantaiBERT",
                                             max_length = 512)

#Initializing a model from scratch

In [None]:
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config = config)
print(model)

RobertaForMaskedLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(52000, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNor

Exploring model parameters

In [None]:
print(f'The model is small and contains',
      model.num_parameters(),
      'parameters')

The model is small and contians 83504416 parameters


In [None]:
# calculating the length of the list of parameters

LP = list(model.parameters())

lp=len(LP)
print(lp)

106


In [None]:
# Displaying the 106 matrices and vectors 
# in the tensors that contain them

#for p in range(0,lp):
  #print(LP[p])


The number of parameters is calculated by taking all parameters in the model and adding them
up; for example:
- The vocabulary (52,000) x dimensions (768)
- The size of the vectors is 1 x 768
- The many other dimensions found

In [None]:
# Counting the parameters

np = 0

for p in range(0,lp):#number of tensors
  
  PL2 = True
  
  try:
    L2 = len(LP[p][0]) #check if 2D
  
  except:
    
    L2 = 1             #not 2D but 1D
    
    PL2 = False
  
  L1 = len(LP[p])      
  
  L3 = L1*L2
 
  np + = L3             # number of parameters per tensor
  
  
  if PL2 == True:

    print(p,L1,L2,L3)  # displaying the sizes of the parameters
    
  
  if PL2 == False:

    print(p,L1,L3)  # displaying the sizes of the parameters


print(np)              # total number of parameters

0 52000 768 39936000
1 514 768 394752
2 1 768 768
3 768 768
4 768 768
5 768 768 589824
6 768 768
7 768 768 589824
8 768 768
9 768 768 589824
10 768 768
11 768 768 589824
12 768 768
13 768 768
14 768 768
15 3072 768 2359296
16 3072 3072
17 768 3072 2359296
18 768 768
19 768 768
20 768 768
21 768 768 589824
22 768 768
23 768 768 589824
24 768 768
25 768 768 589824
26 768 768
27 768 768 589824
28 768 768
29 768 768
30 768 768
31 3072 768 2359296
32 3072 3072
33 768 3072 2359296
34 768 768
35 768 768
36 768 768
37 768 768 589824
38 768 768
39 768 768 589824
40 768 768
41 768 768 589824
42 768 768
43 768 768 589824
44 768 768
45 768 768
46 768 768
47 3072 768 2359296
48 3072 3072
49 768 3072 2359296
50 768 768
51 768 768
52 768 768
53 768 768 589824
54 768 768
55 768 768 589824
56 768 768
57 768 768 589824
58 768 768
59 768 768 589824
60 768 768
61 768 768
62 768 768
63 3072 768 2359296
64 3072 3072
65 768 3072 2359296
66 768 768
67 768 768
68 768 768
69 768 768 589824
70 768 768
71 768 768

# Building the dataset

In [None]:
%%time
from transformers import LineByLineTextDataset


dataset = LineByLineTextDataset(tokenizer = tokenizer,
                                
                                file_path="./kant.txt",
                                
                                block_size = 128)

#from transformers import ByteLevelBPETokenizer

#tokenizer = ByteLevelBPETokenizer("path/to/vocab.json", "path/to/merges.txt")
#text = "This is an example text to tokenize."

# Tokenize the text
#encoding = tokenizer.encode(text)

# Print the token IDs and corresponding tokens
#for id, token in zip(encoding.ids, encoding.tokens):
    #print(f"{id}\t{token}")

     



CPU times: user 41.5 s, sys: 1.58 s, total: 43.1 s
Wall time: 41.9 s


Defining a data collator

The program will now define a data collator to create an object for backpropagation.

A data collator will take samples from the dataset and collate them into batches.

We are preparing a batched sample process for MLM by setting
- mlm = True.
- We also set the number of masked tokens to train mlm_probability = 0.15, This will determine
the percentage of tokens masked during the pretraining process.


We will now initialize data_collator with our tokenizer, MLM activated, and the proportion of
masked tokens set to 15 percent

In [None]:
# initialize data_collator with our tokenizer

from transformers import DataCollatorForLanguageModeling


data_collator = DataCollatorForLanguageModeling(tokenizer = tokenizer,
                                                
                                                mlm = True, 
                                                
                                                mlm_probability = 0.15)

# Initializing the trainer


In [None]:
from transformers import Trainer, TrainingArguments


training_args = TrainingArguments(output_dir="./KantaiBERT",
                                  
                                  overwrite_output_dir = True,

                                  num_train_epochs = 1,

                                  per_device_train_batch_size = 64,

                                  save_steps = 10_000,

                                  save_total_limit = 2)


In [None]:
trainer = Trainer(model = model,
                  
                  args = training_args,

                  data_collator = data_collator,
                  
                  train_dataset = dataset)
   

Pretraining the model

In [None]:
%%time
trainer.train()



Step,Training Loss
500,6.6098
1000,5.7436
1500,5.2647
2000,5.0066
2500,4.8541


CPU times: user 3min 5s, sys: 1.27 s, total: 3min 6s
Wall time: 3min 7s


TrainOutput(global_step=2672, training_loss=5.451993177037039, metrics={'train_runtime': 187.4363, 'train_samples_per_second': 912.118, 'train_steps_per_second': 14.256, 'total_flos': 873620128952064.0, 'train_loss': 5.451993177037039, 'epoch': 1.0})

Saving the final model (+tokenizer + config) to disk

In [None]:
trainer.save_model("./KantaiBERT")

#Language modeling with FillMaskPipeline


In [None]:
from transformers import pipeline


fill_mask = pipeline("fill-mask",
                     
                     model="./KantaiBERT",
                     
                     tokenizer="./KantaiBERT")

In [None]:
fill_mask("Father, mother and <mask>")

[{'score': 0.06342872977256775,
  'token': 18,
  'token_str': '.',
  'sequence': 'Father, mother and.'},
 {'score': 0.014288828708231449,
  'token': 267,
  'token_str': ' the',
  'sequence': 'Father, mother and the'},
 {'score': 0.009417579509317875,
  'token': 396,
  'token_str': ' are',
  'sequence': 'Father, mother and are'},
 {'score': 0.007689234334975481,
  'token': 650,
  'token_str': ' without',
  'sequence': 'Father, mother and without'},
 {'score': 0.00540144881233573,
  'token': 288,
  'token_str': ' to',
  'sequence': 'Father, mother and to'}]

we built KantaiBERT, a RoBERTa-like model transformer, from scratch using the building blocks provided by Hugging Face