## Pretraining a RoBERTa Model from Scratch

Transformers usually follow a two step process
* pretraining and 
* finetuning. 

For most tasks, we would usually load a pretrained model then finetune it for our use case. Here we explore ways to pretrain our transform from scratch for a Masked_Language_modelling task based on Immanuel Kant Works.

For this task, we will following a 15 steps process.

#### Step 1: Loading The Dataset

The dataset for this work is a compilation of following three books by Immanuel Kant into a text file named kant.txt:
 * The Critique of Pure Reason
 * The Critique of Practical Reason
 * Fundamental Principles of the Metaphysic of Morals
 
 It is available in the books github repo

In [None]:
#@title Step 1: Loading the Dataset
!curl -L https://raw.githubusercontent.com/PacktPublishing/Transformers-for-Natural-Language-Processing/master/Chapter03/kant.txt --output "kant.txt"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10.7M  100 10.7M    0     0  24.2M      0 --:--:-- --:--:-- --:--:-- 24.2M


#### Step 2: Installing Hugging Face transformers

In [None]:
#@title Step 2:Installing Hugging Face Transformers
!pip install git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-wh43tzr5
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-wh43tzr5
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 5.2 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 61.8 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.

#### Step 3: Training a tokenizer
Here we will train a tokenizer specific to our data from scratch instead of using one of hugging face pretrained tokenizers.Hugging Face's ByteLevelBPETokenizer() will be trained using kant.txt

In [None]:
# Step 3: Training a tokenizer
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in Path(".").glob("**/*.txt")]   # path of the kant.txt
tokenizer = ByteLevelBPETokenizer()  # initialize tokenizer
tokenizer.train(files = paths, vocab_size=52_000, 
                min_frequency=2,
                special_tokens=["<s>","<pad>","</s>","<unk>","<mask>",])

#### Step 4: Saving the files to disk

tokenizer generate two files, save them to disk

In [None]:
## save tokenizer generated files to disk
import os
token_dir = '/content/KantaiBERT' # path to save files
if not os.path.exists(token_dir):
  os.makedirs(token_dir)
tokenizer.save_model(token_dir)

['/content/KantaiBERT/vocab.json', '/content/KantaiBERT/merges.txt']

#### Step 5: Loading the trained tokenizer files

In [None]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing

tokenizer = ByteLevelBPETokenizer('/content/KantaiBERT/vocab.json', # recreating the tokenizer, this time post processing
                                  '/content/KantaiBERT/merges.txt')  # based on bert

In [None]:
## test the tokenizer 
print(tokenizer.encode('The Critique of Pure Reason.'))
print(tokenizer.encode('The Critique of Pure Reason.').tokens)

Encoding(num_tokens=6, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])
['The', 'ĠCritique', 'Ġof', 'ĠPure', 'ĠReason', '.']


In [None]:
## process tokenizer to use bert variant of tokenization
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),)   # incate end and start of a sequence

tokenizer.enable_truncation(max_length= 512)  ## add truncation
# test post-processing
tokenizer.encode("The Critique of Pure Reason.").tokens

['<s>', 'The', 'ĠCritique', 'Ġof', 'ĠPure', 'ĠReason', '.', '</s>']

#### Step 6: Checking resource constraints: GPU and CUDA

In [None]:
import torch
torch.cuda.is_available()

True

#### Step 7: Defining the configuration of the model

We will be pretraining a RoBerta type transformer using DistilBert Configurations

In [None]:
from transformers import RobertaConfig
config = RobertaConfig(vocab_size=52_000,
                       max_position_embeddings=514,
                       num_attention_heads=12,
                       num_hidden_layers=6,
                       type_vocab_size=1,)

#### Step 8: Reloading the tokenizer in transformers

In [None]:
## loading the the tokenizer 
from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained("./KantaiBERT", max_length=512)

#### Step 9: Initializing a model from scratch

In [None]:
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config) # intialize roberta with the configurations above
print(model)

RobertaForMaskedLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(52000, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNor

In [None]:
## exploring the model parameters
print(model.num_parameters())
LP = list(model.parameters())  ## see the parameters matrices and vectors
lp = len(LP)
print(lp)
for p in range(0, lp):
  print(LP[p])

83504416
106
Parameter containing:
tensor([[-0.0268,  0.0231, -0.0161,  ..., -0.0100, -0.0145, -0.0016],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0063,  0.0428, -0.0214,  ..., -0.0373,  0.0047,  0.0108],
        ...,
        [-0.0010, -0.0481, -0.0087,  ..., -0.0037, -0.0063, -0.0204],
        [ 0.0026,  0.0083,  0.0341,  ..., -0.0131, -0.0028,  0.0409],
        [-0.0323, -0.0173, -0.0170,  ..., -0.0095, -0.0177, -0.0226]],
       requires_grad=True)
Parameter containing:
tensor([[ 0.0385,  0.0049,  0.0337,  ..., -0.0010,  0.0087, -0.0188],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0015,  0.0165,  0.0126,  ...,  0.0033,  0.0114,  0.0097],
        ...,
        [ 0.0276,  0.0122,  0.0071,  ..., -0.0103,  0.0180, -0.0052],
        [ 0.0302, -0.0234,  0.0140,  ..., -0.0240,  0.0036,  0.0318],
        [-0.0161, -0.0027, -0.0105,  ...,  0.0005, -0.0002,  0.0158]],
       requires_grad=True)
Parameter containing:

In [None]:
# counting number of parameters and dimension
np = 0
for p in range(0, lp):
  PL2 = True
  try:
    L2 = len(LP[p][0])
  except:
    L2 = 1
    PL2 = False
  L1=len(LP[p])
  L3 = L1*L2
  np += L3
  if PL2==True:
    print(p,L1,L2,L3) # displaying the sizes of the parameters
  if PL2==False:
    print(p,L1,L3) # displaying the sizes of the parameters
print(np)

0 52000 768 39936000
1 514 768 394752
2 1 768 768
3 768 768
4 768 768
5 768 768 589824
6 768 768
7 768 768 589824
8 768 768
9 768 768 589824
10 768 768
11 768 768 589824
12 768 768
13 768 768
14 768 768
15 3072 768 2359296
16 3072 3072
17 768 3072 2359296
18 768 768
19 768 768
20 768 768
21 768 768 589824
22 768 768
23 768 768 589824
24 768 768
25 768 768 589824
26 768 768
27 768 768 589824
28 768 768
29 768 768
30 768 768
31 3072 768 2359296
32 3072 3072
33 768 3072 2359296
34 768 768
35 768 768
36 768 768
37 768 768 589824
38 768 768
39 768 768 589824
40 768 768
41 768 768 589824
42 768 768
43 768 768 589824
44 768 768
45 768 768
46 768 768
47 3072 768 2359296
48 3072 3072
49 768 3072 2359296
50 768 768
51 768 768
52 768 768
53 768 768 589824
54 768 768
55 768 768 589824
56 768 768
57 768 768 589824
58 768 768
59 768 768 589824
60 768 768
61 768 768
62 768 768
63 3072 768 2359296
64 3072 3072
65 768 3072 2359296
66 768 768
67 768 768
68 768 768
69 768 768 589824
70 768 768
71 768 768

#### Step 10: Building the dataset

In [None]:
## loading the data as a sequence line by line
from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(tokenizer = tokenizer, file_path ='/content/kant.txt',
                                block_size = 128)



#### Step 11: Defining a data collator

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer = tokenizer,
                                                mlm = True, mlm_probability=0.15) # initialize collator

#### Step 12: Initializing the trainer

In [None]:
# initialize trainer and train_args
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(output_dir="./KantaiBERT",
                                  overwrite_output_dir=True,
                                  num_train_epochs=1,
                                  per_device_train_batch_size=64,
                                  save_steps=10_000,
                                  save_total_limit=2,)

trainer = Trainer(model=model,
                  args=training_args,
                  data_collator=data_collator,
                  train_dataset=dataset,
                  )

#### Step 13: Pretraining the model

In [None]:
%%time
trainer.train()

***** Running training *****
  Num examples = 170964
  Num Epochs = 1
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 2672


Step,Training Loss
500,6.5956
1000,5.747
1500,5.2719
2000,5.0058
2500,4.8537




Training completed. Do not forget to share your model on huggingface.co/models =)




CPU times: user 9min 26s, sys: 2.57 s, total: 9min 29s
Wall time: 9min 32s


TrainOutput(global_step=2672, training_loss=5.450884950375128, metrics={'train_runtime': 572.7854, 'train_samples_per_second': 298.478, 'train_steps_per_second': 4.665, 'total_flos': 873620128952064.0, 'train_loss': 5.450884950375128, 'epoch': 1.0})

#### Step 14: Saving the final model (+tokenizer +config) to disk

In [None]:
trainer.save_model("./KantaiBERT")

Saving model checkpoint to ./KantaiBERT
Configuration saved in ./KantaiBERT/config.json
Model weights saved in ./KantaiBERT/pytorch_model.bin


#### Step 15: Language modeling with FillMaskPipeline
import pipeline for masked language modelling

In [None]:
from transformers import pipeline
fill_mask = pipeline("fill-mask",model="./KantaiBERT",
                     tokenizer="./KantaiBERT")
fill_mask("Human thinking involves human <mask>.")

loading configuration file ./KantaiBERT/config.json
Model config RobertaConfig {
  "_name_or_path": "./KantaiBERT",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.21.0.dev0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 52000
}

loading configuration file ./KantaiBERT/config.json
Model config RobertaConfig {
  "_name_or_path": "./KantaiBERT",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dro

[{'score': 0.04251458868384361,
  'sequence': 'Human thinking involves human reason.',
  'token': 393,
  'token_str': ' reason'},
 {'score': 0.012001300230622292,
  'sequence': 'Human thinking involves human understanding.',
  'token': 600,
  'token_str': ' understanding'},
 {'score': 0.011809378862380981,
  'sequence': 'Human thinking involves human conceptions.',
  'token': 605,
  'token_str': ' conceptions'},
 {'score': 0.010143973864614964,
  'sequence': 'Human thinking involves human law.',
  'token': 446,
  'token_str': ' law'},
 {'score': 0.009305383078753948,
  'sequence': 'Human thinking involves human existence.',
  'token': 604,
  'token_str': ' existence'}]