# BERT from scratch

This content is loosely based on James Briggs' [tutorial](https://www.kdnuggets.com/2021/08/train-bert-model-scratch.html) "How to Train a BERT Model From Scratch".

Differently from the original tutorial, the latin language dataset is used - not the best choice for accuracy, but it is a small dataset and evaluations come easy.

## Getting the data

In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-1.11.0-py3-none-any.whl (264 kB)
[K     |████████████████████████████████| 264 kB 8.3 MB/s 
Collecting huggingface-hub<0.1.0
  Downloading huggingface_hub-0.0.15-py3-none-any.whl (43 kB)
[K     |████████████████████████████████| 43 kB 2.4 MB/s 
[?25hCollecting fsspec>=2021.05.0
  Downloading fsspec-2021.7.0-py3-none-any.whl (118 kB)
[K     |████████████████████████████████| 118 kB 74.6 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 67.8 MB/s 
Installing collected packages: xxhash, huggingface-hub, fsspec, datasets
Successfully installed datasets-1.11.0 fsspec-2021.7.0 huggingface-hub-0.0.15 xxhash-2.0.2


In [2]:
import datasets

In [3]:
all_ds = datasets.list_datasets()
all_ds[:5] 

['acronym_identification',
 'ade_corpus_v2',
 'adversarial_qa',
 'aeslc',
 'afrikaans_ner_corpus']

In [4]:
'oscar' in all_ds

True

In [5]:
dataset = datasets.load_dataset('oscar', 'unshuffled_deduplicated_la')

Downloading:   0%|          | 0.00/5.58k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/359k [00:00<?, ?B/s]

Downloading and preparing dataset oscar/unshuffled_deduplicated_la (download: 3.26 MiB, generated: 8.46 MiB, post-processed: Unknown size, total: 11.72 MiB) to /root/.cache/huggingface/datasets/oscar/unshuffled_deduplicated_la/1.0.0/84838bd49d2295f62008383b05620571535451d84545037bb94d6f3501651df2...


Downloading:   0%|          | 0.00/81.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.42M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

Dataset oscar downloaded and prepared to /root/.cache/huggingface/datasets/oscar/unshuffled_deduplicated_la/1.0.0/84838bd49d2295f62008383b05620571535451d84545037bb94d6f3501651df2. Subsequent calls will reuse this data.


In [6]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'text'],
        num_rows: 18808
    })
})

In [7]:
dataset['train'][0] 

{'id': 0,
 'text': 'Hæ sunt generationes Noë: Noë vir justus atque perfectus fuit in generationibus suis; cum Deo ambulavit.\nEcce ego adducam aquas diluvii super terram, ut interficiam omnem carnem, in qua spiritus vitæ est subter cælum: universa quæ in terra sunt, consumentur.\nTolles igitur tecum ex omnibus escis, quæ mandi possunt, et comportabis apud te: et erunt tam tibi, quam illis in cibum.'}

In [8]:
from tqdm.auto import tqdm

text_data = []
file_count = 0

for sample in tqdm(dataset['train']):
    sample = sample['text'].replace('\n', '')
    text_data.append(sample)
    if len(text_data) == 6_000:
        # once we git the 6K mark, save to file
        with open(f'oscar_la/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
            fp.write('\n'.join(text_data))
        text_data = []
        file_count += 1
# after saving in 6K chunks, we will have ~808 leftover samples, we save those now too
with open(f'oscar_la/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
    fp.write('\n'.join(text_data))

  0%|          | 0/18808 [00:00<?, ?it/s]

## Building a tokenizer

In [9]:
from pathlib import Path
paths = [str(x) for x in Path('oscar_la').glob('**/*.txt')] 

In [10]:
paths

['oscar_la/text_2.txt',
 'oscar_la/text_0.txt',
 'oscar_la/text_1.txt',
 'oscar_la/text_3.txt']

In [11]:
!pip install transformers 

Collecting transformers
  Downloading transformers-4.9.2-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 6.9 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 25.0 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 58.1 MB/s 
Collecting huggingface-hub==0.0.12
  Downloading huggingface_hub-0.0.12-py3-none-any.whl (37 kB)
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 62.1 MB/s 
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninsta

In [12]:
from tokenizers import ByteLevelBPETokenizer

tokenizer = ByteLevelBPETokenizer() 

In [13]:
tokenizer.train(files=paths, 
                vocab_size=30_522,
                min_frequency=2,
                special_tokens=['<s>', '<pad>', '</s>', '<unk>', '<mask>']) 

In [14]:
import os

os.mkdir('./liberto')

tokenizer.save_model('liberto') 

['liberto/vocab.json', 'liberto/merges.txt']

## Initializing the tokenizer

In [15]:
from transformers import RobertaTokenizer

# initialize the tokenizer using the tokenizer we initialized and saved to file
tokenizer = RobertaTokenizer.from_pretrained('liberto', max_len=512) 

file liberto/config.json not found


In [16]:
# test our tokenizer on a simple sentence
tokens = tokenizer('quo vadis?') 

In [17]:
tokens

{'input_ids': [0, 3106, 14116, 35, 2], 'attention_mask': [1, 1, 1, 1, 1]}

In [18]:
tokens.input_ids

[0, 3106, 14116, 35, 2]

## Creating the Input Pipeline

## Preparing the data

In [19]:
with open('oscar_la/text_0.txt', 'r', encoding='utf-8') as fp:
    lines = fp.read().split('\n') 

In [20]:
lines[0] 

'Hæ sunt generationes Noë: Noë vir justus atque perfectus fuit in generationibus suis; cum Deo ambulavit.Ecce ego adducam aquas diluvii super terram, ut interficiam omnem carnem, in qua spiritus vitæ est subter cælum: universa quæ in terra sunt, consumentur.Tolles igitur tecum ex omnibus escis, quæ mandi possunt, et comportabis apud te: et erunt tam tibi, quam illis in cibum.'

In [21]:
batch = tokenizer(lines, max_length=512, padding='max_length', truncation=True)
len(batch) 

2

In [22]:
for x in batch['input_ids']:
    print(x)
    break

[0, 44, 836, 337, 7597, 21560, 30, 21560, 609, 14600, 545, 9976, 517, 285, 16827, 1490, 31, 342, 1149, 15969, 18, 5436, 636, 10902, 4973, 12302, 761, 1516, 16, 329, 10904, 1458, 5203, 16, 285, 503, 3658, 9917, 297, 9259, 19092, 30, 3481, 1673, 285, 1127, 337, 16, 16795, 18, 56, 20711, 796, 2007, 349, 837, 14882, 16, 1673, 16329, 884, 16, 290, 26857, 494, 486, 30, 290, 1933, 508, 591, 16, 350, 1144, 285, 4729, 18, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

In [23]:
import torch

labels = torch.tensor([x for x in batch['input_ids']])
mask = torch.tensor([x for x in batch['attention_mask']]) 

In [24]:
labels

tensor([[    0,    44,   836,  ...,     1,     1,     1],
        [    0, 10046,   411,  ...,     1,     1,     1],
        [    0, 14774,  1153,  ...,     1,     1,     1],
        ...,
        [    0,  1323,  3128,  ...,     1,     1,     1],
        [    0, 13530, 14249,  ...,     1,     1,     1],
        [    0,  2807,   411,  ...,     1,     1,     1]])

In [25]:
# make copy of labels tensor, this will be input_ids
input_ids = labels.detach().clone()
# create random array of floats with equal dims to input_ids
rand = torch.rand(input_ids.shape)
# mask random 15% where token is not 0 [PAD], 1 [CLS], or 2 [SEP]
# mask_arr = (rand < .15) * (input_ids != 0) * (input_ids != 1) * (input_ids != 2)
mask_arr = (rand < .15) * (input_ids > 2) 
# loop through each row in input_ids tensor (cannot do in parallel)
for i in range(input_ids.shape[0]):
    # get indices of mask positions from mask array
    selection = torch.flatten(mask_arr[i].nonzero()).tolist()
    # mask input_ids
    input_ids[i, selection] = 3  # our custom [MASK] token == 3 

In [26]:
input_ids.shape

torch.Size([6000, 512])

In [27]:
input_ids[0][:200] 

tensor([    0,    44,   836,   337,     3,     3,    30,     3,   609, 14600,
            3,  9976,   517,   285, 16827,     3,    31,     3,  1149, 15969,
           18,  5436,   636, 10902,  4973, 12302,   761,  1516,     3,   329,
        10904,  1458,  5203,    16,   285,   503,     3,  9917,     3,     3,
        19092,    30,  3481,  1673,   285,     3,     3,    16,     3,    18,
           56, 20711,   796,     3,   349,   837,     3,    16,     3, 16329,
          884,    16,   290, 26857,   494,   486,    30,   290,  1933,   508,
          591,    16,   350,  1144,   285,  4729,    18,     2,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1, 

In [57]:
encodings = {'input_ids': input_ids, 'attention_mask': mask, 'labels': labels} 

In [58]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        # store encodings internally
        self.encodings = encodings

    def __len__(self):
        # return the number of samples
        return self.encodings['input_ids'].shape[0]

    def __getitem__(self, i):
        # return dictionary of input_ids, attention_mask, and labels for index i
        return {key: tensor[i] for key, tensor in self.encodings.items()}

In [59]:
dataset = Dataset(encodings) 

In [60]:
loader = torch.utils.data.DataLoader(dataset, batch_size=16, shuffle=True) 

## Training the model

## Initializing the model 

In [61]:
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=30_522,  # we align this to the tokenizer vocab_size
    max_position_embeddings=514,
    hidden_size=768,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1
    ) 

In [62]:
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config) 

## Training Preparation 

In [63]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# and move our model over to the selected device
model.to(device) 

RobertaForMaskedLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNor

In [64]:
from transformers import AdamW

# activate training mode
model.train()
# initialize optimizer
optim = AdamW(model.parameters(), lr=1e-4)

## Training 

In [65]:
epochs = 2

for epoch in range(epochs):
    # setup loop with TQDM and dataloader
    loop = tqdm(loader, leave=True)
    for batch in loop:
        # initialize calculated gradients (from prev step)
        optim.zero_grad()
        # pull all tensor batches required for training
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        # process
        outputs = model(input_ids, attention_mask=attention_mask,
                        labels=labels)
        # extract loss
        loss = outputs.loss
        # calculate loss for every parameter that needs grad update
        loss.backward()
        # update parameters
        optim.step()
        # print relevant info to progress bar
        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item()) 

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

In [66]:
model.save_pretrained('./liberto')  # and don't forget to save liBERTo!

## The Real Test 

In [67]:
from transformers import pipeline

In [68]:
fill = pipeline('fill-mask', model='liberto', tokenizer='liberto')

In [74]:
fill(f'abundans {fill.tokenizer.mask_token} non nocet ') # abundans cautela non nocet

[{'score': 0.0004939708742313087,
  'sequence': 'abundans qui non nocet ',
  'token': 365,
  'token_str': ' qui'},
 {'score': 0.00044498726492747664,
  'sequence': 'abundans, non nocet ',
  'token': 16,
  'token_str': ','},
 {'score': 0.0003394597733858973,
  'sequence': 'abundans vel non nocet ',
  'token': 449,
  'token_str': ' vel'},
 {'score': 0.00030346045969054103,
  'sequence': 'abundans. non nocet ',
  'token': 18,
  'token_str': '.'},
 {'score': 0.00024564063642174006,
  'sequence': 'abundans expressit non nocet ',
  'token': 29638,
  'token_str': ' expressit'}]

In [75]:
fill(f'quod {fill.tokenizer.mask_token} demonstrandum') # quod erat demonstrandum 

[{'score': 0.00040464798803441226,
  'sequence': 'quod qui demonstrandum',
  'token': 365,
  'token_str': ' qui'},
 {'score': 0.00034588476410135627,
  'sequence': 'quod, demonstrandum',
  'token': 16,
  'token_str': ','},
 {'score': 0.00031329740886576474,
  'sequence': 'quod vel demonstrandum',
  'token': 449,
  'token_str': ' vel'},
 {'score': 0.0002996937255375087,
  'sequence': 'quod deserv demonstrandum',
  'token': 18483,
  'token_str': ' deserv'},
 {'score': 0.0002783830277621746,
  'sequence': 'quod. demonstrandum',
  'token': 18,
  'token_str': '.'}]