### Nemanja Petrovic SIR 1 - SRBerta pre train

## OSCAR Dataset for Serbian language

We need to download data from OSCAR corpus from Hugging face, after data is downloaded we are removing all new lines and divide data to chucnks of 10000

In [1]:
from datasets import load_dataset

hugging_face_token = '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'
dataset = load_dataset("oscar-corpus/OSCAR-2301",
                       cache_dir="dataset_cache",
                       use_auth_token=hugging_face_token,
                       language="sr",
                       streaming=False)

print(dataset)

# Format everything and pu in files with length 10000
print("STARTED WRITING DATA TO FILES")
from tqdm.auto import tqdm
text_data = []
file_count = 0

for sample in tqdm(dataset['train']):

    sample = sample['text'].replace('\n', '')
    text_data.append(sample)

    if len(text_data) == 10_000:
        with open(f'sr_{file_count}.txt', 'w', encoding='utf-8') as fp:
            fp.write('\n'.join(text_data))
        text_data = []
        file_count += 1

# after saving in 10K chunks, we have to add leftovers
with open(f'sr_{file_count}.txt', 'w', encoding='utf-8') as fp:
    fp.write('\n'.join(text_data))

print("FINISHED WRITING DATA TO FILES")

Downloading and preparing dataset oscar-2301/sr to C:/Users/HP1/Documents/Nemanja/SRBerta-pretrain/dataset_cache/oscar-corpus___oscar-2301/sr-language=sr/0.0.0/156efb8ba9f439f881d8f41fd7fddd5e04604bc27505c46ddef015f2fc551a4a...


Downloading data:   0%|          | 0.00/450 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/532M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/344M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/531M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/532M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/531M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset oscar-2301 downloaded and prepared to C:/Users/HP1/Documents/Nemanja/SRBerta-pretrain/dataset_cache/oscar-corpus___oscar-2301/sr-language=sr/0.0.0/156efb8ba9f439f881d8f41fd7fddd5e04604bc27505c46ddef015f2fc551a4a. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'meta'],
        num_rows: 838948
    })
})
STARTED WRITING DATA TO FILES


  0%|          | 0/838948 [00:00<?, ?it/s]

FINISHED WRITING DATA TO FILES


## Tokenizer

Before training model, we need to train tokenizer

In [1]:
from pathlib import Path
import os
from transformers import RobertaTokenizerFast
from tokenizers.decoders import ByteLevel

paths = [str(x) for x in Path('./').glob('*.txt')]
# For testing, taking only first 5 files, for real training remove this and go through more data
paths = paths[0:40]

from tokenizers import ByteLevelBPETokenizer

print("STARTING TOKENIZER TRAINING")
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(
    files=paths,
    vocab_size=50265,
    min_frequency=2,show_progress=True,
    special_tokens=[
        '<s>', '<pad>', '</s>', '<unk>', '<mask>'
    ]
)
print("FINISHED TOKENIZER TRAINING")

os.mkdir('./srberta_tokenizer')
tokenizer.save_model('srberta_tokenizer')
srberta_tokenizer = RobertaTokenizerFast.from_pretrained("srberta_tokenizer")

sample = srberta_tokenizer("Добар дан, како си данас ти човече", return_tensors='pt')
print("Shape of input ids in sample:")
print(str(sample.input_ids.shape))

# Test decoder
print("Testing decoder")
decoder = ByteLevel()
decoder.decode('ĠÐ´Ð°Ð½')

STARTING TOKENIZER TRAINING


The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'RobertaTokenizer'.


FINISHED TOKENIZER TRAINING


The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'RobertaTokenizerFast'.


Shape of input ids in sample:
torch.Size([1, 10])
Testing decoder


' дан'

## Input pipeline

In [2]:
import torch
from pathlib import Path
from transformers import RobertaTokenizerFast
from tqdm.auto import tqdm
import os

def mlm(tensor):

    rand = torch.rand(tensor.shape) #[0,1]
    mask_arr = (rand < 0.15)* (tensor!=0)* (tensor!=1)* (tensor!=2)
    for i in range(tensor.shape[0]):
        selection = torch.flatten(mask_arr[i].nonzero()).tolist()
        tensor[i, selection] = 4

    return tensor

paths = [str(x) for x in Path('./').glob('*.txt')]

tokenizer_srberta = RobertaTokenizerFast.from_pretrained("srberta_tokenizer")

input_ids = []
mask = [] # attention mask
labels = []

for path in tqdm(paths):
    with open(path, 'r', encoding='utf-8') as f:
        lines = f.read().split('\n')

    sample = tokenizer_srberta(lines, max_length=512, padding='max_length', truncation=True, return_tensors='pt')

    labels.append(sample.input_ids)
    mask.append(sample.attention_mask)
    input_ids.append(mlm(sample.input_ids.detach().clone()))

# sample['input_ids'].shape

input_ids = torch.cat(input_ids)
mask = torch.cat(mask)
labels = torch.cat(labels)
torch.save(input_ids, 'input_ids.pt')
torch.save(mask, 'mask.pt')
torch.save(labels, 'labels.pt')

input_ids = torch.load("input_ids.pt")
mask = torch.load("mask.pt")
labels = torch.load("labels.pt")
input_ids[0][:10]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'RobertaTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'RobertaTokenizerFast'.


  0%|          | 0/74 [00:00<?, ?it/s]

tensor([    0,  6909,  7665,  3528,  1602,   912,  5051, 19599,   365,   480])

In [3]:
import torch
torch.cuda.get_device_capability()

(7, 5)

## Dataloader

In [5]:
import torch

input_ids = torch.load("input_ids.pt")
mask = torch.load("mask.pt")
labels = torch.load("labels.pt")

encodings = {
    'input_ids': input_ids,
    'attention_mask': mask,
    'labels': labels
}

class Dataset(torch.utils.data.Dataset):

    def __init__(self, encodings):
        self.encodings = encodings

    def __len__(self):
        return self.encodings['input_ids'].shape[0]

    def __getitem__(self, i):
        return {key: tensor[i] for key, tensor in self.encodings.items()}

dataset = Dataset(encodings)
BATCH_SIZE = 16
DO_SHUFFLE = True
dataloader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=DO_SHUFFLE)
for i, data in enumerate(dataloader, 0):
    print(i)
    print(data)
    break

print(len(dataloader.dataset))

0
{'input_ids': tensor([[    0, 46776,  6994,  ...,   966,    16,     2],
        [    0,  1165,  4644,  ...,   800,   786,     2],
        [    0,   424,     4,  ...,   280,  2126,     2],
        ...,
        [    0,    42, 40390,  ...,     1,     1,     1],
        [    0,   424,  1679,  ...,     1,     1,     1],
        [    0, 23701, 19769,  ..., 35977,  1040,     2]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1]]), 'labels': tensor([[    0, 46776,  6994,  ...,   966,    16,     2],
        [    0,  1165,  4644,  ...,   800,   786,     2],
        [    0,   424, 25258,  ...,   280,  2126,     2],
        ...,
        [    0,    42, 40390,  ...,     1,     1,     1],
        [    0,   424,  1679,  ...,     1,     1,     1],
        [    0, 23701, 19769,  ..., 35977,  1040,     2]])}
738948


## Training

In [6]:
from transformers import RobertaConfig
from transformers import RobertaTokenizerFast
from transformers import RobertaForMaskedLM
import torch
from transformers import AdamW
from torch.utils.tensorboard import SummaryWriter
from tqdm.notebook import tqdm

tokenizer_srberta = RobertaTokenizerFast.from_pretrained("srberta_tokenizer")

config = RobertaConfig(
    vocab_size=tokenizer_srberta.vocab_size,
    max_position_embeddings=514,
    hidden_size=768,
    num_attention_heads=12,
    num_hidden_layers=12,
    type_vocab_size=1,
)

model = RobertaForMaskedLM(config) #randomly initialized weights

torch.cuda.empty_cache()
device = torch.device('cuda') if torch.cuda.is_available() else torch.device.cpu()
print(str(device))
model.to(device)
model.train()

optim = AdamW(model.parameters(), lr=1e-5)
epochs=25

writer = SummaryWriter("./runs_v2")

for epoch in range(epochs):
    step=0
    # setup loop with TQDM and dataloader
    loop = tqdm(dataloader, leave=True)

    for batch in loop:

        optim.zero_grad()

        input_ids = batch['input_ids'].to(device)
        mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(input_ids, attention_mask=mask, labels=labels)
        loss = outputs.loss

        loss.backward()
        optim.step()

        loop.set_description(f'Epoch: {epoch}')
        loop.set_postfix(loss=loss.item())

        writer.add_scalar("Loss/train", loss, step)
        writer.flush()

        if step % 25_000 == 0:
            torch.save({'optimizer_state_dict': optim.state_dict()}, str(step)+'_'+ str(epoch)+'_optimizer.pt')
            model.save_pretrained("./srberta_model_"+str(step)+'_'+ str(epoch))

        step+=1

    # Save after each epoch
    torch.save({'optimizer_state_dict': optim.state_dict()}, str(epoch)+'_optimizer.pt')
    model.save_pretrained("./srberta_model_"+ str(epoch))

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'RobertaTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'RobertaTokenizerFast'.


cuda




  0%|          | 0/61579 [00:00<?, ?it/s]

In [2]:
import torch
torch.save({
    'optimizer_state_dict': optim.state_dict()
},'optimizer_3_epochs.pt')

model.save_pretrained("./srberta_model_3_epochs")

def save(model, optimizer):
    # save
    torch.save({
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict()
    }, 'output_model.pt')

save(model, optim)

NameError: name 'optim' is not defined

## Testing model

In [13]:
from transformers import pipeline
from transformers import RobertaForMaskedLM

model_v2 = RobertaForMaskedLM.from_pretrained("./srberta_model_1")
model_v2.to('cpu')

fill = pipeline('fill-mask', model=model_v2, tokenizer=tokenizer_srberta)
fill(f'Добар дан како {fill.tokenizer.mask_token} ')

[{'score': 0.01705239899456501,
  'token': 18,
  'token_str': '.',
  'sequence': 'Добар дан како. '},
 {'score': 0.0163358673453331,
  'token': 341,
  'token_str': ' на',
  'sequence': 'Добар дан како на '},
 {'score': 0.01422953512519598,
  'token': 316,
  'token_str': ' је',
  'sequence': 'Добар дан како је '},
 {'score': 0.013286540284752846,
  'token': 16,
  'token_str': ',',
  'sequence': 'Добар дан како, '},
 {'score': 0.011648965999484062,
  'token': 280,
  'token_str': ' и',
  'sequence': 'Добар дан како и '}]