# roBERTo
This is a BERT based language model which is trained to learn Spanish, since most of the models are in English. 

## Gathering Dataset

As stated, to train the model we need a spanish corpus, therefore we will be using the spanish dataset Dahiana from huggingface.co/

In [142]:
!pip install datasets

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [143]:
from datasets import load_dataset

In [144]:
dataset = load_dataset("mlsum", "es")

Reusing dataset mlsum (/Users/ernestomancebo/.cache/huggingface/datasets/mlsum/es/1.0.0/77f23eb185781f439927ac2569ab1da1083195d8b2dab2b2f6bbe52feb600688)


In [145]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'summary', 'topic', 'url', 'title', 'date'],
        num_rows: 266367
    })
    validation: Dataset({
        features: ['text', 'summary', 'topic', 'url', 'title', 'date'],
        num_rows: 10358
    })
    test: Dataset({
        features: ['text', 'summary', 'topic', 'url', 'title', 'date'],
        num_rows: 13920
    })
})

In [146]:
train = dataset['train']
validation = dataset['validation']


In [147]:
train[25]['text']


'No habrá tregua para el consumidor en 2010. Ni la crisis, ni el paro, ni siquiera el previsible estancamiento de los precios impedirán una subida general de los servicios y suministros más básicos. El afán recaudatorio de las distintas administraciones (Estado, comunidades autónomas y ayuntamientos) para paliar los agujeros de las cuentas públicas han desatado una oleada de aumentos de impuestos, tasas y tarifas de servicios públicos en el año que comienza. El Gobierno está a la cabeza de esta política impositiva. A partir de julio, la práctica totalidad de los productos -exceptuando los de primera necesidad, como el pan- costarán más gracias a la subida de dos puntos del tipo general del IVA, del 16% al 18%. Pensionistas, parados y sueldos bajos mejorarán algo su renta Y si alguien pensaba que el aumento de los impuestos indirectos se va a compensar con una relajación de los directos, los que gravan la renta de cada ciudadano, nada más lejos de la realidad. Los 400 euros de desgravac

Observamos cuántos registros nos restan si procesamos el corpus de 10 mil en 10 mil entradas

In [148]:
266367 % 10_000

6367

In [149]:
import os

corpus_dir = os.path.join(os.getcwd(), 'corpus')
os.mkdir(corpus_dir)
corpus_dir


FileExistsError: [Errno 17] File exists: '/Users/ernestomancebo/projects/coloquial_bot/corpus'

In [None]:
from tqdm.auto import tqdm


def serialize_corpus(dataset, dest_path, max_entries=10_000):
    text_data = []
    file_count = 0

    for sample in tqdm(dataset):
        # Clean up a bit the text
        text = sample['text']
        text = text.replace("\n", ' ')
        text_data.append(text)

        if len(text_data) == max_entries:
            with open(os.path.join(dest_path, f'es_{file_count}.txt'), 'w', encoding='utf-8') as file:
                file.write('\n'.join(text_data))
                text_data = []
                file_count += 1

                file.close()

    # The last iteration may exceed the max_entries cap
    if len(text_data) > 0:
        with open(os.path.join(dest_path, f'es_{file_count}.txt'), 'w', encoding='utf-8') as file:
            file.write('\n'.join(text_data))

            file.close()


In [None]:
train_dir = os.path.join(corpus_dir, 'train')
val_dir = os.path.join(corpus_dir, 'validation')

!mkdir {train_dir}
!mkdir {val_dir}

serialize_corpus(train, train_dir)
serialize_corpus(validation, val_dir)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 266367/266367 [01:51<00:00, 2387.99it/s]
100%|██████████| 10358/10358 [00:05<00:00, 1933.35it/s]


## Tokenizing

In [None]:
from pathlib import Path

corpus_paths = [str(x) for x in Path('./corpus/train').glob('*.txt')]
corpus_paths[:3]


['corpus/train/es_2.txt', 'corpus/train/es_11.txt', 'corpus/train/es_10.txt']

In [None]:
from tokenizers import ByteLevelBPETokenizer

In [None]:
VOCAB_SIZE = 32_000

tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=corpus_paths,
                vocab_size=VOCAB_SIZE,
                min_frequency=2,
                special_tokens=[
                    '<s>', '<pad>', '</s>', '<unk>', '<mask>'])







In [None]:
!mkdir roberto

In [None]:
tokenizer.save_model('roberto')

['roberto/vocab.json', 'roberto/merges.txt']

### Loading from pretrained model

First, we create a model configuration. This config is ok for a small model.

In [None]:
import json
config = {
    "attention_probs_dropout_prob": 0.1,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.3,
    "hidden_size": 128,
    "initializer_range": 0.02,
    "num_attention_heads": 1,
    "num_hidden_layers": 1,
    "vocab_size": VOCAB_SIZE,
    "intermediate_size": 256,
    "max_position_embeddings": 256
}

with open("./roberto/config.json", 'w') as fp:
    json.dump(config, fp)


In [None]:
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained('roberto')

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'RobertaTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'RobertaTokenizerFast'.


In [None]:
TOKEN_MAX_LEN= 512

tokenizer('hoy es un buen día', padding='max_length')


Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.


{'input_ids': [0, 17756, 317, 298, 1384, 975, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

## Input Pipeline and Training

In [None]:
import torch


def mlm(tensor, percent=0.15):
    """Mask randomly the given 

    Args:
        tensor ([type]): [description]

    Returns:
        [type]: [description]
    """
    rand = torch.rand(tensor.shape)  # [0,1]
    mask_arr = (rand < 0.15) * (tensor > 2)  # Special tokens: 0, 1, 2

    for i in range(tensor.shape[0]):
        selection = torch.flatten(mask_arr[i].nonzero())
        tensor[i, selection] = 4  # <mask> : 4 in vocab.json

    return tensor


### Create three tensors

- **Labels**: Are the ground truth of the given input sequence.
- **Input Ids**: Are the masked labels, ie. the labels + 0.15% of them masked.
- **Attention Maks**


In [None]:
from tqdm.auto import tqdm

input_ids = []
masks = []
labels = []

for path in tqdm(corpus_paths):
    with open(path, 'r', encoding='utf-8') as f:
        lines = f.read().split('\n')
    sample = tokenizer(lines,
                       max_length=TOKEN_MAX_LEN,
                       padding='max_length',
                       truncation=True,
                       return_tensors='pt')

    labels.append(sample.input_ids)
    masks.append(sample.attention_mask)
    input_ids.append(mlm(sample.input_ids.detach().clone()))


100%|██████████| 27/27 [22:25<00:00, 49.83s/it]


Cast the parsed list to tensors

In [None]:
input_ids = torch.cat(input_ids)
masks = torch.cat(masks)
labels = torch.cat(labels)


In [178]:
!mkdir tensors

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [179]:
torch.save(input_ids, './tensors/input_ids.pt')
torch.save(masks, './tensors/masks.pt')
torch.save(labels, './tensors/labels.pt')

Loading persisted Tensors

In [None]:
import torch

input_ids = torch.load( './tensors/input_ids.pt')
masks = torch.load( './tensors/masks.pt')
labels = torch.load( './tensors/labels.pt')

In [None]:

# Input encoding
encodings = {'input_ids': input_ids, 'attention_mask': masks, 'labels': labels}

In [None]:
class Dataset(torch.utils.data.Dataset):

    def __init__(self, encodings):
        self.encodings = encodings

    def __len__(self):
        return self.encodings['input_ids'].shape[0]

    def __getitem__(self, i):
        return {key: tensor[i] for key, tensor in self.encodings.items()}


In [None]:
dataset = Dataset(encodings)


In [None]:
dataloader = torch.utils.data.DataLoader(dataset, batch_size=16, shuffle=True)


In [None]:
from transformers import RobertaConfig


In [170]:
config = RobertaConfig(
    vocab_size=VOCAB_SIZE,
    max_position_embeddings=(TOKEN_MAX_LEN + 2),
    hidden_size=768,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1
)


In [None]:
from transformers import RobertaForMaskedLM

In [171]:
model = RobertaForMaskedLM(config)

In [172]:
device = torch.device(
    'cuda') if torch.cuda.is_available() else torch.device('cpu')


In [173]:
model.to(device)

RobertaForMaskedLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(32000, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNor

In [None]:
from transformers import AdamW

In [167]:
model.train() 
optimizer = AdamW(model.parameters(), lr=1e-4)

In [161]:
from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter()

In [124]:
from tqdm.auto import tqdm

In [174]:

epochs = 3
step = 0

for epoch in range(epochs):

    loop = tqdm(dataloader, leave=True)
    for batch in loop:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(input_ids, attention_mask=mask, labels=labels)
        loss = outputs.loss

        writer.add_scalar('Loss/Train', loss, epoch)
        loss.backward()
        optim.step()

        loop.set_description(f'Epoch: {epoch}')
        loop.set_postfix(loss=loss.item())

model.save('./roberto')

  0%|          | 0/16648 [00:51<?, ?it/s]


AttributeError: 'Tensor' object has no attribute 'backwards'

## File-Maks testing

In [153]:
from transformers import pipeline

In [None]:
fill = pipeline('fill-mask', model='roberto', tokenizer='roberto')
