# Desafio I2A2 - Embedding em português
## Hilquias de Paiva Araújo

Este é o notebook utilizado na prototipagem do modelo de embedding em português.

Para avaliação, é necessário rodar apenas os `Imports` e a seção `Testing the model as embedding`. Não é necessário rodar nenhuma célula dentro da seção `Development`.

Escolha duas palavras por vez (célula 7 - terceira célula de código) para que seja calculada a similaridade entre elas.

-----

Disclaimer: Não consegui treinar o modelo durante muito tempo, e precisei rodar localmente. Por conta disso, o embedding não funciona tão bem, mas é possível verificar uma tendência de similaridades maiores para palavras sinônimas, por mais que não seja possível verificar uma similaridade "negativa" para palavras antagônicas.

-----

**OBS**

Instale todos os pacotes do `requirements.txt` antes de rodar os imports

### Imports

In [None]:
import os
import numpy as np
from pathlib import Path
import torch
from tqdm.auto import tqdm
from datasets import load_dataset
from transformers import RobertaTokenizer
from transformers import RobertaConfig
from transformers import RobertaForMaskedLM
from transformers import AdamW
from tokenizers import ByteLevelBPETokenizer

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')


# Testing the model as embedding

In [None]:
tokenizer = RobertaTokenizer.from_pretrained('pt_tokenizer', max_len=512)
embedding = RobertaForMaskedLM.from_pretrained('pt_model', output_hidden_states=True)
embedding.to(device)


def similarity(tokenizer, embedding, text1, text2):
    tok1 = tokenizer(text1, return_tensors='pt').to(device)
    tok2 = tokenizer(text2, return_tensors='pt').to(device)


    with torch.no_grad():
        out1 = embedding(**tok1)
        out2 = embedding(**tok2)

    # Only grab the last hidden state
    states1 = out1.hidden_states[-1].squeeze()
    states2 = out2.hidden_states[-1].squeeze()

    # Select the tokens that we're after corresponding to "New" and "York"
    embs1 = states1
    embs2 = states2

    avg1 = embs1.mean(axis=0)
    avg2 = embs2.mean(axis=0)


    return torch.cosine_similarity(avg1.reshape(1,-1), avg2.reshape(1,-1)).cpu().numpy()[0]


Write two words below to be compared with the embedding model:

In [None]:
text1 = 'ator'
text2 = 'protagonista'

similarity(tokenizer, embedding, text1, text2)

# Development

## Saving training data

In [None]:
dataset = load_dataset('oscar', 'unshuffled_deduplicated_pt', split='train[:1%]')


In [None]:
text_data = []
file_count = 0

for sample in tqdm(dataset['text']):
    sample = sample.replace('\n', '')
    text_data.append(sample)
    if len(text_data) == 10_000:
        # once we git the 10K mark, save to file
        with open(f'data/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
            fp.write('\n'.join(text_data))
        text_data = []
        file_count += 1
# after saving in 10K chunks, we will have ~2082 leftover samples, we save those now too
with open(f'data/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
    fp.write('\n'.join(text_data))


## Tokenizer

In [None]:
files_paths = [str(x) for x in Path('data/').glob('**/*.txt')]

tokenizer = ByteLevelBPETokenizer()

tokenizer.train(files=files_paths, vocab_size=30_522, min_frequency=2,
                special_tokens=['<s>', '<pad>', '</s>', '<unk>', '<mask>'])

if not os.path.exists('./pt_tokenizer'):
    os.mkdir('./pt_tokenizer')

tokenizer.save_model('pt_tokenizer')


In [None]:
# initialize the tokenizer using the tokenizer we initialized and saved to file
tokenizer = RobertaTokenizer.from_pretrained('pt_tokenizer', max_len=512)
# test our tokenizer on a simple sentence
tokens = tokenizer('olá, como vai?')
print(tokens)
print(tokens.input_ids)

## Data preparation

In [None]:
all_lines = []

file = files_paths[0]
with open(file, 'r', encoding='utf-8') as fp:
    lines = fp.read().split('\n')
    all_lines.extend(lines)


In [None]:
batch = tokenizer(all_lines, max_length=512, padding='max_length', truncation=True)


It's needed to create a dataset loader with all data already prepared. We used only 10k lines due to lack of GPU available to train the model.

We will pre-process all data below, masking random tokens to pass as training data to the model.

In [None]:
labels = torch.tensor(batch['input_ids'])
mask = torch.tensor(batch['attention_mask'])

# make copy of labels tensor, this will be input_ids
input_ids = labels.detach().clone()
# create random array of floats with equal dims to input_ids
rand = torch.rand(input_ids.shape)
# mask random 15% where token is not 0 [PAD], 1 [CLS], or 2 [SEP]
mask_arr = (rand < .15) * (input_ids != 0) * (input_ids != 1) * (input_ids != 2)
# loop through each row in input_ids tensor (cannot do in parallel)
for i in range(input_ids.shape[0]):
    # get indices of mask positions from mask array
    selection = torch.flatten(mask_arr[i].nonzero()).tolist()
    # mask input_ids
    input_ids[i, selection] = 3  # our custom [MASK] token == 3


In [None]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        # store encodings internally
        self.encodings = encodings

    def __len__(self):
        # return the number of samples
        return self.encodings['input_ids'].shape[0]

    def __getitem__(self, i):
        # return dictionary of input_ids, attention_mask, and labels for index i
        return {key: tensor[i] for key, tensor in self.encodings.items()}


In [None]:
encodings = {'input_ids': input_ids, 'attention_mask': mask, 'labels': labels}

dataset = Dataset(encodings)
loader = torch.utils.data.DataLoader(dataset, batch_size=16, shuffle=True)

## Defining the model

In [None]:
config = RobertaConfig(
    vocab_size=30_522,  # we align this to the tokenizer vocab_size
    max_position_embeddings=514,
    hidden_size=768,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1
)


model = RobertaForMaskedLM(config)
model.to(device)

# activate training mode
model.train()
# initialize optimizer
optim = AdamW(model.parameters(), lr=1e-4)

## Training

In [None]:
epochs = 1

for epoch in range(epochs):
    # setup loop with TQDM and dataloader
    loop = tqdm(loader, leave=True)
    for batch in loop:
        # initialize calculated gradients (from prev step)
        optim.zero_grad()
        # pull all tensor batches required for training
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        # process
        outputs = model(input_ids, attention_mask=attention_mask,
                        labels=labels)
        # extract loss
        loss = outputs.loss
        # calculate loss for every parameter that needs grad update
        loss.backward()
        # update parameters
        optim.step()
        # print relevant info to progress bar
        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())


In [None]:
model.save_pretrained('./pt_model')