# Notebook de referência 

Nome: Arthur

## Instruções

Neste colab iremos treinar um modelo T5 para traduzir de inglês para português. Iremos treiná-lo com o data Paracrawl.

- Usaremos o dataset Paracrawl Inglês-Português. Truncamos o dataset de treino para apenas 100k pares para deixar o treinamento mais rápido. Quem quiser pode treinar com mais amostras. Se demorar muito para treinar, truncar o dataset ainda mais.

- Usaremos o BLEU como métrica. Usaremos o SacreBLEU pois sempre faz o mesmo pré-processamento (tokenização, lowercase). Não usaremos torchnlp.metrics.bleu, torchtext.data.metrics.bleu_score, etc. SacreBLEU é lento: usar poucas amostras de validação (ex: 5k)


Usaremos o modelo PTT5 disponível no model hub da HuggingFace:

https://huggingface.co/unicamp-dl/ptt5-small-portuguese-vocab

Este é  um T5 pré-treinado em textos em português e com tokenizador em português.

É recomendável salvar os pesos do modelo e estado dos otimizadores, pois o treinamento é longo.


In [7]:
# Configurações gerais
model_name = "unicamp-dl/ptt5-small-portuguese-vocab"
batch_size = 64
accumulate_grad_batches = 2
source_max_length = 128
target_max_length = 128
learning_rate = 1e-3

In [49]:
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")
device

device(type='cuda')

In [1]:
! pip install sacrebleu
! pip install transformers
! pip install sentencepiece
! pip install accelerate

Collecting accelerate
  Downloading accelerate-0.13.1-py3-none-any.whl (148 kB)
[K     |████████████████████████████████| 148 kB 3.9 MB/s eta 0:00:01
Installing collected packages: accelerate
Successfully installed accelerate-0.13.1


In [24]:
# Importar todos os pacotes de uma só vez para evitar duplicados ao longo do notebook.
import gzip
import os
import random
import sacrebleu
import torch
import torch.nn.functional as F

# from google.colab import drive

from transformers import T5ForConditionalGeneration, T5Tokenizer, get_scheduler

from torch.utils.data import DataLoader
from torch.utils.data import Dataset

from typing import Dict
from typing import List
from typing import Tuple

from tqdm import tqdm

from accelerate import Accelerator
from numpy import exp

In [3]:
# Important: Fix seeds so we can replicate results
seed = 123
random.seed(seed)
torch.random.manual_seed(seed)
torch.cuda.manual_seed(seed)

Iremos salvar os checkpoints (pesos do modelo) no google drive, para que possamos continuar o treino de onde paramos.

In [None]:
drive.mount('/content/drive')

## Preparando Dados

Primeiro, fazemos download do dataset:

In [4]:
! wget -nc https://storage.googleapis.com/unicamp-dl/ia024a_2022s2/paracrawl_enpt_train.tsv.gz
! wget -nc https://storage.googleapis.com/unicamp-dl/ia024a_2022s2/paracrawl_enpt_test.tsv.gz

File ‘paracrawl_enpt_train.tsv.gz’ already there; not retrieving.

File ‘paracrawl_enpt_test.tsv.gz’ already there; not retrieving.



## Carregando o dataset

Criaremos uma divisão de treino (100k pares) e val (5k pares) artificialmente.

Nota: Evitar de olhar ao máximo o dataset de teste para não ficar enviseado no que será testado. Em aplicações reais, o dataset de teste só estará disponível no futuro, ou seja, é quando o usuário começa a testar o seu produto.

In [5]:
def load_text_pairs(path):
    text_pairs = []
    for line in gzip.open(path, mode='rt'):
        text_pairs.append(line.strip().split('\t'))
    return text_pairs

x_train = load_text_pairs('paracrawl_enpt_train.tsv.gz')
x_test = load_text_pairs('paracrawl_enpt_test.tsv.gz')

# Embaralhamos o treino para depois fazermos a divisão treino/val.
random.shuffle(x_train)

# Truncamos o dataset para 100k pares de treino e 5k pares de validação.
x_val = x_train[100000:105000]
x_train = x_train[:100000]

for set_name, x in [('treino', x_train), ('validação', x_val), ('test', x_test)]:
    print(f'\n{len(x)} amostras de {set_name}')
    print(f'3 primeiras amostras {set_name}:')
    for i, (source, target) in enumerate(x[:3]):
        print(f'{i}: source: {source}\n   target: {target}')


100000 amostras de treino
3 primeiras amostras treino:
0: source: More Croatian words and phrases
   target: Mais palavras e frases em croata
1: source: Jerseys and pullovers, containing at least 50Â % by weight of wool and weighing 600Â g or more per article 6110 11 10 (PCE)
   target: Camisolas e pulôveres, com pelo menos 50 %, em peso, de lã e pesando 600g ou mais por unidade 6110 11 10 (PCE)
2: source: Atex Colombia SAS makes available its lead product, 100% natural liquid latex, excellent quality and price. ... Welding manizales caldas Colombia a DuckDuckGo
   target: Atex Colômbia SAS torna principal produto está disponível, látex líquido 100% natural, excelente qualidade e preço. ...

5000 amostras de validação
3 primeiras amostras validação:
0: source: «You have hidden these things from the wise and the learned you have revealed them to the childlike»
   target: «Escondeste estas coisas aos sábios e entendidos e as revelaste aos pequenos»
1: source: Repair of computers, applic

Criando Dataset


In [10]:
tokenizer = T5Tokenizer.from_pretrained(model_name)

In [11]:
class MyDataset(Dataset):
    def __init__(
        self,
        text_pairs: List[Tuple[str]],
        tokenizer,
        source_max_length: int = 32,
        target_max_length: int = 32,
    ):
        self.original_source = [text[0] for text in text_pairs]
        self.original_target = [text[1] for text in text_pairs]

        inputs = tokenizer(
            self.original_source,
            padding=True,
            truncation=True,
            max_length=source_max_length,
            return_tensors="pt",
        )
        targets = tokenizer(
            self.original_target,
            padding=True,
            truncation=True,
            max_length=target_max_length,
            return_tensors="pt",
        )

        self.text_pairs = [
            {
                "source_tokens": input_ids,
                "source_attention_mask": input_attention_mask,
                "target_tokens": target_ids,
                "target_attention_mask": target_attention_mask,
            }
            for input_ids, input_attention_mask, target_ids, target_attention_mask in zip(
                inputs.input_ids,
                inputs.attention_mask,
                targets.input_ids,
                targets.attention_mask,
            )
        ]

    def __len__(self):
        return len(self.text_pairs)

    def __getitem__(self, idx):
        data = self.text_pairs[idx]
        (
            source_token_ids,
            source_mask,
            target_token_ids,
            target_mask,
            original_source,
            original_target,
        ) = (
            data["source_tokens"],
            data["source_attention_mask"],
            data["target_tokens"],
            data["target_attention_mask"],
            self.original_source[idx],
            self.original_target[idx],
        )
        return (
            source_token_ids,
            source_mask,
            target_token_ids,
            target_mask,
            original_source,
            original_target,
        )


## Testando o DataLoader

In [12]:
text_pairs = [('we like pizza', 'eu gosto de pizza')]
dataset_debug = MyDataset(
    text_pairs=text_pairs,
    tokenizer=tokenizer,
    source_max_length=source_max_length,
    target_max_length=target_max_length)

dataloader_debug = DataLoader(dataset_debug, batch_size=10, shuffle=True, 
                              num_workers=0)

source_token_ids, source_mask, target_token_ids, target_mask, _, _ = next(iter(dataloader_debug))
print('source_token_ids:\n', source_token_ids)
print('source_mask:\n', source_mask)
print('target_token_ids:\n', target_token_ids)
print('target_mask:\n', target_mask)

print('source_token_ids.shape:', source_token_ids.shape)
print('source_mask.shape:', source_mask.shape)
print('target_token_ids.shape:', target_token_ids.shape)
print('target_mask.shape:', target_mask.shape)

source_token_ids:
 tensor([[  31, 1528, 1079,  634, 1241, 7531,    1]])
source_mask:
 tensor([[1, 1, 1, 1, 1, 1, 1]])
target_token_ids:
 tensor([[2077, 6618,    4, 1241, 7531,    1]])
target_mask:
 tensor([[1, 1, 1, 1, 1, 1]])
source_token_ids.shape: torch.Size([1, 7])
source_mask.shape: torch.Size([1, 7])
target_token_ids.shape: torch.Size([1, 6])
target_mask.shape: torch.Size([1, 6])


## Criando DataLoaders de Treino/Val/Test

In [13]:
dataset_train = MyDataset(text_pairs=x_train,
                          tokenizer=tokenizer,
                          source_max_length=source_max_length,
                          target_max_length=target_max_length)

dataset_val = MyDataset(text_pairs=x_val,
                        tokenizer=tokenizer,
                        source_max_length=source_max_length,
                        target_max_length=target_max_length)

dataset_test = MyDataset(text_pairs=x_test,
                         tokenizer=tokenizer,
                         source_max_length=source_max_length,
                         target_max_length=target_max_length)

train_dataloader = DataLoader(dataset_train, batch_size=batch_size,
                              shuffle=True, num_workers=0)

val_dataloader = DataLoader(dataset_val, batch_size=batch_size, shuffle=False, 
                            num_workers=0)

test_dataloader = DataLoader(dataset_test, batch_size=batch_size,
                             shuffle=False, num_workers=0)

In [35]:
def train(
    train_dataloader,
    accelerator,
    optimizer,
    lr_scheduler,
    model,
):

    model.train()
    train_loss = 0.0
    progress_bar = tqdm(range(len(train_dataloader)))
    for step, batch in enumerate(train_dataloader):
        optimizer.zero_grad()
        outputs = model(**batch)
        loss = outputs.loss
        train_loss += loss.item()
        accelerator.backward(loss)

        if step % 8 == 0:
            optimizer.step()
            lr_scheduler.step()

        progress_bar.update(1)

    return train_loss / len(train_dataloader)


In [25]:
def evaluate_model(
    model,
    val_dataloader
):
    model.eval()
    val_loss = 0.0
    progress_bar = tqdm(range(len(val_dataloader)))
    for step, batch in enumerate(val_dataloader):
        with torch.no_grad():
            outputs = model(**batch)
            loss = outputs.loss
            val_loss += loss.item()
        progress_bar.update(1)

    return val_loss / len(val_dataloader)

In [50]:
num_train_epochs = 1
model = T5ForConditionalGeneration.from_pretrained(model_name)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)


num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

accelerator = Accelerator(fp16=True)
device = accelerator.device

model, optimizer, lr_scheduler, train_dataloader, val_dataloader = accelerator.prepare(
    model, optimizer, lr_scheduler, train_dataloader, val_dataloader
)


In [47]:
train_losses = []
valid_losses = []
perplexities = []
for epoch in range(num_train_epochs):
    train_loss = train('train_dataloader', 'accelerator', 'optimizer', 'lr_scheduler', model) 
    # train_losses.append(train_loss)
    # valid_loss = evaluate_model(model, val_dataloader)
    # valid_losses.append(valid_loss)
    # perplexities.append(exp(train_loss))
    # print(
    #     f"Epoch: {epoch+1}; Train Loss: {train_loss:.3f}; Perplexity: {exp(train_loss):.3f}; Validation Loss: {valid_loss:.3f};"
    # )

# Save and upload
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained('.', save_function=accelerator.save)
if accelerator.is_main_process:
    tokenizer.save_pretrained('.')
