<a href="https://colab.research.google.com/github/finardi/IA376A/blob/master/T5-Paracrawl.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<style type="text/css">
@media print { body { -webkit-print-color-adjust: exact; } }
</style>



# <span style="color:orange"> Paulo Finardi </span>
<span style="color:purple"> - Semana 9 </span>

Colab com modelo T5 para a tarefa de tradução inglês para português com o dado Paracrawl.

In [None]:
! nvidia-smi

Sat May 23 13:29:21 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

In [None]:
# Configurações gerais
model_name = "t5-small"
batch_size = 16
accumulate_grad_batches = 16
x_max_length = 256
y_max_length = 256

In [None]:
! pip install -q sacrebleu
! pip install -q pytorch-lightning
! pip install -q transformers

[K     |████████████████████████████████| 61kB 1.8MB/s 
[K     |████████████████████████████████| 256kB 2.8MB/s 
[K     |████████████████████████████████| 829kB 42.2MB/s 
[?25h  Building wheel for future (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 665kB 2.9MB/s 
[K     |████████████████████████████████| 3.8MB 14.4MB/s 
[K     |████████████████████████████████| 1.1MB 30.6MB/s 
[K     |████████████████████████████████| 890kB 45.5MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


In [None]:
# Basics
import os
import gzip
import random
import nvidia_smi
import numpy as np
from google.colab import drive

# PyTorch
import torch 
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

# Dataset e PyTorch Lightning
import sacrebleu
import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint

# Transformers
from transformers import T5ForConditionalGeneration, AdamW
from transformers import T5Tokenizer

#Typing
from typing import Dict
from typing import List
from typing import Tuple

In [None]:
manual_seed = 0
def deterministic(rep=True):
    if rep:
        np.random.seed(manual_seed)
        torch.manual_seed(manual_seed)
        if torch.cuda.is_available():
            torch.cuda.manual_seed(manual_seed)
            torch.cuda.manual_seed_all(manual_seed)
        torch.backends.cudnn.enabled = False 
        torch.backends.cudnn.benchmark = False
        torch.backends.cudnn.deterministic = True
        print(f'Deterministic experiment, seed: {manual_seed}')
    else:
        print('Random experiment')
deterministic()

Deterministic experiment, seed: 0


In [None]:
print(f"Pytorch Lightning Version: {pl.__version__}")
nvidia_smi.nvmlInit()
handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
print(f"Device name: {nvidia_smi.nvmlDeviceGetName(handle)}")

def gpu_usage():
    global handle
    return str(nvidia_smi.nvmlDeviceGetUtilizationRates(handle).gpu) + '%'

Pytorch Lightning Version: 0.7.6
Device name: b'Tesla P100-PCIE-16GB'


In [None]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Preparando Dados

In [None]:
! wget -nc https://storage.googleapis.com/neuralresearcher_data/unicamp/ia376e_2020s1/paracrawl_enpt_train.tsv.gz
! wget -nc https://storage.googleapis.com/neuralresearcher_data/unicamp/ia376e_2020s1/paracrawl_enpt_test.tsv.gz

--2020-05-23 13:29:47--  https://storage.googleapis.com/neuralresearcher_data/unicamp/ia376e_2020s1/paracrawl_enpt_train.tsv.gz
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.203.128, 2607:f8b0:400c:c1a::80
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.203.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 106548256 (102M) [text/tab-separated-values]
Saving to: ‘paracrawl_enpt_train.tsv.gz’


2020-05-23 13:29:48 (116 MB/s) - ‘paracrawl_enpt_train.tsv.gz’ saved [106548256/106548256]

--2020-05-23 13:29:49--  https://storage.googleapis.com/neuralresearcher_data/unicamp/ia376e_2020s1/paracrawl_enpt_test.tsv.gz
Resolving storage.googleapis.com (storage.googleapis.com)... 108.177.11.128, 2607:f8b0:400c:c16::80
Connecting to storage.googleapis.com (storage.googleapis.com)|108.177.11.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2139168 (2.0M) [text/tab-separated-values]
Saving to: ‘paracr

## Carregando o dataset

Criaremos uma divisão de treino (100k pares) e val (5k pares) artificialmente.



In [None]:
def load_text_pairs(path):
    text_pairs = []
    for line in gzip.open(path, mode='rt'):
        text_pairs.append(line.strip().split('\t'))
    return text_pairs

x_train_ = load_text_pairs('paracrawl_enpt_train.tsv.gz')
x_test  = load_text_pairs('paracrawl_enpt_test.tsv.gz')

# Embaralhamos o treino para depois fazermos a divisão treino/val.
random.shuffle(x_train_)

In [None]:
# conj. treino = 20k amostras
# conj. valid  = 2500 amostras

split = 20_000
x_train = x_train_[:split]
x_val   = x_train_[split: split+ 2_000]  
len(x_train), len(x_val), len(x_test)

(20000, 2000, 20000)

### Dataset


In [None]:
tokenizer = T5Tokenizer.from_pretrained(model_name)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…




In [None]:
class MyDataset(Dataset):
    def __init__(self, text_pairs: List[Tuple[str]], tokenizer = tokenizer,
                 x_max_length: int = 32, y_max_length: int = 32):
        self.tokenizer = tokenizer  
        self.text_pairs = text_pairs
        self.x_max_length = x_max_length
        self.y_max_length = y_max_length
        
    def __len__(self):
        return len(self.text_pairs)
    
    def __getitem__(self, idx):
        x, y = self.text_pairs[idx]
        tknzr_x = self.tokenizer.encode_plus(x, 
                    max_length=self.x_max_length,
                    pad_to_max_length=True,
                    return_token_type_ids=False,
                    return_tensors='pt')
        
        x_token_ids = tknzr_x['input_ids'][0]
        x_mask      = tknzr_x['attention_mask'][0]

        tknzr_y = self.tokenizer.encode_plus(y,
                    max_length=self.y_max_length,
                    pad_to_max_length=True,
                    return_token_type_ids=False,
                    return_tensors='pt')
        
        y_token_ids = tknzr_y['input_ids'][0]
        y_mask      = tknzr_y['attention_mask'][0]

        return (x_token_ids, x_mask, y_token_ids, y_mask, x, y)

## Testando o Dataset

In [None]:
t = tokenizer.tokenize('31/11/2020'); t

['▁31', '/11', '/', '2020']

In [None]:
tokenizer.encode(t)

[2664, 20223, 87, 22224]

In [None]:
tokenizer.decode(87)

'/'

In [None]:
text_pairs = [('trinta e um de março de dois mil e vinte', '31/03/2020')]

# text_pairs = [('we like pizza', 'eu gosto de pizza')]
dataset_debug = MyDataset(
    text_pairs=text_pairs,
    tokenizer=tokenizer,
    x_max_length=32, #x_max_length,
    y_max_length=32) #y_max_length)

dataloader_debug = DataLoader(dataset_debug, batch_size=10, shuffle=True, 
                              num_workers=0)

x_token_ids, x_mask, y_token_ids, y_mask, x, y = next(iter(dataloader_debug))
print('source_token_ids:\n', x_token_ids)
print('source_mask:\n', x_mask)
print('target_token_ids:\n', y_token_ids)
print('target_mask:\n', y_mask)

print('source_token_ids.shape:', x_token_ids.shape)
print('source_mask.shape:', x_mask.shape)
print('target_token_ids.shape:', y_token_ids.shape)
print('target_mask.shape:', y_mask.shape)

source_token_ids:
 tensor([[ 6467,    29,    17,     9,     3,    15,   561,    20,  3157, 24065,
            20,   103,   159, 15533,     3,    15,     3,   208,  2429,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0]])
source_mask:
 tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]])
target_token_ids:
 tensor([[ 2664, 31064, 22224,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0]])
target_mask:
 tensor([[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]])
source_token_ids.shape: torch.Size([1, 32])
source_mask.shape: torch.Size([1, 32])
target_token_ids.shape: torch.Size([1, 32])
target_mask.shape: torch.Size([1, 32])


### Datasets e Dataloaders

In [None]:
ds_train = MyDataset(text_pairs=x_train,
                     tokenizer=tokenizer,
                     x_max_length=x_max_length,
                     y_max_length=y_max_length)

ds_val =   MyDataset(text_pairs=x_val,
                     tokenizer=tokenizer,
                     x_max_length=x_max_length,
                     y_max_length=y_max_length)

ds_test =  MyDataset(text_pairs=x_test,
                     tokenizer=tokenizer,
                     x_max_length=x_max_length,
                     y_max_length=y_max_length)

dataloaders = {
    'train': DataLoader(ds_train,
                        batch_size=batch_size,
                        num_workers=4,
                        pin_memory=True),
    'val':   DataLoader(ds_val,
                        batch_size=batch_size,
                        num_workers=4,
                        pin_memory=False),
    'test':  DataLoader(ds_test,
                        batch_size=batch_size,
                        num_workers=4,
                        pin_memory=False),
               }

# sanity check
dl_sizes = {x: len(dataloaders[x]) for x in dataloaders.keys()}
dl_sizes 

{'test': 1250, 'train': 1250, 'val': 125}

## Criando o T5 com Pytorch Lightning

In [None]:
class T5Finetuner(pl.LightningModule):
    def __init__(self, tokenizer, dataloader):
        super(T5Finetuner, self).__init__()

        self.model      = T5ForConditionalGeneration.from_pretrained(model_name)
        self.dataloader = dataloader
        self.tokenizer  = tokenizer

    def forward(self, x_token_ids, x_mask, y_token_ids=None, y_mask=None):
        if self.training:
            outputs = self.model.forward(input_ids = x_token_ids, attention_mask = x_mask,
                                         lm_labels  = y_token_ids)
            return outputs[0] 
        else:
            predicted_token_ids = self.model.generate(input_ids = x_token_ids, attention_mask = x_mask,
                                                      max_length=128)
            return predicted_token_ids

    def configure_optimizers(self):
        return torch.optim.Adam([p for p in self.parameters() if p.requires_grad],lr=5e-3)

    def decode_token_ids(self, x_token_ids):
        translation = self.tokenizer.decode(x_token_ids,
                                            skip_special_tokens=True,
                                            clean_up_tokenization_spaces=False)
        return translation

    def training_step(self, batch, batch_nb):
        x_token_ids, x_mask, y_token_ids, y_mask, _, _ = batch
        loss = self(x_token_ids, x_mask, y_token_ids, y_mask)
        
        tensorboard_logs = {'train_loss': loss}
        progress_bar     = {'gpu_usage': gpu_usage()}
        return {'loss': loss, 'log': tensorboard_logs, 'progress_bar': progress_bar}

    def validation_step(self, batch, batch_nb):
        x_token_ids, x_mask, y_token_ids, y_mask, x, y = batch
        preds_token_ids  = self(x_token_ids, x_mask)
        preds = [self.decode_token_ids(token_ids) for token_ids in preds_token_ids]
        bleu_score       = sacrebleu.corpus_bleu(preds, [y]).score
        tensorboard_logs = {'val_bleu': bleu_score}
        progress_bar     = {'gpu_usage': gpu_usage()}
        return {'val_bleu': bleu_score, 'progress_bar': progress_bar, 'log':tensorboard_logs}

    def test_step(self, batch, batch_nb):
        x_token_ids, x_mask, y_token_ids, y_mask, x, y = batch
        preds_token_ids = self(x_token_ids, x_mask)
        preds = [self.decode_token_ids(token_ids) for token_ids in preds_token_ids]
        bleu_score   = sacrebleu.corpus_bleu(preds, [y]).score
        progress_bar = {'gpu_usage': gpu_usage()}
        return {'test_bleu': bleu_score, 'progress_bar': progress_bar}

    def validation_epoch_end(self, outputs):
        bleu_score       = sum([x['val_bleu'] for x in outputs]) / len(outputs)
        tensorboard_logs = {'avg_val_bleu': bleu_score}
        return {'avg_val_bleu': bleu_score, 'progress_bar': tensorboard_logs, 'log': tensorboard_logs}

    def training_epoch_end(self, outputs):
        avg_loss         = torch.stack([x['loss'] for x in outputs]).mean()
        tensorboard_logs = {'train_loss': avg_loss}
        return {'log': tensorboard_logs}
        
    def test_epoch_end(self, outputs):
        bleu_av          = sum([x['test_bleu'] for x in outputs]) / len(outputs)
        tensorboard_logs = {'avg_test_bleu': bleu_av}
        return {'avg_test_bleu': bleu_av, 'progress_bar': tensorboard_logs}
    
    def train_dataloader(self):
        return self.dataloader['train']
    
    def val_dataloader(self):
        return self.dataloader['val']
    
    def test_dataloader(self):
        return self.dataloader['test']

model = T5Finetuner(tokenizer, dataloaders)
del model

INFO:transformers.configuration_utils:loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/t5-small-config.json from cache at /root/.cache/torch/transformers/26561bc9e840d8945f475d0d4c4b9df32025eadd79894b867b570cb1d09e67a9.3817cc1260a6b941b17af62b4f2a942b9825f209d8e2eed99e79e96f85f59aab
INFO:transformers.configuration_utils:Model config T5Config {
  "_num_labels": 2,
  "architectures": [
    "T5WithLMHeadModel"
  ],
  "bad_words_ids": null,
  "bos_token_id": null,
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "do_sample": false,
  "dropout_rate": 0.1,
  "early_stopping": false,
  "eos_token_id": 1,
  "finetuning_task": null,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
  },
  "initializer_factor": 1.0,
  "is_decoder": false,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1
  },
  "layer_norm_epsilon": 1e-06,
  "length_penalty": 1.0,
  "max_length": 20,
  "min_length": 0,
  "model_type

## Overfit em algumas amostras

In [None]:
trainer = pl.Trainer(gpus=1,
                     max_epochs=30,
                     check_val_every_n_epoch=10,
                     checkpoint_callback=False,  # Disable checkpoint saving
                     overfit_pct=0.005)

# Dataset usando apenas um batch de amostras de treino.
dataset_debug = MyDataset(text_pairs=x_train,
                          tokenizer=tokenizer,
                          x_max_length=x_max_length,
                          y_max_length=y_max_length)

debug_dataloader = DataLoader(dataset_debug, batch_size=batch_size,
                              shuffle=False, num_workers=4)

model = T5Finetuner(tokenizer, debug_dataloader)

trainer.fit(model)
del model  # Para não ter estouro de mémoria da GPU

INFO:lightning:GPU available: True, used: True
INFO:lightning:CUDA_VISIBLE_DEVICES: [0]
INFO:transformers.configuration_utils:loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/t5-small-config.json from cache at /root/.cache/torch/transformers/26561bc9e840d8945f475d0d4c4b9df32025eadd79894b867b570cb1d09e67a9.3817cc1260a6b941b17af62b4f2a942b9825f209d8e2eed99e79e96f85f59aab
INFO:transformers.configuration_utils:Model config T5Config {
  "_num_labels": 2,
  "architectures": [
    "T5WithLMHeadModel"
  ],
  "bad_words_ids": null,
  "bos_token_id": null,
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "do_sample": false,
  "dropout_rate": 0.1,
  "early_stopping": false,
  "eos_token_id": 1,
  "finetuning_task": null,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
  },
  "initializer_factor": 1.0,
  "is_decoder": false,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1
  },
  "layer_norm_epsilo

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validation sanity check', layout=Layout…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Training', layout=Layout(flex='2'), max…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…




## Treinamento e Validação no dataset todo

In [None]:
checkpoint_path = '/content/drive/My Drive/Colab Notebooks/Semana9/epoch=2.ckpt' 
checkpoint_dir = os.path.dirname(os.path.abspath(checkpoint_path))
print(f'Files in {checkpoint_dir}: {os.listdir(checkpoint_dir)}')
print(f'Saving checkpoints to {checkpoint_dir}')
checkpoint_callback = ModelCheckpoint(filepath=checkpoint_dir,
                                      save_top_k=-1)  # Keeps all checkpoints.

resume_from_checkpoint = None
if os.path.exists(checkpoint_path):
    print(f'Restoring checkpoint: {checkpoint_path}')
    resume_from_checkpoint = checkpoint_path

Files in /content/drive/My Drive/Colab Notebooks/Semana9: ['Paulo Finardi [Sem: 9].ipynb', 'Leitura Sem 9.pdf', 'Leitura Sem 9.gdoc', 'epoch=0.ckpt', 'epoch=0_v0.ckpt']
Saving checkpoints to /content/drive/My Drive/Colab Notebooks/Semana9


In [None]:
trainer = pl.Trainer(gpus=1,
                     max_epochs=2,
                     progress_bar_refresh_rate=60,
                     accumulate_grad_batches=8,
                     checkpoint_callback=checkpoint_callback,
                     resume_from_checkpoint=resume_from_checkpoint)

model = T5Finetuner(tokenizer, dataloaders)

trainer.fit(model)

INFO:lightning:GPU available: True, used: True
INFO:lightning:CUDA_VISIBLE_DEVICES: [0]
INFO:transformers.configuration_utils:loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/t5-small-config.json from cache at /root/.cache/torch/transformers/26561bc9e840d8945f475d0d4c4b9df32025eadd79894b867b570cb1d09e67a9.3817cc1260a6b941b17af62b4f2a942b9825f209d8e2eed99e79e96f85f59aab
INFO:transformers.configuration_utils:Model config T5Config {
  "_num_labels": 2,
  "architectures": [
    "T5WithLMHeadModel"
  ],
  "bad_words_ids": null,
  "bos_token_id": null,
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "do_sample": false,
  "dropout_rate": 0.1,
  "early_stopping": false,
  "eos_token_id": 1,
  "finetuning_task": null,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
  },
  "initializer_factor": 1.0,
  "is_decoder": false,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1
  },
  "layer_norm_epsilo

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validation sanity check', layout=Layout…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Training', layout=Layout(flex='2'), max…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…




1

## Após treinado, avaliamos o modelo no dataset de teste

É importante que essa avaliação seja feita poucas vezes para evitar "overfit manual" no dataset de teste.

In [None]:
trainer.test(model)                         

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Testing', layout=Layout(flex='2'), max=…

--------------------------------------------------------------------------------
TEST RESULTS
{'avg_test_bleu': 19.118952502513597}
--------------------------------------------------------------------------------



# <span style="color:purple">Fim do notebook