<a href="https://colab.research.google.com/github/finardi/IA376A/blob/master/XLNET-IMDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<style type="text/css">
@media print { body { -webkit-print-color-adjust: exact; } }
</style>



# <span style="color:orange"> Paulo Finardi </span>
<span style="color:purple"> - Semana 8 </span>

Utilizei a **XLNet**
---
- <span style="color:purple"> **XLNet** </span>(*Google AI Brain e Carnegie Mellon*) ela obteve melhor desempenho do que o BERT em 20 tarefas. A XLNet maximiza a esperança logarítmica de uma sequência com respeito a todas as permutações possíveis de ordem de fatoração (não de posição). A ordem de fatoração é realizada através de máscaras (equação 2 do artigo). Com essa operação de permutação, o contexto para cada posição pode consistir em tokens da esquerda e da direita, permitindo que cada posição aprenda a utilizar informações contextuais de todas as posições e assim capturando o contexto bidirecional. Artigo [aqui](https://arxiv.org/pdf/1906.08237.pdf)



###  Instalando o PyTorch Lightning 

In [None]:
! pip install -q pytorch-lightning

# Pytorch Lightning 
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.callbacks import EarlyStopping
import pytorch_lightning as pl

In [None]:
# Pacote do Transformer Hface
! pip install -q transformers

In [None]:
# Basics
import re
import os
import sys
import random
import functools, traceback
import pandas as pd
import numpy as np
from collections import OrderedDict
from multiprocessing import cpu_count

# PyTorch
import torch 
import torch.nn.functional as F
from torch.utils.data import (TensorDataset, DataLoader,
                              RandomSampler, SequentialSampler)

# Transformers
from transformers import get_linear_schedule_with_warmup, AdamW
from transformers import XLNetTokenizer, XLNetForSequenceClassification

# Sklearn
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

INFO:transformers.file_utils:PyTorch version 1.5.0+cu101 available.
INFO:transformers.file_utils:TensorFlow version 2.2.0-rc3 available.


In [None]:
manual_seed = 0
def deterministic(rep=True):
    if rep:
        np.random.seed(manual_seed)
        torch.manual_seed(manual_seed)
        if torch.cuda.is_available():
            torch.cuda.manual_seed(manual_seed)
            torch.cuda.manual_seed_all(manual_seed)
        torch.backends.cudnn.enabled = False 
        torch.backends.cudnn.benchmark = False
        torch.backends.cudnn.deterministic = True
        print(f'Deterministic experiment, seed: {manual_seed}')
    else:
        print('Random experiment')

deterministic()

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {torch.cuda.get_device_name(0)}')
print(f'# CPU cores: {cpu_count()}')

Deterministic experiment, seed: 0
Using device: Tesla P100-PCIE-16GB
# CPU cores: 2


In [None]:
# https://docs.fast.ai/troubleshoot.html#memory-leakage-on-exception

def gpu_mem_restore(func):
    "Reclaim GPU RAM if CUDA out of memory happened, or execution was interrupted"
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except:
            type, val, tb = sys.exc_info()
            traceback.clear_frames(tb)
            raise type(val).with_traceback(tb) from None
    return wrapper

## <span style="color:orange"> Preparando Dados
Dataset salvo no drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [None]:
!tar -xzf '/content/drive/My Drive/Colab Notebooks/Bert/aclImdb.tgz'

In [None]:
# Carregando o dataset

def load_texts(folder):
    texts = []
    for path in os.listdir(folder):
        with open(os.path.join(folder, path)) as f:
            texts.append(f.read())
    return texts

x_train_pos = load_texts('aclImdb/train/pos')
x_train_neg = load_texts('aclImdb/train/neg')

x_test_pos  = load_texts('aclImdb/test/pos')
x_test_neg  = load_texts('aclImdb/test/neg')

x_train = x_train_pos + x_train_neg
y_train = [1] * len(x_train_pos) + [0] * len(x_train_neg)

x_test = x_test_pos + x_test_neg
y_test = [1] * len(x_test_pos) + [0] * len(x_test_neg)

In [None]:
# Criando um dataset com o Pandas 

df_train = pd.DataFrame({'Review': x_train, 'label': y_train})
df_test  = pd.DataFrame({'Review' : x_test, 'label': y_test}) 
df_train.head()

Unnamed: 0,Review,label
0,Story of Ireland in the 70/s. This film is a b...,1
1,Melvyn Douglas once more gives a polished perf...,1
2,I have seen the movie Holes and say that it ha...,1
3,"""Gunga Din"": one of the greatest adventure sto...",1
4,I was not expecting the powerful filmmaking ex...,1


## <span style="color:orange"> Função que tokeniza e cria as máscaras tok_with_masks


### Arquitetura dos inputs 

---

$~~~~~~~~~$ **BERT:**  $~~[CLS]~~$  + $~~$ *tokens* $~~$ + $~~[SEP]~~$ + $~~$ *padding*

---

$~~~~~~~~~$ **XLNet:** $~~$*padding*$~~$ +$~~$ *tokens*$~~$ + $~~[SEP]~~$ + $~~[CLS]$

--- 

BERT não implementado, representado aqui para efeitos de comparação


In [None]:
MAX_LEN = 500

REP_HTML_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)") # removendo HTML

def preprocess_reviews(reviews):
    return [REP_HTML_SPACE.sub(" ", r) for r in reviews]

def tok_with_masks(df):
    tok = XLNetTokenizer.from_pretrained('xlnet-base-cased')
    sentences = preprocess_reviews(df.Review.values)
    tokenized = [tok.tokenize(s) for s in sentences]
    tokens = [t[:(MAX_LEN-2)] + [tok.sep_token] + [tok.cls_token] for t in tokenized] 
    ids = [tok.convert_tokens_to_ids(t) for t in tokens] 
    ids = np.array([np.pad(i, (MAX_LEN-len(i), 0),
                        mode='constant') for i in ids]) # padding
    masks = []
    for seq in ids:
        seq_mask = [float(i>0) for i in seq]
        masks.append(seq_mask)

    return ids, masks, df.label.values

#------------------------------#
# Tokenizando, e criando masks #
#------------------------------#
ids_train_xlnet, masks_train_xlnet, labels_train_xlnet = tok_with_masks(df_train)
ids_test_xlnet,  masks_test_xlnet,  labels_test_xlnet  = tok_with_masks(df_test)

INFO:filelock:Lock 139646709136240 acquired on /root/.cache/torch/transformers/dad589d582573df0293448af5109cb6981ca77239ed314e15ca63b7b8a318ddd.8b10bd978b5d01c21303cc761fc9ecd464419b3bf921864a355ba807cfbfafa8.lock
INFO:transformers.file_utils:https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-base-cased-spiece.model not found in cache or force_download set to True, downloading to /root/.cache/torch/transformers/tmpl69pj2kx


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=798011.0, style=ProgressStyle(descripti…

INFO:transformers.file_utils:storing https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-base-cased-spiece.model in cache at /root/.cache/torch/transformers/dad589d582573df0293448af5109cb6981ca77239ed314e15ca63b7b8a318ddd.8b10bd978b5d01c21303cc761fc9ecd464419b3bf921864a355ba807cfbfafa8
INFO:transformers.file_utils:creating metadata file for /root/.cache/torch/transformers/dad589d582573df0293448af5109cb6981ca77239ed314e15ca63b7b8a318ddd.8b10bd978b5d01c21303cc761fc9ecd464419b3bf921864a355ba807cfbfafa8
INFO:filelock:Lock 139646709136240 released on /root/.cache/torch/transformers/dad589d582573df0293448af5109cb6981ca77239ed314e15ca63b7b8a318ddd.8b10bd978b5d01c21303cc761fc9ecd464419b3bf921864a355ba807cfbfafa8.lock
INFO:transformers.tokenization_utils:loading file https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-base-cased-spiece.model from cache at /root/.cache/torch/transformers/dad589d582573df0293448af5109cb6981ca77239ed314e15ca63b7b8a318ddd.8b10bd978b5d01c21303cc761fc9ecd4




INFO:transformers.tokenization_utils:loading file https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-base-cased-spiece.model from cache at /root/.cache/torch/transformers/dad589d582573df0293448af5109cb6981ca77239ed314e15ca63b7b8a318ddd.8b10bd978b5d01c21303cc761fc9ecd464419b3bf921864a355ba807cfbfafa8


## <span style="color:orange"> Função que cria dataset (ds)
- dataset de treino
- dataset de validação
- dataset de teste
- dataset de debug (um pequeno dataset com os mesmos dados do conj. de treino) que é usado para depuração da rede

In [None]:
def make_ds(ids, masks, labels, val=True):
    if val == True:
        t_inputs, v_inputs, t_labels, v_labels = train_test_split(
            ids, 
            labels,
            random_state=manual_seed,
            test_size=0.2
            )
        t_masks, v_masks,_,_ = train_test_split(
            masks, 
            ids,
            random_state=manual_seed, 
            test_size=0.2
            )
        t_inputs = torch.tensor(t_inputs); v_inputs = torch.tensor(v_inputs)
        t_masks  = torch.tensor(t_masks);  v_masks  = torch.tensor(v_masks)
        t_labels = torch.tensor(t_labels); v_labels = torch.tensor(v_labels)

        # Debug set is a small train-set for debug         
        d_inputs = torch.tensor(t_inputs[:64])
        d_labels = torch.tensor(t_labels[:64])
        d_masks  = torch.tensor(t_masks[ :64])
     
        train_ds = TensorDataset(t_inputs, t_masks, t_labels)
        val_ds   = TensorDataset(v_inputs, v_masks, v_labels)
        debug_ds = TensorDataset(d_inputs, d_masks, d_labels)

        return train_ds, val_ds, debug_ds

    test_inputs = torch.tensor(ids)
    test_labels = torch.tensor(labels)
    test_masks  = torch.tensor(masks)
    
    test_ds  = TensorDataset(test_inputs, test_masks, test_labels)

    return  test_ds

#------------------#
# Criando datasets #
#------------------#
train_ds_xlnet, val_ds_xlnet, debug_ds_xlnet = make_ds(ids_train_xlnet, masks_train_xlnet, labels_train_xlnet, val=True)
test_ds_xlnet = make_ds(ids_test_xlnet,  masks_test_xlnet,  labels_test_xlnet, val=False)



## <span style="color:orange"> Dataloaders
- treino
- validação
- teste
- debug

In [None]:
BATCH_SZ = 4

@gpu_mem_restore
def make_dl(ds_train, ds_val, ds_debug, ds_test, batch_sz=BATCH_SZ):
    train_sampler = RandomSampler(ds_train)
    val_sampler   = SequentialSampler(ds_val)
    test_sampler  = SequentialSampler(ds_test)

    dataloaders = {
    
        'train': DataLoader(ds_train,
                            sampler = train_sampler, 
                            batch_size=batch_sz,
                            num_workers=4,
                            pin_memory=True),
        'val':   DataLoader(ds_val,
                            sampler=val_sampler,
                            batch_size=batch_sz,
                            num_workers=4,
                            pin_memory=True),
       'debug':  DataLoader(ds_debug, 
                            shuffle= True, 
                            batch_size=batch_sz,
                            num_workers=1,
                            pin_memory=True),
        'test':  DataLoader(ds_test, 
                            sampler=test_sampler,
                            batch_size=batch_sz,
                            num_workers=4,
                            pin_memory=True)
                 }
    return dataloaders

#---------------------#
# Criando dataloaders #
#---------------------#
dataloaders_XLNET = make_dl(train_ds_xlnet, val_ds_xlnet, debug_ds_xlnet, test_ds_xlnet)

# sanity check
dl_sizes_xlnet = {x: len(dataloaders_XLNET[x]) for x in dataloaders_XLNET.keys()}
dl_sizes_xlnet 

{'debug': 16, 'test': 6250, 'train': 5000, 'val': 1250}

# <span style="color:orange"> Modelo com o Pytorch Lightning

In [None]:
class Model(pl.LightningModule):
    def __init__(self, debug=False):
        super(Model, self).__init__()
        self.model = XLNetForSequenceClassification.from_pretrained('xlnet-base-cased')
        
        if debug == True:
            self.dataloader = dataloaders_XLNET['debug']
        else:
            self.dataloader  = dataloaders_XLNET
    
    def forward(self, input_ids, mask, labels):
        _, logits = self.model(input_ids,
                               attention_mask=mask, 
                               labels=labels)
        return logits
    
    def configure_optimizers(self):
        param_optimizer = list(self.model.named_parameters())
        no_decay = ["bias", "gamma", "beta"]
        optimizer_grouped_parameters = [
        {
            "params": [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
            "weight_decay_rate": 0.01
            },
        {
            "params": [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
            "weight_decay_rate": 0.0
            },
        ]
        optimizer =  AdamW(optimizer_grouped_parameters, lr=2e-5)
        return optimizer
    
    def training_step(self, batch, batch_idx):
        input_ids, mask, labels = batch
        loss, _ = self.model(input_ids,
                             attention_mask=mask,
                             labels=labels
                             )
        tqdm_dict = {"train_loss": loss}
        output = OrderedDict({
            "loss": loss,
            "progress_bar": tqdm_dict,
            "log": tqdm_dict
            })
        return output
    
    def validation_step(self, batch, batch_idx):
        input_ids, mask, labels = batch
        loss, logits = self.model(
                input_ids,
                attention_mask=mask,
                labels=labels
                )
        labels_hat = torch.argmax(logits, dim=1)
        correct_count = torch.sum(labels == labels_hat)
        if self.on_gpu:
            correct_count = correct_count.cuda(loss.device.index)

        output = OrderedDict({
            "val_loss": loss,
            "correct_count": correct_count,
            "batch_size": len(labels)
            })
        return output
    
    def validation_epoch_end(self, outputs):
        val_acc  = sum([out["correct_count"] for out in outputs]).float()/sum(out["batch_size"] for out in outputs)
        val_loss = sum([out["val_loss"] for out in outputs]) / len(outputs)
        tqdm_dict = {
                "val_loss": val_loss,
                "val_acc": val_acc,
                }
        result = {"progress_bar": tqdm_dict, "log": tqdm_dict, "val_loss": val_loss}
        return result
    
    def test_step(self, batch, batch_idx):
        input_ids, mask, labels = batch
        loss, logits = self.model(
                input_ids,
                attention_mask=mask,
                labels=labels
                )
        labels_hat = torch.argmax(logits, dim=1)
        correct_count = torch.sum(labels == labels_hat)

        if self.on_gpu:
            correct_count = correct_count.cuda(loss.device.index)

        output = OrderedDict({
            "test_loss": loss,
            "correct_count": correct_count,
            "batch_size": len(labels)
            })
        return output
    
    def test_epoch_end(self, outputs):
        test_acc = sum([out["correct_count"] for out in outputs]).float() / sum(out["batch_size"] for out in outputs)
        test_loss = sum([out["test_loss"] for out in outputs]) / len(outputs)
        tqdm_dict = {
                "test_loss": test_loss,
                "test_acc": test_acc,
                    }
        result = {"progress_bar": tqdm_dict, "log": tqdm_dict}
        return result
    
    @gpu_mem_restore
    def train_dataloader(self):
        return self.dataloader['train']
    
    @gpu_mem_restore
    def val_dataloader(self):
        return self.dataloader['val']
    
    @gpu_mem_restore
    def test_dataloader(self):
        return self.dataloader['test']

model = Model()

INFO:filelock:Lock 139645706376864 acquired on /root/.cache/torch/transformers/c9cc6e53904f7f3679a31ec4af244f4419e25ebc8e71ebf8c558a31cbcf07fc8.f23f7137b19a096a6eed89d1ffd00d4530935d72381eb7b9fcf8d5a1f25919ad.lock
INFO:transformers.file_utils:https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-base-cased-config.json not found in cache or force_download set to True, downloading to /root/.cache/torch/transformers/tmp2xy2uscm


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=736.0, style=ProgressStyle(description_…

INFO:transformers.file_utils:storing https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-base-cased-config.json in cache at /root/.cache/torch/transformers/c9cc6e53904f7f3679a31ec4af244f4419e25ebc8e71ebf8c558a31cbcf07fc8.f23f7137b19a096a6eed89d1ffd00d4530935d72381eb7b9fcf8d5a1f25919ad
INFO:transformers.file_utils:creating metadata file for /root/.cache/torch/transformers/c9cc6e53904f7f3679a31ec4af244f4419e25ebc8e71ebf8c558a31cbcf07fc8.f23f7137b19a096a6eed89d1ffd00d4530935d72381eb7b9fcf8d5a1f25919ad
INFO:filelock:Lock 139645706376864 released on /root/.cache/torch/transformers/c9cc6e53904f7f3679a31ec4af244f4419e25ebc8e71ebf8c558a31cbcf07fc8.f23f7137b19a096a6eed89d1ffd00d4530935d72381eb7b9fcf8d5a1f25919ad.lock
INFO:transformers.configuration_utils:loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-base-cased-config.json from cache at /root/.cache/torch/transformers/c9cc6e53904f7f3679a31ec4af244f4419e25ebc8e71ebf8c558a31cbcf07fc8.f23f7137b19a096a6ee




INFO:filelock:Lock 139645706378208 acquired on /root/.cache/torch/transformers/24197ba0ce5dbfe23924431610704c88e2c0371afa49149360e4c823219ab474.7eac4fe898a021204e63c88c00ea68c60443c57f94b4bc3c02adbde6465745ac.lock
INFO:transformers.file_utils:https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-base-cased-pytorch_model.bin not found in cache or force_download set to True, downloading to /root/.cache/torch/transformers/tmptbpppa15


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=467042463.0, style=ProgressStyle(descri…

INFO:transformers.file_utils:storing https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-base-cased-pytorch_model.bin in cache at /root/.cache/torch/transformers/24197ba0ce5dbfe23924431610704c88e2c0371afa49149360e4c823219ab474.7eac4fe898a021204e63c88c00ea68c60443c57f94b4bc3c02adbde6465745ac
INFO:transformers.file_utils:creating metadata file for /root/.cache/torch/transformers/24197ba0ce5dbfe23924431610704c88e2c0371afa49149360e4c823219ab474.7eac4fe898a021204e63c88c00ea68c60443c57f94b4bc3c02adbde6465745ac
INFO:filelock:Lock 139645706378208 released on /root/.cache/torch/transformers/24197ba0ce5dbfe23924431610704c88e2c0371afa49149360e4c823219ab474.7eac4fe898a021204e63c88c00ea68c60443c57f94b4bc3c02adbde6465745ac.lock
INFO:transformers.modeling_utils:loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-base-cased-pytorch_model.bin from cache at /root/.cache/torch/transformers/24197ba0ce5dbfe23924431610704c88e2c0371afa49149360e4c823219ab474.7eac4fe898a021204e




INFO:transformers.modeling_utils:Weights of XLNetForSequenceClassification not initialized from pretrained model: ['sequence_summary.summary.weight', 'sequence_summary.summary.bias', 'logits_proj.weight', 'logits_proj.bias']
INFO:transformers.modeling_utils:Weights from pretrained model not used in XLNetForSequenceClassification: ['lm_loss.weight', 'lm_loss.bias']


# <span style="color:orange"> Modelo XLNet (Google/CMU)


## <span style="color:orange"> Treinando no dataloader Debug: check de convergência

In [None]:
torch.cuda.empty_cache()
model = Model('xlnet', debug=True)

trainer = pl.Trainer(gpus=1, 
                     checkpoint_callback=False, 
                     max_epochs=15); trainer.fit(model)

In [None]:
trainer.test(model)

# save gpu memmory 
del model 
torch.cuda.empty_cache()



HBox(children=(FloatProgress(value=0.0, description='Testing', layout=Layout(flex='2'), max=4.0, style=Progres…



--------------------------------------------------------------------------------
TEST RESULTS
{'test_acc': 1.0, 'test_loss': 0.0016988813877105713}
--------------------------------------------------------------------------------



## <span style="color:orange"> Treinamento completo XLNet


In [None]:
# del model
max_epochs = 2

#------- utilizando check point salvo em treinamento salvo --------
path = '/content/drive/My Drive/Colab Notebooks/Bert/epoch=2.ckpt' 

checkpoint_path = path 
checkpoint_dir = os.path.dirname(os.path.abspath(checkpoint_path))
print(f'Files in {checkpoint_dir}: {os.listdir(checkpoint_dir)}')
print(f'Saving checkpoints to {checkpoint_dir}')
checkpoint_callback = ModelCheckpoint(filepath=checkpoint_dir,
                                      save_top_k=-1)  # Keeps all checkpoints.

resume_from_checkpoint = True
if os.path.exists(checkpoint_path):
    print(f'Restoring checkpoint: {checkpoint_path}')
    resume_from_checkpoint = checkpoint_path

trainer = pl.Trainer(gpus=1,
                     max_epochs=max_epochs,
                     check_val_every_n_epoch=1,
                     accumulate_grad_batches=4,
                     checkpoint_callback=checkpoint_callback,
                     resume_from_checkpoint=resume_from_checkpoint)

model = Model()

trainer.fit(model)

# Desempenho: Conj. teste

In [None]:
# XLNet teste 
trainer.test(model)

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Testing', layout=Layout(flex='2'), max=…



--------------------------------------------------------------------------------
TEST RESULTS
{'test_acc': tensor(0.9503, device='cuda:0'),
 'test_loss': tensor(0.1414, device='cuda:0')}
--------------------------------------------------------------------------------



# <span style="color:purple">Fim do notebook