<a href="https://colab.research.google.com/github/finardi/tutos/blob/master/banking77.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [158]:
!nvidia-smi

Thu Jun 24 19:37:21 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    41W / 300W |   7595MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [159]:
%%capture
!pip install transformers

In [160]:
import os
import gc
import numpy as np
import pandas as pd

import torch
from torch.utils.data import TensorDataset, DataLoader

from sklearn.metrics import f1_score, accuracy_score

from transformers import BertForSequenceClassification, BertTokenizer
from transformers import AdamW, get_linear_schedule_with_warmup

device = 'cuda'

In [161]:
path_base = '/content/drive/MyDrive/BACEN/data_translated/'

# TRAIN
data_train = pd.read_parquet(path_base+'banking77_ptbr_train')

# ajustando o label de str para int
data_train = data_train.assign(Label = data_train.Label.apply(lambda x: np.int64(x.replace('.', ''))))
data_train = data_train.sample(frac=1).reset_index(drop=True)

# TEST
data_test = pd.read_parquet(path_base+'banking77_ptbr_test')

# ajustando o label de str para int
data_test = data_test.assign(Label = data_test.Label.apply(lambda x: np.int64(x.replace('.', ''))))
data_test = data_test.sample(frac=1).reset_index(drop=True)

# - - - - -
print(data_train.shape, data_test.shape)
data_train

(10003, 2) (3080, 2)


Unnamed: 0,Data,Label
0,"Devolvi um item, mas não o vejo na minha conta?",51
1,Por que haveria uma cobrança extra no meu apli...,34
2,Preciso de um cartão de uso único para comprar...,37
3,A quantidade de dinheiro que recebi não corres...,75
4,Meu cartão ainda não chegou. O que eu faço?,11
...,...,...
9998,Quanto tempo até um cheque eu mandar autorização?,58
9999,Como faço com transferências bancárias para as...,65
10000,O que significa ter uma retirada pendente de d...,46
10001,Por que a transferência de dinheiro ainda não ...,66


# Tokenize data

In [162]:
def get_banking77_tokens(data, label, maxlen):
    ids, labels = [], []
    for text, label in zip(data, label):
        tokens = tokenizer.encode_plus(
            text=text,
            truncation=True, 
            max_length=maxlen,
            padding='max_length',
            return_tensors='pt'
        )
        labels.append(int(label))
        ids.append(tokens['input_ids'])

    ids = torch.vstack(ids)
    return ids, torch.tensor(labels)

In [163]:
# artigo: https://arxiv.org/pdf/2003.04807.pdf

BSIZE = 16
MAX_LEN = 32
path_model = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(path_model)

# Train
texts_train  = data_train.Data.to_list()
labels_train = data_train.Label.to_list()    
train_texts, train_labels = get_banking77_tokens(texts_train, labels_train, MAX_LEN)
ds_train = TensorDataset(train_texts, train_labels)

# Test
texts_test  = data_test.Data.to_list()
labels_test = data_test.Label.to_list() 
test_texts, test_labels = get_banking77_tokens(texts_test, labels_test, MAX_LEN)
ds_test = TensorDataset(test_texts, test_labels)

# Dataloaders
dataloader = {
    'train': DataLoader(
        dataset=ds_train,
        batch_size=BSIZE,
        shuffle=True,
        pin_memory=True,
        num_workers=os.cpu_count()
         ),
    'test': DataLoader(
        dataset=ds_test,
        batch_size=BSIZE,
        shuffle=True,
        pin_memory=True,
        num_workers=os.cpu_count()
         )
}

# - - - - -
_ = {_:len(dataloader[_]) for _ in dataloader.keys()}
print(_)
ids_batch, label_batch = next(iter(dataloader['train']))
ids_batch[0], label_batch[0]

{'train': 626, 'test': 193}


(tensor([  101,  1051,  2072,  1010,  9311,  7416,  2033, 22450,  2099, 21877,
          2721,  3539,  7895,  2310,  2480,  7570,  6460,  1006,  2061,  2226,
          8529, 24576,  7396,  2063,  1007,  1010,  1041,  1037,  9099, 19629,
          2080,   102]), tensor(47))

In [164]:
model = BertForSequenceClassification.from_pretrained(
    path_model, 
    num_labels=data_train.Label.nunique(), 
    return_dict=True
    )

# - - - - -
outs = model(input_ids=ids_batch, labels=label_batch)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [165]:
try:
    del model
    gc.collect()
    torch.cuda.empty_cache()
except:
    pass

def train(model, loader, optim, scheduler):
    total_acc, total_f1, total_loss = 0,0,0
    model.train()
    for batch in loader:
        model.zero_grad()
        
        batch_ids, batch_label = (b.to('cuda') for b in batch)
        
        outs = model(input_ids=batch_ids, labels=batch_label)
        
        loss = outs['loss']
        total_loss += loss.item()
        
        loss.backward()
        optim.step()
        scheduler.step()
        
        logits = torch.nn.functional.softmax(outs['logits'].cpu().detach(), dim=-1).numpy()
        y_pred = np.argmax(logits, axis=1)
        total_f1  += f1_score(y_true=batch_label.cpu(), y_pred=y_pred, average='macro')
        total_acc += accuracy_score(y_true=batch_label.cpu(), y_pred=y_pred)
    
    avg_loss = total_loss/len(loader)
    avg_f1 = total_f1/len(loader)
    avg_acc = total_acc/len(loader)

    return avg_loss, avg_f1, avg_acc

def evaluate(model, loader):
    total_acc, total_f1, total_loss = 0,0,0
    model.eval()
    for batch in loader:
        model.zero_grad()
        
        batch_ids, batch_label = (b.to('cuda') for b in batch)
        
        with torch.no_grad():
            outs = model(input_ids=batch_ids, labels=batch_label)
        
        loss = outs['loss']
        total_loss += loss.item()
        
        logits = torch.nn.functional.softmax(outs['logits'].cpu().detach(), dim=-1).numpy()
        y_pred = np.argmax(logits, axis=1)
        total_f1  += f1_score(y_true=batch_label.cpu(), y_pred=y_pred, average='macro')
        total_acc += accuracy_score(y_true=batch_label.cpu(), y_pred=y_pred)
    
    avg_loss = total_loss/len(loader)
    avg_f1 = total_f1/len(loader)
    avg_acc = total_acc/len(loader)

    return avg_loss, avg_f1, avg_acc

# - - - - -
model = BertForSequenceClassification.from_pretrained(
    path_model, 
    num_labels=data_train.Label.nunique(), 
    return_dict=True
    ).to(device)

optim = AdamW(model.parameters(), lr=5e-5, eps=1e-8)

N_EPOCHS = 3

total_steps = len(loader) * N_EPOCHS
scheduler = get_linear_schedule_with_warmup(
    optim, 
    num_warmup_steps = int(0.02 * total_steps), # 2% of warmup
    num_training_steps = total_steps
    )

training_stats = []
# ----------------------------------------------------------------------------------------
for idx in range(N_EPOCHS):
    loss_train, f1_train, acc_train = train(
        model, 
        dataloader['train'], 
        optim, 
        scheduler=scheduler
        )
    loss_eval,  f1_eval,  acc_eval = evaluate(
        model, 
        dataloader['test']
        )
    
    print(f'Epoch {idx}/{N_EPOCHS-1}:')
    print(f'\tTrain: Loss: {loss_train:<5.3} -- F1: {f1_train:<5.3} -- ACC: {acc_train:.3}')
    print(f'\tEval:  Loss: {loss_eval:<5.3} -- F1: {f1_eval:<5.3} -- ACC: {acc_eval:.3}')

    training_stats.append(
        {
            'epoch': idx,
            'Training Loss': loss_train,
            'Training Acc': acc_train,
            'Training F1': f1_train,
            'Eval Loss': loss_eval,
            'Eval Acc': acc_eval,
            'Eval F1': f1_eval,
        }
    )

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Epoch 0/2:
	Train: Loss: 3.3   -- F1: 0.149 -- ACC: 0.226
	Eval:  Loss: 1.93  -- F1: 0.386 -- ACC: 0.529
Epoch 1/2:
	Train: Loss: 1.33  -- F1: 0.555 -- ACC: 0.693
	Eval:  Loss: 0.972 -- F1: 0.634 -- ACC: 0.76
Epoch 2/2:
	Train: Loss: 0.679 -- F1: 0.741 -- ACC: 0.84
	Eval:  Loss: 0.652 -- F1: 0.725 -- ACC: 0.828


In [166]:
df_stats = pd.DataFrame(data=training_stats)
df_stats = df_stats.set_index('epoch')
pd.set_option('precision', 3)
df_stats

Unnamed: 0_level_0,Training Loss,Training Acc,Training F1,Eval Loss,Eval Acc,Eval F1
epoch,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,3.299,0.226,0.149,1.928,0.529,0.386
1,1.326,0.693,0.555,0.972,0.76,0.634
2,0.679,0.84,0.741,0.652,0.828,0.725
