# Criando modelo p/ discurso de ódio c/ BERT pré-treinado

[Voltar ao Índice](../00_indice.ipynb)

Vamos usar o modelo BERT pré-treinado (multilingual, p/ funcionar com o português) para classificar discursos de ódio. 
Na verdade, buscaremos reproduzir o código do Diogo Cortiz (que usa o pytorch), que reproduzimos [aqui](https://colab.research.google.com/drive/18YXlk-ZIlAymoOYn5nJQE16I3SsguUwq).

In [1]:
# P/ rodar no Colab:
#!pip install transformers
#!pip install datasets

In [2]:
import os
import numpy as np
import pandas as pd
import tensorflow as tf
# Hugging Face:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
from datasets import Dataset
from transformers import DefaultDataCollator

2022-11-18 09:12:50.736590: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


In [3]:
# P/ rodar no Colab:
# Para baixar um arquivo do Google Drive:
#!pip install -U -q PyDrive

#from pydrive.auth import GoogleAuth
#from pydrive.drive import GoogleDrive
#from google.colab import auth
#from oauth2client.client import GoogleCredentials# Authenticate and create the PyDrive client.

##auth.authenticate_user()
#gauth = GoogleAuth()
#gauth.credentials = GoogleCredentials.get_application_default()
#drive = GoogleDrive(gauth)

## Funções

In [4]:
###########################################
### Splitting datasets into random sets ###
###########################################

def shuffled_pos(length, seed):
    """
    Return indices from 0 to `length` - 1 in a shuffled state, given random `seed`.
    """
    return np.random.RandomState(seed=seed).permutation(length)


def random_index_sets(size, set_fracs, seed):
    """
    Return sets of random indices (from 0 to `size` - 1) with lengths 
    given by ~ `size` * `set_fracs`.
    
    
    Input
    -----
    
    size : int
        The size of the index list to split into sets.
        
    set_fracs : iterable
        The fractions of the list of indices that each index set 
        should contain. 
    
    seed : int
        The seed for the random number generator.
        
        
    Returns
    -------
    
    indices : tuple of arrays
        The indices for each set.
    """
    
    assert np.isclose(np.sum(set_fracs), 1), '`set_fracs` should add up to one.'
    
    # Create randomized list of indices:
    shuffled_indices = shuffled_pos(size, seed)
    
    
    indices   = []
    set_start = [0]
    # Determine the sizes of the sets:
    set_sizes = [round(size * f) for f in set_fracs]
    set_sizes[0] = size - sum(set_sizes[1:])
    assert np.sum(set_sizes) == size, 'Set sizes should add up to total size.'
    
    for i in range(0, len(set_fracs) - 1):
        # Select indices for a set:
        set_start.append(set_start[i] + set_sizes[i])
        set_indices = shuffled_indices[set_start[i]:set_start[i + 1]]
        indices.append(set_indices)
        assert len(indices[i]) == len(set(indices[i])), 'There are repeating indices in a set.'
        
    # Select the indices for the last set:
    indices.append(shuffled_indices[set_start[-1]:])
    assert len(set(np.concatenate(indices))) == sum([len(i) for i in indices]), \
    'There are common indices between sets.'
    
    return tuple(indices)


def random_set_split(df, set_fracs, seed):
    """
    Split a DataFrame into randomly selected disjoint and complete sets.
    
    
    Input
    -----
    
    df : Pandas DataFrame
        The dataframe to split into a complete and disjoint set of sub-sets.
        
    set_fracs : array-like
        The fraction of `df` that should be put into each set. The length of 
        `set_fracs` determines the number of sub-sets to create.
    
    seed : int
        The seed for the random number generator used to split `df`.
        
    
    Returns
    -------
    
    A tuple of DataFrames, one for each fraction in `set_fracs`, in that order.
    """
    # Get positional indices for each set:
    sets_idx = random_index_sets(len(df), set_fracs, seed)
    
    return tuple(df.iloc[idx] for idx in sets_idx)


In [5]:
def process_pandas_to_tfdataset(df, tokenizer, max_length=80, shuffle=True, text_col='text', target_col='label', batch_size=8):
    """
    Prepare NLP data in a Pandas DataFrame to be used 
    in a TensorFlow transformer model.
    
    Parameters
    ----------
    df : DataFrame
        The corpus, containing the columns `text_col` 
        (the sentences) and `target_col` (the labels).
    tokenizer : HuggingFace AutoTokenizer
        A tokenizer loaded from 
        `transformers.AutoTokenizer.from_pretrained()`.
    max_length : int
        Maximum length of the sentences (smaller 
        sentences will be padded and longer ones
        will be truncated). This is required for 
        training, so batches have instances of the
        same shape.
    shuffle : bool
        Shuffle the dataset order when loading. 
        Recommended True for training, False for 
        validation/evaluation.
    text_col : str
        Name of `df` column containing the sentences.
    target_col : str
        Name of `df` column containing the labels of 
        the sentences.
    batch_size : int
        The size of the batch in the output 
        tensorflow dataset.
        
    Returns
    -------
    tf_dataset : TF dataset
        A dataset that can be fed into a transformer 
        model.
    """
    
    # Security checks:
    renamed_df = df.rename({target_col:'labels'}, axis=1) # Hugging Face requer esse nome p/ y.
    
    # Define função para processar os dados com o tokenizador:
    def tokenize_function(examples):
        return tokenizer(examples[text_col], padding=True, max_length=max_length, truncation=True)
    
    # pandas -> hugging face:
    hugging_set = Dataset.from_pandas(renamed_df)
    # texto -> sequência de IDs: 
    encoded_set = hugging_set.map(tokenize_function, batched=True)
    # hugging face -> tensorflow dataset:
    data_collator = DefaultDataCollator(return_tensors="tf")
    tf_dataset = encoded_set.to_tf_dataset(columns=["attention_mask", "input_ids", "token_type_ids"], label_cols=["labels"], shuffle=shuffle, collate_fn=data_collator, batch_size=batch_size)
    
    return tf_dataset

In [6]:
def gen_tensorboard_callback(root_dir, run_name):
    """
    Return a tensorboard callback with log dir given 
    by `root_dir` + `run_name`. It avoids logging 
    to a pre-existing log inadvertently. 
    """
    
    # Root dir should exist. Check it:
    if os.path.isdir(root_dir) == False:
        raise Exception("`root_dir` {} is unknown.".format(root_dir))
    
    # Build path to log:
    fullpath = os.path.join(root_dir, run_name)
    
    # Check if log already exists:
    already_exists = os.path.isdir(fullpath)
    if already_exists:
        
        # If exists, ask if it sohuld continue:
        go_on = input("Run log '{}' already exists. Continue (y/n)?".format(run_name))
        if go_on == 'y' or go_on == 'Y':
            return tf.keras.callbacks.TensorBoard(fullpath)
       
        else:
            raise Exception('Abort so not to mess with tensorboard log.')
    
    else:
        return tf.keras.callbacks.TensorBoard(fullpath)

In [7]:
def predict_proba(model, tf_dataset):
    """
    Use the provided model to compute the
    probability that each instance is 
    in the positive class (1 in a binary 
    classification).

    Parameters
    ----------
    model : TFBertForSequenceClassification
        A Hugging Face implementation of a 
        Tensorflow transformer model.
    tf_dataset : Tensorflow Dataset
        The data for which to make predictions.
    
    Returns
    -------
    probs : array
        Probability that the corresponding 
        instance falls in the positive class
        (y = 1).
    """

    tf_predict = model.predict(tf_dataset).logits
    probs = tf.sigmoid(tf_predict)[:,0].numpy()
    
    return probs


def predict_class(model, tf_dataset, threshold=0.5):
    """
    Use the provided model to predict
    the class of each instance.

    Parameters
    ----------
    model : TFBertForSequenceClassification
        A Hugging Face implementation of a 
        Tensorflow transformer model.
    tf_dataset : Tensorflow Dataset
        The data for which to make predictions.
    
    Returns
    -------
    preds : array
        Predicted class for the corresponding
        instances.
    """

    probs = predict_proba(model, tf_dataset)
    preds = (probs > threshold).astype(int)

    return preds

## Carregando o BERTimbau

In [8]:
# Define o modelo em questão:
model_name = 'neuralmind/bert-base-portuguese-cased'
# Carregando:
tokenizer  = AutoTokenizer.from_pretrained(model_name, do_lower_case=False)
model      = TFAutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)

2022-11-18 09:12:56.847540: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-18 09:12:56.864711: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-18 09:12:56.864873: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-18 09:12:56.865648: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the approp

## Carregando os dados

Fonte: Juntamos os dados de Fortuna e Pelle (veja o notebook do modelo baseline).

In [9]:
# P/ rodar no Colab:
# Para baixar os dados do Google Drive p/ o Colab:
#link = 'https://drive.google.com/file/d/1Xq_xPg-OA3q0pfIOf7oIvbc0xsOQ0sxa/view?usp=sharing'
#link_id = link.split('/')[-2]
#downloaded = drive.CreateFile({'id':link_id}) 
#downloaded.GetContentFile('hatespeech_fortuna3+offcombr2.csv')  

In [10]:
# Carrega os dados:
mass_df = pd.read_csv('../../dados/processados/hatespeech_fortuna3+offcombr2.csv')
#mass_df = pd.read_csv('hatespeech_fortuna3+offcombr2.csv')

In [11]:
# Quantidade de dados em cada classe:
mass_df['label'].value_counts()

0    4713
1     926
Name: label, dtype: int64

In [12]:
# Separa os dados em amostras:
train_df, val_df, test_df = random_set_split(mass_df, [0.7, 0.15, 0.15], 1323)
#train_df, val_df = random_set_split(mass_df, [0.85, 0.15], 45998)

In [13]:
# Salvando dados p/ teste do modelo:
#val_df.to_csv('../../dados/processados/hatespeech_fortuna3+offcombr2_val_seed1323.csv', index=False)
#test_df.to_csv('../../dados/processados/hatespeech_fortuna3+offcombr2_test_seed1323.csv', index=False)

In [14]:
# Tokeniza os textos e os coloca no formato do Tensorflow Dataset:
train_tfd = process_pandas_to_tfdataset(train_df, tokenizer, batch_size=32, shuffle=True)
val_tfd   = process_pandas_to_tfdataset(val_df, tokenizer, batch_size=32, shuffle=False)
#test_tfd  = process_pandas_to_tfdataset(test_df, tokenizer, batch_size=32, shuffle=False)

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [15]:
# Acurácia mínima (chute a moda):
(val_df['label'] == 0).mean()

0.8262411347517731

## Treinando o modelo

In [16]:
# Parâmetros do treinamento:
model_loss = tf.keras.losses.BinaryCrossentropy(from_logits=True) # O Hugging Face não coloca uma função de ativação na última camada, por isso usaremos 'logits'.
metrics = ['accuracy']

### Início do treinamento: ajuste grosso da última camada

Nesta etapa, não esperamos que haja overfitting pois o modelo é muito simples (basicamente uma regressão logística sobre as features criadas pelo BERT. Na verdade, devemos ter um underfitting. Podemos treinar à vontade.

In [17]:
# Preparando o modelo com o BERT congelado:
optimizer  = tf.keras.optimizers.Adam(learning_rate=1e-3)
model.get_layer('bert').trainable = False
model.compile(optimizer, model_loss, metrics)
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  108923136 
                                                                 
 dropout_37 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  769       
                                                                 
Total params: 108,923,905
Trainable params: 769
Non-trainable params: 108,923,136
_________________________________________________________________


In [18]:
# Monitoramento com o Tensorboard 
# tensorboard --logdir=tensor_logs/
#board = gen_tensorboard_callback('tensor_logs/', 'first_try')

In [19]:
# Ajustando o modelo:
model.fit(train_tfd, epochs=20, validation_data=val_tfd)

Epoch 1/20

KeyboardInterrupt: 

Esperamos chegar em algo perto de:

    loss: 0.3111 - accuracy: 0.8610 - val_loss: 0.3074 - val_accuracy: 0.8617

### Ajuste fino da última camada

In [None]:
# Vamos baixar a taxa de aprendizado:
optimizer  = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer, model_loss, metrics)

In [None]:
# Ajustando o modelo:
model.fit(train_tfd, epochs=10, validation_data=val_tfd)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f57f7ee2510>

Esperamos chegar em algo perto de:

    loss: 0.3077 - accuracy: 0.8621 - val_loss: 0.3071 - val_accuracy: 0.8593

### Liberar o modelo todo para treinamento

Agora é importante ir acompanhando a evolução da função de custo tanto para a amostra de treinamento quanto para a amostra de validação. 

* Uma boa taxa de aprendizado deve levar a uma queda gradual da função de custo na amostra de treinamento. Para não bagunçar os pesos, vamos baixar bastante a taxa de aprendizado.

* Quando a função de custo parar de baixar para a amostra de validação, entramos no regime de overfitting. É preciso parar o treinamento.

In [None]:
# Preparando o modelo com o BERT livre p/ ajustes (vamos baixar ainda mais a taxa de aprendizado):
optimizer  = tf.keras.optimizers.Adam(learning_rate=5e-7)
model.get_layer('bert').trainable = True
model.compile(optimizer, model_loss, metrics)
model.summary()

Model: "tf_bert_for_sequence_classification_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  108923136 
                                                                 
 dropout_75 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  769       
                                                                 
Total params: 108,923,905
Trainable params: 108,923,905
Non-trainable params: 0
_________________________________________________________________


In [None]:
# Ajustando o modelo:
early_stopping = tf.keras.callbacks.EarlyStopping('val_loss', patience=4, restore_best_weights=True)
model.fit(train_tfd, epochs=40, validation_data=val_tfd, callbacks=[early_stopping])

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40


<keras.callbacks.History at 0x7f57ec895790>

Esperamos chegar, em umas 19 épocas, em algo perto de:

    loss: 0.2074 - accuracy: 0.9090 - val_loss: 0.2857 - val_accuracy: 0.8806|

In [None]:
# Salva o modelo treinado:
#model.save_pretrained('bertimbau-hatespeech-trained')

## Testando o modelo

In [None]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

In [None]:
saved_model = TFAutoModelForSequenceClassification.from_pretrained('bertimbau-hatespeech-trained')

Some layers from the model checkpoint at bertimbau-hatespeech-trained were not used when initializing TFBertForSequenceClassification: ['dropout_75']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at bertimbau-hatespeech-trained.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [None]:
# Predictions for validation set:
val_pred  = predict_class(saved_model, val_tfd)

# Metrics:
y_true, y_pred = val_df['label'], val_pred
for name, scorer in {'acc': accuracy_score, 'f1': f1_score, 'prec': precision_score, 'rec': recall_score}.items():
    s = scorer(y_true, y_pred)
    print('{}: {:.3f}'.format(name, s))

acc: 0.881
f1: 0.512
prec: 0.646
rec: 0.424


In [None]:
# Predictions for test set:
test_pred = predict_class(model, test_tfd)

# Metrics:
y_true, y_pred = test_df['label'], test_pred
for name, scorer in {'acc': accuracy_score, 'f1': f1_score, 'prec': precision_score, 'rec': recall_score}.items():
    s = scorer(y_true, y_pred)
    print('{}: {:.3f}'.format(name, s))

acc: 0.901
f1: 0.596
prec: 0.689
rec: 0.525
