# Criando modelo p/ identificar alvo do tweet c/ BERT pré-treinado

[Voltar ao Índice](../00_indice.ipynb)

O objetivo deste notebook era criar um modelo de arquitetura BERT que identificasse o alvo de um tweet (se o candidato mencionado ou não), tal qual descrito no notebook sobre o [modelo de ML de mesmo propósito](30_modelo_baseline_objeto_do_tweet.ipynb).

**Resultado:** A performance obtida na amostra de teste foi muito ruim, provavelmente devido ao baixo número de instâncias de teste. Vamos utilizar o [modelo de machine learning](30_modelo_baseline_objeto_do_tweet.ipynb) mais simples.

**ATTENTION:** This notebook uses data that is not available in this project due to legal restrictions by the Brazilian Personal Data Protection Law ([LGPD](https://www.planalto.gov.br/ccivil_03/_ato2015-2018/2018/lei/l13709.htm)).

In [1]:
import os
import numpy as np
import pandas as pd
import tensorflow as tf
# Hugging Face:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
from datasets import Dataset
from transformers import DefaultDataCollator

2023-02-16 09:41:47.591844: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
  from .autonotebook import tqdm as notebook_tqdm


## Funções

In [2]:
###########################################
### Splitting datasets into random sets ###
###########################################

def shuffled_pos(length, seed):
    """
    Return indices from 0 to `length` - 1 in a shuffled state, given random `seed`.
    """
    return np.random.RandomState(seed=seed).permutation(length)


def random_index_sets(size, set_fracs, seed):
    """
    Return sets of random indices (from 0 to `size` - 1) with lengths 
    given by ~ `size` * `set_fracs`.
    
    
    Input
    -----
    
    size : int
        The size of the index list to split into sets.
        
    set_fracs : iterable
        The fractions of the list of indices that each index set 
        should contain. 
    
    seed : int
        The seed for the random number generator.
        
        
    Returns
    -------
    
    indices : tuple of arrays
        The indices for each set.
    """
    
    assert np.isclose(np.sum(set_fracs), 1), '`set_fracs` should add up to one.'
    
    # Create randomized list of indices:
    shuffled_indices = shuffled_pos(size, seed)
    
    
    indices   = []
    set_start = [0]
    # Determine the sizes of the sets:
    set_sizes = [round(size * f) for f in set_fracs]
    set_sizes[0] = size - sum(set_sizes[1:])
    assert np.sum(set_sizes) == size, 'Set sizes should add up to total size.'
    
    for i in range(0, len(set_fracs) - 1):
        # Select indices for a set:
        set_start.append(set_start[i] + set_sizes[i])
        set_indices = shuffled_indices[set_start[i]:set_start[i + 1]]
        indices.append(set_indices)
        assert len(indices[i]) == len(set(indices[i])), 'There are repeating indices in a set.'
        
    # Select the indices for the last set:
    indices.append(shuffled_indices[set_start[-1]:])
    assert len(set(np.concatenate(indices))) == sum([len(i) for i in indices]), \
    'There are common indices between sets.'
    
    return tuple(indices)


def random_set_split(df, set_fracs, seed):
    """
    Split a DataFrame into randomly selected disjoint and complete sets.
    
    
    Input
    -----
    
    df : Pandas DataFrame
        The dataframe to split into a complete and disjoint set of sub-sets.
        
    set_fracs : array-like
        The fraction of `df` that should be put into each set. The length of 
        `set_fracs` determines the number of sub-sets to create.
    
    seed : int
        The seed for the random number generator used to split `df`.
        
    
    Returns
    -------
    
    A tuple of DataFrames, one for each fraction in `set_fracs`, in that order.
    """
    # Get positional indices for each set:
    sets_idx = random_index_sets(len(df), set_fracs, seed)
    
    return tuple(df.iloc[idx] for idx in sets_idx)


In [3]:
def process_pandas_to_tfdataset(df, tokenizer, max_length=80, shuffle=True, text_col='text', target_col='label', batch_size=8):
    """
    Prepare NLP data in a Pandas DataFrame to be used 
    in a TensorFlow transformer model.
    
    Parameters
    ----------
    df : DataFrame
        The corpus, containing the columns `text_col` 
        (the sentences) and `target_col` (the labels).
    tokenizer : HuggingFace AutoTokenizer
        A tokenizer loaded from 
        `transformers.AutoTokenizer.from_pretrained()`.
    max_length : int
        Maximum length of the sentences (smaller 
        sentences will be padded and longer ones
        will be truncated). This is required for 
        training, so batches have instances of the
        same shape.
    shuffle : bool
        Shuffle the dataset order when loading. 
        Recommended True for training, False for 
        validation/evaluation.
    text_col : str
        Name of `df` column containing the sentences.
    target_col : str
        Name of `df` column containing the labels of 
        the sentences.
    batch_size : int
        The size of the batch in the output 
        tensorflow dataset.
        
    Returns
    -------
    tf_dataset : TF dataset
        A dataset that can be fed into a transformer 
        model.
    """
    
    # Security checks:
    renamed_df = df.rename({target_col:'labels'}, axis=1) # Hugging Face requer esse nome p/ y.
    
    # Define função para processar os dados com o tokenizador:
    def tokenize_function(examples):
        return tokenizer(examples[text_col], padding=True, max_length=max_length, truncation=True)
    
    # pandas -> hugging face:
    hugging_set = Dataset.from_pandas(renamed_df)
    # texto -> sequência de IDs: 
    encoded_set = hugging_set.map(tokenize_function, batched=True)
    # hugging face -> tensorflow dataset:
    data_collator = DefaultDataCollator(return_tensors="tf")
    tf_dataset = encoded_set.to_tf_dataset(columns=["attention_mask", "input_ids", "token_type_ids"], label_cols=["labels"], shuffle=shuffle, collate_fn=data_collator, batch_size=batch_size)
    
    return tf_dataset

In [4]:
def gen_tensorboard_callback(root_dir, run_name):
    """
    Return a tensorboard callback with log dir given 
    by `root_dir` + `run_name`. It avoids logging 
    to a pre-existing log inadvertently. 
    """
    
    # Root dir should exist. Check it:
    if os.path.isdir(root_dir) == False:
        raise Exception("`root_dir` {} is unknown.".format(root_dir))
    
    # Build path to log:
    fullpath = os.path.join(root_dir, run_name)
    
    # Check if log already exists:
    already_exists = os.path.isdir(fullpath)
    if already_exists:
        
        # If exists, ask if it sohuld continue:
        go_on = input("Run log '{}' already exists. Continue (y/n)?".format(run_name))
        if go_on == 'y' or go_on == 'Y':
            return tf.keras.callbacks.TensorBoard(fullpath)
       
        else:
            raise Exception('Abort so not to mess with tensorboard log.')
    
    else:
        return tf.keras.callbacks.TensorBoard(fullpath)

In [5]:
def predict_proba(model, tf_dataset):
    """
    Use the provided model to compute the
    probability that each instance is 
    in the positive class (1 in a binary 
    classification).

    Parameters
    ----------
    model : TFBertForSequenceClassification
        A Hugging Face implementation of a 
        Tensorflow transformer model.
    tf_dataset : Tensorflow Dataset
        The data for which to make predictions.
    
    Returns
    -------
    probs : array
        Probability that the corresponding 
        instance falls in the positive class
        (y = 1).
    """

    tf_predict = model.predict(tf_dataset).logits
    probs = tf.sigmoid(tf_predict)[:,0].numpy()
    
    return probs


def predict_class(model, tf_dataset, threshold=0.5):
    """
    Use the provided model to predict
    the class of each instance.

    Parameters
    ----------
    model : TFBertForSequenceClassification
        A Hugging Face implementation of a 
        Tensorflow transformer model.
    tf_dataset : Tensorflow Dataset
        The data for which to make predictions.
    
    Returns
    -------
    preds : array
        Predicted class for the corresponding
        instances.
    """

    probs = predict_proba(model, tf_dataset)
    preds = (probs > threshold).astype(int)

    return preds

In [6]:
def load_annotations(annotated_files):
    annotated_df = pd.concat([pd.read_csv(f) for f in annotated_files], ignore_index=True)
    annotated_df = annotated_df.loc[~annotated_df['not_the_target'].isnull()]
    annotated_df['not_the_target'] = annotated_df['not_the_target'].astype(int)
    return annotated_df

## Carregando o BERTimbau

In [7]:
# Define o modelo em questão:
model_name = 'neuralmind/bert-base-portuguese-cased'
# Carregando:
tokenizer  = AutoTokenizer.from_pretrained(model_name, do_lower_case=False)
model      = TFAutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)
#model = TFAutoModelForSequenceClassification.from_pretrained('../../modelos/bertimbau-hatespeech-v01')

2023-02-16 09:42:06.684114: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-16 09:42:06.699801: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.2/lib64
2023-02-16 09:42:06.699812: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2023-02-16 09:42:06.700540: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized wi

## Carregando os dados

In [8]:
# Carrega os dados:
#annotated_files = ['../../dados/processados/tweets_classificados_por_objeto_anotados.csv',    # Dados não criptografados.
#                   '../../dados/processados/tweets_classificados_por_objeto_2_anotados.csv',
#                   '../../dados/processados/tweets_classificados_por_objeto_3_anotados.csv']
annotated_files = ['../../dados/processados/tweets_classificados_por_objeto_anotados_encrypted.csv', 
                   '../../dados/processados/tweets_classificados_por_objeto_2_anotados_encrypted.csv',
                   '../../dados/processados/tweets_classificados_por_objeto_3_anotados_encrypted.csv']
annotated_df = load_annotations(annotated_files)

In [9]:
# Separa a amostra:
train_df, val_df, test_df = random_set_split(annotated_df, [0.7, 0.15, 0.15], 614)

In [10]:
# Salvando dados p/ teste do modelo:
#val_df.to_csv('../../dados/processados/hatespeech_fortuna3+offcombr2_val_seed1323.csv', index=False)
#test_df.to_csv('../../dados/processados/hatespeech_fortuna3+offcombr2_test_seed1323.csv', index=False)

In [11]:
# Tokeniza os textos e os coloca no formato do Tensorflow Dataset:
train_tfd = process_pandas_to_tfdataset(train_df, tokenizer, batch_size=32, shuffle=True, target_col='not_the_target')
val_tfd   = process_pandas_to_tfdataset(val_df, tokenizer, batch_size=32, shuffle=False, target_col='not_the_target')
test_tfd  = process_pandas_to_tfdataset(test_df, tokenizer, batch_size=32, shuffle=False, target_col='not_the_target')

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 31.79ba/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 107.07ba/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 119.49ba/s]


In [12]:
# Acurácia mínima (chute a moda):
(val_df['not_the_target'] == 0).mean()

0.7422680412371134

## Treinando o modelo

In [13]:
# Parâmetros do treinamento:
model_loss = tf.keras.losses.BinaryCrossentropy(from_logits=True) # O Hugging Face não coloca uma função de ativação na última camada, por isso usaremos 'logits'.
metrics = ['accuracy']

### Início do treinamento: ajuste grosso da última camada

Nesta etapa, não esperamos que haja overfitting pois o modelo é muito simples (basicamente uma regressão logística sobre as features criadas pelo BERT. Na verdade, devemos ter um underfitting. Podemos treinar à vontade.

In [14]:
# Preparando o modelo com o BERT congelado:
optimizer  = tf.keras.optimizers.Adam(learning_rate=1e-2)
model.get_layer('bert').trainable = False
model.compile(optimizer, model_loss, metrics)
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  108923136 
                                                                 
 dropout_37 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  769       
                                                                 
Total params: 108,923,905
Trainable params: 769
Non-trainable params: 108,923,136
_________________________________________________________________


In [44]:
# Monitoramento com o Tensorboard 
# tensorboard --logdir=tensor_logs/
#board = gen_tensorboard_callback('tensor_logs/', 'first_try')

In [None]:
# Ajustando o modelo:
model.fit(train_tfd, epochs=40, validation_data=val_tfd)

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<keras.callbacks.History at 0x7f5b880c7190>

### Ajuste fino da última camada

In [15]:
# Vamos baixar a taxa de aprendizado:
optimizer  = tf.keras.optimizers.Adam(learning_rate=5e-4)
model.compile(optimizer, model_loss, metrics)

In [16]:
# Ajustando o modelo:
model.fit(train_tfd, epochs=10, validation_data=val_tfd)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f5b9c197280>

### Liberar o modelo todo para treinamento

Agora é importante ir acompanhando a evolução da função de custo tanto para a amostra de treinamento quanto para a amostra de validação. 

* Uma boa taxa de aprendizado deve levar a uma queda gradual da função de custo na amostra de treinamento. Para não bagunçar os pesos, vamos baixar bastante a taxa de aprendizado.

* Quando a função de custo parar de baixar para a amostra de validação, entramos no regime de overfitting. É preciso parar o treinamento.

In [17]:
# Preparando o modelo com o BERT livre p/ ajustes (vamos baixar ainda mais a taxa de aprendizado):
optimizer  = tf.keras.optimizers.Adam(learning_rate=5e-7)
model.get_layer('bert').trainable = True
model.compile(optimizer, model_loss, metrics)
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  108923136 
                                                                 
 dropout_37 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  769       
                                                                 
Total params: 108,923,905
Trainable params: 108,923,905
Non-trainable params: 0
_________________________________________________________________


In [18]:
# Ajustando o modelo:
early_stopping = tf.keras.callbacks.EarlyStopping('val_loss', patience=10, restore_best_weights=True)
model.fit(train_tfd, epochs=40, validation_data=val_tfd, callbacks=[early_stopping])

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40


<keras.callbacks.History at 0x7f5b8806e7f0>

In [19]:
# Salva o modelo treinado:
#model.save_pretrained('bertimbau-hatespeech-trained')

In [20]:
saved_model = model

## Testando o modelo

In [21]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

In [22]:
#saved_model = TFAutoModelForSequenceClassification.from_pretrained('bertimbau-hatespeech-trained')

In [23]:
# Predictions for validation set:
val_pred  = predict_class(saved_model, val_tfd)

# Metrics:
y_true, y_pred = val_df['not_the_target'], val_pred
for name, scorer in {'acc': accuracy_score, 'f1': f1_score, 'prec': precision_score, 'rec': recall_score}.items():
    s = scorer(y_true, y_pred)
    print('{}: {:.3f}'.format(name, s))

acc: 0.835
f1: 0.529
prec: 1.000
rec: 0.360


In [28]:
# Predictions for test set:
test_pred = predict_class(model, test_tfd)

# Metrics:
y_true, y_pred = test_df['not_the_target'], test_pred
for name, scorer in {'acc': accuracy_score, 'f1': f1_score, 'prec': precision_score, 'rec': recall_score}.items():
    s = scorer(y_true, y_pred)
    print('{}: {:.3f}'.format(name, s))

acc: 0.825
f1: 0.105
prec: 0.250
rec: 0.067
