# 📚 Fine-Tuning de um Modelo BERT em Português para Masked Language Modeling

Este notebook demonstra o processo completo de **fine-tuning** de um modelo **BERT pré-treinado em português** na tarefa de **Masked Language Modeling (MLM)**, usando a biblioteca **Hugging Face Transformers**.

O objetivo é treinar o modelo para **prever palavras mascaradas** em textos legislativos brasileiros, aproveitando o corpus **UlyssesNER-Br** (alvo do projeto como um todo), que contém projetos de lei e consultas da Câmara dos Deputados.  

Esse tipo de pré-treinamento adaptado ajuda o modelo a se especializar no **vocabulário jurídico e legislativo**, tornando-o mais eficaz em tarefas posteriores, como reconhecimento de entidades nomeadas (NER).

De fato, este modelo servirá como base para o treinamento de um modelo de NER diante do mesmo corpus.

Ao longo do notebook, são realizadas as seguintes etapas principais:
- **Importação e configuração:** Definição do ambiente, carregamento do tokenizer e do BERT pré-treinado.
- **Download e preparação dos dados:** Coleta do corpus legislativo, junção dos tokens em textos contínuos e criação do dataset.
- **Tokenização e agrupamento:** Conversão dos textos em IDs numéricos e organização em blocos de tamanho fixo compatíveis com o BERT.
- **Configuração do treino:** Definição do `DataCollator` para aplicar o mascaramento dinâmico e ajuste dos hiperparâmetros do treinamento.
- **Execução do fine-tuning:** Treinamento do modelo utilizando o `Trainer` da Hugging Face, salvando os pesos ajustados para uso futuro.

No final, o modelo resultante é uma versão especializada do BERT em português, **mais alinhada ao contexto de textos legislativos**, pronta para servir de base em tarefas de PLN específicas do domínio jurídico.

In [1]:
from transformers import (
    AutoTokenizer,
    AutoModelForMaskedLM,
    DataCollatorForLanguageModeling,
    Trainer,
    TrainingArguments,
)
from datasets import Dataset
from pathlib import Path
import requests
import json

In [2]:
model_checkpoint = "neuralmind/bert-base-portuguese-cased"
path_to_save_lm = Path("./outputs/bert_masked_lm_ulysses")
path_to_save_lm.mkdir(parents=True, exist_ok=True)

In [3]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/43.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/647 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at neuralmind/bert-base-portuguese-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

In [4]:
base_url = "https://raw.githubusercontent.com/Convenio-Camara-dos-Deputados/ulyssesner-br-propor/main/PL-corpus_v2/ulysses_categories/holdout/"

def load_data(url):
    response = requests.get(url)
    return json.loads(response.text)

urls = {
    "train": base_url + "train.json",
    "dev": base_url + "dev.json",
    "test": base_url + "test.json"
}

train_data = load_data(urls["train"])
dev_data = load_data(urls["dev"])
test_data = load_data(urls["test"])


In [5]:
def join_tokens(data):
    return [" ".join(example["tokens"]) for example in data]

train_texts = join_tokens(train_data) + join_tokens(dev_data)
test_texts = join_tokens(test_data)

print(f"N exemplos treino: {len(train_texts)}")
print(f"N exemplos teste: {len(test_texts)}")

N exemplos treino: 1900
N exemplos teste: 592


In [6]:
train_dataset = Dataset.from_dict({"text": train_texts})
test_dataset = Dataset.from_dict({"text": test_texts})

train_dataset = train_dataset.shuffle(seed=271828)
print(train_dataset)

Dataset({
    features: ['text'],
    num_rows: 1900
})


In [7]:
def tokenize_function(examples):
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [
            result.word_ids(i) for i in range(len(result["input_ids"]))
        ]
    return result

tokenized_train = train_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["text"]
)

tokenized_test = test_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["text"]
)

print(tokenized_train[0])

Map:   0%|          | 0/1900 [00:00<?, ? examples/s]

Map:   0%|          | 0/592 [00:00<?, ? examples/s]

{'input_ids': [101, 2070, 18407, 16520, 122, 18656, 22281, 117, 1971, 180, 7913, 271, 171, 16176, 4015, 117, 11367, 10258, 260, 2380, 15092, 22281, 180, 1837, 733, 18982, 17512, 154, 1772, 122, 8051, 2924, 10278, 6617, 7791, 15173, 179, 598, 20692, 22287, 3545, 122, 4560, 353, 6350, 117, 11338, 259, 1867, 173, 5121, 125, 16087, 17238, 173, 327, 8972, 117, 625, 346, 125, 10276, 122, 2281, 8318, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'word_ids': [None, 0, 1, 2, 3, 4, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 14, 15, 16, 16, 16, 17, 18, 19, 19, 19, 19, 20, 21, 22, 22, 23, 24, 25, 25, 26

In [8]:
chunk_size = 512

def group_texts(examples):
    concatenated = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated[list(examples.keys())[0]])
    total_length = (total_length // chunk_size) * chunk_size
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

tokenized_train = tokenized_train.map(group_texts, batched=True)
tokenized_test = tokenized_test.map(group_texts, batched=True)

print(tokenized_train[0])


Map:   0%|          | 0/1900 [00:00<?, ? examples/s]

Map:   0%|          | 0/592 [00:00<?, ? examples/s]

{'input_ids': [101, 2070, 18407, 16520, 122, 18656, 22281, 117, 1971, 180, 7913, 271, 171, 16176, 4015, 117, 11367, 10258, 260, 2380, 15092, 22281, 180, 1837, 733, 18982, 17512, 154, 1772, 122, 8051, 2924, 10278, 6617, 7791, 15173, 179, 598, 20692, 22287, 3545, 122, 4560, 353, 6350, 117, 11338, 259, 1867, 173, 5121, 125, 16087, 17238, 173, 327, 8972, 117, 625, 346, 125, 10276, 122, 2281, 8318, 119, 102, 101, 146, 16975, 1772, 9835, 154, 131, 1328, 119, 100, 146, 1328, 119, 100, 180, 2241, 100, 1193, 119, 17909, 22338, 117, 125, 2336, 125, 1512, 125, 5232, 1379, 9999, 171, 18043, 148, 310, 1379, 1425, 123, 8773, 159, 5103, 1859, 171, 1457, 204, 100, 117, 12963, 118, 176, 146, 20737, 6257, 2232, 171, 4319, 10764, 173, 204, 100, 131, 107, 204, 100, 2810, 12119, 251, 123, 19231, 171, 6330, 625, 20075, 6944, 2738, 12507, 125, 123, 4857, 10907, 22282, 2113, 4654, 5245, 117, 2557, 291, 173, 1028, 11055, 179, 6554, 8814, 146, 347, 1700, 119, 22354, 102, 101, 170, 860, 2630, 117, 15060, 2671, 1

In [9]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,
    mlm_probability=0.15,
)


In [10]:
training_batch_size = 8

model_name = model_checkpoint.split("/")[-1]

training_arguments = TrainingArguments(
    output_dir=path_to_save_lm / f"{model_name}-finetuned-ulysses",
    learning_rate=3e-5,
    per_device_train_batch_size=training_batch_size,
    per_device_eval_batch_size=training_batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_arguments,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    data_collator=data_collator,
    tokenizer=tokenizer,
)

trainer.train()


  trainer = Trainer(


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mangelolimamiranda[0m ([33mangelolimamiranda-universidade-federal-do-rio-grande-do-[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss


TrainOutput(global_step=72, training_loss=1.8117237091064453, metrics={'train_runtime': 4459.6296, 'train_samples_per_second': 0.129, 'train_steps_per_second': 0.016, 'total_flos': 151604683997184.0, 'train_loss': 1.8117237091064453, 'epoch': 3.0})