# EXPERIMENTO 03 - TREINAMENTO DO MODELO UTILIZANDO A API DE TRANSFORMERS DO HUGGINGFACE E O MODELO PRÉ-TREINADO => https://huggingface.co/alfaneo/bertimbaulaw-base-portuguese-cased

Modelo ajustado com termos jurídicos com base no modelo pré-treinado: https://huggingface.co/neuralmind/bert-base-portuguese-cased

**Ambiente Microsoft Azure - Treinamento da dissertação**

**INSTALAÇÃO DE DEPENDÊNCIAS**

In [None]:
!pip install transformers datasets torch tqdm numpy pandas py7zr rouge_score

**IMPORTAÇÃO DAS  BIBLIOTECAS**

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer
from torch.utils.data import DataLoader
from transformers import EncoderDecoderModel
from datasets import load_metric
import torch
from tqdm.notebook import tqdm
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

**MONTAGEM DO DATASET COM OS PARES DE SENTENÇAS**

In [None]:
PATH = '/mnt/batch/tasks/shared/LS_root/mounts/clusters/computerexperimento03/code/Users/alexandre.alves.net'

# Dataset/Downloads dos dados
dataset = load_dataset(PATH, data_files='amostra_sentences_stf_pt_13288.json', split='train', field="data")

ds = dataset.train_test_split(test_size=0.05)
train_data = ds['train'].shuffle(seed=42)
val_data = ds['test']

Downloading and preparing dataset json/alexandre.alves.net to /home/azureuser/.cache/huggingface/datasets/json/alexandre.alves.net-84014f0e3307bd0c/0.0.0/da492aad5680612e4028e7f6ddc04b1dfcec4b64db470ed7cc5f2bb265b9b6b5...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Using custom data configuration alexandre.alves.net-84014f0e3307bd0c


Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset json downloaded and prepared to /home/azureuser/.cache/huggingface/datasets/json/alexandre.alves.net-84014f0e3307bd0c/0.0.0/da492aad5680612e4028e7f6ddc04b1dfcec4b64db470ed7cc5f2bb265b9b6b5. Subsequent calls will reuse this data.


**DEFINIÇÃO DO MODELO**

In [None]:
model_name = 'alfaneo/bertimbaulaw-base-portuguese-cased'
ds_col_in = 'original'
ds_col_out = 'simples'

tokenizer = AutoTokenizer.from_pretrained(model_name)
assert tokenizer.is_fast

encoder_max_length = 512
decoder_max_length = 512


def process_data_to_model_inputs(batch):
    # tokenize the inputs and labels

    inputs = tokenizer(batch[ds_col_in], padding="max_length", truncation=True, max_length=encoder_max_length)
    outputs = tokenizer(batch[ds_col_out], padding="max_length", truncation=True, max_length=decoder_max_length)

    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask
    batch["labels"] = outputs.input_ids.copy()

    # We have to make sure that the PAD token is ignored by the loss function
    batch["labels"] = [[-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]]

    return batch

v_batch_size = 8

train_data = train_data.map(
    process_data_to_model_inputs,
    batched=True,
    batch_size=v_batch_size,
    remove_columns=[ds_col_in, ds_col_out]
)
val_data = val_data.map(
    process_data_to_model_inputs,
    batched=True,
    batch_size=v_batch_size,
    remove_columns=[ds_col_in, ds_col_out]
)

train_data.set_format(
    type="torch", columns=["input_ids", "attention_mask", "labels"],
)

val_data.set_format(
    type="torch", columns=["input_ids", "attention_mask", "labels"],
)



  0%|          | 0/1578 [00:00<?, ?ba/s]

  0%|          | 0/84 [00:00<?, ?ba/s]

**TESTES INICIAIS**

In [None]:
train_dataloader = DataLoader(train_data, shuffle=True, batch_size=v_batch_size)
val_dataloader = DataLoader(val_data, batch_size=v_batch_size)

batch = next(iter(train_dataloader))

for k,v in batch.items():
  print(k, v.shape)
print('---------------------------------------------------------')
print(tokenizer.decode(batch["input_ids"][0].tolist()))
print('---------------------------------------------------------')
labels = batch["labels"][0].tolist()
labels = [label for label in labels if label != -100 ]
tokenizer.decode(labels)

input_ids torch.Size([8, 512])
attention_mask torch.Size([8, 512])
labels torch.Size([8, 512])
---------------------------------------------------------
[CLS] O primeiro versa sobre mera questão de fato ; o segundo, ao revés, sobre questão de direito. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD

'[CLS] O primeiro é sobre mera questão de fato ; o segundo é sobre questão de direito. [SEP]'

**MÉTRICAS**

In [None]:
# Métricas
rouge = load_metric("rouge")


def compute_rouge(pred_ids, label_ids):
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_ids[label_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

**TREINAMENTO E DEFINIÇÃO DOS PARÂMETROS**

In [None]:
# Treinamento
model = EncoderDecoderModel.from_encoder_decoder_pretrained(model_name, model_name)
model.config.decoder_start_token_id = tokenizer.cls_token_id
model.config.eos_token_id = tokenizer.sep_token_id
model.config.pad_token_id = tokenizer.pad_token_id
model.config.vocab_size = model.config.encoder.vocab_size
# settings for the generate() method
model.config.max_length = 120
model.config.min_length = 40
model.config.no_repeat_ngram_size = 3
model.config.early_stopping = True
model.config.length_penalty = 0.8
model.config.num_beams = 3

training_arguments = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    evaluation_strategy='steps',
    num_train_epochs=10,
    per_device_train_batch_size=v_batch_size,
    per_device_eval_batch_size=v_batch_size,
    fp16=torch.cuda.is_available(),
    output_dir=PATH + '/output',
    logging_steps=100,
    save_steps=3000,
    eval_steps=10000,
    warmup_steps=2000,
    gradient_accumulation_steps=1,
    save_total_limit=3
)

trainer = Seq2SeqTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_arguments,
    compute_metrics=compute_rouge,
    train_dataset=train_data,
    eval_dataset=val_data
)

trainer.train()
trainer.save_model(PATH + '/model')

Some weights of the model checkpoint at juridics/bertimbaulaw-base-portuguese-cased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at juridics/bertimbaulaw-base-portuguese-cased and are newly initialized: ['bert.poo

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Step,Training Loss,Validation Loss


Attempted to log scalar metric loss:
10.3457
Attempted to log scalar metric learning_rate:
2.5e-06
Attempted to log scalar metric epoch:
0.06
Attempted to log scalar metric loss:
5.804
Attempted to log scalar metric learning_rate:
5e-06
Attempted to log scalar metric epoch:
0.13
Attempted to log scalar metric loss:
4.0619
Attempted to log scalar metric learning_rate:
7.5e-06
Attempted to log scalar metric epoch:
0.19
Attempted to log scalar metric loss:
3.2184
Attempted to log scalar metric learning_rate:
1e-05
Attempted to log scalar metric epoch:
0.25
Attempted to log scalar metric loss:
2.7232
Attempted to log scalar metric learning_rate:
1.25e-05
Attempted to log scalar metric epoch:
0.32
Attempted to log scalar metric loss:
2.446
Attempted to log scalar metric learning_rate:
1.5e-05
Attempted to log scalar metric epoch:
0.38
Attempted to log scalar metric loss:
2.2503
Attempted to log scalar metric learning_rate:
1.75e-05
Attempted to log scalar metric epoch:
0.44
Attempted to log

Bad pipe message: %s [b'\x14\xaf\xb5\x90;<\x1f\n\xd2\xb1\x9c-\x14%\x05\xd9.T \x16\xd5l\xad){U;\xda"_\xac\xbeq4\xe0\xfeEiO\xfaaK\xf9\x87:\xeb\xd0O\xe4\xbb\xfb\x00\x08\x13\x02\x13\x03\x13\x01\x00\xff\x01\x00\x00\x8f\x00\x00\x00\x0e\x00\x0c\x00\x00\t127.0.0.1\x00\x0b\x00\x04\x03\x00\x01\x02\x00\n\x00\x0c\x00\n\x00\x1d\x00\x17\x00\x1e\x00\x19\x00\x18\x00#\x00\x00\x00\x16\x00\x00\x00\x17\x00\x00\x00\r\x00\x1e\x00\x1c\x04\x03\x05\x03\x06\x03\x08\x07\x08\x08\x08\t\x08\n\x08\x0b\x08\x04\x08\x05\x08\x06\x04\x01\x05\x01\x06\x01\x00+\x00\x03\x02\x03\x04\x00-\x00\x02\x01\x01\x003\x00&\x00$\x00\x1d\x00 \x7f\xe7\xce\xbcqU\x9dF=%\x06E]\xde\xc8\xa2"<']
Bad pipe message: %s [b"\xa9\x0bBF\xa5\xc7o\x1c\x16\x86\xccw\xa0_\xd2\x17\xe1F\x00\x00|\xc0,\xc00\x00\xa3\x00\x9f\xcc\xa9\xcc\xa8\xcc\xaa\xc0\xaf\xc0\xad\xc0\xa3\xc0\x9f\xc0]\xc0a\xc0W\xc0S\xc0+\xc0/\x00\xa2\x00\x9e\xc0\xae\xc0\xac\xc0\xa2\xc0\x9e\xc0\\\xc0`\xc0V\xc0R\xc0$\xc0(\x00k\x00j\xc0#\xc0'\x00g\x00@\xc0\n\xc0\x14\x009\x008\xc0\t\xc0\x13\x003\x00