# üß† Domain-Adaptive Pre-training: BERTimbau + MLM M√©dico

**Objetivo:** Continuar o pr√©-treinamento do BERTimbau com textos m√©dicos/radiol√≥gicos usando Masked Language Modeling (MLM), como proposto no paper original do BERT (Devlin et al., 2018).

**Pipeline:**
```
BERTimbau (portugu√™s geral)
    ‚Üì MLM com textos m√©dicos
BERTimbau-Medical (portugu√™s m√©dico)
    ‚Üì Fine-tuning com Focal Loss (no Kaggle)
Classificador BI-RADS
```

**Requisitos:**
- Google Colab (GPU T4 gratuita √© suficiente)
- Conta Kaggle (para baixar o dataset da competi√ß√£o)
- Conta GitHub (para clonar o repo)

---

## Etapa 1 ‚Äî Setup do Ambiente

In [1]:
!pip install -q kagglehub kagglesdk

[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/160.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m160.5/160.5 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
# ===== ETAPA 1: SETUP =====
import os
import sys
import torch
import numpy as np
import pandas as pd
from pathlib import Path

# Verificar GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_mem = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f'‚úÖ GPU: {gpu_name} ({gpu_mem:.1f} GB)')
else:
    print('‚ö†Ô∏è Sem GPU! V√° em Runtime > Change runtime type > T4 GPU')
print(f'PyTorch: {torch.__version__}')
print(f'Device: {device}')

‚úÖ GPU: NVIDIA L4 (23.7 GB)
PyTorch: 2.10.0+cu128
Device: cuda


## Etapa 2 ‚Äî Configurar Kaggle API
1. Clique no √≠cone üîë (Secrets) na barra lateral esquerda
2. Adicione: `KAGGLE_USERNAME` e `KAGGLE_KEY`
3. Ative o acesso ao notebook

Se o token expirou, gere um novo em: kaggle.com ‚Üí Settings ‚Üí API ‚Üí Create New Token

In [3]:
# ===== ETAPA 2: KAGGLE API =====
try:
    from google.colab import userdata
    os.environ['KAGGLE_USERNAME'] = userdata.get('KAGGLE_USERNAME')
    os.environ['KAGGLE_KEY'] = userdata.get('KAGGLE_KEY')
    print('‚úÖ Kaggle credentials carregadas dos Secrets')
except:
    print('‚ö†Ô∏è Secrets n√£o encontrados. Digite manualmente:')
    os.environ['KAGGLE_USERNAME'] = input('KAGGLE_USERNAME: ')
    os.environ['KAGGLE_KEY'] = input('KAGGLE_KEY: ')

# Configurar kaggle.json
os.makedirs(os.path.expanduser('~/.kaggle'), exist_ok=True)
import json
with open(os.path.expanduser('~/.kaggle/kaggle.json'), 'w') as f:
    json.dump({'username': os.environ['KAGGLE_USERNAME'], 'key': os.environ['KAGGLE_KEY']}, f)
os.chmod(os.path.expanduser('~/.kaggle/kaggle.json'), 0o600)
print('‚úÖ Kaggle API configurada')

‚úÖ Kaggle credentials carregadas dos Secrets
‚úÖ Kaggle API configurada


In [4]:
# Testar se as credenciais funcionam
!kaggle competitions list | head -5
print('\n‚úÖ Se apareceu uma lista acima, as credenciais est√£o OK!')

401 Client Error: Unauthorized for url: https://api.kaggle.com/v1/competitions.CompetitionApiService/ListCompetitions

‚úÖ Se apareceu uma lista acima, as credenciais est√£o OK!


## Etapa 3 ‚Äî Baixar dados da competi√ß√£o

In [5]:
!unzip -o /content/spr-2026-mammography-report-classification.zip -d /content

import os
csvs = [f for f in os.listdir('/content') if f.endswith('.csv')]
print(f'‚úÖ Arquivos: {csvs}')

Archive:  /content/spr-2026-mammography-report-classification.zip
  inflating: /content/submission.csv  
  inflating: /content/test.csv       
  inflating: /content/train.csv      
‚úÖ Arquivos: ['submission.csv', 'train.csv', 'test.csv']


## Etapa 4 ‚Äî Preparar corpus para MLM
Usa os laudos de mamografia (treino + teste) como corpus para o pr√©-treino MLM.
O MLM n√£o usa labels, ent√£o podemos usar ambos sem risco de data leakage.

**Nota:** Removemos a parte de Medical Transcriptions (ingl√™s) para focar 100% no dom√≠nio-alvo (portugu√™s m√©dico/radiol√≥gico).

In [6]:
# ===== ETAPA 4: PREPARAR CORPUS =====
import pandas as pd
import numpy as np

DATA_DIR = '/content'
train_df = pd.read_csv(f'{DATA_DIR}/train.csv')
print(f'Train: {train_df.shape}')
print(f'Colunas: {train_df.columns.tolist()}')
print(f'Exemplo de report:\n{train_df["report"].iloc[0][:300]}\n')

# Textos de treino
mammo_texts = train_df['report'].dropna().tolist()

# Se test.csv existir, usar tamb√©m (MLM n√£o usa labels)
test_path = f'{DATA_DIR}/test.csv'
if os.path.exists(test_path):
    test_df = pd.read_csv(test_path)
    if 'report' in test_df.columns:
        mammo_texts += test_df['report'].dropna().tolist()
        print(f'Test: {test_df.shape} (adicionado ao corpus MLM)')

print(f'\nüìä Total laudos de mamografia: {len(mammo_texts)}')

Train: (18272, 3)
Colunas: ['ID', 'report', 'target']
Exemplo de report:
Indica√ß√£o cl√≠nica:
 rastreamento.
Achados:
Mamas parcialmente lipossubstitu√≠das.
Calcifica√ß√µes benignas esparsas.
N√£o se observam calcifica√ß√µes suspeitas agrupadas.
As regi√µes axilares n√£o apresentam altera√ß√µes significativas.
An√°lise comparativa:
Imagens de mamografias anteriores n√£o dispon

Test: (4, 2) (adicionado ao corpus MLM)

üìä Total laudos de mamografia: 18276


In [7]:
# Corpus final: laudos repetidos 3x para refor√ßar o dom√≠nio
corpus = mammo_texts * 3
np.random.seed(42)
np.random.shuffle(corpus)

print(f'üìä Corpus final para MLM: {len(corpus)} textos')

üìä Corpus final para MLM: 54828 textos


## Etapa 5 ‚Äî Carregar BERTimbau e Tokenizer

In [8]:
# ===== ETAPA 5: CARREGAR BERTIMBAU =====
from transformers import (
    AutoTokenizer,
    AutoModelForMaskedLM,
    DataCollatorForLanguageModeling,
    TrainingArguments,
    Trainer
)
from datasets import Dataset

# Escolha o modelo base:
# - 'neuralmind/bert-base-portuguese-cased' (110M params, mais r√°pido)
# - 'neuralmind/bert-large-portuguese-cased' (335M params, mais poderoso)
MODEL_NAME = 'neuralmind/bert-large-portuguese-cased'  # mesmo do melhor score

print(f'Baixando {MODEL_NAME}...')
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForMaskedLM.from_pretrained(MODEL_NAME)
model.to(device)

print(f'‚úÖ Modelo carregado: {MODEL_NAME}')
print(f'   Par√¢metros: {sum(p.numel() for p in model.parameters()):,}')

Baixando neuralmind/bert-large-portuguese-cased...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/648 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/155 [00:00<?, ?B/s]



vocab.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/395 [00:00<?, ?it/s]

BertForMaskedLM LOAD REPORT from: neuralmind/bert-large-portuguese-cased
Key                         | Status     |  | 
----------------------------+------------+--+-
cls.seq_relationship.bias   | UNEXPECTED |  | 
cls.seq_relationship.weight | UNEXPECTED |  | 
bert.pooler.dense.weight    | UNEXPECTED |  | 
bert.pooler.dense.bias      | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

‚úÖ Modelo carregado: neuralmind/bert-large-portuguese-cased
   Par√¢metros: 364,937,314


## Etapa 6 ‚Äî Tokenizar corpus para MLM

In [9]:
# ===== ETAPA 6: TOKENIZAR =====
MAX_LEN = 256  # mesmo do fine-tuning

# Criar dataset HuggingFace
raw_dataset = Dataset.from_dict({'text': corpus})
print(f'Dataset: {len(raw_dataset)} textos')

# Tokenizar em batch
def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        truncation=True,
        max_length=MAX_LEN,
        padding='max_length',
        return_special_tokens_mask=True,  # importante para o MLM!
    )

tokenized_dataset = raw_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=['text'],
    desc='Tokenizando'
)

print(f'‚úÖ Dataset tokenizado: {len(tokenized_dataset)} exemplos')

Dataset: 54828 textos


Tokenizando:   0%|          | 0/54828 [00:00<?, ? examples/s]

‚úÖ Dataset tokenizado: 54828 exemplos


In [10]:
# Split treino/valida√ß√£o (95/5) para monitorar loss
split = tokenized_dataset.train_test_split(test_size=0.05, seed=42)
train_dataset = split['train']
eval_dataset = split['test']

print(f'Train: {len(train_dataset)}, Eval: {len(eval_dataset)}')

Train: 52086, Eval: 2742


## Etapa 7 ‚Äî Configurar MLM Training
**Hiperpar√¢metros-chave (baseados no paper do BERT):**
- `mlm_probability=0.15` ‚Üí 15% dos tokens s√£o mascarados
- Learning rate baixo (`1e-5`) para n√£o esquecer o portugu√™s geral
- Warmup de 10% dos steps

In [11]:
# ===== ETAPA 7: CONFIGURAR TREINAMENTO MLM =====

# Data Collator - mascara 15% dos tokens automaticamente
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,
    mlm_probability=0.15,  # 15% como no paper original do BERT
)

# Diret√≥rio de sa√≠da
OUTPUT_DIR = '/content/bertimbau-medical'

# Argumentos de treinamento
# CUIDADOS para evitar catastrophic forgetting:
# - LR baixo (1e-5 em vez de 3e-4 do BERT original)
# - Poucas √©pocas (3-5)
# - Warmup longo (10%)
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=1e-5,
    weight_decay=0.01,
    warmup_ratio=0.1,
    logging_steps=50,
    eval_strategy='steps',
    eval_steps=200,
    save_strategy='steps',
    save_steps=500,
    save_total_limit=2,
    fp16=True,
    dataloader_num_workers=2,
    report_to='none',
    seed=42,
)

print('‚úÖ Training args configurados')
print(f'   √âpocas: {training_args.num_train_epochs}')
print(f'   LR: {training_args.learning_rate}')
print(f'   Batch size: {training_args.per_device_train_batch_size}')

warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.


‚úÖ Training args configurados
   √âpocas: 5
   LR: 1e-05
   Batch size: 16


## Etapa 8 ‚Äî Treinar MLM
**O que esperar:**
- Loss inicial: ~2.0-3.0
- Loss final: ~1.0-1.5 (bom sinal)
- Tempo: ~20-40 min na T4

In [12]:
# ===== ETAPA 8: TREINAR MLM =====
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
)

print('üöÄ Iniciando pr√©-treino MLM...')
print()

train_result = trainer.train()

# Resultados
print('\n' + '='*50)
print('‚úÖ PR√â-TREINO MLM CONCLU√çDO!')
print('='*50)
print(f'Loss final treino: {train_result.training_loss:.4f}')

# Avaliar
eval_results = trainer.evaluate()
print(f'Loss valida√ß√£o:    {eval_results["eval_loss"]:.4f}')
print(f'Perplexidade:      {np.exp(eval_results["eval_loss"]):.2f}')

üöÄ Iniciando pr√©-treino MLM...



Step,Training Loss,Validation Loss


KeyboardInterrupt: 

## Etapa 9 ‚Äî Testar o modelo pr√©-treinado

In [None]:
# ===== ETAPA 9: TESTAR MODELO =====
from transformers import pipeline

fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer, device=0)

test_sentences = [
    'O exame de mamografia revelou [MASK] bilateral.',
    'Classifica√ß√£o BI-RADS [MASK] - exame incompleto.',
    'N√£o foram observadas [MASK] suspeitas.',
    'N√≥dulo de contornos [MASK] no quadrante superior.',
    'Recomenda-se [MASK] em 6 meses.',
    'Mamas com par√™nquima de padr√£o [MASK].',
]

print('üîç Teste fill-mask com frases m√©dicas:\n')
for sent in test_sentences:
    results = fill_mask(sent, top_k=3)
    print(f'Input: {sent}')
    for r in results:
        print(f'  ‚Üí {r["token_str"]:20s} (score: {r["score"]:.4f})')
    print()

## Etapa 10 ‚Äî Salvar modelo

In [None]:
# ===== ETAPA 10: SALVAR MODELO =====

# 10a. Salvar modelo final
SAVE_DIR = '/content/bertimbau-medical-final'
trainer.save_model(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

# Verificar arquivos salvos
print('üìÇ Arquivos salvos:')
total_size = 0
for f in sorted(os.listdir(SAVE_DIR)):
    size = os.path.getsize(os.path.join(SAVE_DIR, f)) / 1e6
    total_size += size
    print(f'  {f}: {size:.1f} MB')
print(f'\nTotal: {total_size:.0f} MB')

In [None]:
# 10b. Salvar no Google Drive (backup)
from google.colab import drive
drive.mount('/content/drive')

DRIVE_DIR = '/content/drive/MyDrive/spr_2026/models/bertimbau-medical'
os.makedirs(DRIVE_DIR, exist_ok=True)

!cp -r {SAVE_DIR}/* {DRIVE_DIR}/

print(f'‚úÖ Modelo salvo no Google Drive: {DRIVE_DIR}')

## Etapa 11 ‚Äî Upload para Kaggle como Dataset

In [None]:
# ===== ETAPA 11: UPLOAD PARA KAGGLE =====
import json

KAGGLE_DATASET_DIR = '/content/kaggle-upload'
os.makedirs(KAGGLE_DATASET_DIR, exist_ok=True)

# Copiar modelo
!cp -r {SAVE_DIR}/* {KAGGLE_DATASET_DIR}/

# Criar metadata do dataset
username = os.environ.get('KAGGLE_USERNAME', 'seu-username')
metadata = {
    'title': 'BERTimbau Medical MLM Pretrained',
    'id': f'{username}/bertimbau-medical-mlm',
    'licenses': [{'name': 'CC0-1.0'}]
}

with open(f'{KAGGLE_DATASET_DIR}/dataset-metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2)

# Upload (cria novo ou atualiza)
!kaggle datasets create -p {KAGGLE_DATASET_DIR} --dir-mode zip

print('\n‚úÖ Dataset enviado para o Kaggle!')
print(f'   Acesse: https://www.kaggle.com/datasets/{username}/bertimbau-medical-mlm')

## Etapa 12 ‚Äî Pr√≥ximos passos

No notebook de fine-tuning (`submit_bertimbau_large_focal.ipynb`), troque o `MODEL_PATH`:

```python
# ANTES (modelo original):
MODEL_PATH = find_model_path()

# DEPOIS (modelo com MLM m√©dico):
MODEL_PATH = '/kaggle/input/bertimbau-medical-mlm'
```

**TODO (tudo no Kaggle, com Internet OFF):**
1. Adicionar o dataset `bertimbau-medical-mlm` como Input
2. Trocar o MODEL_PATH
3. Submeter e comparar com o score anterior (0.79696)

In [None]:
# ===== RESUMO FINAL =====
print('='*60)
print('üìä RESUMO DO PR√â-TREINAMENTO')
print('='*60)
print(f'Modelo base:     {MODEL_NAME}')
print(f'Corpus MLM:      {len(corpus)} textos')
print(f'  - Mamografia:  {len(mammo_texts) * 3} (3x repetido)')
print(f'√âpocas MLM:      {training_args.num_train_epochs}')
print(f'Learning rate:   {training_args.learning_rate}')
print(f'Loss final:      {train_result.training_loss:.4f}')
eval_loss = eval_results["eval_loss"]
print(f'Eval loss:       {eval_loss:.4f}')
print(f'Perplexidade:    {np.exp(eval_loss):.2f}')
print(f'\nModelo salvo em: {SAVE_DIR}')
print(f'Google Drive:    {DRIVE_DIR}')
print('='*60)