# Valida√ß√£o: LLMs Zero/Few-shot

**Modelos de Linguagem para Classifica√ß√£o**

## üìä Modelos Testados
- Qwen 3 (4B) - Zero-shot e Few-shot
- MedGemma (4B) - Zero-shot com instru√ß√£o BI-RADS
- Phi-3.5 (3.8B) - Compara√ß√£o

## üéØ Objetivo
Avaliar performance de LLMs sem fine-tuning para classifica√ß√£o BI-RADS.

---

In [None]:
import os
import numpy as np
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, classification_report
from transformers import AutoTokenizer, AutoModelForCausalLM
from tqdm.auto import tqdm
import warnings
warnings.filterwarnings('ignore')

SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)

if os.path.exists('/kaggle/input'):
    DATA_DIR = '/kaggle/input/competitions/spr-2026-mammography-report-classification'
    def find_model_path():
        base = '/kaggle/input'
        def search_dir(directory, depth=0, max_depth=10):
            if depth > max_depth: return None
            try:
                for item in os.listdir(directory):
                    path = os.path.join(directory, item)
                    if os.path.isdir(path) and os.path.exists(os.path.join(path, 'config.json')):
                        return path
                    result = search_dir(path, depth + 1, max_depth) if os.path.isdir(path) else None
                    if result: return result
            except: pass
            return None
        return search_dir(base)
    MODEL_PATH = find_model_path()
else:
    DATA_DIR = '../data'
    MODEL_PATH = 'Qwen/Qwen3-4B-Instruct'

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Device: {device}')
print(f'Model: {MODEL_PATH}')

In [None]:
# ===== DADOS =====
train_df = pd.read_csv(f'{DATA_DIR}/train.csv')

# Usar amostra menor para valida√ß√£o (LLMs s√£o lentos)
train_sample = train_df.groupby('target', group_keys=False).apply(
    lambda x: x.sample(min(30, len(x)), random_state=SEED)
).reset_index(drop=True)

train_texts, val_texts, train_labels, val_labels = train_test_split(
    train_sample['report'].tolist(),
    train_sample['target'].tolist(),
    test_size=0.3,  # Mais para valida√ß√£o porque √© amostra pequena
    stratify=train_sample['target'],
    random_state=SEED
)

print(f'Train: {len(train_texts)}, Val: {len(val_texts)}')

In [None]:
# ===== PROMPTS =====

# Zero-shot PT-BR
SYSTEM_PROMPT_ZERO = """Voc√™ √© um radiologista especialista em classifica√ß√£o BI-RADS de mamografias.

## Categorias BI-RADS:
- 0: Incompleto - necessita imagens adicionais
- 1: Negativo - mamografia normal
- 2: Benigno - achados definitivamente benignos
- 3: Provavelmente benigno - <2% malignidade, seguimento 6 meses
- 4: Suspeito - 2-95% malignidade, bi√≥psia recomendada
- 5: Altamente sugestivo de malignidade - >95%
- 6: Malignidade comprovada por bi√≥psia

Responda APENAS com o n√∫mero da categoria (0-6)."""

# Few-shot com exemplos
FEW_SHOT_EXAMPLES = """
## Exemplos:

Relat√≥rio: "Exame realizado para controle. Imagens mostram par√™nquima mam√°rio denso, sem n√≥dulos, calcifica√ß√µes suspeitas ou distor√ß√µes arquiteturais."
BI-RADS: 1

Relat√≥rio: "Presen√ßa de n√≥dulo oval, circunscrito, paralelo √† pele, no QSE da mama direita, medindo 8mm, com caracter√≠sticas benignas."
BI-RADS: 2

Relat√≥rio: "N√≥dulo irregular, de contornos microlobulados, com 15mm na JQQ da mama direita. Bi√≥psia recomendada."
BI-RADS: 4

Relat√≥rio: "Les√£o espiculada, densa, de 25mm na regi√£o retroareolar esquerda, associada a microcalcifica√ß√µes pleom√≥rficas."
BI-RADS: 5
"""

SYSTEM_PROMPT_FEW = SYSTEM_PROMPT_ZERO + FEW_SHOT_EXAMPLES

USER_TEMPLATE = """Relat√≥rio:
{report}

BI-RADS:"""

In [None]:
# ===== CARREGAR MODELO =====
print("Carregando modelo...")

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, local_files_only=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH, local_files_only=True,
    torch_dtype=torch.bfloat16, device_map="auto", low_cpu_mem_usage=True
)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"Modelo carregado: {model.config.architectures}")

In [None]:
# ===== FUN√á√ÉO DE CLASSIFICA√á√ÉO =====
def classify_report(report, system_prompt, max_tokens=10):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": USER_TEMPLATE.format(report=report)}
    ]
    
    if hasattr(tokenizer, 'apply_chat_template'):
        text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    else:
        text = f"{system_prompt}\n\n{USER_TEMPLATE.format(report=report)}"
    
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs, max_new_tokens=max_tokens, do_sample=False,
            pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id
        )
    
    response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    
    # Extrair n√∫mero
    for char in response.strip():
        if char.isdigit() and char in '0123456':
            return int(char)
    return 2  # Default

In [None]:
# ===== TESTE 1: ZERO-SHOT =====
print("\n" + "="*50)
print("Zero-shot Classification")
print("="*50)

zero_preds = [
    classify_report(text, SYSTEM_PROMPT_ZERO) 
    for text in tqdm(val_texts, desc='Zero-shot')
]

zero_f1 = f1_score(val_labels, zero_preds, average='macro')
print(f'F1-Macro: {zero_f1:.5f}')
print(classification_report(val_labels, zero_preds))

In [None]:
# ===== TESTE 2: FEW-SHOT =====
print("\n" + "="*50)
print("Few-shot Classification (4 exemplos)")
print("="*50)

few_preds = [
    classify_report(text, SYSTEM_PROMPT_FEW) 
    for text in tqdm(val_texts, desc='Few-shot')
]

few_f1 = f1_score(val_labels, few_preds, average='macro')
print(f'F1-Macro: {few_f1:.5f}')
print(classification_report(val_labels, few_preds))

In [None]:
# ===== AN√ÅLISE DE ERROS =====
print("\n" + "="*50)
print("An√°lise de Erros (Few-shot)")
print("="*50)

errors = []
for i, (text, true, pred) in enumerate(zip(val_texts, val_labels, few_preds)):
    if true != pred:
        errors.append({
            'text': text[:100] + '...',
            'true': true,
            'pred': pred
        })

print(f'Total erros: {len(errors)} / {len(val_labels)} ({100*len(errors)/len(val_labels):.1f}%)')
print('\nExemplos de erros:')
for e in errors[:5]:
    print(f"  True: {e['true']}, Pred: {e['pred']} | {e['text']}")

In [None]:
# ===== RESUMO =====
print("\n" + "="*60)
print("üìä RESUMO - LLM Validation")
print("="*60)

results = [
    ('Zero-shot', zero_f1),
    ('Few-shot (4 ex)', few_f1),
]

for name, f1 in sorted(results, key=lambda x: -x[1]):
    print(f"{name:<20} {f1:.5f}")

print(f"\nüìù Refer√™ncia (TF-IDF): 0.77885")
print(f"üìù Refer√™ncia (BERTimbau v4): 0.82073")
print(f"\n‚ö†Ô∏è LLMs s√£o mais lentos (~10x) que transformers fine-tuned")

In [None]:
# ===== INSIGHTS =====
print("""
üìù INSIGHTS - LLMs Zero/Few-shot
=================================

1. **Zero vs Few-shot:**
   - [PREENCHER AP√ìS EXPERIMENTOS]
   - Exemplos geralmente ajudam

2. **BI-RADS espec√≠fico:**
   - Prompt com descri√ß√µes detalhadas ajuda
   - Modelo pode confundir categorias adjacentes

3. **Limita√ß√µes:**
   - MUITO mais lento que fine-tuning
   - N√£o aprende padr√µes espec√≠ficos do dataset
   - Token limit pode truncar relat√≥rios longos

4. **Quando usar:**
   - Como baseline sem treino
   - Para casos amb√≠guos (ensemble)
   - Para explicabilidade
""")