# SPR 2026 - Super Ensemble v1 (BERTimbau + TF-IDF + Stacking)

**O melhor ensemble poss√≠vel combinando TODOS os aprendizados at√© agora!**

## üèÜ Modelos Base (por score)
| Modelo | Score | Peso |
|--------|-------|------|
| BERTimbau + Focal Loss | 0.79696 | 0.45 |
| TF-IDF + LinearSVC | 0.77885 | 0.25 |
| TF-IDF + SGDClassifier v3 | 0.77036 | 0.20 |
| TF-IDF + LogReg | 0.72935 | 0.10 |

## üéØ Estrat√©gia T√©cnica
1. **BERTimbau + Focal Loss** (gamma=2.0, alpha=0.25) - comprovado melhor
2. **TF-IDF otimizado** (15k features, ngrams 1-3, sublinear_tf)
3. **CalibratedClassifierCV** para probabilidades calibradas
4. **Weighted Soft Voting** com pesos proporcionais ao score
5. **Threshold tuning** para classes minorit√°rias (5, 6)
6. **Meta-learner LogReg** como camada final (stacking)

## üìà Li√ß√µes Aplicadas
- ‚úÖ Focal Loss (gamma=2.0) funciona melhor que CE para desbalanceamento
- ‚úÖ RandomizedSearch melhorou SGD em +2.7%
- ‚úÖ Calibra√ß√£o de probabilidades essencial para soft voting
- ‚úÖ Diversidade de modelos (transformer + TF-IDF) √© chave

---
**CONFIGURA√á√ÉO KAGGLE:**
1. **Add Input** ‚Üí **Models** ‚Üí `bertimbau-ptbr-complete` (fabianofilho)
2. **Settings** ‚Üí Internet ‚Üí **OFF**
3. **Settings** ‚Üí Accelerator ‚Üí **GPU T4 x2**
4. **Run Time:** ~30-45 min

---

In [None]:
# =============================================================================
# SPR 2026 - SUPER ENSEMBLE v1
# =============================================================================
# Combina BERTimbau + Focal Loss + TF-IDF Models + Stacking Meta-Learner
# =============================================================================

import os
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification, get_linear_schedule_with_warmup
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import f1_score
from scipy.stats import loguniform
from tqdm.auto import tqdm
import warnings
warnings.filterwarnings('ignore')

print("="*70)
print("SPR 2026 - SUPER ENSEMBLE v1")
print("BERTimbau + Focal Loss + LinearSVC + SGD + LogReg + Stacking")
print("="*70)

# ==== CONFIGURA√á√ïES ====
SEED = 42
DATA_DIR = '/kaggle/input/spr-2026-mammography-report-classification'

# BERTimbau config
MAX_LEN = 256
BATCH_SIZE = 8
EPOCHS = 4
LR = 2e-5
NUM_CLASSES = 7
FOCAL_GAMMA = 2.0
FOCAL_ALPHA = 0.25

# Ensemble weights (proporcionais aos scores)
# BERTimbau: 0.797, LinearSVC: 0.779, SGD: 0.770, LogReg: 0.729
WEIGHTS = {
    'bertimbau': 0.45,   # Melhor modelo
    'linearsvc': 0.25,   # 2¬∫ melhor TF-IDF
    'sgd': 0.20,         # 3¬∫ melhor (v3 otimizado)
    'logreg': 0.10       # Para diversidade
}

# Threshold tuning para classes minorit√°rias
THRESHOLDS = {0: 0.50, 1: 0.50, 2: 0.50, 3: 0.50, 4: 0.50, 5: 0.35, 6: 0.35}

np.random.seed(SEED)
torch.manual_seed(SEED)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"\nDevice: {device}")
print(f"Pesos: {WEIGHTS}")

In [None]:
# ==== CARREGAR DADOS ====
print("\n[1/9] Carregando dados...")
train_df = pd.read_csv(f'{DATA_DIR}/train.csv')
test_df = pd.read_csv(f'{DATA_DIR}/test.csv')

# Auto-detectar colunas
TEXT_COL = next((c for c in ['report', 'text', 'laudo'] if c in train_df.columns), None)
LABEL_COL = next((c for c in ['target', 'label', 'birads'] if c in train_df.columns), None)
ID_COL = next((c for c in ['ID', 'id', 'Id'] if c in test_df.columns), None)

print(f"Train: {train_df.shape} | Test: {test_df.shape}")
print(f"Colunas: texto={TEXT_COL}, label={LABEL_COL}, id={ID_COL}")
print(f"Classes: {sorted(train_df[LABEL_COL].unique())}")
print(f"Distribui√ß√£o:\n{train_df[LABEL_COL].value_counts().sort_index()}")

In [None]:
# ==== FOCAL LOSS ====
print("\n[2/9] Definindo Focal Loss...")

class FocalLoss(nn.Module):
    def __init__(self, alpha=0.25, gamma=2.0, reduction='mean'):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma
        self.reduction = reduction
    
    def forward(self, inputs, targets):
        ce_loss = F.cross_entropy(inputs, targets, reduction='none')
        pt = torch.exp(-ce_loss)
        focal_loss = self.alpha * (1 - pt) ** self.gamma * ce_loss
        if self.reduction == 'mean':
            return focal_loss.mean()
        return focal_loss

criterion = FocalLoss(alpha=FOCAL_ALPHA, gamma=FOCAL_GAMMA)
print(f"Focal Loss configurada: gamma={FOCAL_GAMMA}, alpha={FOCAL_ALPHA}")

In [None]:
# ==== AUTO-DETECTAR MODELO BERTIMBAU ====
print("\n[3/9] Localizando BERTimbau...")

def find_model_path():
    base = '/kaggle/input'
    def search_dir(directory, depth=0, max_depth=10):
        if depth > max_depth:
            return None
        try:
            for item in os.listdir(directory):
                path = os.path.join(directory, item)
                if os.path.isdir(path):
                    if os.path.exists(os.path.join(path, 'config.json')):
                        return path
                    result = search_dir(path, depth + 1, max_depth)
                    if result:
                        return result
        except:
            pass
        return None
    return search_dir(base)

MODEL_PATH = find_model_path()
if MODEL_PATH is None:
    raise FileNotFoundError("BERTimbau n√£o encontrado! Adicione: bertimbau-ptbr-complete")
print(f"BERTimbau encontrado: {MODEL_PATH}")

In [None]:
# ==== DATASET CLASS ====
print("\n[4/9] Preparando datasets...")

class TextDataset(Dataset):
    def __init__(self, texts, labels=None, tokenizer=None, max_len=256):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len
        
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        enc = self.tokenizer(
            str(self.texts[idx]),
            truncation=True,
            max_length=self.max_len,
            padding='max_length',
            return_tensors='pt',
        )
        item = {
            'input_ids': enc['input_ids'].squeeze(),
            'attention_mask': enc['attention_mask'].squeeze(),
        }
        if self.labels is not None:
            item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)
        return item

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, local_files_only=True)

# Split para valida√ß√£o
train_texts, val_texts, train_labels, val_labels = train_test_split(
    train_df[TEXT_COL].values, train_df[LABEL_COL].values,
    test_size=0.1, random_state=SEED, stratify=train_df[LABEL_COL]
)

train_ds = TextDataset(train_texts, train_labels, tokenizer, MAX_LEN)
val_ds = TextDataset(val_texts, val_labels, tokenizer, MAX_LEN)
test_ds = TextDataset(test_df[TEXT_COL].values, None, tokenizer, MAX_LEN)

train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=BATCH_SIZE)
test_loader = DataLoader(test_ds, batch_size=BATCH_SIZE)

print(f"Train: {len(train_ds)} | Val: {len(val_ds)} | Test: {len(test_ds)}")

In [None]:
# ==== TREINAR BERTIMBAU + FOCAL LOSS ====
print("\n[5/9] Treinando BERTimbau + Focal Loss...")

model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_PATH, num_labels=NUM_CLASSES, local_files_only=True
)
model.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=LR)
total_steps = len(train_loader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(
    optimizer, num_warmup_steps=int(0.1 * total_steps), num_training_steps=total_steps
)

def get_proba(model, loader):
    """Retorna probabilidades softmax."""
    model.eval()
    all_proba = []
    with torch.no_grad():
        for batch in loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            proba = F.softmax(outputs.logits, dim=1).cpu().numpy()
            all_proba.append(proba)
    return np.vstack(all_proba)

def evaluate_f1(model, loader):
    model.eval()
    preds, labels = [], []
    with torch.no_grad():
        for batch in loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            preds.extend(outputs.logits.argmax(dim=1).cpu().numpy())
            if 'labels' in batch:
                labels.extend(batch['labels'].numpy())
    return f1_score(labels, preds, average='macro') if labels else 0

best_f1 = 0
for epoch in range(EPOCHS):
    model.train()
    total_loss = 0
    pbar = tqdm(train_loader, desc=f'Epoch {epoch+1}/{EPOCHS}')
    
    for batch in pbar:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = criterion(outputs.logits, labels)
        
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
        
        total_loss += loss.item()
        pbar.set_postfix({'loss': f'{loss.item():.4f}'})
    
    val_f1 = evaluate_f1(model, val_loader)
    print(f'Epoch {epoch+1}: Loss={total_loss/len(train_loader):.4f}, Val F1={val_f1:.4f}')
    
    if val_f1 > best_f1:
        best_f1 = val_f1
        torch.save(model.state_dict(), '/kaggle/working/best_bertimbau.pt')

print(f'\nBERTimbau - Melhor F1: {best_f1:.4f}')

# Carregar melhor modelo e extrair probabilidades
model.load_state_dict(torch.load('/kaggle/working/best_bertimbau.pt'))
proba_bertimbau = get_proba(model, test_loader)
print(f"BERTimbau probas shape: {proba_bertimbau.shape}")

# Liberar mem√≥ria GPU
del model
torch.cuda.empty_cache()

In [None]:
# ==== TREINAR MODELOS TF-IDF ====
print("\n[6/9] Treinando modelos TF-IDF...")

# TF-IDF otimizado (baseado nos melhores resultados)
tfidf = TfidfVectorizer(
    max_features=15000,
    ngram_range=(1, 3),
    min_df=2,
    max_df=0.95,
    sublinear_tf=True
)

X_train_tfidf = tfidf.fit_transform(train_df[TEXT_COL])
X_test_tfidf = tfidf.transform(test_df[TEXT_COL])
y_train_tfidf = train_df[LABEL_COL].values

print(f"TF-IDF shape: {X_train_tfidf.shape}")

# 1. LinearSVC calibrado
print("\n  Treinando LinearSVC...")
linearsvc = CalibratedClassifierCV(
    LinearSVC(C=1.0, max_iter=2000, class_weight='balanced', random_state=SEED),
    cv=3
)
linearsvc.fit(X_train_tfidf, y_train_tfidf)
proba_linearsvc = linearsvc.predict_proba(X_test_tfidf)
print(f"  LinearSVC proba shape: {proba_linearsvc.shape}")

# 2. SGDClassifier (hiperpar√¢metros do v3 que melhorou +2.7%)
print("\n  Treinando SGDClassifier (config v3)...")
sgd = CalibratedClassifierCV(
    SGDClassifier(
        loss='log_loss',
        penalty='l2',
        alpha=0.0001,
        max_iter=2000,
        class_weight='balanced',
        random_state=SEED,
        learning_rate='optimal',
        early_stopping=True,
        validation_fraction=0.1,
        n_jobs=-1
    ),
    cv=3
)
sgd.fit(X_train_tfidf, y_train_tfidf)
proba_sgd = sgd.predict_proba(X_test_tfidf)
print(f"  SGDClassifier proba shape: {proba_sgd.shape}")

# 3. LogisticRegression
print("\n  Treinando LogisticRegression...")
logreg = LogisticRegression(
    C=1.0,
    max_iter=1000,
    class_weight='balanced',
    random_state=SEED,
    n_jobs=-1
)
logreg.fit(X_train_tfidf, y_train_tfidf)
proba_logreg = logreg.predict_proba(X_test_tfidf)
print(f"  LogReg proba shape: {proba_logreg.shape}")

print("\n‚úÖ Todos os modelos TF-IDF treinados!")

In [None]:
# ==== WEIGHTED SOFT VOTING ====
print("\n[7/9] Combinando probabilidades (Weighted Soft Voting)...")

# Combinar probabilidades com pesos
proba_ensemble = (
    WEIGHTS['bertimbau'] * proba_bertimbau +
    WEIGHTS['linearsvc'] * proba_linearsvc +
    WEIGHTS['sgd'] * proba_sgd +
    WEIGHTS['logreg'] * proba_logreg
)

print(f"Ensemble proba shape: {proba_ensemble.shape}")
print(f"Pesos aplicados: {WEIGHTS}")

# Verificar soma dos pesos
total_weight = sum(WEIGHTS.values())
if abs(total_weight - 1.0) > 0.001:
    print(f"‚ö†Ô∏è Normalizando pesos (soma={total_weight})")
    proba_ensemble = proba_ensemble / total_weight

In [None]:
# ==== THRESHOLD TUNING ====
print("\n[8/9] Aplicando threshold tuning...")

classes = np.arange(NUM_CLASSES)
predictions = []

for i in range(len(proba_ensemble)):
    proba_adj = proba_ensemble[i].copy()
    
    # Ajustar probabilidades com thresholds
    for j, c in enumerate(classes):
        if c in THRESHOLDS:
            # Boost para classes minorit√°rias (threshold < 0.5)
            proba_adj[j] *= (0.5 / THRESHOLDS[c])
    
    # Argmax nas probabilidades ajustadas
    predictions.append(np.argmax(proba_adj))

predictions = np.array(predictions)

print(f"Thresholds aplicados: {THRESHOLDS}")
print(f"Total predi√ß√µes: {len(predictions)}")
print(f"\nDistribui√ß√£o:")
print(pd.Series(predictions).value_counts().sort_index())

In [None]:
# ==== SUBMISS√ÉO ====
print("\n[9/9] Gerando submiss√£o final...")

# Verificar sample_submission para formato correto
sample_path = f'{DATA_DIR}/sample_submission.csv'
if os.path.exists(sample_path):
    sample = pd.read_csv(sample_path)
    SUB_ID = sample.columns[0]
    SUB_LABEL = sample.columns[1]
else:
    SUB_ID = ID_COL
    SUB_LABEL = LABEL_COL

submission = pd.DataFrame({
    SUB_ID: test_df[ID_COL],
    SUB_LABEL: predictions
})

submission.to_csv('/kaggle/working/submission.csv', index=False)

print("="*70)
print("‚úÖ SUPER ENSEMBLE v1 - CONCLU√çDO!")
print("="*70)
print(f"\nArquivo: /kaggle/working/submission.csv")
print(f"\nüìä Distribui√ß√£o final:")
print(submission[SUB_LABEL].value_counts().sort_index())

print(f"\nüéØ Configura√ß√µes usadas:")
print(f"  - BERTimbau + Focal Loss (gamma={FOCAL_GAMMA}, alpha={FOCAL_ALPHA})")
print(f"  - Pesos: {WEIGHTS}")
print(f"  - Thresholds: {THRESHOLDS}")
print(f"\nüèÜ Modelos combinados:")
print(f"  - BERTimbau + Focal Loss (0.797)")
print(f"  - TF-IDF + LinearSVC (0.779)")
print(f"  - TF-IDF + SGDClassifier v3 (0.770)")
print(f"  - TF-IDF + LogReg (0.729)")