<a href="https://colab.research.google.com/github/UFG-PPGCC-NLP-Final-Project/movie-recommender/blob/main/colab/sbert_movie_recommender.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SBERT Movie Recommender System

**Adapta√ß√£o do artigo**: *BERT one-shot movie recommender system* usando Sentence-BERT

Este notebook implementa um sistema de recomenda√ß√£o de filmes end-to-end usando SBERT (Sentence-BERT), projetado para produzir recomenda√ß√µes estruturadas a partir de queries n√£o estruturadas.

## ‚ö†Ô∏è PROBLEMAS IDENTIFICADOS E SOLU√á√ïES

### üî¥ Problemas nos Experimentos 1-4 (Original Trainer)

| Problema | Causa | Impacto |
|----------|-------|---------|
| **Stagna√ß√£o em nDCG@10=0.044** | `BCEWithLogitsLoss()` sem pesos + desbalanceamento 1:1200 | Modelo aprende a predizer ~0 para tudo |
| **RNN sem efeito** (+0.0019 vs +0.035 esperado) | Loss inadequado impede aprender features colaborativas | RNN funciona, mas otimiza√ß√£o ignora |
| **Multi-task quebrado** (Train Loss=6.2) | CE loss (‚âà6.0) domina BCE loss (‚âà0.003) | Modelo otimiza tags, ignora filmes |

### ‚úÖ Solu√ß√µes Implementadas (Weighted Trainer - Se√ß√£o 7.1)

1. **Weighted BCE Loss**: `pos_weight=1200` para balancear classes positivas/negativas
2. **Multi-task balanceado**: `loss = bce_loss + (0.001 * ce_loss)` para equalizar contribui√ß√µes
3. **Monitoramento detalhado**: Track BCE e CE losses separadamente

**Expectativa**: nDCG@10 deve subir de **0.044 ‚Üí > 0.10** em 20 √©pocas.

---

## Arquitetura
- **Baseline**: SBERT + FFN para classifica√ß√£o multi-label
- **Extens√£o 1**: SBERT + RNN para features colaborativas
- **Extens√£o 2**: Multi-task learning com dados de tags de usu√°rios

---

## 1. Configura√ß√£o do Ambiente

In [None]:
# Verificar GPU dispon√≠vel
!nvidia-smi

In [None]:
# Instalar depend√™ncias
!pip install -q sentence-transformers transformers datasets torch accelerate scikit-learn pandas numpy tqdm

In [None]:
import os
import re
import json
import random
import numpy as np
import pandas as pd
from collections import defaultdict
from typing import List, Dict, Tuple, Optional

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW

from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModel, get_linear_schedule_with_warmup
from datasets import load_dataset
from sklearn.metrics import ndcg_score
from tqdm.auto import tqdm

# Configurar seeds para reprodutibilidade
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)

# Configurar device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Usando device: {device}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Mem√≥ria total: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

## 2. Configura√ß√µes e Hiperpar√¢metros

In [None]:
class Config:
    """Configura√ß√µes do modelo e treinamento"""

    # Modelo SBERT
    sbert_model_name = 'sentence-transformers/all-MiniLM-L6-v2'  # Modelo SBERT eficiente
    sbert_hidden_size = 384  # Dimens√£o do all-MiniLM-L6-v2

    # RNN para features colaborativas
    rnn_embedding_size = 256
    rnn_hidden_size = 128

    # FFN
    ffn_hidden_size = 256
    dropout_prob = 0.1  # Reduzido de 0.3 para 0.1 (dataset pequeno precisa menos regulariza√ß√£o)

    # Treinamento
    movies_batch_size = 32  # Aumentado para 32 (GPU tem capacidade, gradiente mais est√°vel)
    tags_batch_size = 64    # Mantido (tarefa auxiliar simples)
    learning_rate = 2e-5    # Aumentado de 1e-5 para 2e-5 (converg√™ncia mais r√°pida)
    num_epochs = 20         # Reduzido para demonstra√ß√£o (artigo usa 200)\n",
    warmup_ratio = 0.1
    max_seq_length = 512

    # Dataset
    num_movies = None  # Ser√° definido automaticamente pelo movie_mapper.get_num_movies()

    # Avalia√ß√£o
    eval_k = 10            # nDCG@10

    # Checkpoints
    save_dir = './checkpoints'

config = Config()
os.makedirs(config.save_dir, exist_ok=True)

# Imprimir configura√ß√µes importantes
print("="*60)
print("HIPERPAR√ÇMETROS CONFIGURADOS")
print("="*60)
print(f"Learning Rate: {config.learning_rate}")
print(f"Dropout: {config.dropout_prob}")
print(f"Movies Batch Size: {config.movies_batch_size}")
print(f"Tags Batch Size: {config.tags_batch_size}")
print(f"Num Epochs: {config.num_epochs}")
print(f"Max Seq Length: {config.max_seq_length}")
print(f"Num Movies: {config.num_movies} (ser√° definido automaticamente)")
print("="*60)

### 2.1 Explica√ß√£o dos Hiperpar√¢metros Ajustados

**Mudan√ßas em rela√ß√£o √† configura√ß√£o original para melhorar resultados:**

#### 1. **Learning Rate: 1e-5 ‚Üí 2e-5** (2x maior)
   - **O que faz:** Controla o tamanho dos "passos" ao atualizar pesos da rede neural
   - **Por que mudar:** 1e-5 √© muito conservador para dataset pequeno
   - **Impacto esperado:** Converg√™ncia mais r√°pida sem perder estabilidade
   - **Alternativas:** Experimentar 5e-5 se 2e-5 n√£o melhorar

#### 2. **Dropout: 0.3 ‚Üí 0.1** (3x menor)
   - **O que faz:** "Desliga" aleatoriamente 10% dos neur√¥nios durante treino
   - **Por que mudar:** Dataset pequeno (~9K exemplos) + 30% dropout = underfitting
   - **Impacto esperado:** Modelo consegue aprender melhor os padr√µes
   - **Trade-off:** Menos regulariza√ß√£o, mas dataset pequeno precisa de capacidade

#### 3. **Movies Batch Size: 8 ‚Üí 16 ‚Üí 32** (4x maior)
   - **O que faz:** N√∫mero de exemplos processados juntos antes de atualizar pesos
   - **Por que 32:** GPU tem capacidade + tempo de execu√ß√£o melhorou
   - **Vantagens:** Gradiente mais est√°vel, treinamento mais r√°pido, menos vari√¢ncia
   - **Diferen√ßa de tags_batch_size:** 
     * movies (32): Tarefa complexa (di√°logos ‚Üí m√∫ltiplos filmes)
     * tags (64): Tarefa simples (1 tag ‚Üí 1 filme)
   - **Monitorar:** Se der "CUDA out of memory", reduzir para 16

#### 4. **Num Epochs: 30 ‚Üí 50 ‚Üí 20** (ajustado para testes)
   - **O que faz:** N√∫mero de vezes que todo dataset √© visto
   - **Por que 20:** Testes iniciais antes de treino completo
   - **Quando avaliar:**
     * 8 √©pocas: Muito cedo para conclus√µes
     * 20 √©pocas: Primeira avalia√ß√£o real
     * 50+ √©pocas: Treinamento completo (artigo usa 200)
   - **Dica:** Observe nDCG@10 entre √©pocas 10-20 para decidir continuar

#### 5. **num_movies: 6924 ‚Üí None (autom√°tico)**
   - **Problema original:** Valor hardcoded n√£o correspondia ao dataset real
   - **Solu√ß√£o:** Calculado automaticamente via `movie_mapper.get_num_movies()`
   - **Impacto:** Evita bugs e garante dimens√£o correta da camada de sa√≠da

---

### 2.2 Como Avaliar os Resultados

**‚è±Ô∏è Cronograma de Avalia√ß√£o:**

| √âpocas | O que esperar | O que observar |
|--------|---------------|----------------|
| 1-5    | Aprendizado b√°sico | Loss caindo rapidamente |
| 5-10   | Primeiras melhorias | nDCG@10 come√ßando a subir (0.01 ‚Üí 0.03) |
| 10-20  | **Avalia√ß√£o cr√≠tica** | Se nDCG@10 > 0.05, continuar. Se < 0.03, revisar |
| 20-50  | Converg√™ncia real | nDCG@10 deve atingir 0.08-0.13 |
| 50+    | Fine-tuning | Ganhos marginais, risco de overfitting |

**üìä Sinais de Sucesso:**
- ‚úÖ nDCG@10 crescendo consistentemente
- ‚úÖ Train loss < Eval loss, mas gap pequeno (<0.01)
- ‚úÖ Recall@10 acompanhando nDCG@10

**‚ö†Ô∏è Sinais de Problema:**
- ‚ùå nDCG@10 estagnado em < 0.03 ap√≥s 15 √©pocas
- ‚ùå Eval loss > Train loss (underfitting) ‚Üí reduzir dropout ou aumentar LR
- ‚ùå Train loss << Eval loss (overfitting) ‚Üí aumentar dropout ou reduzir LR

**üéØ Meta Realista:**
- Com SBERT/BERT + RNN + Multi-Task: **nDCG@10 ‚âà 0.10-0.15** em 20-50 √©pocas
- Artigo original reporta: **0.130-0.169** com 200 √©pocas

## 3. Carregamento e Processamento dos Dados



### 3.1 Dataset ReDial
O dataset ReDial cont√©m di√°logos de recomenda√ß√£o de filmes entre um iniciador e um respondente.

#### 3.1.1 Acesso ao dataset

In [None]:

# Carregar dataset ReDial
print("Carregando dataset ReDial...")
redial_dataset_raw = load_dataset('community-datasets/re_dial')
print(f"Train: {len(redial_dataset_raw['train'])} exemplos")
print(f"Test: {len(redial_dataset_raw['test'])} exemplos")

print("\n")

# Visualizar estrutura de um exemplo
sample = redial_dataset_raw['train'][0]
print("Estrutura de um exemplo:")
for key in sample.keys():
    print(f"  {key}: {type(sample[key])}")

In [None]:
class MovieIDMapper:
    """
    Mapeia IDs de filmes entre diferentes formatos a partir do dataset ReDial. Exemplo: '@123' -> 123
    """

    def __init__(self):
        self.movie_to_idx = {}
        self.idx_to_movie = {}
        self.movie_names = {}

    def build_from_dataset(self, dataset):
        """Constr√≥i mapeamento a partir do dataset ReDial"""
        all_movies = set()

        for split in ['train', 'test']:
            for original_content in dataset[split]:
                # Extrair IDs de filmes das mensagens
                messages = original_content.get('messages', [])
                for msg in messages:
                    text = msg.get('text', '')
                    movie_ids = re.findall(r'@(\d+)', text)
                    all_movies.update(movie_ids)

                # Extrair dos movieMentions
                mentions = original_content.get('movieMentions', {})
                if isinstance(mentions, dict):
                    for movie_id, name in mentions.items():
                        all_movies.add(str(movie_id).replace('@', ''))
                        self.movie_names[str(movie_id).replace('@', '')] = name

        # Criar mapeamento ordenado
        sorted_movies = sorted([int(m) for m in all_movies if m.isdigit()])

        for idx, movie_id in enumerate(sorted_movies):
            self.movie_to_idx[str(movie_id)] = idx
            self.idx_to_movie[idx] = str(movie_id)

        print(f"Total de filmes √∫nicos: {len(self.movie_to_idx)}")
        return self

    def get_num_movies(self):
        return len(self.movie_to_idx)

    def movie_id_to_idx(self, movie_id):
        movie_id = str(movie_id).replace('@', '')
        return self.movie_to_idx.get(movie_id, -1)

    def idx_to_movie_id(self, idx):
        return self.idx_to_movie.get(idx, None)

# Construir mapeamento
movie_mapper = MovieIDMapper().build_from_dataset(redial_dataset_raw)

#### 3.1.2 Processamento dos Di√°logos

Conforme o artigo, concatenamos as utterances do iniciador com tokens [SEP] e usamos as recomenda√ß√µes do respondente como labels.

In [None]:
def process_dialogue(example, movie_mapper):
    """
    Processa um di√°logo do ReDial conforme descrito no artigo:
    - Input: utterances do iniciador concatenadas com [SEP]
    - Output: IDs dos filmes recomendados pelo respondente
    - Movies mentioned: filmes mencionados pelo iniciador (para RNN)
    """
    messages = example.get('messages', [])

    initiator_texts = []
    mentioned_movies = []  # Filmes mencionados pelo iniciador
    recommended_movies = []  # Filmes recomendados pelo respondente

    for msg in messages:
        text = msg.get('text', '')
        sender_id = msg.get('senderWorkerId', 0)

        # Extrair IDs de filmes
        movie_ids = re.findall(r'@(\d+)', text)

        # Determinar se √© iniciador (primeiro sender) ou respondente
        if sender_id == messages[0].get('senderWorkerId', 0):
            # Iniciador - adicionar texto e filmes mencionados
            # Substituir IDs por placeholder para vers√£o sem RNN
            clean_text = re.sub(r'@\d+', '@', text)
            initiator_texts.append(clean_text)
            mentioned_movies.extend(movie_ids)
        else:
            # Respondente - coletar recomenda√ß√µes
            recommended_movies.extend(movie_ids)

    # Concatenar textos do iniciador com [SEP]
    input_text = ' [SEP] '.join(initiator_texts)

    # Converter IDs para √≠ndices
    mentioned_indices = [movie_mapper.movie_id_to_idx(m) for m in mentioned_movies]
    mentioned_indices = [idx for idx in mentioned_indices if idx >= 0]

    recommended_indices = [movie_mapper.movie_id_to_idx(m) for m in recommended_movies]
    recommended_indices = list(set([idx for idx in recommended_indices if idx >= 0]))

    return {
        'input_text': input_text,
        'input_text_with_ids': ' [SEP] '.join([msg.get('text', '') for msg in messages
                                               if msg.get('senderWorkerId') == messages[0].get('senderWorkerId')]),
        'mentioned_movies': mentioned_indices,
        'recommended_movies': recommended_indices
    }

# Processar dataset
def process_split(dataset_split, movie_mapper):
    processed = []
    for example in tqdm(dataset_split, desc="Processando"):
        proc = process_dialogue(example, movie_mapper)
        # Filtrar exemplos sem recomenda√ß√µes
        if proc['recommended_movies'] and proc['input_text'].strip():
            processed.append(proc)
    return processed

print("Processando split de treino...")
train_data = process_split(redial_dataset_raw['train'], movie_mapper)
print(f"Exemplos de treino v√°lidos: {len(train_data)}")

print("\nProcessando split de teste...")
test_data = process_split(redial_dataset_raw['test'], movie_mapper)
print(f"Exemplos de teste v√°lidos: {len(test_data)}")

In [None]:
# Visualizar exemplo processado
print("Exemplo processado:")
print(f"Input text: {train_data[0]['input_text'][:500]}...")
print(f"\nMentioned movies (√≠ndices): {train_data[0]['mentioned_movies'][:5]}")
print(f"Recommended movies (√≠ndices): {train_data[0]['recommended_movies']}")

### 3.2 Dataset MovieLens

Para o experimento de multi-task learning, usamos tags de usu√°rios do MovieLens.

In [None]:
# Download MovieLens tags
!wget -q -nc http://files.grouplens.org/datasets/movielens/ml-latest.zip
!unzip -q -o ml-latest.zip

In [None]:
# Carregar dados do MovieLens
tags_csv = pd.read_csv('ml-latest/tags.csv')
movies_csv = pd.read_csv('ml-latest/movies.csv')

print(f"Tags totais: {len(tags_csv)}")
print(f"Filmes totais: {len(movies_csv)}")
print(f"\nExemplo de tags:")
print(tags_csv.head())

In [None]:
def create_tag_dataset(tags_csv, movie_mapper, max_tags_per_movie=50):
    """
    Cria dataset de tags para multi-task learning
    Input: tag text
    Output: movie index
    """
    tag_data = []

    # Agrupar tags por filme
    for movie_id, group in tags_csv.groupby('movieId'):
        movie_idx = movie_mapper.movie_id_to_idx(str(movie_id))
        if movie_idx < 0:
            continue

        tags = group['tag'].tolist()[:max_tags_per_movie]
        for tag in tags:
            if isinstance(tag, str) and len(tag.strip()) > 2:
                tag_data.append({
                    'tag_text': tag.strip(),
                    'movie_idx': movie_idx
                })

    return tag_data

# Criar dataset de tags (movie_mapper definido junto do dataset do redial)
tag_data = create_tag_dataset(tags_csv, movie_mapper)
print(f"Exemplos de tags: {len(tag_data)}")
print(f"\nExemplo de tag:")
print(tag_data[0])
print("/n")

# Split treino/teste para tags
random.shuffle(tag_data)
split_idx = int(len(tag_data) * 0.9)
tag_train_data = tag_data[:split_idx]
tag_test_data = tag_data[split_idx:]

print(f"Tags treino: {len(tag_train_data)}")
print(f"Tags teste: {len(tag_test_data)}")

## 4. Dataset Classes

In [None]:
class MovieRecommendationDataset(Dataset):
    """Dataset para recomenda√ß√£o de filmes (tarefa principal)"""

    def __init__(self, data, tokenizer, num_movies, max_length=512):
        self.data = data
        self.tokenizer = tokenizer
        self.num_movies = num_movies
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]

        # Tokenizar input
        encoding = self.tokenizer(
            item['input_text'],
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )

        # Criar label multi-hot
        labels = torch.zeros(self.num_movies)
        for movie_idx in item['recommended_movies']:
            if 0 <= movie_idx < self.num_movies:
                labels[movie_idx] = 1.0

        # Filmes mencionados (para RNN)
        mentioned = item['mentioned_movies'][:20]  # Limitar
        mentioned_tensor = torch.zeros(20, dtype=torch.long)
        mentioned_mask = torch.zeros(20, dtype=torch.bool)

        for i, m_idx in enumerate(mentioned):
            if i < 20 and 0 <= m_idx < self.num_movies:
                mentioned_tensor[i] = m_idx
                mentioned_mask[i] = True

        return {
            'input_ids': encoding['input_ids'].squeeze(0),
            'attention_mask': encoding['attention_mask'].squeeze(0),
            'labels': labels,
            'mentioned_movies': mentioned_tensor,
            'mentioned_mask': mentioned_mask
        }


class TagDataset(Dataset):
    """Dataset para predi√ß√£o de filme a partir de tag (tarefa auxiliar)"""

    def __init__(self, data, tokenizer, max_length=64):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]

        encoding = self.tokenizer(
            item['tag_text'],
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )

        return {
            'input_ids': encoding['input_ids'].squeeze(0),
            'attention_mask': encoding['attention_mask'].squeeze(0),
            'label': torch.tensor(item['movie_idx'], dtype=torch.long)
        }

In [None]:
# Inicializar tokenizer para SBERT
sbert_tokenizer = AutoTokenizer.from_pretrained(config.sbert_model_name)
print(f"Cria o tokenizador do Modelo: {config.sbert_model_name}")

# Atualizar n√∫mero de filmes baseado no mapeamento real
config.num_movies = movie_mapper.get_num_movies()
print(f"N√∫mero de filmes: {config.num_movies}")

# Criar datasets
train_dataset = MovieRecommendationDataset(
    train_data, sbert_tokenizer, config.num_movies, config.max_seq_length
)
test_dataset = MovieRecommendationDataset(
    test_data, sbert_tokenizer, config.num_movies, config.max_seq_length
)

tag_train_dataset = TagDataset(tag_train_data, sbert_tokenizer)
tag_test_dataset = TagDataset(tag_test_data, sbert_tokenizer)

print(f"\nDataset de treino: {len(train_dataset)} exemplos")
print(f"Dataset de teste: {len(test_dataset)} exemplos")
print(f"Dataset de tags treino: {len(tag_train_dataset)} exemplos")

# Diagn√≥stico de configura√ß√£o
print("\n" + "="*60)
print("DIAGN√ìSTICO DE TREINAMENTO")
print("="*60)
print(f"Batches por √©poca (movies): {len(train_dataset) // config.movies_batch_size}")
print(f"Total de steps (√©pocas): {(len(train_dataset) // config.movies_batch_size) * config.num_epochs}")
print(f"Warmup steps (10%): {int((len(train_dataset) // config.movies_batch_size) * config.num_epochs * 0.1)}")
print(f"\nEspa√ßo de classifica√ß√£o: {config.num_movies} filmes poss√≠veis")
print(f"M√©dia de labels positivos por exemplo: {sum(len(d['recommended_movies']) for d in train_data) / len(train_data):.2f}")
print(f"Taxa de desbalanceamento: {config.num_movies / (sum(len(d['recommended_movies']) for d in train_data) / len(train_data)):.1f}:1")
print("="*60)

## 5. Arquitetura dos Modelos



### 5.1 Modelo Baseline: SBERT + FFN

In [None]:
class SBERTMovieRecommender(nn.Module):
    """
    Modelo baseline: SBERT + FFN para classifica√ß√£o multi-label.

    f(U) = FFN(SBERT_mean_pooling(U))

    onde U √© o input concatenado com [SEP] tokens.
    SBERT usa mean pooling dos tokens ao inv√©s de apenas [CLS].
    """

    def __init__(self, config):
        super().__init__()

        # Carregar modelo SBERT base
        self.sbert = AutoModel.from_pretrained(config.sbert_model_name)

        # FFN para proje√ß√£o
        self.classifier = nn.Sequential(
            nn.Dropout(config.dropout_prob),
            nn.Linear(config.sbert_hidden_size, config.ffn_hidden_size),
            nn.ReLU(),
            nn.Dropout(config.dropout_prob),
            nn.Linear(config.ffn_hidden_size, config.num_movies)
        )

    def mean_pooling(self, token_embeddings, attention_mask):
        """Mean pooling - considera attention mask para m√©dia correta"""
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
        sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
        return sum_embeddings / sum_mask

    def forward(self, input_ids, attention_mask, **kwargs):
        # Encoding SBERT
        outputs = self.sbert(
            input_ids=input_ids,
            attention_mask=attention_mask
        )

        # Mean pooling (caracter√≠stica do SBERT)
        sentence_embeddings = self.mean_pooling(outputs.last_hidden_state, attention_mask)

        # Proje√ß√£o para logits
        logits = self.classifier(sentence_embeddings)

        return logits

    def get_sentence_embedding(self, input_ids, attention_mask):
        """Retorna sentence embedding para multi-task"""
        outputs = self.sbert(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        return self.mean_pooling(outputs.last_hidden_state, attention_mask)

### 5.2 Modelo com RNN para Features Colaborativas

In [None]:
class SBERTRNNMovieRecommender(nn.Module):
    """
    Modelo com RNN para aprender features colaborativas.

    f(U) = FFN(SBERT_mean_pooling(U), RNN(L(U)))

    onde L(U) √© a lista de filmes mencionados em U.
    """

    def __init__(self, config):
        super().__init__()

        self.sbert = AutoModel.from_pretrained(config.sbert_model_name)

        # Embedding de filmes para RNN
        self.movie_embedding = nn.Embedding(
            config.num_movies + 1,  # +1 para padding
            config.rnn_embedding_size,
            padding_idx=config.num_movies
        )

        # RNN para processar sequ√™ncia de filmes mencionados
        self.rnn = nn.GRU(
            input_size=config.rnn_embedding_size,
            hidden_size=config.rnn_hidden_size,
            num_layers=1,
            batch_first=True,
            bidirectional=True
        )

        # Dimens√£o combinada: SBERT embedding + RNN output (bidirectional)
        combined_size = config.sbert_hidden_size + (config.rnn_hidden_size * 2)

        # FFN para proje√ß√£o
        self.classifier = nn.Sequential(
            nn.Dropout(config.dropout_prob),
            nn.Linear(combined_size, config.ffn_hidden_size),
            nn.ReLU(),
            nn.Dropout(config.dropout_prob),
            nn.Linear(config.ffn_hidden_size, config.num_movies)
        )

        self.num_movies = config.num_movies

    def mean_pooling(self, token_embeddings, attention_mask):
        """Mean pooling - considera attention mask para m√©dia correta"""
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
        sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
        return sum_embeddings / sum_mask

    def forward(self, input_ids, attention_mask, mentioned_movies, mentioned_mask, **kwargs):
        batch_size = input_ids.size(0)

        # Encoding SBERT
        sbert_outputs = self.sbert(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        sentence_embeddings = self.mean_pooling(sbert_outputs.last_hidden_state, attention_mask)

        # Processar filmes mencionados com RNN
        # Substituir √≠ndices inv√°lidos pelo √≠ndice de padding
        mentioned_movies = mentioned_movies.clone()
        mentioned_movies[~mentioned_mask] = self.num_movies

        movie_embeds = self.movie_embedding(mentioned_movies)

        # RNN
        rnn_output, hidden = self.rnn(movie_embeds)

        # Usar √∫ltimo hidden state (concatenado de ambas dire√ß√µes)
        rnn_features = hidden.transpose(0, 1).contiguous().view(batch_size, -1)

        # Combinar features
        combined = torch.cat([sentence_embeddings, rnn_features], dim=-1)

        # Proje√ß√£o para logits
        logits = self.classifier(combined)

        return logits

    def get_sentence_embedding(self, input_ids, attention_mask):
        """Retorna sentence embedding para multi-task"""
        outputs = self.sbert(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        return self.mean_pooling(outputs.last_hidden_state, attention_mask)

### 5.3 Wrapper para Multi-Task Learning

In [None]:
class MultiTaskWrapper(nn.Module):
    """
    Wrapper para multi-task learning com tarefa auxiliar de tags.

    Loss = BCE(f(U), y) + CE(f(U), z)

    onde z √© o filme correto para uma tag.
    """

    def __init__(self, base_model, config):
        super().__init__()
        self.base_model = base_model

        # Head separado para tarefa de tags (usa mesmo pathway do SBERT)
        self.tag_classifier = nn.Sequential(
            nn.Dropout(config.dropout_prob),
            nn.Linear(config.sbert_hidden_size, config.ffn_hidden_size),
            nn.ReLU(),
            nn.Dropout(config.dropout_prob),
            nn.Linear(config.ffn_hidden_size, config.num_movies)
        )

    def forward(self, input_ids, attention_mask, **kwargs):
        return self.base_model(input_ids, attention_mask, **kwargs)

    def forward_tags(self, input_ids, attention_mask):
        """Forward pass para tarefa de tags"""
        # Obter sentence embedding do modelo base
        if hasattr(self.base_model, 'get_sentence_embedding'):
            embeddings = self.base_model.get_sentence_embedding(input_ids, attention_mask)
        else:
            embeddings = self.base_model.get_sentence_embedding(input_ids, attention_mask)

        logits = self.tag_classifier(embeddings)
        return logits

## 6. M√©tricas de Avalia√ß√£o

In [None]:
def compute_ndcg_at_k(predictions, labels, k=10):
    """
    Calcula nDCG@k conforme usado no artigo.

    Args:
        predictions: tensor de logits (batch_size, num_movies)
        labels: tensor multi-hot de labels (batch_size, num_movies)
        k: n√∫mero de itens para considerar

    Returns:
        nDCG@k m√©dio
    """
    predictions = predictions.detach().cpu().numpy()
    labels = labels.detach().cpu().numpy()

    ndcg_scores = []

    for pred, label in zip(predictions, labels):
        # Se n√£o h√° labels positivos, pular
        if label.sum() == 0:
            continue

        try:
            score = ndcg_score([label], [pred], k=k)
            ndcg_scores.append(score)
        except:
            continue

    return np.mean(ndcg_scores) if ndcg_scores else 0.0


def compute_recall_at_k(predictions, labels, k=10):
    """Calcula Recall@k"""
    predictions = predictions.detach().cpu().numpy()
    labels = labels.detach().cpu().numpy()

    recalls = []

    for pred, label in zip(predictions, labels):
        if label.sum() == 0:
            continue

        top_k_indices = np.argsort(pred)[-k:]
        relevant = label[top_k_indices].sum()
        total_relevant = label.sum()

        recalls.append(relevant / total_relevant)

    return np.mean(recalls) if recalls else 0.0

## 7. Loop de Treinamento

### 7.1 Weighted Trainer (Solu√ß√£o para Class Imbalance)

**Problemas identificados nos experimentos:**

1. **BCE Loss sem pesos** ‚Üí modelo aprende a predizer 0 para tudo (nDCG@10 ‚âà 0.044)
2. **Multi-task desbalanceado** ‚Üí CE loss (‚âà6.2) domina BCE loss (‚âà0.003)

**Solu√ß√µes implementadas:**

‚úÖ **Weighted BCE Loss**: `pos_weight=1200` para balancear 1:1200 classes  
‚úÖ **Multi-task balanceado**: `loss = bce_loss + (0.001 * ce_loss)` para equalizar contribui√ß√µes  
‚úÖ **Monitoramento separado**: Track BCE e CE losses individualmente

In [None]:
class WeightedTrainer:
    """
    Trainer com loss balanceado para resolver problemas de class imbalance
    e multi-task desbalanceado identificados nos experimentos.
    """

    def __init__(self, model, config, train_loader, eval_loader,
                 tag_train_loader=None, tag_eval_loader=None,
                 use_multitask=False):
        self.model = model.to(device)
        self.config = config
        self.train_loader = train_loader
        self.eval_loader = eval_loader
        self.tag_train_loader = tag_train_loader
        self.tag_eval_loader = tag_eval_loader
        self.use_multitask = use_multitask

        # Optimizer
        self.optimizer = AdamW(
            model.parameters(),
            lr=config.learning_rate
        )

        # Scheduler
        total_steps = len(train_loader) * config.num_epochs
        warmup_steps = int(total_steps * config.warmup_ratio)

        self.scheduler = get_linear_schedule_with_warmup(
            self.optimizer,
            num_warmup_steps=warmup_steps,
            num_training_steps=total_steps
        )

        # ‚úÖ SOLU√á√ÉO 1: Weighted BCE Loss para class imbalance
        # Calcular pos_weight baseado no desbalanceamento real
        avg_labels = sum(len(d['recommended_movies']) for d in train_data) / len(train_data)
        pos_weight_value = config.num_movies / avg_labels  # ‚âà 1200

        print(f"\n{'='*60}")
        print("WEIGHTED TRAINER - CONFIGURA√á√ÉO")
        print('='*60)
        print(f"M√©dia de labels positivos: {avg_labels:.2f}")
        print(f"Total de filmes: {config.num_movies}")
        print(f"Desbalanceamento: {pos_weight_value:.1f}:1")
        print(f"‚úÖ pos_weight aplicado: {pos_weight_value:.1f}")
        print('='*60)

        pos_weight = torch.full([config.num_movies], pos_weight_value, device=device)
        self.bce_loss = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
        self.ce_loss = nn.CrossEntropyLoss()

        # ‚úÖ SOLU√á√ÉO 2: Balancear multi-task losses
        # CE loss √© ~2000x maior que BCE, ent√£o usar alpha=0.001
        self.multitask_alpha = 0.001  # Peso para CE loss

        # Hist√≥rico expandido para monitorar losses separadamente
        self.history = {
            'train_loss': [],
            'train_bce_loss': [],
            'train_ce_loss': [],
            'eval_loss': [],
            'ndcg': [],
            'recall': []
        }

    def train_epoch(self):
        self.model.train()
        total_loss = 0
        total_bce_loss = 0
        total_ce_loss = 0
        num_batches = 0

        # Iterator para tags se multi-task
        tag_iter = iter(self.tag_train_loader) if self.use_multitask and self.tag_train_loader else None

        progress_bar = tqdm(self.train_loader, desc="Training")

        for batch in progress_bar:
            # Move batch para device
            batch = {k: v.to(device) for k, v in batch.items()}

            self.optimizer.zero_grad()

            # Forward pass principal
            logits = self.model(
                input_ids=batch['input_ids'],
                attention_mask=batch['attention_mask'],
                mentioned_movies=batch.get('mentioned_movies'),
                mentioned_mask=batch.get('mentioned_mask')
            )

            # Loss principal (BCE multi-label) com pesos
            bce_loss = self.bce_loss(logits, batch['labels'])
            loss = bce_loss

            # Multi-task: adicionar loss de tags COM BALANCEAMENTO
            ce_loss_value = 0
            if self.use_multitask and tag_iter:
                try:
                    tag_batch = next(tag_iter)
                except StopIteration:
                    tag_iter = iter(self.tag_train_loader)
                    tag_batch = next(tag_iter)

                tag_batch = {k: v.to(device) for k, v in tag_batch.items()}

                tag_logits = self.model.forward_tags(
                    input_ids=tag_batch['input_ids'],
                    attention_mask=tag_batch['attention_mask']
                )

                ce_loss = self.ce_loss(tag_logits, tag_batch['label'])
                ce_loss_value = ce_loss.item()

                # ‚úÖ Balancear: BCE loss (‚âà0.003) + alpha * CE loss (‚âà6.0)
                # Com alpha=0.001: 0.003 + 0.001*6.0 = 0.009 (balanceado!)
                loss = loss + (self.multitask_alpha * ce_loss)

            loss.backward()

            # Gradient clipping
            torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)

            self.optimizer.step()
            self.scheduler.step()

            total_loss += loss.item()
            total_bce_loss += bce_loss.item()
            total_ce_loss += ce_loss_value
            num_batches += 1

            # Mostrar losses separadamente
            if self.use_multitask:
                progress_bar.set_postfix({
                    'loss': f'{loss.item():.4f}',
                    'bce': f'{bce_loss.item():.4f}',
                    'ce': f'{ce_loss_value:.4f}'
                })
            else:
                progress_bar.set_postfix({'loss': f'{loss.item():.4f}'})

        return {
            'total': total_loss / num_batches,
            'bce': total_bce_loss / num_batches,
            'ce': total_ce_loss / num_batches if self.use_multitask else 0
        }

    @torch.no_grad()
    def evaluate(self):
        self.model.eval()
        total_loss = 0
        all_predictions = []
        all_labels = []

        for batch in tqdm(self.eval_loader, desc="Evaluating"):
            batch = {k: v.to(device) for k, v in batch.items()}

            logits = self.model(
                input_ids=batch['input_ids'],
                attention_mask=batch['attention_mask'],
                mentioned_movies=batch.get('mentioned_movies'),
                mentioned_mask=batch.get('mentioned_mask')
            )

            loss = self.bce_loss(logits, batch['labels'])
            total_loss += loss.item()

            all_predictions.append(logits)
            all_labels.append(batch['labels'])

        # Concatenar todas as predi√ß√µes
        all_predictions = torch.cat(all_predictions, dim=0)
        all_labels = torch.cat(all_labels, dim=0)

        # Calcular m√©tricas
        ndcg = compute_ndcg_at_k(all_predictions, all_labels, k=self.config.eval_k)
        recall = compute_recall_at_k(all_predictions, all_labels, k=self.config.eval_k)

        return {
            'loss': total_loss / len(self.eval_loader),
            'ndcg@10': ndcg,
            'recall@10': recall
        }

    def train(self, num_epochs=None):
        num_epochs = num_epochs or self.config.num_epochs
        best_ndcg = 0

        for epoch in range(num_epochs):
            print(f"\n{'='*50}")
            print(f"Epoch {epoch + 1}/{num_epochs}")
            print('='*50)

            # Treinar
            train_metrics = self.train_epoch()
            self.history['train_loss'].append(train_metrics['total'])
            self.history['train_bce_loss'].append(train_metrics['bce'])
            self.history['train_ce_loss'].append(train_metrics['ce'])

            # Avaliar
            eval_metrics = self.evaluate()
            self.history['eval_loss'].append(eval_metrics['loss'])
            self.history['ndcg'].append(eval_metrics['ndcg@10'])
            self.history['recall'].append(eval_metrics['recall@10'])

            # Print com losses separadas
            print(f"\nTrain Loss (total): {train_metrics['total']:.4f}")
            print(f"  ‚îú‚îÄ BCE Loss: {train_metrics['bce']:.4f}")
            if self.use_multitask:
                print(f"  ‚îî‚îÄ CE Loss (x{self.multitask_alpha}): {train_metrics['ce']:.4f}")
            print(f"Eval Loss: {eval_metrics['loss']:.4f}")
            print(f"nDCG@10: {eval_metrics['ndcg@10']:.4f}")
            print(f"Recall@10: {eval_metrics['recall@10']:.4f}")

            # Salvar melhor modelo
            if eval_metrics['ndcg@10'] > best_ndcg:
                best_ndcg = eval_metrics['ndcg@10']
                torch.save(
                    self.model.state_dict(),
                    os.path.join(self.config.save_dir, 'best_weighted_model.pt')
                )
                print(f"‚úÖ Novo melhor modelo salvo! nDCG@10: {best_ndcg:.4f}")

        return self.history

In [None]:
class Trainer:
    """Trainer para os modelos de recomenda√ß√£o"""

    def __init__(self, model, config, train_loader, eval_loader,
                 tag_train_loader=None, tag_eval_loader=None,
                 use_multitask=False):
        self.model = model.to(device)
        self.config = config
        self.train_loader = train_loader
        self.eval_loader = eval_loader
        self.tag_train_loader = tag_train_loader
        self.tag_eval_loader = tag_eval_loader
        self.use_multitask = use_multitask

        # Optimizer
        self.optimizer = AdamW(
            model.parameters(),
            lr=config.learning_rate
        )

        # Scheduler
        total_steps = len(train_loader) * config.num_epochs
        warmup_steps = int(total_steps * config.warmup_ratio)

        self.scheduler = get_linear_schedule_with_warmup(
            self.optimizer,
            num_warmup_steps=warmup_steps,
            num_training_steps=total_steps
        )

        # Loss functions
        self.bce_loss = nn.BCEWithLogitsLoss()
        self.ce_loss = nn.CrossEntropyLoss()

        # Hist√≥rico
        self.history = {
            'train_loss': [],
            'eval_loss': [],
            'ndcg': [],
            'recall': []
        }

    def train_epoch(self):
        self.model.train()
        total_loss = 0
        num_batches = 0

        # Iterator para tags se multi-task
        tag_iter = iter(self.tag_train_loader) if self.use_multitask and self.tag_train_loader else None

        progress_bar = tqdm(self.train_loader, desc="Training")

        for batch in progress_bar:
            # Move batch para device
            batch = {k: v.to(device) for k, v in batch.items()}

            self.optimizer.zero_grad()

            # Forward pass principal
            logits = self.model(
                input_ids=batch['input_ids'],
                attention_mask=batch['attention_mask'],
                mentioned_movies=batch.get('mentioned_movies'),
                mentioned_mask=batch.get('mentioned_mask')
            )

            # Loss principal (BCE multi-label)
            loss = self.bce_loss(logits, batch['labels'])

            # Multi-task: adicionar loss de tags
            if self.use_multitask and tag_iter:
                try:
                    tag_batch = next(tag_iter)
                except StopIteration:
                    tag_iter = iter(self.tag_train_loader)
                    tag_batch = next(tag_iter)

                tag_batch = {k: v.to(device) for k, v in tag_batch.items()}

                tag_logits = self.model.forward_tags(
                    input_ids=tag_batch['input_ids'],
                    attention_mask=tag_batch['attention_mask']
                )

                tag_loss = self.ce_loss(tag_logits, tag_batch['label'])
                loss = loss + tag_loss  # Peso igual conforme artigo

            loss.backward()

            # Gradient clipping
            torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)

            self.optimizer.step()
            self.scheduler.step()

            total_loss += loss.item()
            num_batches += 1

            progress_bar.set_postfix({'loss': f'{loss.item():.4f}'})

        return total_loss / num_batches

    @torch.no_grad()
    def evaluate(self):
        self.model.eval()
        total_loss = 0
        all_predictions = []
        all_labels = []

        for batch in tqdm(self.eval_loader, desc="Evaluating"):
            batch = {k: v.to(device) for k, v in batch.items()}

            logits = self.model(
                input_ids=batch['input_ids'],
                attention_mask=batch['attention_mask'],
                mentioned_movies=batch.get('mentioned_movies'),
                mentioned_mask=batch.get('mentioned_mask')
            )

            loss = self.bce_loss(logits, batch['labels'])
            total_loss += loss.item()

            all_predictions.append(logits)
            all_labels.append(batch['labels'])

        # Concatenar todas as predi√ß√µes
        all_predictions = torch.cat(all_predictions, dim=0)
        all_labels = torch.cat(all_labels, dim=0)

        # Calcular m√©tricas
        ndcg = compute_ndcg_at_k(all_predictions, all_labels, k=self.config.eval_k)
        recall = compute_recall_at_k(all_predictions, all_labels, k=self.config.eval_k)

        return {
            'loss': total_loss / len(self.eval_loader),
            'ndcg@10': ndcg,
            'recall@10': recall
        }

    def train(self, num_epochs=None):
        num_epochs = num_epochs or self.config.num_epochs
        best_ndcg = 0

        for epoch in range(num_epochs):
            print(f"\n{'='*50}")
            print(f"Epoch {epoch + 1}/{num_epochs}")
            print('='*50)

            # Treinar
            train_loss = self.train_epoch()
            self.history['train_loss'].append(train_loss)

            # Avaliar
            eval_metrics = self.evaluate()
            self.history['eval_loss'].append(eval_metrics['loss'])
            self.history['ndcg'].append(eval_metrics['ndcg@10'])
            self.history['recall'].append(eval_metrics['recall@10'])

            print(f"\nTrain Loss: {train_loss:.4f}")
            print(f"Eval Loss: {eval_metrics['loss']:.4f}")
            print(f"nDCG@10: {eval_metrics['ndcg@10']:.4f}")
            print(f"Recall@10: {eval_metrics['recall@10']:.4f}")

            # Salvar melhor modelo
            if eval_metrics['ndcg@10'] > best_ndcg:
                best_ndcg = eval_metrics['ndcg@10']
                torch.save(
                    self.model.state_dict(),
                    os.path.join(self.config.save_dir, 'best_model.pt')
                )
                print(f"Novo melhor modelo salvo! nDCG@10: {best_ndcg:.4f}")

        return self.history

## 8. Experimento 1: Baseline SBERT + FFN

In [None]:
# Criar DataLoaders
train_loader = DataLoader(
    train_dataset,
    batch_size=config.movies_batch_size,
    shuffle=True,
    num_workers=0,  # 0 para evitar problemas com multiprocessing no Colab
    pin_memory=True
)

eval_loader = DataLoader(
    test_dataset,
    batch_size=config.movies_batch_size * 2,
    shuffle=False,
    num_workers=0,  # 0 para evitar problemas com multiprocessing no Colab
    pin_memory=True
)

print(f"Batches de treino: {len(train_loader)}")
print(f"Batches de avalia√ß√£o: {len(eval_loader)}")

In [None]:
# Treinar modelo baseline
print("="*60)
print("EXPERIMENTO 1: SBERT Baseline (sem RNN, sem multi-task)")
print("="*60)
print(f"‚öôÔ∏è  Configura√ß√£o: batch_size={config.movies_batch_size}, lr={config.learning_rate}, dropout={config.dropout_prob}")
print(f"üìä Observe: nDCG@10 deve come√ßar baixo (~0.01-0.02) e crescer gradualmente")
print(f"üéØ Meta em {config.num_epochs} √©pocas: nDCG@10 > 0.05")
print("="*60)

baseline_model = SBERTMovieRecommender(config)

baseline_trainer = Trainer(
    model=baseline_model,
    config=config,
    train_loader=train_loader,
    eval_loader=eval_loader,
    use_multitask=False
)

# Treinar com hiperpar√¢metros ajustados
baseline_history = baseline_trainer.train(config.num_epochs)

## 9. Experimento 2: SBERT + RNN para Features Colaborativas

In [None]:
# Treinar modelo com RNN
print("\n" + "="*60)
print("EXPERIMENTO 2: SBERT + RNN (features colaborativas)")
print("="*60)

rnn_model = SBERTRNNMovieRecommender(config)

rnn_trainer = Trainer(
    model=rnn_model,
    config=config,
    train_loader=train_loader,
    eval_loader=eval_loader,
    use_multitask=False
)

rnn_history = rnn_trainer.train(config.num_epochs)

## 10. Experimento 3: Multi-Task Learning com Tags

In [None]:
# Criar DataLoaders para tags
tag_train_loader = DataLoader(
    tag_train_dataset,
    batch_size=config.tags_batch_size,
    shuffle=True,
    num_workers=0,  # 0 para evitar problemas com multiprocessing no Colab
    pin_memory=True
)

tag_eval_loader = DataLoader(
    tag_test_dataset,
    batch_size=config.tags_batch_size,
    shuffle=False,
    num_workers=0,  # 0 para evitar problemas com multiprocessing no Colab
    pin_memory=True
)

In [None]:
# Treinar modelo com multi-task (SBERT baseline + tags)
print("\n" + "="*60)
print("EXPERIMENTO 3: SBERT + Multi-Task (user tags)")
print("="*60)

base_model_mt = SBERTMovieRecommender(config)
multitask_model = MultiTaskWrapper(base_model_mt, config)

multitask_trainer = Trainer(
    model=multitask_model,
    config=config,
    train_loader=train_loader,
    eval_loader=eval_loader,
    tag_train_loader=tag_train_loader,
    tag_eval_loader=tag_eval_loader,
    use_multitask=True
)

multitask_history = multitask_trainer.train(config.num_epochs)

## 11. Experimento 4: SBERT + RNN + Multi-Task (Modelo Completo)

In [None]:
# Modelo completo: RNN + Multi-task
print("\n" + "="*60)
print("EXPERIMENTO 4: SBERT + RNN + Multi-Task (modelo completo)")
print("="*60)

rnn_base_model = SBERTRNNMovieRecommender(config)
full_model = MultiTaskWrapper(rnn_base_model, config)

full_trainer = Trainer(
    model=full_model,
    config=config,
    train_loader=train_loader,
    eval_loader=eval_loader,
    tag_train_loader=tag_train_loader,
    tag_eval_loader=tag_eval_loader,
    use_multitask=True
)

full_history = full_trainer.train(config.num_epochs)

## 12. Compara√ß√£o dos Resultados

In [None]:
import matplotlib.pyplot as plt

def plot_results(histories, names):
    """Plota compara√ß√£o dos resultados"""
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))

    colors = ['#2ecc71', '#3498db', '#e74c3c', '#9b59b6']

    # Train Loss
    ax = axes[0, 0]
    for hist, name, color in zip(histories, names, colors):
        ax.plot(hist['train_loss'], label=name, color=color, linewidth=2)
    ax.set_xlabel('Epoch')
    ax.set_ylabel('Loss')
    ax.set_title('Training Loss')
    ax.legend()
    ax.grid(True, alpha=0.3)

    # Eval Loss
    ax = axes[0, 1]
    for hist, name, color in zip(histories, names, colors):
        ax.plot(hist['eval_loss'], label=name, color=color, linewidth=2)
    ax.set_xlabel('Epoch')
    ax.set_ylabel('Loss')
    ax.set_title('Evaluation Loss')
    ax.legend()
    ax.grid(True, alpha=0.3)

    # nDCG@10
    ax = axes[1, 0]
    for hist, name, color in zip(histories, names, colors):
        ax.plot(hist['ndcg'], label=name, color=color, linewidth=2)
    ax.set_xlabel('Epoch')
    ax.set_ylabel('nDCG@10')
    ax.set_title('nDCG@10 (m√©trica principal do artigo)')
    ax.legend()
    ax.grid(True, alpha=0.3)

    # Recall@10
    ax = axes[1, 1]
    for hist, name, color in zip(histories, names, colors):
        ax.plot(hist['recall'], label=name, color=color, linewidth=2)
    ax.set_xlabel('Epoch')
    ax.set_ylabel('Recall@10')
    ax.set_title('Recall@10')
    ax.legend()
    ax.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.savefig('training_results.png', dpi=150, bbox_inches='tight')
    plt.show()

# Plotar resultados
plot_results(
    [baseline_history, rnn_history, multitask_history, full_history],
    ['SBERT Baseline', 'SBERT + RNN', 'SBERT + Multi-Task', 'SBERT + RNN + Multi-Task']
)

In [None]:

results = {
    'Modelo': [
        'SBERT Baseline',
        'SBERT + RNN',
        'SBERT + Multi-Task',
        'SBERT + RNN + Multi-Task',
        '--- Artigo Original (BERT) ---',
        'Artigo: Sem RNN, Sem Tags',
        'Artigo: Com RNN, Sem Tags',
        'Artigo: Sem RNN, Com Tags',
        'Artigo: Com RNN, Com Tags'
    ],
    'nDCG@10': [
        max(baseline_history['ndcg']),
        max(rnn_history['ndcg']),
        max(multitask_history['ndcg']),
        max(full_history['ndcg']),
        '-',
        0.130,
        0.165,
        0.138,
        0.169
    ]
}

results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))

print("\n" + "="*70)
print("Nota: O artigo original usa BERT e reporta nDCG@10 de 0.819 para")
print("modelos conversacionais completos (tarefa diferente). Esta implementa√ß√£o")
print("usa SBERT e foca na tarefa de recomenda√ß√£o one-shot a partir de queries.")
print("="*70)

## 13. Infer√™ncia e Demonstra√ß√£o

In [None]:
class MovieRecommenderInference:
    """Classe para infer√™ncia com o modelo treinado"""

    def __init__(self, model, tokenizer, movie_mapper, device, top_k=10):
        self.model = model.to(device)
        self.model.eval()
        self.tokenizer = tokenizer
        self.movie_mapper = movie_mapper
        self.device = device
        self.top_k = top_k

    def recommend(self, query, mentioned_movie_names=None):
        """
        Gera recomenda√ß√µes a partir de uma query.

        Args:
            query: texto da query do usu√°rio
            mentioned_movie_names: lista de nomes de filmes mencionados (opcional)

        Returns:
            Lista de filmes recomendados com scores
        """
        # Tokenizar
        encoding = self.tokenizer(
            query,
            max_length=512,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )

        input_ids = encoding['input_ids'].to(self.device)
        attention_mask = encoding['attention_mask'].to(self.device)

        # Preparar filmes mencionados (placeholder se n√£o dispon√≠vel)
        mentioned_movies = torch.zeros(1, 20, dtype=torch.long, device=self.device)
        mentioned_mask = torch.zeros(1, 20, dtype=torch.bool, device=self.device)

        with torch.no_grad():
            logits = self.model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                mentioned_movies=mentioned_movies,
                mentioned_mask=mentioned_mask
            )

        # Obter top-k
        probs = torch.sigmoid(logits).squeeze(0)
        top_scores, top_indices = torch.topk(probs, self.top_k)

        recommendations = []
        for score, idx in zip(top_scores.cpu().numpy(), top_indices.cpu().numpy()):
            movie_id = self.movie_mapper.idx_to_movie_id(idx)
            movie_name = self.movie_mapper.movie_names.get(movie_id, f"Movie {movie_id}")
            recommendations.append({
                'movie_id': movie_id,
                'name': movie_name,
                'score': float(score)
            })

        return recommendations

# Criar inference engine com o melhor modelo
inference = MovieRecommenderInference(
    model=full_model,
    tokenizer=sbert_tokenizer,
    movie_mapper=movie_mapper,
    device=device
)

In [None]:
# Demonstra√ß√£o de infer√™ncia
print("="*60)
print("DEMONSTRA√á√ÉO DE RECOMENDA√á√ïES")
print("="*60)

test_queries = [
    "I like animations and comedies. I enjoyed Toy Story and Finding Nemo.",
    "I'm looking for something dramatic and artistic. I love Christopher Nolan films.",
    "Can you recommend some action movies? I like Marvel superhero films.",
    "I want to watch something scary for Halloween. Horror movies please!"
]

for query in test_queries:
    print(f"\n{'‚îÄ'*60}")
    print(f"Query: {query}")
    print(f"{'‚îÄ'*60}")

    recommendations = inference.recommend(query)

    print("\nTop 5 Recomenda√ß√µes:")
    for i, rec in enumerate(recommendations[:5], 1):
        print(f"  {i}. {rec['name']} (score: {rec['score']:.4f})")

## 14. An√°lise de Erros e Limita√ß√µes

In [None]:
# An√°lise conforme discutido no artigo
print("="*70)
print("AN√ÅLISE DE LIMITA√á√ïES (conforme artigo)")
print("="*70)

analysis = """
1. TAMANHO DO DATASET
   - Treino: {} exemplos
   - Teste: {} exemplos
   - O artigo menciona ~8008 treino e ~2002 avalia√ß√£o
   - Dataset pequeno leva a overfitting

2. QUALIDADE DOS DADOS
   - Senten√ßas concatenadas de di√°logos podem n√£o ser significativas
   - Exemplo: "Anything artistic [SEP] What's it about?" n√£o faz sentido isolado

3. COBERTURA DE FILMES
   - Total de filmes no mapeamento: {}
   - Nem todos t√™m tags de usu√°rios para multi-task

4. COMPARA√á√ÉO COM ARTIGO
   - Artigo reporta nDCG@10 entre 0.130 e 0.169
   - Modelos conversacionais completos atingem 0.819
   - Nossa tarefa √© mais dif√≠cil (one-shot vs conversacional)

5. MELHORIAS OBSERVADAS
   - RNN para features colaborativas: +0.035 nDCG (artigo)
   - Multi-task com tags: +0.004-0.008 nDCG (artigo)
""".format(
    len(train_data),
    len(test_data),
    movie_mapper.get_num_movies()
)

print(analysis)

## 15. Salvar Modelo Final

In [None]:
#
# Salvar modelo completo e configura√ß√µes
import json
import numpy as np

class NumpyEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, (np.floating, np.integer)):
            return obj.item()
        elif isinstance(obj, np.ndarray):
            return obj.tolist()
        elif hasattr(obj, 'item'):  # Para tensores PyTorch
            return obj.item()
        return super().default(obj)

save_path = os.path.join(config.save_dir, 'final_model')
os.makedirs(save_path, exist_ok=True)

# Salvar pesos do modelo
torch.save(full_model.state_dict(), os.path.join(save_path, 'model_weights.pt'))

# Salvar configura√ß√µes
config_dict = {k: v for k, v in vars(config).items() if not k.startswith('_')}
with open(os.path.join(save_path, 'config.json'), 'w') as f:
    json.dump(config_dict, f, indent=2, cls=NumpyEncoder)

# Salvar mapeamento de filmes
with open(os.path.join(save_path, 'movie_mapping.json'), 'w') as f:
    json.dump({
        'movie_to_idx': movie_mapper.movie_to_idx,
        'movie_names': movie_mapper.movie_names
    }, f, indent=2, cls=NumpyEncoder)

# Salvar hist√≥rico de treinamento
with open(os.path.join(save_path, 'training_history.json'), 'w') as f:
    json.dump({
        'baseline': baseline_history,
        'rnn': rnn_history,
        'multitask': multitask_history,
        'full': full_history
    }, f, indent=2, cls=NumpyEncoder)

print(f"Modelo salvo em: {save_path}")
print("Arquivos salvos:")
for f in os.listdir(save_path):
    print(f"  - {f}")

In [None]:
# imprime relatorio com os valores encontrados
print("\n" + "="*60)
print("RELAT√ìRIO FINAL")
print("="*60)

## 16. Conclus√£o

Esta implementa√ß√£o adapta o artigo "BERT one-shot movie recommender system" de Trung Nguyen (Stanford CS224N) para usar SBERT (Sentence-BERT).

### Resultados Principais:

| Configura√ß√£o | nDCG@10 (Artigo BERT) | nDCG@10 (SBERT) |
|-------------|------------------|----------------------|
| Baseline | 0.130 | Veja resultados acima |
| + RNN | 0.165 | Veja resultados acima |
| + Multi-Task | 0.138 | Veja resultados acima |
| + RNN + Multi-Task | 0.169 | Veja resultados acima |

### Insights:
1. O RNN para features colaborativas melhora significativamente os resultados
2. Multi-task learning com tags oferece ganho marginal
3. A combina√ß√£o de ambas t√©cnicas produz o melhor resultado
4. SBERT usa mean pooling em vez de apenas [CLS], potencialmente capturando melhor o contexto
5. O dataset pequeno e a natureza concatenada dos dados limitam o desempenho

### Refer√™ncias:
- Nguyen, T. (2024). BERT one-shot movie recommender system. Stanford CS224N.
- Reimers & Gurevych (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP.
- Li et al. (2018). Towards deep conversational recommendations. NeurIPS.
- Penha & Hauff (2020). What does BERT know about books, movies and music? RecSys.

## 17. Experimentos com Weighted Trainer (Solu√ß√£o)

**Objetivo:** Validar se as corre√ß√µes resolvem os problemas identificados:
- ‚úÖ Weighted BCE Loss: `pos_weight=1200` para balancear classes
- ‚úÖ Multi-task balanceado: `alpha=0.001` para equalizar BCE e CE losses

**Expectativa:** nDCG@10 deve subir de **0.044** para **> 0.10** em 20 √©pocas.

### 17.1 Teste R√°pido: Baseline com Weighted BCE (5 √©pocas)

**Objetivo:** Validar rapidamente se Weighted BCE resolve o problema principal.  
**Esperado:** nDCG@10 deve ser **> 0.06** j√° em 5 √©pocas (vs 0.044 sem pesos).

In [None]:
# Teste r√°pido: Baseline com Weighted BCE (5 √©pocas)
print("="*60)
print("TESTE: SBERT Baseline com Weighted BCE Loss")
print("="*60)

# Criar novo modelo (resetar pesos)
weighted_baseline = SBERTMovieRecommender(config)

# Usar WeightedTrainer
weighted_baseline_trainer = WeightedTrainer(
    model=weighted_baseline,
    config=config,
    train_loader=train_loader,
    eval_loader=eval_loader,
    use_multitask=False
)

# Treinar apenas 5 √©pocas para teste r√°pido
weighted_baseline_history = weighted_baseline_trainer.train(num_epochs=5)

print("\n" + "="*60)
print("RESULTADO DO TESTE")
print("="*60)
print(f"Melhor nDCG@10: {max(weighted_baseline_history['ndcg']):.4f}")
print(f"Baseline original (20 √©pocas): 0.0440")
print(f"Diferen√ßa: {(max(weighted_baseline_history['ndcg']) - 0.0440):.4f}")
print("="*60)

### 17.2 Experimento Completo: RNN + Multi-Task com Weighted Trainer

**Se o teste r√°pido funcionar**, rodar modelo completo com 20 √©pocas:
- Weighted BCE Loss (pos_weight=1200)
- Multi-task balanceado (alpha=0.001)
- RNN para features colaborativas

**Meta:** nDCG@10 **> 0.10** (vs 0.0462 do Exp 4 original)

In [None]:
# Modelo completo: RNN + Multi-task com Weighted Trainer
print("\n" + "="*60)
print("EXPERIMENTO FINAL: SBERT + RNN + Multi-Task (WEIGHTED)")
print("="*60)

# Criar modelo completo
weighted_rnn_base = SBERTRNNMovieRecommender(config)
weighted_full_model = MultiTaskWrapper(weighted_rnn_base, config)

# Usar WeightedTrainer com multi-task
weighted_full_trainer = WeightedTrainer(
    model=weighted_full_model,
    config=config,
    train_loader=train_loader,
    eval_loader=eval_loader,
    tag_train_loader=tag_train_loader,
    tag_eval_loader=tag_eval_loader,
    use_multitask=True
)

# Treinar 20 √©pocas completas
weighted_full_history = weighted_full_trainer.train(config.num_epochs)

### 17.3 Compara√ß√£o: Original vs Weighted Trainer

In [None]:
# Tabela comparativa
comparison_results = {
    'Configura√ß√£o': [
        'Exp 1: Baseline (Original)',
        'Exp 2: +RNN (Original)',
        'Exp 3: +Multi-Task (Original)',
        'Exp 4: +RNN+Multi-Task (Original)',
        '‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ',
        'Teste: Baseline (Weighted) - 5 √©pocas',
        'Final: +RNN+Multi-Task (Weighted) - 20 √©pocas',
        '‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ',
        'Meta (Artigo): Baseline',
        'Meta (Artigo): +RNN+Multi-Task'
    ],
    'Train Loss': [
        '0.0035',
        '0.0035',
        '6.2071 ‚ùå',
        '6.2516 ‚ùå',
        '‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ',
        f"{weighted_baseline_history['train_loss'][-1]:.4f}",
        f"{weighted_full_history['train_loss'][-1]:.4f}",
        '‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ',
        '~0.005',
        '~0.008'
    ],
    'nDCG@10': [
        '0.0440',
        '0.0459',
        '0.0443',
        '0.0462',
        '‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ',
        f"{max(weighted_baseline_history['ndcg']):.4f}",
        f"{max(weighted_full_history['ndcg']):.4f}",
        '‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ',
        '0.130',
        '0.169'
    ],
    'Status': [
        '‚ùå Stagnado',
        '‚ùå RNN in√∫til',
        '‚ùå Multi-task quebrado',
        '‚ùå Combina√ß√£o pior',
        '‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ',
        '‚úÖ Corrigido' if max(weighted_baseline_history['ndcg']) > 0.06 else '‚ö†Ô∏è Verificar',
        '‚úÖ Corrigido' if max(weighted_full_history['ndcg']) > 0.10 else '‚ö†Ô∏è Verificar',
        '‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ',
        'Alvo',
        'Alvo'
    ]
}

comparison_df = pd.DataFrame(comparison_results)
print("\n" + "="*80)
print("COMPARA√á√ÉO: TRAINER ORIGINAL vs WEIGHTED TRAINER")
print("="*80)
print(comparison_df.to_string(index=False))

print("\n" + "="*80)
print("AN√ÅLISE DE GANHO")
print("="*80)
original_best = 0.0462  # Exp 4 original
weighted_best = max(weighted_full_history['ndcg'])
improvement = ((weighted_best - original_best) / original_best) * 100

print(f"Melhor resultado original: {original_best:.4f}")
print(f"Melhor resultado weighted: {weighted_best:.4f}")
print(f"Ganho absoluto: +{(weighted_best - original_best):.4f}")
print(f"Ganho percentual: +{improvement:.1f}%")
print("="*80)

## 18. Pr√≥ximos Passos e Recomenda√ß√µes

### üéØ Como Usar Este Notebook

#### **Op√ß√£o 1: Teste R√°pido (Recomendado para iniciar)**
Execute a c√©lula da se√ß√£o **17.1** para validar o Weighted Trainer em 5 √©pocas:
```python
# Deve levar ~5 minutos no Colab com GPU
weighted_baseline_trainer.train(num_epochs=5)
```

**Crit√©rio de sucesso**: nDCG@10 **> 0.06** j√° em 5 √©pocas (vs 0.044 original).

---

#### **Op√ß√£o 2: Experimento Completo**
Se o teste r√°pido funcionar, execute a c√©lula da se√ß√£o **17.2** para treinar o modelo completo:
```python
# Leva ~20 minutos no Colab com GPU
weighted_full_trainer.train(num_epochs=20)
```

**Crit√©rio de sucesso**: nDCG@10 **> 0.10** em 20 √©pocas (alvo: 0.13-0.17 do artigo).

---

### üîß Ajustes Adicionais (Se Necess√°rio)

#### **Se nDCG@10 ainda estiver baixo (< 0.08 em 20 √©pocas):**

1. **Aumentar pos_weight** (linha 831 do WeightedTrainer):
   ```python
   pos_weight_value = config.num_movies / avg_labels * 1.5  # Aumentar 50%
   ```

2. **Ajustar alpha do multi-task** (linha 851):
   ```python
   self.multitask_alpha = 0.0005  # Reduzir peso do CE loss
   ```

3. **Implementar Focal Loss** (alternativa ao Weighted BCE):
   ```python
   # Foca em exemplos dif√≠ceis, n√£o apenas balanceamento
   class FocalLoss(nn.Module):
       def __init__(self, alpha=1, gamma=2):
           super().__init__()
           self.alpha = alpha
           self.gamma = gamma
       
       def forward(self, inputs, targets):
           bce_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction='none')
           pt = torch.exp(-bce_loss)
           focal_loss = self.alpha * (1-pt)**self.gamma * bce_loss
           return focal_loss.mean()
   ```

---

### üìä Interpretando os Resultados

**Sinais de que est√° funcionando:**
- ‚úÖ Train Loss entre 0.01-0.03 (n√£o mais 6.2)
- ‚úÖ BCE Loss e CE Loss balanceados (~mesma ordem de magnitude)
- ‚úÖ nDCG@10 crescendo consistentemente
- ‚úÖ Gap pequeno entre Train Loss e Eval Loss (<0.01)

**Sinais de problema persistente:**
- ‚ùå Train Loss ainda muito alta (>1.0) ap√≥s 10 √©pocas
- ‚ùå nDCG@10 estagnado em <0.05 ap√≥s 15 √©pocas
- ‚ùå BCE Loss muito maior que CE Loss (desbalanceamento inverso)

---

### üìà Benchmarks Esperados

| √âpocas | nDCG@10 (Weighted) | Status |
|--------|-------------------|--------|
| 5 | 0.06-0.08 | ‚úÖ Valida√ß√£o inicial |
| 10 | 0.08-0.10 | üîÑ Progresso |
| 20 | 0.10-0.13 | üéØ Meta m√≠nima |
| 50+ | 0.13-0.17 | üèÜ Meta do artigo |

---

### üêõ Troubleshooting

**"CUDA out of memory":**
```python
config.movies_batch_size = 16  # Reduzir de 32
```

**"Loss NaN":**
```python
# Reduzir learning rate
config.learning_rate = 1e-5  # De 2e-5
```

**"Overfitting (Train Loss << Eval Loss)":**
```python
# Aumentar dropout
config.dropout_prob = 0.2  # De 0.1
```

In [None]:
print("\n" + "="*60)
print("IMPLEMENTA√á√ÉO SBERT COMPLETA!")
print("="*60)
print("\nPara continuar o treinamento com mais √©pocas, ajuste:")
print("  config.num_epochs = 200  # Conforme artigo original")
print("\nPara usar o modelo treinado:")
print("  inference = MovieRecommenderInference(full_model, sbert_tokenizer, movie_mapper, device)")
print("  recs = inference.recommend('I like action movies')")