# üöÄ IndoBERT Sentiment Analysis 3 Kelas - Kaggle Version

**Dataset**: gojek_reviews_3class_clean.csv  
**Model**: IndoBERT (indobenchmark/indobert-base-p1)  
**Target**: Akurasi tinggi dengan generalisasi yang baik (tidak overfitting)

## üìã Persiapan Sebelum Running:

1. **Upload dataset** ke Kaggle sebagai Dataset atau langsung upload file
2. **Aktifkan GPU**: Settings ‚Üí Accelerator ‚Üí GPU T4 x2 atau P100
3. **Enable Internet**: Settings ‚Üí Internet ‚Üí On

---

### Teknik Anti-Overfitting yang Digunakan:
1. **Layer Freezing** - Freeze 6 layer BERT pertama
2. **Data Balancing** - Undersampling ke kelas minoritas
3. **High Dropout** - 0.5 untuk regularisasi agresif
4. **Label Smoothing** - 0.15 untuk soft labels
5. **Early Stopping** - Patience 5, monitor F1 score
6. **Weight Decay** - L2 regularization (0.02)
7. **Learning Rate Warmup** - Gradual increase
8. **Gradient Clipping** - Mencegah exploding gradients
9. **Data Augmentation** - Word dropout, swap, duplication

### Kelas Sentiment:
- **0 = Negative** (Score 1-2)
- **1 = Neutral** (Score 3)
- **2 = Positive** (Score 4-5)

In [None]:
# ============================================
# SETUP KAGGLE
# ============================================

# Install dependencies
!pip install transformers -q

# Check GPU
import torch
print(f'PyTorch version: {torch.__version__}')
if torch.cuda.is_available():
    print(f'‚úì GPU Available: {torch.cuda.get_device_name(0)}')
    print(f'‚úì GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB')
else:
    print('‚ö†Ô∏è GPU not available, using CPU (akan lebih lambat)')

# List input files
import os
print('\nüìÅ Input files:')
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from tqdm.auto import tqdm
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW
from transformers import BertTokenizer, BertModel, get_linear_schedule_with_warmup
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score, precision_recall_fscore_support, 
    classification_report, confusion_matrix, f1_score
)
from sklearn.utils import resample
import random
import os
import copy
import json
from datetime import datetime

warnings.filterwarnings('ignore')

# Reproducibility
def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_seed(42)

# Device setup (sudah di-check di cell sebelumnya)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'üñ•Ô∏è  Device: {device}')
if torch.cuda.is_available():
    print(f'üéÆ GPU: {torch.cuda.get_device_name(0)}')

## üìä 1. Load & Explore Data

In [None]:
# ============================================
# LOAD DATA - SESUAIKAN PATH SESUAI DATASET KAMU
# ============================================
# File yang direkomendasikan: gojek_reviews_final_augmented.csv (15,000 samples, 5000/kelas)

# Cari file otomatis (prioritaskan file augmented)
DATA_PATH = None
priority_files = [
    'gojek_reviews_final_augmented',  # 15,000 samples - RECOMMENDED
    'gojek_reviews_3class_balanced',
    'gojek_reviews_3class_clean',
    'gojek_reviews'
]

for dirname, _, filenames in os.walk('/kaggle/input'):
    for priority in priority_files:
        for filename in filenames:
            if priority in filename and filename.endswith('.csv'):
                DATA_PATH = os.path.join(dirname, filename)
                print(f'‚úì Found data file: {DATA_PATH}')
                break
        if DATA_PATH:
            break
    if DATA_PATH:
        break

if DATA_PATH is None:
    print('‚ùå Data file tidak ditemukan!')
    print('\nüìÅ Files yang tersedia:')
    for dirname, _, filenames in os.walk('/kaggle/input'):
        for filename in filenames:
            print(f'   {os.path.join(dirname, filename)}')
    print('\nüí° Upload file gojek_reviews_final_augmented.csv ke Kaggle')
else:
    df = pd.read_csv(DATA_PATH)
    
    print('=' * 60)
    print('üìä DATA OVERVIEW')
    print('=' * 60)
    print(f'Total samples: {len(df):,}')
    print(f'\nColumns: {df.columns.tolist()}')
    print(f'\nüìà Sentiment Distribution:')
    print(df['sentiment'].value_counts())
    
    # Visualize
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    
    # Bar plot
    colors = {'negative': '#e74c3c', 'neutral': '#95a5a6', 'positive': '#2ecc71'}
    sentiment_counts = df['sentiment'].value_counts()
    axes[0].bar(sentiment_counts.index, sentiment_counts.values, 
                color=[colors[s] for s in sentiment_counts.index])
    axes[0].set_title('Sentiment Distribution')
    axes[0].set_ylabel('Count')
    
    # Pie chart
    axes[1].pie(sentiment_counts.values, labels=sentiment_counts.index, 
                autopct='%1.1f%%', colors=[colors[s] for s in sentiment_counts.index])
    axes[1].set_title('Sentiment Percentage')
    
    plt.tight_layout()
    plt.show()

## ‚öñÔ∏è 2. Balance Data (Undersampling)

In [None]:
# Check if data is already balanced
counts = df['sentiment'].value_counts()
min_count = counts.min()
max_count = counts.max()

# If already balanced (difference < 10%), skip undersampling
if (max_count - min_count) / max_count < 0.1:
    print('‚úì Data sudah balanced! Skip undersampling.')
    df_balanced = df.copy()
else:
    # Balance data menggunakan undersampling
    print(f'‚ö†Ô∏è Data tidak balanced. Melakukan undersampling...')
    print(f'Kelas minoritas: {min_count} samples')
    
    df_balanced = pd.DataFrame()
    for sentiment in ['negative', 'neutral', 'positive']:
        df_class = df[df['sentiment'] == sentiment]
        df_sampled = resample(df_class, replace=False, n_samples=min_count, random_state=42)
        df_balanced = pd.concat([df_balanced, df_sampled])
    
    # Shuffle
    df_balanced = df_balanced.sample(frac=1, random_state=42).reset_index(drop=True)

print('\n' + '=' * 60)
print('‚öñÔ∏è  DATA UNTUK TRAINING')
print('=' * 60)
print(f'Total: {len(df_balanced):,}')
print(df_balanced['sentiment'].value_counts())

# Visualize balanced
plt.figure(figsize=(8, 4))
balanced_counts = df_balanced['sentiment'].value_counts()
colors = {'negative': '#e74c3c', 'neutral': '#95a5a6', 'positive': '#2ecc71'}
plt.bar(balanced_counts.index, balanced_counts.values, 
        color=[colors[s] for s in balanced_counts.index])
plt.title('Sentiment Distribution for Training')
plt.ylabel('Count')
for i, (label, count) in enumerate(balanced_counts.items()):
    plt.text(i, count + 50, str(count), ha='center', fontweight='bold')
plt.show()

## üè∑Ô∏è 3. Prepare Labels & Split Data

In [None]:
# Label mapping
LABEL_MAP = {'negative': 0, 'neutral': 1, 'positive': 2}
LABEL_NAMES = ['negative', 'neutral', 'positive']
NUM_CLASSES = 3

df_balanced['label'] = df_balanced['sentiment'].map(LABEL_MAP)

# Split: 70% train, 15% validation, 15% test (stratified)
# Stratified split memastikan distribusi kelas sama di setiap split
train_df, temp_df = train_test_split(
    df_balanced, test_size=0.3, random_state=42, 
    stratify=df_balanced['label']
)
val_df, test_df = train_test_split(
    temp_df, test_size=0.5, random_state=42, 
    stratify=temp_df['label']
)

print('=' * 60)
print('üìÇ DATA SPLITS')
print('=' * 60)
print(f'Train: {len(train_df):,} samples ({len(train_df)/len(df_balanced)*100:.1f}%)')
print(f'Val:   {len(val_df):,} samples ({len(val_df)/len(df_balanced)*100:.1f}%)')
print(f'Test:  {len(test_df):,} samples ({len(test_df)/len(df_balanced)*100:.1f}%)')

print(f'\nüìä Train label distribution:')
print(train_df['sentiment'].value_counts())

## üîß 4. Hyperparameters & Configuration

In [None]:
# === HYPERPARAMETERS ===
# ULTRA-OPTIMIZED untuk menghindari overfitting dan meningkatkan akurasi
# Strategi: Freeze lebih banyak layer, regularisasi lebih kuat, learning rate lebih kecil

CONFIG = {
    # Model
    'model_name': 'indobenchmark/indobert-base-p1',
    'max_length': 128,
    'num_classes': NUM_CLASSES,
    
    # Training - CONSERVATIVE untuk hindari overfitting
    'batch_size': 32,
    'epochs': 30,  # Lebih banyak epoch dengan early stopping
    'learning_rate': 5e-6,  # SANGAT KECIL - kunci mengurangi overfitting
    
    # Anti-Overfitting - MAXIMUM REGULARIZATION
    'dropout_rate': 0.6,  # Tinggi
    'attention_dropout': 0.3,  # Dropout di attention juga
    'weight_decay': 0.05,  # L2 regularization lebih kuat
    'label_smoothing': 0.2,  # Lebih tinggi
    'warmup_ratio': 0.15,  # Warmup lebih lama
    'max_grad_norm': 0.5,  # Gradient clipping lebih ketat
    'early_stopping_patience': 7,  # Lebih sabar
    
    # Data Augmentation - Enhanced
    'word_dropout_prob': 0.2,
    'mixup_alpha': 0.2,  # Mixup augmentation
    
    # Layer Freezing - FREEZE MORE LAYERS
    'freeze_layers': 9,  # Freeze 9 dari 12 layer (hanya 3 layer trainable)
    
    # R-Drop regularization
    'rdrop_alpha': 0.5,  # KL divergence loss weight
}

print('=' * 60)
print('‚öôÔ∏è  ULTRA-OPTIMIZED CONFIGURATION')
print('=' * 60)
print('üéØ Strategy: Maximum regularization + Minimal trainable params')
print('-' * 60)
for key, value in CONFIG.items():
    print(f'{key}: {value}')

## üì¶ 5. Dataset Class with Augmentation

In [None]:
# Load tokenizer
tokenizer = BertTokenizer.from_pretrained(CONFIG['model_name'])
print(f'‚úì Tokenizer loaded: {CONFIG["model_name"]}')

class SentimentDataset(Dataset):
    """Dataset dengan ENHANCED augmentation + Mixup support"""
    
    def __init__(self, texts, labels, tokenizer, max_length=128, 
                 augment=False, word_dropout_prob=0.2):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.augment = augment
        self.word_dropout_prob = word_dropout_prob
    
    def __len__(self):
        return len(self.texts)
    
    def _augment_text(self, text):
        """Enhanced augmentation dengan multiple techniques"""
        if not self.augment:
            return text
            
        text = str(text)
        words = text.split()
        
        if len(words) <= 3:
            return text
        
        # Randomly choose augmentation technique
        aug_type = random.random()
        
        if aug_type < 0.3:
            # Word dropout - hapus beberapa kata
            words = [w for w in words if random.random() > self.word_dropout_prob]
        elif aug_type < 0.5:
            # Word swap - tukar posisi kata
            if len(words) > 2:
                idx = random.randint(0, len(words) - 2)
                words[idx], words[idx + 1] = words[idx + 1], words[idx]
        elif aug_type < 0.7:
            # Random deletion - hapus 1 kata random
            if len(words) > 4:
                del_idx = random.randint(0, len(words) - 1)
                words.pop(del_idx)
        elif aug_type < 0.85:
            # Shuffle middle words (keep first and last)
            if len(words) > 4:
                middle = words[1:-1]
                random.shuffle(middle)
                words = [words[0]] + middle + [words[-1]]
        # else: no augmentation (15% chance)
        
        return ' '.join(words) if words else text
    
    def __getitem__(self, idx):
        text = self._augment_text(self.texts[idx])
        
        encoding = self.tokenizer.encode_plus(
            str(text),
            add_special_tokens=True,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )
        
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'label': torch.tensor(self.labels[idx], dtype=torch.long)
        }

# Create datasets
train_dataset = SentimentDataset(
    train_df['content_clean'].values,
    train_df['label'].values,
    tokenizer,
    max_length=CONFIG['max_length'],
    augment=True,
    word_dropout_prob=CONFIG['word_dropout_prob']
)

val_dataset = SentimentDataset(
    val_df['content_clean'].values,
    val_df['label'].values,
    tokenizer,
    max_length=CONFIG['max_length'],
    augment=False
)

test_dataset = SentimentDataset(
    test_df['content_clean'].values,
    test_df['label'].values,
    tokenizer,
    max_length=CONFIG['max_length'],
    augment=False
)

# Create dataloaders
train_loader = DataLoader(train_dataset, batch_size=CONFIG['batch_size'], shuffle=True, drop_last=True)
val_loader = DataLoader(val_dataset, batch_size=CONFIG['batch_size'], shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=CONFIG['batch_size'], shuffle=False)

print(f'\n‚úì Datasets created:')
print(f'  Train: {len(train_dataset)} samples, {len(train_loader)} batches')
print(f'  Val:   {len(val_dataset)} samples, {len(val_loader)} batches')
print(f'  Test:  {len(test_dataset)} samples, {len(test_loader)} batches')

## üß† 6. Model Architecture

In [None]:
class IndoBERTSentimentClassifier(nn.Module):
    """
    IndoBERT dengan regularisasi MAKSIMAL untuk mencegah overfitting:
    - Freeze 9 dari 12 layer BERT (hanya 3 layer trainable)
    - Multiple dropout layers
    - Attention dropout
    - Simple classifier (hindari overfitting di classifier)
    """
    
    def __init__(self, model_name, num_classes, dropout_rate=0.6, 
                 attention_dropout=0.3, freeze_layers=9):
        super(IndoBERTSentimentClassifier, self).__init__()
        
        # Load pretrained BERT
        self.bert = BertModel.from_pretrained(model_name)
        self.hidden_size = self.bert.config.hidden_size
        
        # === FREEZE BERT LAYERS - MORE AGGRESSIVE ===
        # Freeze embeddings
        for param in self.bert.embeddings.parameters():
            param.requires_grad = False
        
        # Freeze first N encoder layers
        for i in range(freeze_layers):
            for param in self.bert.encoder.layer[i].parameters():
                param.requires_grad = False
        
        # Also add dropout to attention in unfrozen layers
        for i in range(freeze_layers, 12):
            self.bert.encoder.layer[i].attention.self.dropout = nn.Dropout(attention_dropout)
            self.bert.encoder.layer[i].attention.output.dropout = nn.Dropout(attention_dropout)
        
        print(f'‚úì Froze embeddings and first {freeze_layers} encoder layers')
        print(f'  Only layers {freeze_layers}-11 are trainable (3 layers)')
        
        # Regularization - AGGRESSIVE
        self.dropout1 = nn.Dropout(dropout_rate)
        self.dropout2 = nn.Dropout(dropout_rate)
        self.dropout3 = nn.Dropout(dropout_rate * 0.5)
        self.layer_norm = nn.LayerNorm(self.hidden_size)
        
        # SIMPLER Classifier - hindari overfitting
        # Langsung ke output, tanpa hidden layer kompleks
        self.fc = nn.Linear(self.hidden_size, num_classes)
        
        # Initialize weights
        nn.init.xavier_uniform_(self.fc.weight)
        nn.init.zeros_(self.fc.bias)
    
    def forward(self, input_ids, attention_mask, return_hidden=False):
        # Get BERT output
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        
        # Use [CLS] token representation
        pooled_output = outputs.pooler_output
        
        # Apply regularization pipeline dengan multiple dropout
        x = self.layer_norm(pooled_output)
        x = self.dropout1(x)
        
        # Monte Carlo Dropout - apply dropout multiple times during training
        if self.training:
            x = self.dropout2(x)
        
        logits = self.fc(x)
        
        if return_hidden:
            return logits, pooled_output
        return logits

# Initialize model
model = IndoBERTSentimentClassifier(
    model_name=CONFIG['model_name'],
    num_classes=CONFIG['num_classes'],
    dropout_rate=CONFIG['dropout_rate'],
    attention_dropout=CONFIG['attention_dropout'],
    freeze_layers=CONFIG['freeze_layers']
).to(device)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
frozen_params = total_params - trainable_params

print(f'\n‚úì Model initialized')
print(f'  Total parameters: {total_params:,}')
print(f'  Trainable parameters: {trainable_params:,} ({trainable_params/total_params*100:.1f}%)')
print(f'  Frozen parameters: {frozen_params:,} ({frozen_params/total_params*100:.1f}%)')
print(f'\n‚ö†Ô∏è Note: Trainable params sangat sedikit = lebih sulit overfitting')

## üìâ 7. Loss Function, Optimizer & Scheduler

In [None]:
# Loss function dengan label smoothing
criterion = nn.CrossEntropyLoss(label_smoothing=CONFIG['label_smoothing'])

# Optimizer - ONLY for trainable parameters
# Pisahkan parameter yang perlu weight decay dan yang tidak
no_decay = ['bias', 'LayerNorm.weight', 'layer_norm.weight']

# Filter hanya parameter yang requires_grad=True
trainable_params_list = [(n, p) for n, p in model.named_parameters() if p.requires_grad]

optimizer_grouped_parameters = [
    {
        'params': [p for n, p in trainable_params_list if not any(nd in n for nd in no_decay)],
        'weight_decay': CONFIG['weight_decay']
    },
    {
        'params': [p for n, p in trainable_params_list if any(nd in n for nd in no_decay)],
        'weight_decay': 0.0
    }
]

optimizer = AdamW(optimizer_grouped_parameters, lr=CONFIG['learning_rate'])

# Learning rate scheduler dengan warmup
total_steps = len(train_loader) * CONFIG['epochs']
warmup_steps = int(total_steps * CONFIG['warmup_ratio'])

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=warmup_steps,
    num_training_steps=total_steps
)

print(f'‚úì Optimizer: AdamW (lr={CONFIG["learning_rate"]}, weight_decay={CONFIG["weight_decay"]})')
print(f'‚úì Scheduler: Linear warmup ({warmup_steps} warmup steps, {total_steps} total steps)')
print(f'‚úì Loss: CrossEntropy with label_smoothing={CONFIG["label_smoothing"]}')

## üèãÔ∏è 8. Training Functions

In [None]:
import torch.nn.functional as F

def compute_kl_loss(p, q):
    """Compute KL divergence loss for R-Drop"""
    p_loss = F.kl_div(F.log_softmax(p, dim=-1), F.softmax(q, dim=-1), reduction='batchmean')
    q_loss = F.kl_div(F.log_softmax(q, dim=-1), F.softmax(p, dim=-1), reduction='batchmean')
    return (p_loss + q_loss) / 2

def train_epoch_rdrop(model, dataloader, criterion, optimizer, scheduler, device, 
                      max_grad_norm, rdrop_alpha=0.5):
    """Train dengan R-Drop regularization untuk mengurangi overfitting"""
    model.train()
    total_loss = 0
    all_preds = []
    all_labels = []
    
    progress_bar = tqdm(dataloader, desc='Training', leave=False)
    
    for batch in progress_bar:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)
        
        optimizer.zero_grad()
        
        # R-Drop: Forward pass 2 kali dengan dropout berbeda
        logits1 = model(input_ids, attention_mask)
        logits2 = model(input_ids, attention_mask)
        
        # Cross entropy loss
        ce_loss = (criterion(logits1, labels) + criterion(logits2, labels)) / 2
        
        # KL divergence loss (R-Drop)
        kl_loss = compute_kl_loss(logits1, logits2)
        
        # Total loss
        loss = ce_loss + rdrop_alpha * kl_loss
        
        loss.backward()
        
        # Gradient clipping - lebih ketat
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
        
        optimizer.step()
        scheduler.step()
        
        total_loss += loss.item()
        # Use average logits for prediction
        avg_logits = (logits1 + logits2) / 2
        preds = torch.argmax(avg_logits, dim=1).cpu().numpy()
        all_preds.extend(preds)
        all_labels.extend(labels.cpu().numpy())
        
        progress_bar.set_postfix({'loss': f'{loss.item():.4f}', 'ce': f'{ce_loss.item():.4f}'})
    
    avg_loss = total_loss / len(dataloader)
    accuracy = accuracy_score(all_labels, all_preds)
    f1 = f1_score(all_labels, all_preds, average='weighted')
    
    return avg_loss, accuracy, f1


def evaluate(model, dataloader, criterion, device):
    """Evaluasi model dengan Monte Carlo Dropout untuk uncertainty estimation"""
    model.eval()
    total_loss = 0
    all_preds = []
    all_labels = []
    all_probs = []
    
    with torch.no_grad():
        for batch in dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)
            
            logits = model(input_ids, attention_mask)
            loss = criterion(logits, labels)
            
            probs = F.softmax(logits, dim=1)
            
            total_loss += loss.item()
            preds = torch.argmax(logits, dim=1).cpu().numpy()
            all_preds.extend(preds)
            all_labels.extend(labels.cpu().numpy())
            all_probs.extend(probs.cpu().numpy())
    
    avg_loss = total_loss / len(dataloader)
    accuracy = accuracy_score(all_labels, all_preds)
    f1 = f1_score(all_labels, all_preds, average='weighted')
    
    return avg_loss, accuracy, f1, all_preds, all_labels


class EarlyStopping:
    """Early stopping yang memonitor gap antara train dan val"""
    
    def __init__(self, patience=7, min_delta=0.001, mode='max', max_gap=0.08):
        self.patience = patience
        self.min_delta = min_delta
        self.mode = mode
        self.max_gap = max_gap  # Maximum allowed train-val gap
        self.counter = 0
        self.best_score = None
        self.early_stop = False
        self.best_model = None
        self.best_gap = float('inf')
    
    def __call__(self, score, model, train_score=None):
        # Check if overfitting (gap too large)
        if train_score is not None:
            gap = train_score - score
            if gap > self.max_gap:
                print(f'   ‚ö†Ô∏è Gap {gap:.4f} > {self.max_gap} - potential overfitting')
        
        if self.mode == 'min':
            is_improvement = self.best_score is None or score < self.best_score - self.min_delta
        else:
            is_improvement = self.best_score is None or score > self.best_score + self.min_delta
        
        if is_improvement:
            self.best_score = score
            self.best_model = copy.deepcopy(model.state_dict())
            self.counter = 0
            if train_score is not None:
                self.best_gap = train_score - score
        else:
            self.counter += 1
            if self.counter >= self.patience:
                self.early_stop = True
        
        return self.early_stop

print('‚úì Training functions with R-Drop regularization defined')
print('  R-Drop helps reduce overfitting by enforcing consistency between dropout samples')

## üöÄ 9. Training Loop

In [None]:
# Training history
history = {
    'train_loss': [], 'train_acc': [], 'train_f1': [],
    'val_loss': [], 'val_acc': [], 'val_f1': [],
    'gap': []  # Track gap untuk monitoring
}

# Early stopping - monitor validation F1 dengan gap monitoring
early_stopping = EarlyStopping(
    patience=CONFIG['early_stopping_patience'], 
    mode='max',
    max_gap=0.08  # Stop jika gap > 8%
)

print('=' * 60)
print('üöÄ TRAINING STARTED - ULTRA OPTIMIZED')
print('=' * 60)
print(f'Epochs: {CONFIG["epochs"]} | Early Stopping Patience: {CONFIG["early_stopping_patience"]}')
print(f'Learning Rate: {CONFIG["learning_rate"]} | Batch Size: {CONFIG["batch_size"]}')
print(f'Frozen Layers: {CONFIG["freeze_layers"]}/12 | Dropout: {CONFIG["dropout_rate"]}')
print(f'R-Drop Alpha: {CONFIG["rdrop_alpha"]} | Weight Decay: {CONFIG["weight_decay"]}')
print('-' * 60)

best_val_f1 = 0
best_epoch = 0
best_gap = float('inf')

for epoch in range(CONFIG['epochs']):
    print(f'\nüìç Epoch {epoch + 1}/{CONFIG["epochs"]}')
    
    # Train dengan R-Drop
    train_loss, train_acc, train_f1 = train_epoch_rdrop(
        model, train_loader, criterion, optimizer, scheduler, 
        device, CONFIG['max_grad_norm'], CONFIG['rdrop_alpha']
    )
    
    # Validate
    val_loss, val_acc, val_f1, _, _ = evaluate(
        model, val_loader, criterion, device
    )
    
    # Calculate gap
    gap = train_acc - val_acc
    
    # Save history
    history['train_loss'].append(train_loss)
    history['train_acc'].append(train_acc)
    history['train_f1'].append(train_f1)
    history['val_loss'].append(val_loss)
    history['val_acc'].append(val_acc)
    history['val_f1'].append(val_f1)
    history['gap'].append(gap)
    
    # Print metrics
    print(f'  Train - Loss: {train_loss:.4f} | Acc: {train_acc:.4f} | F1: {train_f1:.4f}')
    print(f'  Val   - Loss: {val_loss:.4f} | Acc: {val_acc:.4f} | F1: {val_f1:.4f}')
    
    # Track best model - prioritaskan model dengan gap kecil DAN F1 tinggi
    if val_f1 > best_val_f1 and gap < 0.10:
        best_val_f1 = val_f1
        best_epoch = epoch + 1
        best_gap = gap
        print(f'  ‚≠ê New best! F1: {val_f1:.4f}, Gap: {gap:.4f}')
    elif val_f1 > best_val_f1:
        print(f'  üìà Higher F1 but gap too large ({gap:.4f})')
    
    # Check overfitting status
    print(f'  üìä Train-Val Gap: {gap*100:.2f}%', end='')
    if gap > 0.10:
        print(' ‚ö†Ô∏è OVERFITTING!')
    elif gap > 0.05:
        print(' ‚ö° Slight gap')
    else:
        print(' ‚úÖ Good generalization')
    
    # Early stopping check
    if early_stopping(val_f1, model, train_acc):
        print(f'\nüõë Early stopping triggered at epoch {epoch + 1}')
        print(f'   Best F1 was at epoch {best_epoch} with gap {best_gap*100:.2f}%')
        break

# Load best model
if early_stopping.best_model is not None:
    model.load_state_dict(early_stopping.best_model)
    print(f'\n‚úì Loaded best model from epoch {best_epoch}')
    print(f'  Best Val F1: {best_val_f1:.4f}')
    print(f'  Best Gap: {best_gap*100:.2f}%')

## üìà 10. Training Visualization

In [None]:
# Plot training history
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

epochs_range = range(1, len(history['train_loss']) + 1)

# Loss
axes[0].plot(epochs_range, history['train_loss'], 'b-', label='Train Loss', marker='o')
axes[0].plot(epochs_range, history['val_loss'], 'r-', label='Val Loss', marker='s')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training & Validation Loss')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Accuracy
axes[1].plot(epochs_range, history['train_acc'], 'b-', label='Train Acc', marker='o')
axes[1].plot(epochs_range, history['val_acc'], 'r-', label='Val Acc', marker='s')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Training & Validation Accuracy')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# F1 Score
axes[2].plot(epochs_range, history['train_f1'], 'b-', label='Train F1', marker='o')
axes[2].plot(epochs_range, history['val_f1'], 'r-', label='Val F1', marker='s')
axes[2].set_xlabel('Epoch')
axes[2].set_ylabel('F1 Score')
axes[2].set_title('Training & Validation F1 Score')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('training_history.png', dpi=150, bbox_inches='tight')
plt.show()

# Check for overfitting
final_gap = history['train_acc'][-1] - history['val_acc'][-1]
print(f'\nüìä Overfitting Analysis:')
print(f'  Final Train Accuracy: {history["train_acc"][-1]:.4f}')
print(f'  Final Val Accuracy:   {history["val_acc"][-1]:.4f}')
print(f'  Gap (Train - Val):    {final_gap:.4f}')

if final_gap < 0.03:
    print('  ‚úÖ Model is NOT overfitting (gap < 3%)')
elif final_gap < 0.05:
    print('  ‚ö†Ô∏è  Slight overfitting (gap 3-5%)')
else:
    print('  ‚ùå Model is overfitting (gap > 5%)')

## üß™ 11. Test Set Evaluation

In [None]:
# Evaluate on test set
print('=' * 60)
print('üß™ TEST SET EVALUATION')
print('=' * 60)

test_loss, test_acc, test_f1, test_preds, test_labels = evaluate(
    model, test_loader, criterion, device
)

print(f'\nüìä Test Results:')
print(f'  Loss:     {test_loss:.4f}')
print(f'  Accuracy: {test_acc:.4f} ({test_acc*100:.2f}%)')
print(f'  F1 Score: {test_f1:.4f}')

# Classification report
print('\n' + '=' * 60)
print('üìã CLASSIFICATION REPORT')
print('=' * 60)
print(classification_report(test_labels, test_preds, target_names=LABEL_NAMES, digits=4))

# Per-class metrics
precision, recall, f1, support = precision_recall_fscore_support(
    test_labels, test_preds, average=None, labels=[0, 1, 2]
)

print('\nüìä Per-Class Metrics:')
for i, label in enumerate(LABEL_NAMES):
    print(f'  {label.upper():10} - P: {precision[i]:.4f} | R: {recall[i]:.4f} | F1: {f1[i]:.4f} | N: {support[i]}')

## üî• 12. Confusion Matrix

In [None]:
# Confusion Matrix
cm = confusion_matrix(test_labels, test_preds)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Absolute numbers
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=LABEL_NAMES, yticklabels=LABEL_NAMES, ax=axes[0])
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')
axes[0].set_title('Confusion Matrix (Counts)')

# Normalized (percentages)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] * 100
sns.heatmap(cm_normalized, annot=True, fmt='.1f', cmap='Blues',
            xticklabels=LABEL_NAMES, yticklabels=LABEL_NAMES, ax=axes[1])
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')
axes[1].set_title('Confusion Matrix (Percentages %)')

plt.tight_layout()
plt.savefig('confusion_matrix.png', dpi=150, bbox_inches='tight')
plt.show()

# Analysis
print('\nüìä Confusion Matrix Analysis:')
for i, label in enumerate(LABEL_NAMES):
    correct = cm[i, i]
    total = cm[i].sum()
    print(f'  {label.upper():10} - Correct: {correct}/{total} ({correct/total*100:.1f}%)')

## üíæ 13. Save Model

In [None]:
# Create models directory di Kaggle output
OUTPUT_DIR = '/kaggle/working/models'
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Save model
model_path = f'{OUTPUT_DIR}/indobert_sentiment_3class.pt'
torch.save({
    'model_state_dict': model.state_dict(),
    'config': CONFIG,
    'label_map': LABEL_MAP,
    'label_names': LABEL_NAMES,
    'test_accuracy': test_acc,
    'test_f1': test_f1,
    'history': history,
}, model_path)
print(f'‚úì Model saved to: {model_path}')

# Save training history
history_path = f'{OUTPUT_DIR}/training_history.json'
with open(history_path, 'w') as f:
    json.dump(history, f, indent=2)
print(f'‚úì History saved to: {history_path}')

# Save tokenizer
tokenizer.save_pretrained(f'{OUTPUT_DIR}/tokenizer')
print(f'‚úì Tokenizer saved to: {OUTPUT_DIR}/tokenizer/')

# List saved files
print(f'\nüìÅ Saved files:')
for f in os.listdir(OUTPUT_DIR):
    filepath = os.path.join(OUTPUT_DIR, f)
    if os.path.isfile(filepath):
        size = os.path.getsize(filepath) / (1024*1024)
        print(f'   ‚Ä¢ {f} ({size:.2f} MB)')
    else:
        print(f'   ‚Ä¢ {f}/')

print(f'\n‚úÖ Files saved in /kaggle/working/models/')
print('üí° Download dari tab "Output" setelah notebook selesai')

## üîÆ 14. Inference Demo

In [None]:
def predict_sentiment(text, model, tokenizer, device, label_names):
    """Prediksi sentiment untuk satu teks"""
    model.eval()
    
    encoding = tokenizer.encode_plus(
        text,
        add_special_tokens=True,
        max_length=128,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt'
    )
    
    input_ids = encoding['input_ids'].to(device)
    attention_mask = encoding['attention_mask'].to(device)
    
    with torch.no_grad():
        logits = model(input_ids, attention_mask)
        probs = torch.softmax(logits, dim=1)
        pred = torch.argmax(probs, dim=1).item()
    
    return {
        'sentiment': label_names[pred],
        'confidence': probs[0][pred].item(),
        'probabilities': {
            label_names[i]: probs[0][i].item() 
            for i in range(len(label_names))
        }
    }

# Test dengan contoh
test_reviews = [
    "Aplikasi gojek sangat membantu, driver ramah dan cepat",
    "Driver nya lama banget, udah nunggu 1 jam gak datang datang",
    "Biasa aja sih aplikasinya",
    "Pelayanan buruk, driver tidak sopan, tidak akan pakai lagi",
    "Mantap, makanan sampai dengan selamat dan masih hangat",
    "Ongkirnya agak mahal tapi ya lumayan lah"
]

print('=' * 60)
print('üîÆ INFERENCE DEMO')
print('=' * 60)

for review in test_reviews:
    result = predict_sentiment(review, model, tokenizer, device, LABEL_NAMES)
    emoji = {'negative': 'üò†', 'neutral': 'üòê', 'positive': 'üòä'}[result['sentiment']]
    print(f'\nüìù "{review[:50]}..."' if len(review) > 50 else f'\nüìù "{review}"')
    print(f'   {emoji} Sentiment: {result["sentiment"].upper()} (Confidence: {result["confidence"]*100:.1f}%)')
    print(f'   Probabilities: Neg={result["probabilities"]["negative"]*100:.1f}% | '
          f'Neu={result["probabilities"]["neutral"]*100:.1f}% | '
          f'Pos={result["probabilities"]["positive"]*100:.1f}%')

## üìä 15. Final Summary

In [None]:
# Final summary
print('=' * 60)
print('üìä FINAL SUMMARY')
print('=' * 60)

# Calculate metrics
final_gap = history['train_acc'][-1] - history['val_acc'][-1]
best_val_acc = max(history['val_acc'])
min_gap = min(history['gap'])
avg_gap = sum(history['gap']) / len(history['gap'])

print(f'''
üéØ MODEL PERFORMANCE:
   ‚Ä¢ Test Accuracy: {test_acc*100:.2f}%
   ‚Ä¢ Test F1 Score: {test_f1*100:.2f}%
   ‚Ä¢ Best Validation Accuracy: {best_val_acc*100:.2f}%

üìà OVERFITTING CHECK:
   ‚Ä¢ Final Train-Val Gap: {final_gap*100:.2f}%
   ‚Ä¢ Best Gap: {min_gap*100:.2f}%
   ‚Ä¢ Average Gap: {avg_gap*100:.2f}%
   ‚Ä¢ Status: {"‚úÖ Good Generalization" if final_gap < 0.05 else "‚ö†Ô∏è Check Gap" if final_gap < 0.10 else "‚ùå Overfitting"}

‚öôÔ∏è ULTRA ANTI-OVERFITTING TECHNIQUES:
   ‚Ä¢ Layer Freezing: {CONFIG['freeze_layers']}/12 layers frozen
   ‚Ä¢ Dropout Rate: {CONFIG['dropout_rate']}
   ‚Ä¢ Attention Dropout: {CONFIG['attention_dropout']}
   ‚Ä¢ Label Smoothing: {CONFIG['label_smoothing']}
   ‚Ä¢ Weight Decay: {CONFIG['weight_decay']}
   ‚Ä¢ R-Drop Alpha: {CONFIG['rdrop_alpha']}
   ‚Ä¢ Learning Rate: {CONFIG['learning_rate']} (very small)
   ‚Ä¢ Gradient Clipping: {CONFIG['max_grad_norm']}
   ‚Ä¢ Early Stopping: patience={CONFIG['early_stopping_patience']}
   ‚Ä¢ Data Augmentation: word dropout, swap, shuffle

üíæ SAVED FILES:
   ‚Ä¢ Model: /kaggle/working/models/indobert_sentiment_3class.pt
   ‚Ä¢ Tokenizer: /kaggle/working/models/tokenizer/
   ‚Ä¢ History: /kaggle/working/models/training_history.json
''')

# Recommendation based on results
if test_acc >= 0.75 and final_gap < 0.05:
    print('üéâ EXCELLENT! Model has good accuracy and generalization!')
elif test_acc >= 0.75:
    print('‚ö†Ô∏è Good accuracy but check overfitting. Consider more regularization.')
elif final_gap < 0.05:
    print('‚úÖ Good generalization but accuracy could be improved. Try unfreezing more layers.')
else:
    print('‚ùå Both accuracy and generalization need improvement.')

print('=' * 60)
print('‚úÖ Training completed!')
print('=' * 60)

## üì• 16. Copy to Kaggle Output (Optional)

In [None]:
# Zip model folder untuk download yang lebih mudah
import shutil

zip_path = '/kaggle/working/model_sentiment_3class'
shutil.make_archive(zip_path, 'zip', '/kaggle/working/models')

print('‚úì Model di-zip ke: /kaggle/working/model_sentiment_3class.zip')
print('\nüì• Cara download:')
print('   1. Setelah notebook selesai, klik tab "Output" di kanan')
print('   2. Download file model_sentiment_3class.zip')
print('   3. Extract untuk mendapatkan model, tokenizer, dan history')