# 🤖 3. Sentiment Analysis dengan IndoBERT

Notebook ini akan memprediksi sentimen menggunakan model **IndoBERT** (`w11wo/indobert-sentiment-classification`).

**Model:** Pre-trained IndoBERT untuk sentiment analysis bahasa Indonesia  
**Output:** 3 kelas sentimen (negatif, netral, positif) + probabilitas

**Catatan:** First run akan download model (~400MB). Pastikan koneksi internet stabil.

In [None]:
# Import libraries
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from pathlib import Path
from tqdm.auto import tqdm
import warnings
warnings.filterwarnings('ignore')

# Check GPU
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"🖥️ Device: {device}")
print(f"🔢 PyTorch version: {torch.__version__}")

if device == 'cuda':
    print(f"✅ GPU available: {torch.cuda.get_device_name(0)}")
else:
    print("⚠️ Using CPU (inference will be slower)")

# Buat folder
Path('data/processed').mkdir(parents=True, exist_ok=True)
Path('models').mkdir(parents=True, exist_ok=True)

## Load Data

Load data yang sudah dibersihkan dari notebook sebelumnya.

In [None]:
# Load data
data_file = 'data/interim/all_apps_clean.csv'

try:
    df = pd.read_csv(data_file)
    print(f"✅ Loaded data: {len(df)} reviews")
    print(f"📊 Columns: {list(df.columns)}")
    print(f"\n📋 Preview:")
    print(df.head())
except FileNotFoundError:
    print(f"❌ File not found: {data_file}")
    print("⚠️ Jalankan notebook 2_preprocessing.ipynb terlebih dahulu!")
    raise

## Load Model IndoBERT

Download dan load model dari Hugging Face. **First run akan download ~400MB**.

In [None]:
# Model configuration
MODEL_NAME = 'w11wo/indobert-sentiment-classification'
CACHE_DIR = 'models'

print(f"📥 Loading model: {MODEL_NAME}")
print("⏳ Please wait... (first run akan download model)")
print()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME,
    cache_dir=CACHE_DIR
)
print("✅ Tokenizer loaded")

# Load model
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    cache_dir=CACHE_DIR
)
model.to(device)
model.eval()
print(f"✅ Model loaded and moved to {device}")

# Label mapping
LABEL_MAP = {
    0: 'negatif',
    1: 'netral',
    2: 'positif'
}
print(f"\n🏷️ Label mapping: {LABEL_MAP}")

## Test Prediksi

Test model dengan beberapa contoh teks.

In [None]:
def predict_text(text, max_length=256):
    """
    Prediksi sentimen untuk satu teks.
    
    Returns:
        dict: {'label': str, 'probs': dict}
    """
    # Tokenize
    inputs = tokenizer(
        text,
        return_tensors='pt',
        truncation=True,
        max_length=max_length,
        padding=True
    )
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    # Predict
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probs = torch.nn.functional.softmax(logits, dim=-1)
    
    # Get prediction
    pred_idx = torch.argmax(probs, dim=-1).item()
    pred_label = LABEL_MAP[pred_idx]
    
    # Get probabilities
    probs_dict = {
        LABEL_MAP[i]: float(probs[0][i].cpu())
        for i in range(len(LABEL_MAP))
    }
    
    return {
        'label': pred_label,
        'probs': probs_dict
    }

# Test dengan beberapa contoh
print("🧪 TEST PREDIKSI")
print("=" * 80)

test_samples = [
    "aplikasi sangat bagus dan membantu",
    "aplikasi sering error dan lambat",
    "biasa saja tidak ada yang istimewa"
]

for text in test_samples:
    result = predict_text(text)
    print(f"\nText: {text}")
    print(f"Prediction: {result['label'].upper()}")
    print(f"Probabilities:")
    for label, prob in result['probs'].items():
        print(f"  {label}: {prob:.4f} ({prob*100:.2f}%)")
    print("-" * 80)

## Batch Prediction

Prediksi sentimen untuk seluruh dataset. **Estimasi waktu:**
- CPU: 30-45 menit (untuk 5000+ reviews)
- GPU: 3-5 menit

**Tips:** Jika terlalu lama, test dulu dengan sample kecil (ubah `df.head(100)`).

In [None]:
def predict_batch(texts, batch_size=16, max_length=256):
    """
    Prediksi batch dengan progress bar.
    
    Args:
        texts: List of texts
        batch_size: Batch size (reduce jika out of memory)
        max_length: Max sequence length
    
    Returns:
        DataFrame with predictions
    """
    results = {
        'pred_label': [],
        'p_negatif': [],
        'p_netral': [],
        'p_positif': []
    }
    
    # Process in batches
    for i in tqdm(range(0, len(texts), batch_size), desc="Predicting"):
        batch_texts = texts[i:i+batch_size]
        
        # Tokenize
        inputs = tokenizer(
            batch_texts,
            return_tensors='pt',
            truncation=True,
            max_length=max_length,
            padding=True
        )
        inputs = {k: v.to(device) for k, v in inputs.items()}
        
        # Predict
        with torch.no_grad():
            outputs = model(**inputs)
            logits = outputs.logits
            probs = torch.nn.functional.softmax(logits, dim=-1)
        
        # Store results
        pred_indices = torch.argmax(probs, dim=-1).cpu().numpy()
        probs_array = probs.cpu().numpy()
        
        for pred_idx, prob_array in zip(pred_indices, probs_array):
            results['pred_label'].append(LABEL_MAP[pred_idx])
            results['p_negatif'].append(float(prob_array[0]))
            results['p_netral'].append(float(prob_array[1]))
            results['p_positif'].append(float(prob_array[2]))
    
    return pd.DataFrame(results)

# Konfigurasi
BATCH_SIZE = 16  # Reduce jika out of memory (misal 8 atau 4)
MAX_LENGTH = 256

print(f"⚙️ Configuration:")
print(f"   Batch size: {BATCH_SIZE}")
print(f"   Max length: {MAX_LENGTH}")
print(f"   Total reviews: {len(df)}")
print()

# OPTIONAL: Test dengan sample kecil dulu
# Uncomment baris di bawah untuk test dengan 100 data pertama
# df_to_predict = df.head(100).copy()
# print("⚠️ TEST MODE: Using first 100 reviews only")

# Untuk full dataset, gunakan ini:
df_to_predict = df.copy()

print("🚀 Starting prediction...")
print(f"⏱️ Estimated time: {'3-5 minutes (GPU)' if device == 'cuda' else '30-45 minutes (CPU)'}")
print()

In [None]:
# Jalankan prediksi
texts = df_to_predict['clean_text'].fillna('').tolist()

predictions_df = predict_batch(
    texts, 
    batch_size=BATCH_SIZE,
    max_length=MAX_LENGTH
)

print("\n✅ Prediction complete!")

## Gabungkan Hasil dengan Data Asli

In [None]:
# Gabungkan prediksi dengan data asli
df_result = pd.concat([
    df_to_predict.reset_index(drop=True),
    predictions_df
], axis=1)

print(f"✅ Combined data shape: {df_result.shape}")
print(f"\n📋 Columns: {list(df_result.columns)}")
print(f"\n🔍 Preview:")
print(df_result[['app', 'content', 'score', 'label', 'pred_label', 'p_positif', 'p_netral', 'p_negatif']].head(10))

## Analisis Hasil Prediksi

Bandingkan label otomatis (dari rating) vs prediksi model.

In [None]:
# Distribusi prediksi
print("📊 DISTRIBUSI PREDIKSI")
print("=" * 60)

pred_counts = df_result['pred_label'].value_counts()
for label, count in pred_counts.items():
    percentage = count / len(df_result) * 100
    print(f"{label}: {count} ({percentage:.1f}%)")

# Distribusi per app
print("\n📱 DISTRIBUSI PER APLIKASI")
print("=" * 60)

for app in df_result['app'].unique():
    app_data = df_result[df_result['app'] == app]
    print(f"\n{app.upper()}:")
    for label, count in app_data['pred_label'].value_counts().items():
        percentage = count / len(app_data) * 100
        print(f"  {label}: {count} ({percentage:.1f}%)")

# Perbandingan label vs prediksi
print("\n🔄 AGREEMENT: Label (Rating) vs Prediction")
print("=" * 60)

agreement = (df_result['label'] == df_result['pred_label']).sum()
total = len(df_result)
agreement_pct = agreement / total * 100

print(f"Agreement: {agreement}/{total} ({agreement_pct:.2f}%)")
print(f"Disagreement: {total-agreement}/{total} ({100-agreement_pct:.2f}%)")

## Contoh Prediksi

Lihat contoh prediksi per kategori.

In [None]:
print("📋 CONTOH PREDIKSI PER KATEGORI")
print("=" * 80)

for pred_label in ['positif', 'netral', 'negatif']:
    print(f"\n🏷️ PREDIKSI: {pred_label.upper()}")
    print("-" * 80)
    
    samples = df_result[df_result['pred_label'] == pred_label].head(3)
    for idx, row in samples.iterrows():
        print(f"App: {row['app']} | Rating: {row['score']} | Label: {row['label']}")
        print(f"Text: {row['content'][:100]}...")
        print(f"Confidence: {row[f'p_{pred_label}']:.2%}")
        print()

## Simpan Hasil

Simpan data dengan prediksi ke `data/processed/`.

In [None]:
# Simpan hasil
output_file = 'data/processed/all_apps_with_predictions.csv'
df_result.to_csv(output_file, index=False, encoding='utf-8')

print(f"💾 Saved results to: {output_file}")
print(f"📊 Total rows: {len(df_result)}")
print(f"📂 File size: {Path(output_file).stat().st_size / 1024 / 1024:.2f} MB")

## 🎉 Selesai!

Prediksi sentimen berhasil dilakukan!

**Output:**
- File: `data/processed/all_apps_with_predictions.csv`
- Total: {len(df_result)} reviews
- Kolom prediksi: `pred_label`, `p_negatif`, `p_netral`, `p_positif`

**Next steps:**
- Jalankan notebook `4_evaluation.ipynb` untuk evaluasi mendalam & visualisasi

**Summary:**
- ✅ Model: IndoBERT sentiment classification
- ✅ Prediksi: {len(df_result)} reviews
- ✅ Device: {device}
- ✅ Distribusi: {dict(pred_counts)}