# ü§ñ 3. Sentiment Analysis dengan IndoBERT

Notebook ini akan memprediksi sentimen menggunakan model **IndoBERT** (`mdhugol/indonesia-bert-sentiment-classification`).

**Model:** Pre-trained IndoBERT untuk sentiment analysis bahasa Indonesia  
**Output:** 3 kelas sentimen (positive, neutral, negative) + probabilitas

**Catatan:** First run akan download model (~400MB). Pastikan koneksi internet stabil.

In [1]:
# Import libraries
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from pathlib import Path
from tqdm.auto import tqdm
import warnings
warnings.filterwarnings('ignore')

# Check GPU
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"üñ•Ô∏è Device: {device}")
print(f"üî¢ PyTorch version: {torch.__version__}")

if device == 'cuda':
    print(f"‚úÖ GPU available: {torch.cuda.get_device_name(0)}")
else:
    print("‚ö†Ô∏è Using CPU (inference will be slower)")

# Buat folder
Path('data/processed').mkdir(parents=True, exist_ok=True)
Path('models').mkdir(parents=True, exist_ok=True)

üñ•Ô∏è Device: cuda
üî¢ PyTorch version: 2.7.1+cu118
‚úÖ GPU available: NVIDIA GeForce RTX 2050


## Load Data

Load data yang sudah dibersihkan dari notebook sebelumnya.

In [2]:
# Load data
data_file = 'data/interim/all_apps_clean.csv'

try:
    df = pd.read_csv(data_file)
    print(f"‚úÖ Loaded data: {len(df)} reviews")
    print(f"üìä Columns: {list(df.columns)}")
    print(f"\nüìã Preview:")
    print(df.head())
except FileNotFoundError:
    print(f"‚ùå File not found: {data_file}")
    print("‚ö†Ô∏è Jalankan notebook 2_preprocessing.ipynb terlebih dahulu!")
    raise

‚úÖ Loaded data: 2767 reviews
üìä Columns: ['app', 'content', 'score', 'at', 'thumbsUpCount', 'clean_text', 'label']

üìã Preview:
     app                                            content  score  \
0  gojek  hati hati aplikasi ini sering error , jgn mau ...      1   
1  gojek                                    sangat membantu      5   
2  gojek  sangat membantu keperluan masyarakat untuk har...      5   
3  gojek  transfer ke bank lain memakai aplikasi ini tap...      1   
4  gojek                                  apk yang, bagusss      5   

                    at  thumbsUpCount  \
0  2025-10-15 19:02:10              0   
1  2025-10-15 19:01:39              0   
2  2025-10-15 18:43:17              0   
3  2025-10-15 18:37:34              0   
4  2025-10-15 18:30:43              0   

                                          clean_text    label  
0  hati hati aplikasi ini sering error , jangan m...  negatif  
1                                    sangat membantu  positif  
2  sang

## Load Model IndoBERT

Download dan load model dari Hugging Face. **First run akan download ~400MB**.

In [None]:
# Model configuration
MODEL_NAME = 'mdhugol/indonesia-bert-sentiment-classification'
CACHE_DIR = 'models'

print(f"üì• Loading model: {MODEL_NAME}")
print("‚è≥ Please wait... (first run akan download model)")
print()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME,
    cache_dir=CACHE_DIR
)
print("‚úÖ Tokenizer loaded")

# Load model
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    cache_dir=CACHE_DIR
)
model.to(device)
model.eval()
print(f"‚úÖ Model loaded and moved to {device}")

# Label mapping - PENTING: Model mdhugol menggunakan mapping ini!
# Berdasarkan dokumentasi: LABEL_0=positive, LABEL_1=neutral, LABEL_2=negative
LABEL_MAP = {
    0: 'positif',    # LABEL_0 = positive
    1: 'netral',     # LABEL_1 = neutral
    2: 'negatif'     # LABEL_2 = negative
}
print(f"\nüè∑Ô∏è Label mapping (verified): {LABEL_MAP}")
print("   ‚ö†Ô∏è LABEL_0=positif, LABEL_1=netral, LABEL_2=negatif")

üì• Loading model: mdhugol/indonesia-bert-sentiment-classification
‚è≥ Please wait... (first run akan download model)



tokenizer_config.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

‚úÖ Tokenizer loaded


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


pytorch_model.bin:   0%|          | 0.00/498M [00:00<?, ?B/s]

‚úÖ Model loaded and moved to cuda

üè∑Ô∏è Label mapping: {0: 'positif', 1: 'netral', 2: 'negatif'}


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Exception ignored in: <function tqdm.__del__ at 0x0000027A15F08E00>
Traceback (most recent call last):
  File "d:\sentiment-analysis\venv\Lib\site-packages\tqdm\std.py", line 1148, in __del__
    self.close()
  File "d:\sentiment-analysis\venv\Lib\site-packages\tqdm\notebook.py", line 279, in close
    self.disp(bar_style='danger', check_delay=False)
    ^^^^^^^^^
AttributeError: 'tqdm' object has no attribute 'disp'


## Test Prediksi

Test model dengan beberapa contoh teks.

In [None]:
def predict_text(text, max_length=256):
    """
    Prediksi sentimen untuk satu teks.
    
    Returns:
        dict: {'label': str, 'probs': dict, 'confidence': float}
    """
    # Tokenize
    inputs = tokenizer(
        text,
        return_tensors='pt',
        truncation=True,
        max_length=max_length,
        padding=True
    )
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    # Predict
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probs = torch.nn.functional.softmax(logits, dim=-1)
    
    # Get prediction
    pred_idx = torch.argmax(probs, dim=-1).item()
    pred_label = LABEL_MAP[pred_idx]
    
    # Get probabilities for all classes
    probs_dict = {
        LABEL_MAP[i]: float(probs[0][i].cpu())
        for i in range(len(LABEL_MAP))
    }
    
    # Confidence is the probability of predicted class
    confidence = float(probs[0][pred_idx].cpu())
    
    return {
        'label': pred_label,
        'probs': probs_dict,
        'confidence': confidence
    }

# Test dengan beberapa contoh
print("üß™ TEST PREDIKSI")
print("=" * 80)

test_samples = [
    "aplikasi sangat bagus dan membantu sekali, pelayanan cepat",
    "aplikasi sering error dan lambat, sangat mengecewakan",
    "biasa saja tidak ada yang istimewa",
    "driver ramah tapi aplikasi kadang lemot",
    "mantap banget! recommended!"
]

for text in test_samples:
    result = predict_text(text)
    print(f"\nText: {text}")
    print(f"Prediction: {result['label'].upper()} (confidence: {result['confidence']:.2%})")
    print(f"Probabilities:")
    for label, prob in sorted(result['probs'].items(), key=lambda x: x[1], reverse=True):
        bar = '‚ñà' * int(prob * 50)
        print(f"  {label:8s}: {prob:.4f} ({prob*100:5.2f}%) {bar}")
    print("-" * 80)

üß™ TEST PREDIKSI

Text: aplikasi sangat bagus dan membantu
Prediction: POSITIF
Probabilities:
  positif: 0.9942 (99.42%)
  netral: 0.0041 (0.41%)
  negatif: 0.0017 (0.17%)
--------------------------------------------------------------------------------

Text: aplikasi sering error dan lambat
Prediction: NEGATIF
Probabilities:
  positif: 0.0010 (0.10%)
  netral: 0.0046 (0.46%)
  negatif: 0.9944 (99.44%)
--------------------------------------------------------------------------------

Text: biasa saja tidak ada yang istimewa
Prediction: NEGATIF
Probabilities:
  positif: 0.0035 (0.35%)
  netral: 0.0012 (0.12%)
  negatif: 0.9952 (99.52%)
--------------------------------------------------------------------------------


In [None]:
# Validasi dengan test cases yang jelas
print("üîç VALIDASI LABEL MAPPING")
print("=" * 80)
print("Testing dengan kalimat yang jelas positif/negatif untuk memastikan mapping benar")
print()

validation_tests = [
    ("Sangat bagus sekali! Mantap! Recommended!", "positif"),
    ("Jelek banget! Buruk sekali! Mengecewakan!", "negatif"),
    ("Biasa saja, tidak istimewa", "netral"),
]

all_correct = True
for text, expected in validation_tests:
    result = predict_text(text)
    is_correct = result['label'] == expected
    all_correct = all_correct and is_correct
    
    status = "‚úÖ" if is_correct else "‚ùå"
    print(f"{status} Text: {text}")
    print(f"   Expected: {expected} | Got: {result['label']} | Confidence: {result['confidence']:.2%}")
    print()

if all_correct:
    print("‚úÖ VALIDASI BERHASIL! Label mapping sudah benar.")
else:
    print("‚ö†Ô∏è WARNING: Ada prediksi yang tidak sesuai. Ini bisa normal untuk kasus netral.")
print("=" * 80)

## Validasi Label Mapping

Validasi bahwa label mapping sudah benar dengan test cases yang jelas.

In [None]:
def predict_batch(texts, batch_size=16, max_length=256):
    """
    Prediksi batch dengan progress bar.
    
    Args:
        texts: List of texts
        batch_size: Batch size (reduce jika out of memory)
        max_length: Max sequence length
    
    Returns:
        DataFrame with predictions
    """
    results = {
        'pred_label': [],
        'pred_confidence': [],
        'p_positif': [],
        'p_netral': [],
        'p_negatif': []
    }
    
    # Process in batches
    for i in tqdm(range(0, len(texts), batch_size), desc="Predicting"):
        batch_texts = texts[i:i+batch_size]
        
        # Tokenize
        inputs = tokenizer(
            batch_texts,
            return_tensors='pt',
            truncation=True,
            max_length=max_length,
            padding=True
        )
        inputs = {k: v.to(device) for k, v in inputs.items()}
        
        # Predict
        with torch.no_grad():
            outputs = model(**inputs)
            logits = outputs.logits
            probs = torch.nn.functional.softmax(logits, dim=-1)
        
        # Store results
        pred_indices = torch.argmax(probs, dim=-1).cpu().numpy()
        probs_array = probs.cpu().numpy()
        
        for pred_idx, prob_array in zip(pred_indices, probs_array):
            pred_label = LABEL_MAP[pred_idx]
            results['pred_label'].append(pred_label)
            results['pred_confidence'].append(float(prob_array[pred_idx]))
            
            # Store probabilities in correct order
            results['p_positif'].append(float(prob_array[0]))  # LABEL_0 = positif
            results['p_netral'].append(float(prob_array[1]))   # LABEL_1 = netral
            results['p_negatif'].append(float(prob_array[2]))  # LABEL_2 = negatif
    
    return pd.DataFrame(results)

# Konfigurasi
BATCH_SIZE = 16  # Reduce jika out of memory (misal 8 atau 4)
MAX_LENGTH = 256

print(f"‚öôÔ∏è Configuration:")
print(f"   Batch size: {BATCH_SIZE}")
print(f"   Max length: {MAX_LENGTH}")
print(f"   Total reviews: {len(df)}")
print()

# OPTIONAL: Test dengan sample kecil dulu
# Uncomment baris di bawah untuk test dengan 100 data pertama
# df_to_predict = df.head(100).copy()
# print("‚ö†Ô∏è TEST MODE: Using first 100 reviews only")

# Untuk full dataset, gunakan ini:
df_to_predict = df.copy()

print("üöÄ Starting prediction...")
print(f"‚è±Ô∏è Estimated time: {'3-5 minutes (GPU)' if device == 'cuda' else '30-45 minutes (CPU)'}")
print()

# Prepare texts
texts = df_to_predict['clean_text'].fillna('').tolist()
print(f"üìù Prepared {len(texts)} texts for prediction")

‚öôÔ∏è Configuration:
   Batch size: 16
   Max length: 256
   Total reviews: 2767

üöÄ Starting prediction...
‚è±Ô∏è Estimated time: 3-5 minutes (GPU)

üìù Prepared 2767 texts for prediction


In [8]:
# Jalankan prediksi
predictions_df = predict_batch(
    texts, 
    batch_size=BATCH_SIZE,
    max_length=MAX_LENGTH
)

print("\n‚úÖ Prediction complete!")

Predicting:   0%|          | 0/173 [00:00<?, ?it/s]


‚úÖ Prediction complete!


## Gabungkan Hasil dengan Data Asli

In [None]:
# Gabungkan prediksi dengan data asli
df_result = pd.concat([
    df_to_predict.reset_index(drop=True),
    predictions_df
], axis=1)

print(f"‚úÖ Combined data shape: {df_result.shape}")
print(f"\nüìã Columns: {list(df_result.columns)}")

# Validasi hasil prediksi
print(f"\nüîç VALIDASI HASIL:")
print(f"   Total predictions: {len(df_result)}")
print(f"   Unique predicted labels: {df_result['pred_label'].unique()}")
print(f"   Confidence range: {df_result['pred_confidence'].min():.4f} - {df_result['pred_confidence'].max():.4f}")
print(f"   Mean confidence: {df_result['pred_confidence'].mean():.4f}")

# Display columns that exist
print(f"\nüìã Preview (Top 10):")
display_cols = []
for col in ['app', 'content', 'score', 'label', 'pred_label', 'pred_confidence', 'p_positif', 'p_netral', 'p_negatif']:
    if col in df_result.columns:
        display_cols.append(col)

print(df_result[display_cols].head(10))

# Cek distribusi prediksi
print(f"\nüìä DISTRIBUSI PREDIKSI:")
for label in ['positif', 'netral', 'negatif']:
    count = len(df_result[df_result['pred_label'] == label])
    pct = count / len(df_result) * 100
    avg_conf = df_result[df_result['pred_label'] == label]['pred_confidence'].mean()
    print(f"   {label:8s}: {count:4d} ({pct:5.2f}%) | Avg confidence: {avg_conf:.4f}")

‚úÖ Combined data shape: (2767, 11)

üìã Columns: ['app', 'content', 'score', 'at', 'thumbsUpCount', 'clean_text', 'label', 'pred_label', 'p_negatif', 'p_netral', 'p_positif']

üîç Preview:
     app                                            content  score    label  \
0  gojek  hati hati aplikasi ini sering error , jgn mau ...      1  negatif   
1  gojek                                    sangat membantu      5  positif   
2  gojek  sangat membantu keperluan masyarakat untuk har...      5  positif   
3  gojek  transfer ke bank lain memakai aplikasi ini tap...      1  negatif   
4  gojek                                  apk yang, bagusss      5  positif   
5  gojek                                  mudah cepat bagus      5  positif   
6  gojek                                            mantapp      5  positif   
7  gojek                                             mantap      5  positif   
8  gojek  kenapa driver tidak segera mengabari apabila s...      1  negatif   
9  gojek  parah ma

## Analisis Hasil Prediksi

Bandingkan label otomatis (dari rating) vs prediksi model.

In [10]:
# Distribusi prediksi
print("üìä DISTRIBUSI PREDIKSI")
print("=" * 60)

pred_counts = df_result['pred_label'].value_counts()
for label, count in pred_counts.items():
    percentage = count / len(df_result) * 100
    print(f"{label}: {count} ({percentage:.1f}%)")

# Distribusi per app
print("\nüì± DISTRIBUSI PER APLIKASI")
print("=" * 60)

for app in df_result['app'].unique():
    app_data = df_result[df_result['app'] == app]
    print(f"\n{app.upper()}:")
    for label, count in app_data['pred_label'].value_counts().items():
        percentage = count / len(app_data) * 100
        print(f"  {label}: {count} ({percentage:.1f}%)")

# Perbandingan label vs prediksi
print("\nüîÑ AGREEMENT: Label (Rating) vs Prediction")
print("=" * 60)

agreement = (df_result['label'] == df_result['pred_label']).sum()
total = len(df_result)
agreement_pct = agreement / total * 100

print(f"Agreement: {agreement}/{total} ({agreement_pct:.2f}%)")
print(f"Disagreement: {total-agreement}/{total} ({100-agreement_pct:.2f}%)")

üìä DISTRIBUSI PREDIKSI
positif: 1483 (53.6%)
negatif: 1044 (37.7%)
netral: 240 (8.7%)

üì± DISTRIBUSI PER APLIKASI

GOJEK:
  positif: 717 (47.5%)
  negatif: 648 (42.9%)
  netral: 145 (9.6%)

GRAB:
  positif: 766 (60.9%)
  negatif: 396 (31.5%)
  netral: 95 (7.6%)

üîÑ AGREEMENT: Label (Rating) vs Prediction
Agreement: 2246/2767 (81.17%)
Disagreement: 521/2767 (18.83%)


## Contoh Prediksi

Lihat contoh prediksi per kategori.

In [11]:
print("üìã CONTOH PREDIKSI PER KATEGORI")
print("=" * 80)

for pred_label in ['positif', 'netral', 'negatif']:
    print(f"\nüè∑Ô∏è PREDIKSI: {pred_label.upper()}")
    print("-" * 80)
    
    samples = df_result[df_result['pred_label'] == pred_label].head(3)
    for idx, row in samples.iterrows():
        print(f"App: {row['app']} | Rating: {row['score']} | Label: {row['label']}")
        print(f"Text: {row['content'][:100]}...")
        print(f"Confidence: {row[f'p_{pred_label}']:.2%}")
        print()

üìã CONTOH PREDIKSI PER KATEGORI

üè∑Ô∏è PREDIKSI: POSITIF
--------------------------------------------------------------------------------
App: gojek | Rating: 5 | Label: positif
Text: sangat membantu...
Confidence: 0.20%

App: gojek | Rating: 5 | Label: positif
Text: sangat membantu keperluan masyarakat untuk harga bisa sedikit di minimalkan apalagi untuk ongkir mak...
Confidence: 1.70%

App: gojek | Rating: 5 | Label: positif
Text: apk yang, bagusss...
Confidence: 0.28%


üè∑Ô∏è PREDIKSI: NETRAL
--------------------------------------------------------------------------------
App: gojek | Rating: 1 | Label: negatif
Text: kenapa driver tidak segera mengabari apabila sudah sampai, dan kenapa harus marah¬≤...
Confidence: 97.07%

App: gojek | Rating: 5 | Label: positif
Text: biarkan bintang yg berbicara...
Confidence: 90.32%

App: gojek | Rating: 5 | Label: positif
Text: gojek hrs menambah pengaturan pd fitur pesan unt bs memblokir kontak yg msk, krn oknum" DC skg msk k...
Confidence:

## Simpan Hasil

Simpan data dengan prediksi ke `data/processed/`.

In [12]:
# Simpan hasil
output_file = 'data/processed/all_apps_with_predictions.csv'
df_result.to_csv(output_file, index=False, encoding='utf-8')

print(f"üíæ Saved results to: {output_file}")
print(f"üìä Total rows: {len(df_result)}")
print(f"üìÇ File size: {Path(output_file).stat().st_size / 1024 / 1024:.2f} MB")

üíæ Saved results to: data/processed/all_apps_with_predictions.csv
üìä Total rows: 2767
üìÇ File size: 0.70 MB


## üéâ Selesai!

Prediksi sentimen berhasil dilakukan!

**Output:**
- File: `data/processed/all_apps_with_predictions.csv`
- Total: Reviews dengan prediksi sentimen
- Kolom prediksi: `pred_label`, `p_negatif`, `p_netral`, `p_positif`

**Next steps:**
- Jalankan notebook `4_evaluation.ipynb` untuk evaluasi mendalam & visualisasi

**Summary:**
- ‚úÖ Model: IndoBERT sentiment classification
- ‚úÖ Prediksi: Lengkap untuk semua reviews
- ‚úÖ Device: GPU/CPU
- ‚úÖ File tersimpan di data/processed/