# Fine-tune BERT untuk GoEmotions (classifier)
Notebook ini menunjukkan pipeline end-to-end untuk fine-tuning model keluarga BERT pada dataset GoEmotions (multi-label).
Setiap langkah disertai penjelasan singkat pada sel Markdown di atas sel kode terkait.

## 1) Instalasi dependensi
Jika environment belum memiliki paket yang dibutuhkan jalankan sel berikut. Jika sudah ada, lewati saja.

In [1]:
# Install dependencies (jalankan hanya jika perlu)
!pip install -q transformers datasets evaluate accelerate scikit-learn

/bin/bash: line 1: /home/apalah/Documents/uasdl/task1/goemotions/virtualenvdl/bin/pip: cannot execute: required file not found


## 2) Import & konfigurasi awal
Impor library utama, set seed untuk reproducibility, dan tentukan device (CPU/GPU).

In [2]:
import os
import numpy as np
import torch
from datasets import load_dataset
from transformers import (AutoTokenizer, AutoConfig, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding)
from sklearn.metrics import f1_score

# Reproducibility & device
seed = 42
np.random.seed(seed)
torch.manual_seed(seed)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Device:', device)

  from .autonotebook import tqdm as notebook_tqdm


Device: cuda


## 3) Memuat dataset GoEmotions
Gunakan `datasets` untuk memuat `google-research-datasets/go_emotions`. Kita akan melihat nama label dan jumlah label.

In [3]:
# Load GoEmotions dataset from Hugging Face
dataset = load_dataset('google-research-datasets/go_emotions')
label_names = dataset['train'].features['labels'].feature.names
num_labels = len(label_names)
print('Num labels =', num_labels)
print('Example labels:', label_names[:10])

Num labels = 28
Example labels: ['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment']


## 4) Preprocessing & Tokenization
Tokenize teks dan konversi daftar indeks label menjadi vektor multi-hot untuk klasifikasi multi-label. Batasi `max_length` agar input konsisten.

In [4]:
# Preprocessing: tokenizer + convert label lists -> multi-hot vectors
model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)

def preprocess(batch):
    enc = tokenizer(batch['text'], truncation=True, max_length=128)
    multi = []
    for lab in batch['labels']:
        v = [0.0]*num_labels
        for i in lab:
            v[i] = 1.0
        multi.append(v)
    # Ensure labels are float (BCEWithLogits expects float targets)
    enc['labels'] = multi
    return enc

encoded_dataset = dataset.map(preprocess, batched=True, remove_columns=dataset['train'].column_names)
print(encoded_dataset)

Map: 100%|██████████| 5426/5426 [00:00<00:00, 13490.15 examples/s]

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 43410
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5426
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5427
    })
})





## 5) Menyiapkan model untuk multi-label
Buat konfigurasi `AutoConfig` dengan `problem_type='multi_label_classification'` dan muat `AutoModelForSequenceClassification`.

In [8]:
# Model setup (multi-label)
config = AutoConfig.from_pretrained(model_name, num_labels=num_labels, problem_type='multi_label_classification')
model = AutoModelForSequenceClassification.from_pretrained(model_name, config=config).to(device)
data_collator = DataCollatorWithPadding(tokenizer)

# Ensure labels are float in batches: wrap collator to cast labels to float32 (BCEWithLogits needs float targets)
def float_data_collator(features):
    batch = data_collator(features)
    if 'labels' in batch:
        batch['labels'] = batch['labels'].to(torch.float)
    return batch
print(model.config)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "dtype": "float32",
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8",
    "9": "LABEL_9",
    "10": "LABEL_10",
    "11": "LABEL_11",
    "12": "LABEL_12",
    "13": "LABEL_13",
    "14": "LABEL_14",
    "15": "LABEL_15",
    "16": "LABEL_16",
    "17": "LABEL_17",
    "18": "LABEL_18",
    "19": "LABEL_19",
    "20": "LABEL_20",
    "21": "LABEL_21",
    "22": "LABEL_22",
    "23": "LABEL_23",
    "24": "LABEL_24",
    "25": "LABEL_25",
    "26": "LABEL_26",
    "27": "LABEL_27"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_

## 6) Metrik Evaluasi (F1 multi-label)
Terapkan sigmoid ke logits, gunakan threshold 0.5 untuk mendapatkan prediksi biner, lalu hitung F1 (micro & macro).

In [9]:
# Metrics for multi-label: use sigmoid + threshold then compute F1 (micro & macro)
def sigmoid(x):
    return 1/(1+np.exp(-x))

def compute_metrics(pred):
    logits = pred.predictions
    probs = sigmoid(logits)
    y_pred = (probs >= 0.5).astype(int)
    # label_ids may be floats (multi-hot); convert to binary ints for metrics
    y_true = (pred.label_ids >= 0.5).astype(int)
    f1_micro = f1_score(y_true, y_pred, average='micro', zero_division=0)
    f1_macro = f1_score(y_true, y_pred, average='macro', zero_division=0)
    return {'f1_micro': f1_micro, 'f1_macro': f1_macro}

## 7) Konfigurasi Training dan Menjalankan Trainer
Tentukan `TrainingArguments` (batch size, epochs, learning rate, strategi evaluasi). Sesuaikan untuk percobaan cepat.

In [10]:
# Training arguments and Trainer
training_args = TrainingArguments(
    output_dir='./goemotions-bert',
    eval_strategy='epoch',
    save_strategy='epoch',
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    logging_steps=100,
    learning_rate=2e-5,
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_dataset['train'],
    eval_dataset=encoded_dataset['validation'],
    tokenizer=tokenizer,
    data_collator=float_data_collator,
    compute_metrics=compute_metrics,
)

# Note: training can be long. Reduce epochs or use smaller batch for quick tests.
trainer.train()
trainer.evaluate()

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,F1 Micro,F1 Macro
1,0.0934,0.093321,0.509136,0.248086
2,0.0832,0.085575,0.556098,0.367664
3,0.0741,0.085382,0.573374,0.394483


{'eval_loss': 0.08538229018449783,
 'eval_f1_micro': 0.5733743819386137,
 'eval_f1_macro': 0.3944834265551541,
 'eval_runtime': 8.9752,
 'eval_samples_per_second': 604.556,
 'eval_steps_per_second': 18.941,
 'epoch': 3.0}

## 8) Inference contoh
Setelah model terlatih, lakukan prediksi pada teks baru: ambil logits, terapkan sigmoid, dan threshold.

In [11]:
# Inference example
text = 'I feel joyful and excited today!'
inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True).to(device)
with torch.no_grad():
    logits = model(**inputs).logits.cpu().numpy()[0]
probs = 1/(1+np.exp(-logits))
preds = [label_names[i] for i,p in enumerate(probs) if p>0.5]
print('Predicted labels:', preds)
print('Top probabilities:')
for i in np.argsort(probs)[-5:][::-1]:
    print(label_names[i], f'{probs[i]:.3f}')

Predicted labels: ['joy']
Top probabilities:
joy 0.635
excitement 0.473
gratitude 0.052
neutral 0.039
admiration 0.035
