## 1) Setup environment dan instalasi library
Instal dependency yang diperlukan dan pastikan environment (mis. virtualenv) aktif.

# Fine-tuning BERT untuk Klasifikasi AG News
Notebook ini menunjukkan pipeline end-to-end untuk fine-tuning model keluarga BERT (encoder Transformer)
pada dataset AG News (empat kelas). Setiap langkah memiliki penjelasan singkat diikuti kode yang dapat dijalankan.

**Ringkasan langkah**:
- Install dependencies
- Import library dan helper functions
- Load dataset AG News dari `datasets`
- Tokenize dan buat dataset siap untuk `Trainer`
- Konfigurasi `AutoModelForSequenceClassification` dan `Trainer`
- Fine-tune, evaluasi, dan simpan model

In [1]:
# 1) Install dependencies (jalankan sekali)
# Jika sudah terinstal di environment, Anda bisa melewati cell ini.
import sys
!{sys.executable} -m pip install -q transformers datasets evaluate accelerate[suggested]

[0m

## 2) Load dan eksplorasi dataset
Memuat dataset AG News dan lihat struktur serta beberapa contoh.

In [2]:
# 2) Imports dan helper functions
from datasets import load_dataset
from transformers import (AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding, TrainingArguments, Trainer)
import evaluate
import numpy as np
import os
from typing import Dict

# Helper: compute metrics (accuracy + weighted F1)
accuracy = evaluate.load('accuracy')
f1 = evaluate.load('f1')

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    acc = accuracy.compute(predictions=preds, references=labels)['accuracy']
    f1w = f1.compute(predictions=preds, references=labels, average='weighted')['f1']
    return {'accuracy': acc, 'f1_weighted': f1w}

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# 3) Load AG News dataset and inspect
dataset = load_dataset('ag_news')
dataset

Generating train split: 100%|██████████| 120000/120000 [00:00<00:00, 1021634.64 examples/s]
Generating test split: 100%|██████████| 7600/7600 [00:00<00:00, 1252424.58 examples/s]


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})

In [4]:
# Periksa contoh dan fitur dataset
print(dataset['train'].features)
print('Sample:', dataset['train'][0])

{'text': Value('string'), 'label': ClassLabel(names=['World', 'Sports', 'Business', 'Sci/Tech'])}
Sample: {'text': "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.", 'label': 2}


## 3) Tokenisasi dengan BERT tokenizer
Gunakan `AutoTokenizer` dari Hugging Face untuk menyiapkan input token untuk model BERT.

AG News berisi 4 label (0..3). Kita akan menggunakan `AutoTokenizer` dan `AutoModelForSequenceClassification`
untuk memetakan teks menjadi probabilitas kelas. Pilih `model_name` sesuai kebutuhan (mis. `bert-base-uncased`).

In [5]:
# 4) Tokenization dan preprocessing
model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)

# fungsi tokenisasi - gunakan kolom 'text' dari AG News
def preprocess_function(examples):
    return tokenizer(examples['text'], truncation=True)

tokenized_datasets = dataset.map(preprocess_function, batched=True)

# Cek contoh tokenized
print(tokenized_datasets['train'][0])

Map: 100%|██████████| 120000/120000 [00:07<00:00, 16626.76 examples/s]
Map: 100%|██████████| 7600/7600 [00:00<00:00, 13588.78 examples/s]

{'text': "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.", 'label': 2, 'input_ids': [101, 2813, 2358, 1012, 6468, 15020, 2067, 2046, 1996, 2304, 1006, 26665, 1007, 26665, 1011, 2460, 1011, 19041, 1010, 2813, 2395, 1005, 1055, 1040, 11101, 2989, 1032, 2316, 1997, 11087, 1011, 22330, 8713, 2015, 1010, 2024, 3773, 2665, 2153, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}





## 4) Konfigurasi model untuk fine-tuning
Siapkan `AutoModelForSequenceClassification`, data collator, dan `TrainingArguments`.

In [7]:
# 5) Siapkan model, data collator, dan TrainingArguments
num_labels = 4
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

training_args = TrainingArguments(
    output_dir='./results',
    eval_strategy='epoch',
    save_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model='accuracy',
    logging_steps=100,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


## 5) Training dengan Trainer API
Mulai fine-tuning menggunakan `trainer.train()`. Pastikan perangkat (GPU) tersedia bila mungkin.

In [8]:
# 6) Fine-tune model (jalankan ini untuk memulai training)
# Catatan: training bisa memakan waktu tergantung hardware.
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1 Weighted
1,0.1965,0.177072,0.945526,0.945473
2,0.1126,0.185685,0.948289,0.948348
3,0.0826,0.232479,0.947763,0.947801


TrainOutput(global_step=22500, training_loss=0.1499707086775038, metrics={'train_runtime': 5401.0464, 'train_samples_per_second': 66.654, 'train_steps_per_second': 4.166, 'total_flos': 1.7987934973367424e+16, 'train_loss': 0.1499707086775038, 'epoch': 3.0})

## 6) Evaluasi dan inference
Evaluasi performa model pada set test, lalu lakukan inference contoh untuk memverifikasi prediksi.

In [9]:
# 7) Evaluasi model di set test
metrics = trainer.evaluate()
print(metrics)

{'eval_loss': 0.1856851726770401, 'eval_accuracy': 0.9482894736842106, 'eval_f1_weighted': 0.9483479302769415, 'eval_runtime': 42.5632, 'eval_samples_per_second': 178.558, 'eval_steps_per_second': 2.796, 'epoch': 3.0}


In [11]:
# 8) Simpan model terlatih secara lokal
model_dir = './fine-tuned-bert-agnews'
trainer.save_model(model_dir)
print('Saved model to', model_dir)

Saved model to ./fine-tuned-bert-agnews


In [16]:
# 9) Test inference: muat kembali model yang tersimpan dan lakukan prediksi contoh
from transformers import pipeline
import torch
device = 0 if torch.cuda.is_available() else -1
# Nama label AG News sesuai indeks: 0->World, 1->Sports, 2->Business, 3->Sci/Tech
class_names = ['World', 'Sports', 'Business', 'Sci/Tech']
classifier = pipeline('text-classification', model=model_dir, tokenizer=model_dir, device=device)
samples = [
    'Apple releases their latest iPhone model with new features.',
    'The government announced new policies affecting the economy.'
]
results = classifier(samples, truncation=True)
for s, r in zip(samples, results):
    # pipeline returns label like 'LABEL_3' -> ambil indeks dan peta ke nama
    if isinstance(r.get('label'), str) and r['label'].startswith('LABEL_'):
        label_index = int(r['label'].split('_')[-1])
    else:
        try:
            label_index = int(r.get('label'))
        except Exception:
            label_index = None
    human_label = class_names[label_index] if (label_index is not None and 0 <= label_index < len(class_names)) else r.get('label')
    print('Text:', s)
    print('Prediction:', {'label': human_label, 'score': r.get('score')})
    print('---')

Device set to use cuda:0


Text: Apple releases their latest iPhone model with new features.
Prediction: {'label': 'Sci/Tech', 'score': 0.9823869466781616}
---
Text: The government announced new policies affecting the economy.
Prediction: {'label': 'Business', 'score': 0.9666765928268433}
---


**Catatan akhir & tips**:
- Untuk training lebih cepat, jalankan pada GPU (set CUDA_VISIBLE_DEVICES / gunakan `accelerate`).
- Anda dapat mengubah `model_name` ke varian BERT lain atau model ringan (DistilBERT) jika resource terbatas.
- Untuk deploy, export model yang disimpan di `./fine-tuned-bert-agnews`.
- Jika ingin push ke Hugging Face Hub, gunakan `trainer.push_to_hub()` setelah mengatur `token` dan `hub_model_id`.