# Fine-tuning BERT pada MNLI (end-to-end)

Notebook ini menunjukkan pipeline terawasi untuk fine-tuning model keluarga BERT pada dataset MNLI (GLUE). Setiap langkah dilengkapi penjelasan singkat sebelum cell kode terkait.

In [26]:
# Install dependencies (jalankan sekali jika perlu)
!pip install -q transformers datasets evaluate accelerate

/bin/bash: line 1: /home/apalah/Documents/uasdl/task1/mnli/virtualenvdl/bin/pip: cannot execute: required file not found


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Imports dan konfigurasi awal
Cell ini memuat library yang digunakan: `datasets` untuk MNLI, `transformers` untuk tokenizer/model, dan utilitas evaluasi.

## Tahap 1: Deteksi GPU
Cell ini akan memeriksa ketersediaan GPU (CUDA) dan menampilkan perangkat yang akan digunakan untuk training.

In [27]:
import torch

# Periksa ketersediaan CUDA (GPU)
if torch.cuda.is_available():
    # Dapatkan jumlah GPU yang tersedia
    gpu_count = torch.cuda.device_count()
    print(f"Ditemukan {gpu_count} GPU yang tersedia untuk training.")
    # Tampilkan detail untuk setiap GPU
    for i in range(gpu_count):
        print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
else:
    print("Tidak ada GPU yang ditemukan. Training akan berjalan di CPU.")

Ditemukan 1 GPU yang tersedia untuk training.
GPU 0: NVIDIA GeForce RTX 3060 Laptop GPU


In [28]:
import os
import numpy as np
from datasets import load_dataset
import evaluate
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer
)

# Checkpoint BERT (ubah jika ingin varian lain dari keluarga BERT)
model_checkpoint = "bert-base-uncased"
output_dir = "./mnli-bert-finetuned"

## Muat dataset MNLI dari GLUE
Kita akan memuat split `train` dan `validation` (matched/mismatched).

In [29]:
raw_datasets = load_dataset("glue", "mnli")
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 392702
    })
    validation_matched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9815
    })
    validation_mismatched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9832
    })
    test_matched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9796
    })
    test_mismatched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9847
    })
})

## Preprocessing & Tokenisasi
Tokenisasi pasangan teks (`premise`, `hypothesis`) dan penyiapan field `labels` yang diperlukan Trainer.

In [30]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# Fungsi tokenisasi untuk pasangan premise/hypothesis
def tokenize_fn(example):
    return tokenizer(example['premise'], example['hypothesis'], truncation=True)

# Terapkan tokenisasi secara batch untuk efisiensi
tokenized_datasets = raw_datasets.map(tokenize_fn, batched=True)

# Trainer mengharapkan kolom 'labels' — MNLI pada GLUE sudah memiliki 'label'
tokenized_datasets = tokenized_datasets.rename_column('label', 'labels')

# Hapus kolom yang tidak dipakai dan set format ke PyTorch
cols_to_remove = [c for c in tokenized_datasets['train'].column_names if c not in ['input_ids','attention_mask','token_type_ids','labels']]
tokenized_datasets = tokenized_datasets.remove_columns(cols_to_remove)
tokenized_datasets.set_format('torch')

tokenized_datasets

Map: 100%|██████████| 9815/9815 [00:00<00:00, 24198.40 examples/s]


DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 392702
    })
    validation_matched: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9815
    })
    validation_mismatched: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9832
    })
    test_matched: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9796
    })
    test_mismatched: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9847
    })
})

## Siapkan model dan TrainingArguments
Buat model klasifikasi dengan `num_labels=3` (entailment/neutral/contradiction).

In [31]:
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=3)
data_collator = DataCollatorWithPadding(tokenizer)

training_args = TrainingArguments(
    output_dir=output_dir,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_steps=100,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
)

# Metrics
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    acc = accuracy.compute(predictions=predictions, references=labels)
    f1m = f1.compute(predictions=predictions, references=labels, average="macro")
    return {**acc, **{'f1_macro': f1m['f1']}}

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation_matched'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


## Training
Jalankan `trainer.train()` untuk fine-tuning. Sesuaikan `num_train_epochs` dan batch size sesuai resource.

In [32]:
train_result = trainer.train()
trainer.save_model(output_dir)

# Simpan tokenizer juga
tokenizer.save_pretrained(output_dir)

train_result

Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro
1,0.4322,0.45639,0.832298,0.831706


TrainOutput(global_step=49088, training_loss=0.5280754309745003, metrics={'train_runtime': 3249.7041, 'train_samples_per_second': 120.842, 'train_steps_per_second': 15.105, 'total_flos': 1.417814915563998e+16, 'train_loss': 0.5280754309745003, 'epoch': 1.0})

## Evaluasi akhir dan contoh inferensi
Evaluasi pada split validasi (matched) dan berikan contoh prediksi dari kalimat pasangan.

In [41]:
metrics = trainer.evaluate()
print("Validation metrics:", metrics)

# Contoh prediksi cepat menggunakan pipeline dari tokenizer + model yang ter-save
from transformers import pipeline

# Set device ke GPU jika tersedia, jika tidak, gunakan CPU
device = 0 if torch.cuda.is_available() else -1
classifier = pipeline('text-classification', model=output_dir, tokenizer=output_dir, return_all_scores=True, device=device)

# Mapping manual dari ID ke label untuk MNLI
id2label = {0: 'entailment', 1: 'neutral', 2: 'contradiction'}

examples = [
    ("Two men in polo shirts standing in a bar.", "They are in a pub."),
    ("A man inspects the uniform of a figure in some East Asian country.", "A man is sleeping.")
]

for a,b in examples:
    # Cara yang lebih tepat untuk pipeline: lewatkan sebagai pasangan kalimat
    out = classifier({"text": a, "text_pair": b})
    print('Premise:', a)
    print('Hypothesis:', b)
    
    # Temukan prediksi dengan skor tertinggi
    best_prediction = max(out, key=lambda x: x['score'])
    predicted_label_id = int(best_prediction['label'].split('_')[-1])
    predicted_label_name = id2label[predicted_label_id]

    print('Scores:', out)
    print(f"Predicted Label: {predicted_label_name} (Score: {best_prediction['score']:.4f})")
    print('---')

Device set to use cuda:0


Validation metrics: {'eval_loss': 0.4563902020454407, 'eval_accuracy': 0.8322975038206827, 'eval_f1_macro': 0.8317062443100176, 'eval_runtime': 14.0998, 'eval_samples_per_second': 696.107, 'eval_steps_per_second': 43.547, 'epoch': 1.0}
Premise: Two men in polo shirts standing in a bar.
Hypothesis: They are in a pub.
Scores: [{'label': 'LABEL_0', 'score': 0.7811073660850525}, {'label': 'LABEL_1', 'score': 0.05497472733259201}, {'label': 'LABEL_2', 'score': 0.16391794383525848}]
Predicted Label: entailment (Score: 0.7811)
---
Premise: A man inspects the uniform of a figure in some East Asian country.
Hypothesis: A man is sleeping.
Scores: [{'label': 'LABEL_0', 'score': 0.004381849430501461}, {'label': 'LABEL_1', 'score': 0.026000216603279114}, {'label': 'LABEL_2', 'score': 0.9696179628372192}]
Predicted Label: contradiction (Score: 0.9696)
---


