# Task 3 — Fine-tuning Decoder-only LLM (Phi-2) untuk Summarization (XSum)

**Target UAS:** fine-tune model decoder-only (Phi-2) untuk membuat ringkasan abstraktif pada dataset XSum.

**Catatan resource:** Phi-2 relatif besar. Banyak orang memakai PEFT/LoRA + 4-bit quantization agar muat di GPU terbatas.
- Template ini menyediakan jalur **LoRA + 4-bit** (opsional).
- Jika kamu full fine-tune, kamu mungkin butuh GPU memory besar.

Tanggal template: 2026-01-05

## 0. Setup
**TODO:** pastikan `bitsandbytes` kompatibel dengan environment kamu (terutama di Windows/local).

Jika kamu tidak bisa memakai 4-bit, kamu bisa:
- pakai LoRA tanpa quantization (butuh VRAM lebih)
- atau pakai model lebih kecil (jika diizinkan)

In [1]:
from google.colab import drive
drive.mount("/content/drive")

import os
from pathlib import Path

PROJECT_DIR = "/content/drive/MyDrive/finetuning-phi-2-text-summarization"
PROJECT_DIR = Path(PROJECT_DIR)

OUTPUTS_DIR = PROJECT_DIR / "outputs"
REPORTS_DIR = PROJECT_DIR / "reports"
MODELS_DIR  = PROJECT_DIR / "models"

OUTPUTS_DIR.mkdir(parents=True, exist_ok=True)
REPORTS_DIR.mkdir(parents=True, exist_ok=True)
MODELS_DIR.mkdir(parents=True, exist_ok=True)

# (opsional) simpan cache HF di Drive biar nggak download ulang
os.environ["HF_HOME"] = str(PROJECT_DIR / ".hf_home")
os.environ["HF_DATASETS_CACHE"] = str(PROJECT_DIR / ".hf_datasets_cache")
os.environ["TRANSFORMERS_CACHE"] = str(PROJECT_DIR / ".hf_transformers_cache")

print("PROJECT_DIR:", PROJECT_DIR)
print("OUTPUTS_DIR:", OUTPUTS_DIR)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
PROJECT_DIR: /content/drive/MyDrive/finetuning-phi-2-text-summarization
OUTPUTS_DIR: /content/drive/MyDrive/finetuning-phi-2-text-summarization/outputs


In [2]:
!pip -q install -U datasets evaluate transformers accelerate peft rouge-score
# bitsandbytes hanya diperlukan kalau mau 4-bit quantization
!pip -q install -U bitsandbytes


In [3]:
import os
import random
import numpy as np
import torch

from datasets import load_dataset
import evaluate

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
    set_seed,
)

# BitsAndBytesConfig tidak selalu ada (tergantung versi transformers)
try:
    from transformers import BitsAndBytesConfig
    BNB_CONFIG_AVAILABLE = True
except Exception:
    BitsAndBytesConfig = None
    BNB_CONFIG_AVAILABLE = False

# Optional (PEFT)
try:
    from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
    PEFT_AVAILABLE = True
except Exception:
    LoraConfig = None
    get_peft_model = None
    prepare_model_for_kbit_training = None
    PEFT_AVAILABLE = False

SEED = 42
set_seed(SEED)
np.random.seed(SEED)
random.seed(SEED)
torch.manual_seed(SEED)

device = "cuda" if torch.cuda.is_available() else "cpu"
print("device:", device)
print("BNB_CONFIG_AVAILABLE:", BNB_CONFIG_AVAILABLE)
print("PEFT_AVAILABLE:", PEFT_AVAILABLE)




device: cuda
BNB_CONFIG_AVAILABLE: True
PEFT_AVAILABLE: True


In [4]:
USE_4BIT = (device == "cuda") and BNB_CONFIG_AVAILABLE  # otomatis off kalau tidak tersedia
USE_LORA = PEFT_AVAILABLE  # otomatis off kalau peft tidak tersedia

print("USE_4BIT:", USE_4BIT)
print("USE_LORA:", USE_LORA)


USE_4BIT: True
USE_LORA: True


## 1. Load dataset & model

**TODO:** pastikan nama model Phi-2 benar sesuai HuggingFace Hub yang kamu pakai.

Dataset XSum fields umumnya:
- `document`
- `summary`

In [13]:
from datasets import load_dataset
import evaluate

DATASET_REPO = "EdinburghNLP/xsum"
REV = "refs%2Fconvert%2Fparquet"  # URL-encoded

dataset = load_dataset(
    "parquet",
    data_files={
        "train":      f"hf://datasets/{DATASET_REPO}@{REV}/default/train/*.parquet",
        "validation": f"hf://datasets/{DATASET_REPO}@{REV}/default/validation/*.parquet",
        "test":       f"hf://datasets/{DATASET_REPO}@{REV}/default/test/*.parquet",
    },
)

print(dataset)
metric = evaluate.load("rouge")


DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})


In [14]:
print(dataset["train"].column_names)
print(dataset["train"][0])


['document', 'summary', 'id']


## 2. Load tokenizer & model

Untuk model causal LM:
- kita membuat prompt "Summarize: {document}\nSummary:" lalu targetnya `summary`
- saat training, label biasanya sama dengan input_ids (shift internal oleh model)

**TODO:** cek `tokenizer.pad_token` (beberapa LLM tidak punya pad token).

In [15]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

if USE_4BIT_LORA:
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
    )
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        quantization_config=bnb_config,
        device_map="auto",
    )
    model = prepare_model_for_kbit_training(model)

    lora_config = LoraConfig(
        r=16,
        lora_alpha=32,
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
    )
    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()
else:
    model = AutoModelForCausalLM.from_pretrained(MODEL_NAME).to(device)

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

trainable params: 18,350,080 || all params: 2,798,033,920 || trainable%: 0.6558


In [None]:
tokenizer.padding_side = "right"

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token


## 3. Preprocessing



In [16]:
def build_prompt(doc: str) -> str:
    doc = doc[:MAX_DOC_CHARS]
    return f"Summarize the following article in 1-2 sentences.\n\nArticle:\n{doc}\n\nSummary:"

def preprocess_batch(examples):
    docs = examples["document"]
    sums = examples["summary"]

    texts = []
    for d, s in zip(docs, sums):
        prompt = build_prompt(d)
        # Untuk causal LM training sederhana: gabungkan prompt + target + EOS
        full = prompt + " " + s + tokenizer.eos_token
        texts.append(full)

    tokenized = tokenizer(
        texts,
        truncation=True,
        max_length=MAX_LENGTH,
        padding=False,
    )
    # labels = input_ids (standard causal LM)
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

tokenized = dataset.map(preprocess_batch, batched=True, remove_columns=dataset["train"].column_names)
print(tokenized)

Map:   0%|          | 0/204045 [00:00<?, ? examples/s]

Map:   0%|          | 0/11332 [00:00<?, ? examples/s]

Map:   0%|          | 0/11334 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 11334
    })
})


## 4. Trainer

**TODO:** atur output_dir & logging.

In [17]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# NOTE: Some older transformers versions don't support evaluation_strategy.
# We build TrainingArguments in a version-compatible way.
import inspect
from transformers import TrainingArguments

use_fp16 = bool(torch.cuda.is_available())

base_kwargs = dict(
    output_dir=str(OUTPUTS_DIR),
    learning_rate=LR,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRAD_ACCUM,
    num_train_epochs=EPOCHS,
    logging_steps=50,
    save_steps=500,
    save_total_limit=2,
    fp16=use_fp16,
    report_to="none",
)

sig = inspect.signature(TrainingArguments.__init__)
params = sig.parameters

# Newer versions
if "evaluation_strategy" in params:
    base_kwargs.update(dict(
        evaluation_strategy="steps",
        eval_steps=500,
    ))
else:
    # Older versions (fallback)
    if "evaluate_during_training" in params:
        base_kwargs["evaluate_during_training"] = True
    if "eval_steps" in params:
        base_kwargs["eval_steps"] = 500

# Keep only supported args for this transformers version
filtered_kwargs = {k: v for k, v in base_kwargs.items() if k in params}

args = TrainingArguments(**filtered_kwargs)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)


  trainer = Trainer(


## 5. Training

**TODO:** jalankan training.

In [18]:
# Train / Evaluate / Save artifacts to Google Drive
# If you already trained and only want to re-run evaluation, set DO_TRAIN = False.
import json
from datetime import datetime

DO_TRAIN = True

if DO_TRAIN:
    train_result = trainer.train()
    print(train_result)

# Save model + tokenizer
run_id = datetime.now().strftime("%Y%m%d_%H%M%S")
save_dir = MODELS_DIR / f"phi2_xsum_{run_id}"
save_dir.mkdir(parents=True, exist_ok=True)

trainer.save_model(str(save_dir))
tokenizer.save_pretrained(str(save_dir))
print("Saved model to:", save_dir)

# Evaluate (loss on validation by default)
eval_metrics = trainer.evaluate()
print("Eval metrics:", eval_metrics)

# Save metrics
metrics_path = REPORTS_DIR / "metrics.json"
with open(metrics_path, "w") as f:
    json.dump(eval_metrics, f, indent=2)
print("Saved metrics to:", metrics_path)


The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 50256}.


ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`labels` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

## 6. Evaluasi ROUGE (setelah training)

Evaluasi summarization biasanya:
- generate summary dari prompt
- bandingkan dengan reference summary (ROUGE)

**TODO:** jalankan cell ini setelah training (dan mungkin pakai subset agar cepat).

In [None]:
# def generate_summary(doc: str, max_new_tokens: int = 64):
#     prompt = build_prompt(doc)
#     inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
#     with torch.no_grad():
#         out = model.generate(
#             **inputs,
#             max_new_tokens=max_new_tokens,
#             do_sample=False,
#             num_beams=4,
#         )
#     text = tokenizer.decode(out[0], skip_special_tokens=True)
#     # Ambil teks setelah "Summary:"
#     if "Summary:" in text:
#         text = text.split("Summary:", 1)[-1].strip()
#     return text

# # Quick eval on small subset
# n = 200
# preds, refs = [], []
# for ex in dataset["validation"].select(range(n)):
#     preds.append(generate_summary(ex["document"]))
#     refs.append(ex["summary"])

# rouge = metric.compute(predictions=preds, references=refs)
# print(rouge)

## 7. Analisis

**TODO:** isi `reports/` dengan:
- ROUGE score
- contoh hasil summary bagus vs buruk
- diskusi abstractive vs extractive
- kendala truncation & panjang dokumen