# Task 3 — Fine-tuning Decoder-only LLM (Phi-2) untuk Summarization (XSum)

**Target UAS:** fine-tune model decoder-only (Phi-2) untuk membuat ringkasan abstraktif pada dataset XSum.

**Catatan resource:** Phi-2 relatif besar. Banyak orang memakai PEFT/LoRA + 4-bit quantization agar muat di GPU terbatas.
- Template ini menyediakan jalur **LoRA + 4-bit** (opsional).
- Jika kamu full fine-tune, kamu mungkin butuh GPU memory besar.

Tanggal template: 2026-01-05

## 0. Setup
**TODO:** pastikan `bitsandbytes` kompatibel dengan environment kamu (terutama di Windows/local).

Jika kamu tidak bisa memakai 4-bit, kamu bisa:
- pakai LoRA tanpa quantization (butuh VRAM lebih)
- atau pakai model lebih kecil (jika diizinkan)

In [2]:
from google.colab import drive
drive.mount("/content/drive")

import os
from pathlib import Path

PROJECT_DIR = "/content/drive/MyDrive/finetuning-phi-2-text-summarization"
PROJECT_DIR = Path(PROJECT_DIR)

OUTPUTS_DIR = PROJECT_DIR / "outputs"
REPORTS_DIR = PROJECT_DIR / "reports"
MODELS_DIR  = PROJECT_DIR / "models"

OUTPUTS_DIR.mkdir(parents=True, exist_ok=True)
REPORTS_DIR.mkdir(parents=True, exist_ok=True)
MODELS_DIR.mkdir(parents=True, exist_ok=True)

# (opsional) simpan cache HF di Drive biar nggak download ulang
os.environ["HF_HOME"] = str(PROJECT_DIR / ".hf_home")
os.environ["HF_DATASETS_CACHE"] = str(PROJECT_DIR / ".hf_datasets_cache")
os.environ["TRANSFORMERS_CACHE"] = str(PROJECT_DIR / ".hf_transformers_cache")

print("PROJECT_DIR:", PROJECT_DIR)
print("OUTPUTS_DIR:", OUTPUTS_DIR)


Mounted at /content/drive
PROJECT_DIR: /content/drive/MyDrive/finetuning-phi-2-text-summarization
OUTPUTS_DIR: /content/drive/MyDrive/finetuning-phi-2-text-summarization/outputs


In [3]:
!pip -q install -U datasets evaluate transformers accelerate peft rouge-score bitsandbytes huggingface_hub


  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m512.3/512.3 kB[0m [31m27.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m557.0/557.0 kB[0m [31m36.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.1/59.1 MB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.7/47.7 MB[0m [31m56.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone


In [4]:
import os
import random
import numpy as np
import torch

from datasets import load_dataset
import evaluate

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
    set_seed,
)

# Optional: BitsAndBytes (4-bit quantization). Availability depends on transformers + bitsandbytes.
try:
    from transformers import BitsAndBytesConfig
    BNB_CONFIG_AVAILABLE = True
except Exception:
    BitsAndBytesConfig = None
    BNB_CONFIG_AVAILABLE = False

# Optional: PEFT/LoRA
try:
    from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
    PEFT_AVAILABLE = True
except Exception:
    LoraConfig = None
    get_peft_model = None
    prepare_model_for_kbit_training = None
    PEFT_AVAILABLE = False

SEED = 42
set_seed(SEED)
np.random.seed(SEED)
random.seed(SEED)
torch.manual_seed(SEED)

device = "cuda" if torch.cuda.is_available() else "cpu"
print("device:", device)
print("BNB_CONFIG_AVAILABLE:", BNB_CONFIG_AVAILABLE)
print("PEFT_AVAILABLE:", PEFT_AVAILABLE)




device: cuda
BNB_CONFIG_AVAILABLE: True
PEFT_AVAILABLE: True


In [5]:
USE_4BIT = (device == "cuda") and BNB_CONFIG_AVAILABLE  # otomatis off kalau tidak tersedia
USE_LORA = PEFT_AVAILABLE  # otomatis off kalau peft tidak tersedia

print("USE_4BIT:", USE_4BIT)
print("USE_LORA:", USE_LORA)


USE_4BIT: True
USE_LORA: True


## 1. Load dataset & model

**TODO:** pastikan nama model Phi-2 benar sesuai HuggingFace Hub yang kamu pakai.

Dataset XSum fields umumnya:
- `document`
- `summary`

In [6]:
from datasets import load_dataset
import evaluate

DATASET_REPO = "EdinburghNLP/xsum"
REV = "refs%2Fconvert%2Fparquet"  # URL-encoded

dataset = load_dataset(
    "parquet",
    data_files={
        "train":      f"hf://datasets/{DATASET_REPO}@{REV}/default/train/*.parquet",
        "validation": f"hf://datasets/{DATASET_REPO}@{REV}/default/validation/*.parquet",
        "test":       f"hf://datasets/{DATASET_REPO}@{REV}/default/test/*.parquet",
    },
)

print(dataset)
metric = evaluate.load("rouge")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})


In [7]:
print(dataset["train"].column_names)
print(dataset["train"][0])


['document', 'summary', 'id']


In [8]:
# =========================
# (Optional) Stratified sampling to reduce training time
# - We stratify by document length bins (word-count) so the subset keeps a similar distribution
# - Set USE_SUBSET=False if you really want the full dataset
from datasets import DatasetDict

USE_SUBSET = True
TRAIN_SUBSET_N = 5000
VALID_SUBSET_N = 500
TEST_SUBSET_N  = 500

def _len_bin(example):
    # word-count bins (simple + fast)
    wc = len(example["document"].split())
    if wc < 250:
        b = "short_<250"
    elif wc < 450:
        b = "medium_250-449"
    else:
        b = "long_450+"
    return {"len_bin": b}

def _sample_stratified(ds_split, n, seed=SEED):
    if n >= len(ds_split):
        return ds_split

    ds_binned = ds_split.map(_len_bin)

    frac = n / len(ds_binned)
    try:
        sampled = ds_binned.train_test_split(
            train_size=frac,
            seed=seed,
            stratify_by_column="len_bin",
        )["train"]
        sampled = sampled.remove_columns(["len_bin"])
        return sampled
    except Exception as e:
        # Fallback: random sample (if stratify fails due to rare bins)
        print("Stratified sampling failed, fallback to random sampling. Reason:", repr(e))
        return ds_split.shuffle(seed=seed).select(range(n))

if USE_SUBSET:
    dataset = DatasetDict({
        "train": _sample_stratified(dataset["train"], TRAIN_SUBSET_N),
        "validation": _sample_stratified(dataset["validation"], VALID_SUBSET_N),
        "test": _sample_stratified(dataset["test"], TEST_SUBSET_N),
    })
    print("Using subset:", {k: len(dataset[k]) for k in dataset})
else:
    print("Using full dataset:", {k: len(dataset[k]) for k in dataset})



Map:   0%|          | 0/204045 [00:00<?, ? examples/s]

Stratified sampling failed, fallback to random sampling. Reason: ValueError('Stratifying by column is only supported for ClassLabel column, and column len_bin is Value.')


Map:   0%|          | 0/11332 [00:00<?, ? examples/s]

Stratified sampling failed, fallback to random sampling. Reason: ValueError('Stratifying by column is only supported for ClassLabel column, and column len_bin is Value.')


Map:   0%|          | 0/11334 [00:00<?, ? examples/s]

Stratified sampling failed, fallback to random sampling. Reason: ValueError('Stratifying by column is only supported for ClassLabel column, and column len_bin is Value.')
Using subset: {'train': 5000, 'validation': 500, 'test': 500}


## 2. Load tokenizer & model

Untuk model causal LM:
- kita membuat prompt "Summarize: {document}\nSummary:" lalu targetnya `summary`
- saat training, label biasanya sama dengan input_ids (shift internal oleh model)

**TODO:** cek `tokenizer.pad_token` (beberapa LLM tidak punya pad token).

In [9]:
# =========================
# Configuration (edit here)
# =========================

# Model (Phi-2)
MODEL_NAME = "microsoft/phi-2"

# Sequence length for training examples (prompt + summary)
# If you get OOM, reduce to 384 or 256.
MAX_LENGTH = 384

# Optional: limit raw document chars before tokenization (helps keep prompts smaller)
MAX_DOC_CHARS = 4000

# Training hyperparameters (safe defaults for Colab GPU)
BATCH_SIZE = 2
GRAD_ACCUM = 4          # effective batch = BATCH_SIZE * GRAD_ACCUM
EPOCHS = 1
LR = 2e-4

# LoRA + 4-bit (optional)
# Auto-enable only if CUDA + BitsAndBytesConfig + PEFT are available.
USE_4BIT_LORA = bool(device == "cuda" and BNB_CONFIG_AVAILABLE and PEFT_AVAILABLE)

print("MODEL_NAME:", MODEL_NAME)
print("USE_4BIT_LORA:", USE_4BIT_LORA)



MODEL_NAME: microsoft/phi-2
USE_4BIT_LORA: True


In [10]:
# Load tokenizer & model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

# Phi-2 tokenizer doesn't always define pad_token by default.
tokenizer.padding_side = "right"
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

if USE_4BIT_LORA:
    # 4-bit quantization + LoRA (requires bitsandbytes + peft)
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
    )
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        quantization_config=bnb_config,
        device_map="auto",
    )

    model = prepare_model_for_kbit_training(model)

    lora_config = LoraConfig(
        r=16,
        lora_alpha=32,
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
    )
    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()
else:
    model = AutoModelForCausalLM.from_pretrained(MODEL_NAME).to(device)

# Make sure model knows the pad token (important for generation & padding)
model.config.pad_token_id = tokenizer.pad_token_id


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

trainable params: 18,350,080 || all params: 2,798,033,920 || trainable%: 0.6558


## 3. Preprocessing



In [11]:
# Preprocessing / Tokenization
# We will train Phi-2 as a causal LM on (prompt + summary).
# IMPORTANT: do NOT create 'labels' here. Let the data collator handle labels + padding
# to avoid tensor shape errors during batching.

def build_prompt(doc: str) -> str:
    doc = doc[:MAX_DOC_CHARS]
    return (
        "Summarize the following article in 1-2 sentences.\n\n"
        f"Article:\n{doc}\n\n"
        "Summary:"
    )

def preprocess_batch(examples):
    docs = examples["document"]
    sums = examples["summary"]

    texts = []
    for d, s in zip(docs, sums):
        prompt = build_prompt(d)
        full = prompt + " " + s + tokenizer.eos_token
        texts.append(full)

    tokenized = tokenizer(
        texts,
        truncation=True,
        max_length=MAX_LENGTH,
        padding=False,  # padding will be done dynamically by the data collator
    )
    return tokenized

tokenized = dataset.map(preprocess_batch, batched=True, remove_columns=dataset["train"].column_names)
print(tokenized)


Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 5000
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 500
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 500
    })
})


## 4. Trainer

**TODO:** atur output_dir & logging.

In [12]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# NOTE: Some older transformers versions don't support evaluation_strategy.
# We build TrainingArguments in a version-compatible way.
import inspect
from transformers import TrainingArguments

use_fp16 = bool(torch.cuda.is_available())

base_kwargs = dict(
    output_dir=str(OUTPUTS_DIR),
    learning_rate=LR,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRAD_ACCUM,
    num_train_epochs=EPOCHS,
    logging_steps=50,
    save_steps=500,
    save_total_limit=2,
    fp16=use_fp16,
    report_to="none",
)

sig = inspect.signature(TrainingArguments.__init__)
params = sig.parameters

# Newer versions
if "evaluation_strategy" in params:
    base_kwargs.update(dict(
        evaluation_strategy="steps",
        eval_steps=500,
    ))
else:
    # Older versions (fallback)
    if "evaluate_during_training" in params:
        base_kwargs["evaluate_during_training"] = True
    if "eval_steps" in params:
        base_kwargs["eval_steps"] = 500

# Keep only supported args for this transformers version
filtered_kwargs = {k: v for k, v in base_kwargs.items() if k in params}

args = TrainingArguments(**filtered_kwargs)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)


  trainer = Trainer(


## 5. Training

**TODO:** jalankan training.

In [13]:
# Train / Evaluate / Save artifacts to Google Drive
# If you already trained and only want to re-run evaluation, set DO_TRAIN = False.
import json
from datetime import datetime

DO_TRAIN = True

if DO_TRAIN:
    train_result = trainer.train()
    print(train_result)

# Save model + tokenizer
run_id = datetime.now().strftime("%Y%m%d_%H%M%S")
save_dir = MODELS_DIR / f"phi2_xsum_{run_id}"
save_dir.mkdir(parents=True, exist_ok=True)

trainer.save_model(str(save_dir))
tokenizer.save_pretrained(str(save_dir))
print("Saved model to:", save_dir)

# Evaluate (loss on validation by default)
eval_metrics = trainer.evaluate()
print("Eval metrics:", eval_metrics)

# Save metrics
metrics_path = REPORTS_DIR / "metrics.json"
with open(metrics_path, "w") as f:
    json.dump(eval_metrics, f, indent=2)
print("Saved metrics to:", metrics_path)


The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 50256}.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)


Step,Training Loss
50,2.3702
100,2.2842
150,2.2954
200,2.2748
250,2.2842
300,2.2593
350,2.2689
400,2.2485
450,2.2783
500,2.259


  return fn(*args, **kwargs)


TrainOutput(global_step=625, training_loss=2.277365252685547, metrics={'train_runtime': 3352.7535, 'train_samples_per_second': 1.491, 'train_steps_per_second': 0.186, 'total_flos': 2.948031969964032e+16, 'train_loss': 2.277365252685547, 'epoch': 1.0})
Saved model to: /content/drive/MyDrive/finetuning-phi-2-text-summarization/models/phi2_xsum_20260111_095033


Eval metrics: {'eval_loss': 2.210211753845215, 'eval_runtime': 104.6011, 'eval_samples_per_second': 4.78, 'eval_steps_per_second': 2.39, 'epoch': 1.0}
Saved metrics to: /content/drive/MyDrive/finetuning-phi-2-text-summarization/reports/metrics.json


## 6. Evaluasi ROUGE (setelah training)

Evaluasi summarization biasanya:
- generate summary dari prompt
- bandingkan dengan reference summary (ROUGE)

**TODO:** jalankan cell ini setelah training (dan mungkin pakai subset agar cepat).

In [23]:
def generate_summary(doc: str, max_new_tokens: int = 64):
     prompt = build_prompt(doc)
     inputs = tokenizer(
         prompt,
         return_tensors="pt",
         truncation=True,
         max_length=MAX_LENGTH,
         padding=True,
     ).to(model.device)
     with torch.no_grad():
         out = model.generate(
             **inputs,
             max_new_tokens=max_new_tokens,
             do_sample=False,
             num_beams=4,
             pad_token_id=tokenizer.pad_token_id,
         )
     text = tokenizer.decode(out[0], skip_special_tokens=True)
     if "Summary:" in text:
         text = text.split("Summary:", 1)[-1].strip()
     return text

n = 200
preds, refs = [], []
for ex in dataset["validation"].select(range(n)):
     preds.append(generate_summary(ex["document"]))
     refs.append(ex["summary"])

rouge = metric.compute(predictions=preds, references=refs)
print(rouge)

{'rouge1': np.float64(0.19232826587276167), 'rouge2': np.float64(0.06826107755636898), 'rougeL': np.float64(0.16462380438176683), 'rougeLsum': np.float64(0.15845844568498543)}


## 7. Analisis

**TODO:** isi `reports/` dengan:
- ROUGE score
- contoh hasil summary bagus vs buruk
- diskusi abstractive vs extractive
- kendala truncation & panjang dokumen