## Finetuning Llama-3 8B for Related Work Generation

This notebook demonstrates the process of finetuning the Llama-3 8B model using Unsloth for the task of generating 'Related Work' sections in academic papers. The process includes data preparation, model loading, training, evaluation, and saving the finetuned model.

### 1. Setup Environment and Install Dependencies

Before we begin, we need to install the necessary libraries, including `unsloth` for efficient finetuning, `bitsandbytes` for 4-bit quantization, `transformers`, `trl`, `peft`, `accelerate` for model handling, and `evaluate`, `rouge_score`, `bert_score` for evaluation metrics.

In [45]:
# !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
# !pip install -U bitsandbytes
# !pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate
# !pip install -U transformers accelerate
# !pip install evaluate rouge_score bert_score

In [46]:
!unzip "/content/processed_multixscience_data.zip"

Archive:  /content/processed_multixscience_data.zip
replace processed_multixscience_data/data-00000-of-00001.arrow? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


In [47]:
import os
from datasets import load_from_disk
from transformers import AutoTokenizer, TrainingArguments, AutoModelForCausalLM
import accelerate
from trl import SFTTrainer
from unsloth import FastLanguageModel
import torch
import evaluate
from tqdm import tqdm
import numpy as np

### 2. Load and Prepare Dataset

We will load a pre-processed dataset from a ZIP archive, which contains academic paper data structured for our 'Related Work' generation task. The dataset is expected to have 'input_text' (abstract + references) and 'related_work' (target summary) columns.

In [48]:
data_path = "/content/processed_multixscience_data"

print(f"Memuat dataset dari {data_path}...")
ds = load_from_disk(data_path)

Memuat dataset dari /content/processed_multixscience_data...


### 3. Define Prompt Formatting Functions

To prepare the dataset for instruction finetuning with Llama-3, we define a prompt template. This template structures the input (`input_text`) and output (`related_work`) into a conversational format suitable for the model, including system and user messages.

In [49]:
EOS_TOKEN = "<|eot_id|>"

def format_prompt_llama3(examples):
    # Ambil list data dari batch
    inputs = examples["input_text"]       # Context (Abstract + Refs)
    outputs = examples["related_work"]    # Ground Truth (Target)

    prompts = []

    # System Prompt: Instruksi peran untuk AI
    system_msg = (
        "You are an academic writing assistant. "
        "Write a 'Related Work' section based on the provided text. "
        "The input contains the Current Abstract followed by References (marked with @cite_n). "
        "Synthesize these references and highlight the novelty of the Current Abstract."
    )

    for input_text, output_text in zip(inputs, outputs):
        text = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{system_msg}<|eot_id|><|start_header_id|>user<|end_header_id|>

{input_text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{output_text}{EOS_TOKEN}"""

        prompts.append(text)

    return { "text": prompts }

def format_prompt_llama3_val(examples):
    inputs = examples["input_text"]
    outputs = examples["related_work"]

    prompts = []

    system_msg = (
        "You are an academic writing assistant. "
        "Write a 'Related Work' section based on the provided text. "
        "The input contains the Current Abstract followed by References (marked with @cite_n). "
        "Synthesize these references and highlight the novelty of the Current Abstract."
    )

    for input_text, output_text in zip(inputs, outputs):
        text = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{system_msg}<|eot_id|><|start_header_id|>user<|end_header_id|>

{input_text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""
        prompts.append(text)

    return { "text": prompts }

ds_train, ds_val = ds.train_test_split(test_size=0.2, shuffle=True, seed=3407).values()

formatted_ds_train = ds_train.map(format_prompt_llama3, batched=True)
formatted_ds_val = ds_val.map(format_prompt_llama3, batched=True)

### 4. Load and Configure Llama-3 8B Model

We load the `unsloth/llama-3-8b-bnb-4bit` model using Unsloth's `FastLanguageModel`. This leverages 4-bit quantization to reduce memory usage and enables efficient LoRA (Low-Rank Adaptation) for finetuning, significantly speeding up the training process.

In [62]:
max_seq_length = 1024
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

target_modules = ["q_proj", "v_proj"]

# Tambahkan Adapter LoRA (Agar model bisa belajar)
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = target_modules,
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)

print("Model Llama 3 8B (4-bit) siap di-train di Kaggle!")

==((====))==  Unsloth 2025.11.3: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-80GB. Num GPUs = 1. Max memory: 79.318 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Model Llama 3 8B (4-bit) siap di-train di Kaggle!


### Cek saved model

In [63]:
# Define parameters needed for loading from_pretrained
max_seq_length = 1024 # This should match the value used during initial model setup
dtype = None # This should match the value used during initial model setup
load_in_4bit = True # This should match the value used during initial model setup

model_found = False
try:
  # Load save model path
  model_path = "/content/drive/MyDrive/related_works_generation_model"

  # Load the model using FastLanguageModel to ensure Unsloth's patches are applied
  loaded_model, loaded_tokenizer = FastLanguageModel.from_pretrained(
      model_name = model_path,
      max_seq_length = max_seq_length,
      dtype = dtype,
      load_in_4bit = load_in_4bit,
  )
  model_found = True
  print(f"Pre-trained model found and loaded from {model_path}.")
except Exception as e:
  print(f"Model not found or error loading: {e}. Proceeding to training a new model...")

==((====))==  Unsloth 2025.11.3: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-80GB. Num GPUs = 1. Max memory: 79.318 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Pre-trained model found and loaded from /content/drive/MyDrive/related_works_generation_model.


### Training Arguments

Di bagian ini, kita akan menentukan argumen pelatihan yang akan digunakan oleh `SFTTrainer`. Argumen ini mencakup ukuran batch, laju pembelajaran, jumlah epoch, strategi logging, dan lainnya. Tujuan utama adalah untuk mengkonfigurasi pelatihan yang efisien dan stabil untuk model bahasa besar (LLM).

In [64]:

# Define Training Arguments
training_args = TrainingArguments(
    per_device_train_batch_size = 4,      # A100 kuat
    gradient_accumulation_steps = 4,      # total effective batch = 16
    warmup_steps = 50,
    num_train_epochs = 1,
    learning_rate = 2e-4,
    fp16 = False,                         # jangan mix fp16 + bf16
    bf16 = True,                          # A100 support, lebih stabil
    logging_steps = 50,
    optim = "adamw_8bit",
    weight_decay = 0.01,
    lr_scheduler_type = "linear",
    seed = 3407,
    output_dir = "outputs",
    save_strategy="epoch",
    push_to_hub=False,
    report_to=[],
)


# Trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset = formatted_ds_train if not model_found else formatted_ds_train.select(range(5)),
    eval_dataset  = formatted_ds_val   if not model_found else formatted_ds_val.select(range(5)),
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    args=training_args,
)

num_proc must be <= 5. Reducing num_proc to 5 for dataset of size 5.
num_proc must be <= 5. Reducing num_proc to 5 for dataset of size 5.


In [65]:
if model_found:
  # If a pre-trained model was found and loaded with FastLanguageModel, use it directly.
  # This means we skip the initial FastLanguageModel setup from cell rNYYbUwAEu6O
  # and directly use the loaded model and tokenizer for evaluation.
  print("Using loaded FastLanguageModel for evaluation. Skipping training.")
  model = loaded_model # Assign the loaded model to the 'model' variable for subsequent use
  tokenizer = loaded_tokenizer
else:
  # If no pre-trained model was found, proceed with training the current model (Unsloth PEFT model)
  # which was initialized in cell rNYYbUwAEu6O.
  print("No pre-trained model found. Starting training of the newly configured model.")
  trainer.train()
  # After training, the 'model' and 'tokenizer' variables from rNYYbUwAEu6O already hold the trained model.

Using loaded FastLanguageModel for evaluation. Skipping training.


### 5. Evaluate Model Performance

After training, we evaluate the model's performance on a held-out validation set. We generate predictions for a subset of the validation data and compare them against the original 'Related Work' sections using various metrics like ROUGE, BERTScore, and length analysis.

In [76]:
import torch

model.eval()
print("Model set to evaluation mode.")

eval_dataset = formatted_ds_val.select(range(200))
print(f"Selected {len(eval_dataset)} samples for evaluation.")

# 1. Definisikan Config "Obat Repetisi"
generation_kwargs = {
    "max_new_tokens": 150,
    "min_length": 60,
    "num_beams": 1,
    "do_sample": True,
    "temperature": 0.1,
    "top_p": 0.9,
    "no_repeat_ngram_size": 3,
    "repetition_penalty": 1.2,
}

# 3. Define the generation function
def generate_related_work(input_text):
    # Ensure pad_token_id is set for the tokenizer to prevent reorder_cache error
    if tokenizer.pad_token_id is None:
        tokenizer.pad_token_id = tokenizer.eos_token_id

    inputs = tokenizer(input_text, return_tensors="pt", add_special_tokens=False)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

  # 2. Jalankan Generate
    with torch.no_grad():
        outputs = model.generate(
            **inputs,                         # Unpack input_ids & attention_mask
            **generation_kwargs,              # Unpack config di atas
            use_cache=True,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id, # Gunakan pad token yang benar
        )

    generated_tokens = outputs[0][inputs["input_ids"].shape[1]:]
    decoded_output = tokenizer.decode(generated_tokens, skip_special_tokens=True)

    return decoded_output.strip()

predictions = []
original_texts = []
print("Generating predictions...")
for i, sample in enumerate(eval_dataset):
    generated_text = generate_related_work(sample["input_text"])
    predictions.append(generated_text)
    original_texts.append(sample["related_work"])
    if (i + 1) % 10 == 0:
        print(f"Generated prediction for {i + 1}/{len(eval_dataset)} samples.")

print("Prediction generation complete.")
print(f"Generated {len(predictions)} predictions.")

print("\n--- First Generated Prediction ---")
print(predictions[0])
print("\n--- First Original Related Work ---")
print(original_texts[0])

Model set to evaluation mode.
Selected 200 samples for evaluation.
Generating predictions...
Generated prediction for 10/200 samples.
Generated prediction for 20/200 samples.
Generated prediction for 30/200 samples.
Generated prediction for 40/200 samples.
Generated prediction for 50/200 samples.
Generated prediction for 60/200 samples.
Generated prediction for 70/200 samples.


Unsloth: Input IDs of shape torch.Size([1, 1078]) with length 1078 > the model's max sequence length of 1024.
We shall truncate it ourselves. It's imperative if you correct this issue first.
Unsloth: Input IDs of shape torch.Size([1, 1045]) with length 1045 > the model's max sequence length of 1024.
We shall truncate it ourselves. It's imperative if you correct this issue first.


Generated prediction for 80/200 samples.


Unsloth: Input IDs of shape torch.Size([1, 1062]) with length 1062 > the model's max sequence length of 1024.
We shall truncate it ourselves. It's imperative if you correct this issue first.
Unsloth: Input IDs of shape torch.Size([1, 1315]) with length 1315 > the model's max sequence length of 1024.
We shall truncate it ourselves. It's imperative if you correct this issue first.
Unsloth: Input IDs of shape torch.Size([1, 1380]) with length 1380 > the model's max sequence length of 1024.
We shall truncate it ourselves. It's imperative if you correct this issue first.


Generated prediction for 90/200 samples.
Generated prediction for 100/200 samples.


Unsloth: Input IDs of shape torch.Size([1, 1172]) with length 1172 > the model's max sequence length of 1024.
We shall truncate it ourselves. It's imperative if you correct this issue first.


Generated prediction for 110/200 samples.
Generated prediction for 120/200 samples.
Generated prediction for 130/200 samples.
Generated prediction for 140/200 samples.


Unsloth: Input IDs of shape torch.Size([1, 1505]) with length 1505 > the model's max sequence length of 1024.
We shall truncate it ourselves. It's imperative if you correct this issue first.


Generated prediction for 150/200 samples.
Generated prediction for 160/200 samples.
Generated prediction for 170/200 samples.


Unsloth: Input IDs of shape torch.Size([1, 1059]) with length 1059 > the model's max sequence length of 1024.
We shall truncate it ourselves. It's imperative if you correct this issue first.


Generated prediction for 180/200 samples.
Generated prediction for 190/200 samples.


Unsloth: Input IDs of shape torch.Size([1, 1339]) with length 1339 > the model's max sequence length of 1024.
We shall truncate it ourselves. It's imperative if you correct this issue first.


Generated prediction for 200/200 samples.
Prediction generation complete.
Generated 200 predictions.

--- First Generated Prediction ---
<doc-sep> @article{2016arXiv160705690B, author = {Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas}, title = {{Enriching Word Vectors With Subword Information}}, year = 2016} @inproceedings{DBLP:journals/corr/ChenWY16a, author={Zichao Yang and Di Wang and Edward Y. Chang}, howpublished="{{\textcopyright}} Z.Yang, D.Wang, E.Chang, 2015.", journal="{\textbackslash }jmlr", month=jun #, note="Submitted on 15 Jun

--- First Original Related Work ---
Firstly, previous datasets in this area are not yet released or in their infancy for verification of their applicability as abuse ground truth gold standard. The authors of @cite_14 claim to outperform deep learning techniques to detect hate speech, derogatory language and profanity. They compare their results with a previous dataset from @cite_12 and assess the accuracy of detecting a

### 6. Calculate Comprehensive Evaluation Metrics

This section defines a function to compute ROUGE scores (for lexical similarity), BERTScore (for semantic similarity), and analyze the length of generated texts compared to the references. These metrics provide a holistic view of the model's generation quality.

In [77]:
def calculate_comprehensive_metrics(predictions, references):
    """
    Menghitung ROUGE, BERTScore, dan Rasio Panjang.

    Args:
        predictions (list): List string hasil output model.
        references (list): List string kunci jawaban asli (ground truth).

    Returns:
        dict: Dictionary berisi semua skor evaluasi.
    """

    print(f"üìä Memulai Evaluasi untuk {len(predictions)} sampel data...")
    results = {}

    # --- 1. ROUGE SCORE (Lexical / Kata) ---
    print("‚è≥ Menghitung ROUGE...")
    rouge_metric = evaluate.load("rouge")
    rouge_scores = rouge_metric.compute(
        predictions=predictions,
        references=references,
        use_stemmer=True # Penting untuk bahasa Inggris
    )
    # Konversi ke Persen (0-100)
    results['ROUGE-1'] = round(rouge_scores['rouge1'] * 100, 2)
    results['ROUGE-2'] = round(rouge_scores['rouge2'] * 100, 2)
    results['ROUGE-L'] = round(rouge_scores['rougeL'] * 100, 2)

    # --- 2. BERTSCORE (Semantic / Makna) ---
    print("‚è≥ Menghitung BERTScore (Mungkin butuh waktu & download model)...")
    bertscore_metric = evaluate.load("bertscore")
    # Gunakan batch_size agar tidak OOM
    bert_scores = bertscore_metric.compute(
        predictions=predictions,
        references=references,
        lang="en",
        batch_size=16
    )
    # Kita ambil rata-rata F1 Score dari semua data
    results['BERTScore-F1'] = round(np.mean(bert_scores['f1']) * 100, 2)
    results['BERTScore-Precision'] = round(np.mean(bert_scores['precision']) * 100, 2)
    results['BERTScore-Recall'] = round(np.mean(bert_scores['recall']) * 100, 2)

    # --- 3. LENGTH ANALYSIS (Analisis Panjang) ---
    print("‚è≥ Menghitung Statistik Panjang Teks...")
    pred_lens = [len(p.split()) for p in predictions]
    ref_lens = [len(r.split()) for r in references]

    avg_pred_len = np.mean(pred_lens)
    avg_ref_len = np.mean(ref_lens)
    length_ratio = (avg_pred_len / avg_ref_len) * 100

    results['Avg Gen Length'] = round(avg_pred_len, 1)
    results['Avg Ref Length'] = round(avg_ref_len, 1)
    results['Length Ratio (%)'] = round(length_ratio, 2)

    return results

# Jalankan Fungsi
final_metrics = calculate_comprehensive_metrics(predictions, original_texts)

# Tampilkan Hasil Rapih
print("\n" + "="*40)
print("      LAPORAN HASIL EVALUASI AKHIR      ")
print("="*40)
for metric, score in final_metrics.items():
    print(f"{metric:<20} : {score}")
print("="*40)

# Interpretasi Singkat
print("\n--- Interpretasi Cepat ---")
if final_metrics['BERTScore-F1'] > 85:
    print("Kualitas Makna SANGAT BAIK (Mirip manusia).")
elif final_metrics['BERTScore-F1'] > 80:
    print("Kualitas Makna CUKUP BAIK.")
else:
    print("Kualitas Makna KURANG (Model mungkin halusinasi/tidak nyambung).")

if final_metrics['Length Ratio (%)'] < 80:
    print("WARNING: Output model terlalu pendek dibanding referensi asli.")

üìä Memulai Evaluasi untuk 200 sampel data...
‚è≥ Menghitung ROUGE...
‚è≥ Menghitung BERTScore (Mungkin butuh waktu & download model)...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


‚è≥ Menghitung Statistik Panjang Teks...

      LAPORAN HASIL EVALUASI AKHIR      
ROUGE-1              : 16.93
ROUGE-2              : 1.31
ROUGE-L              : 9.1
BERTScore-F1         : 80.38
BERTScore-Precision  : 79.94
BERTScore-Recall     : 80.87
Avg Gen Length       : 93.2
Avg Ref Length       : 106.2
Length Ratio (%)     : 87.8

--- Interpretasi Cepat ---
Kualitas Makna CUKUP BAIK.


### 7. Perform Quality Check with Sample Outputs

To get a qualitative understanding of the model's performance, we randomly select a few samples from the validation set and display their generated 'Related Work' alongside the original reference text. This helps in visually inspecting the coherence and relevance of the generated content.

In [78]:
# Ambil 3 sampel acak
import random
indices = random.sample(range(len(predictions)), 3)

print("=== QUALITY CHECK ===")
for i in indices:
    print(f"\n[SAMPEL {i}]")
    print("GENERATED:", predictions[i])
    print("-" * 20)
    print("REFERENCE:", original_texts[i])
    print("=" * 40)

=== QUALITY CHECK ===

[SAMPEL 14]
GENERATED: <div> </div><div> Abstract‚ÄîIn recent years, there has been increasing interest in generating stylized illustrations using deep generative networks (DGNs). While DGN-based approaches have shown promising performance in synthesizing realistic-looking illustrations, they typically require massive amounts of training samples to achieve satisfactory quality. Moreover, most previous works focus only on single-image generation tasks without considering the spatial relationships among multiple objects or regions.</div></div>, <div><h3>References</h3><ul><li>A. Dosovitskiy, G. Larsson, E. Sohn, H. Wang, J.Y. Zhu, C.L. Zitnick, M. Fathy, B. Chandrasekhar,
--------------------
REFERENCE: Due to the popularity of comics, many related research topics, such as @cite_31 , @cite_25 and @cite_13 have drawn considerable research attention in computer graphics community. Particularly, several techniques have been studied to facilitate layout generation. For

### 8. Save the Finetuned Model

Finally, we merge the LoRA adapters with the base model and save the complete finetuned model and its tokenizer to a specified path. This allows us to reuse the model for inference without needing the LoRA configuration again.

In [79]:
presaved_model = model.merge_and_unload()

save_path = "/content/drive/MyDrive/related_works_generation_model_v2"

# Save full merged model
presaved_model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

print(f"Model saved to {save_path}")

AttributeError: 'LlamaForCausalLM' object has no attribute 'merge_and_unload'